Using singularity images¶

The CQLS has installed singularity on compute nodes to facilitate containerized computing. Please let us know if the current singularity deployment is or is not working for you so we can better serve your needs as researchers.

Tip

We recommend using nextflow -profile singularity when a nf-core pipeline is available. Using the singularity profile significantly simplifies the use and maintenance of nextflow pipelines.

Using `singularity` instead of `docker`¶

One of the reasons that Singularity was developed was to allow for running containers in multi-user environments, like we have at the CQLS Wildwood HPC.

For many docker workflows, migrating to Singularity is relatively simple. Often you can download a docker image with singularity pull and then run the converted .sif file directly. Sometimes there are wrapper scripts that are provided with docker images that need subtle modifications, including adding additional bind paths among other things.

Please let us know if this is not working for you.

Finding pre-downloaded singularity images¶

We have implemented singularity image management through the Lmod module system. In this way, users can organically find singularity images without having to know where the images are stored on disk.

ml avail

➜ ml avail

-------------------------------------- /local/cqls/singularity/shpc/modules --------------------------------------
   gitlab-registry.in2p3.fr/phoogle/pelican/1.0.8/module    teambraker/braker3/latest/module
   python/3.13-rc/module

------------ /fs1/local/cqls/opt/bootstrap/lmod/x86_64/.pixi/envs/default/lmod/lmod/modulefiles/Core -------------
   lmod    settarg

Module defaults are chosen based on Find First Rules due to Name/Version/Version modules found in the module tree.
See https://lmod.readthedocs.io/en/latest/060_locating.html for details.

If the avail list is too long consider trying:

"module --default avail" or "ml -d av" to just list the default modules.
"module overview" or "ml ov" to display the number of modules for each name.

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

Using the ml avail command, one can see which singularity images have been downloaded. To get more information about the images, one can run the ml spider command, with optionally providing a module (image) name. As an example, to get more information about the braker3 install:

ml spider braker3

➜ ml spider braker3

--------------------------------------------------------------------------------------------------------------
  teambraker/braker3: teambraker/braker3/latest/module
--------------------------------------------------------------------------------------------------------------

    This module can be loaded directly: module load teambraker/braker3/latest/module

    Help:
      This module is a singularity container wrapper for teambraker/braker3:latest vlatest


      Container (available through variable SINGULARITY_CONTAINER):

       - /local/cqls/singularity/shpc/images/teambraker/braker3/latest/teambraker-braker3-latest-sha256:8e8cc01384
971a9cf04a4dc519faf3df947ed62b2bae2ce8be3075cb5b5e1e1e.sif

      Commands include:

       - braker3-run:
             singularity run -B <wrapperDir>/99-shpc.sh:/.singularity.d/env/99-shpc.sh <container> "$@"
       - braker3-shell:
             singularity shell -s /bin/sh -B <wrapperDir>/99-shpc.sh:/.singularity.d/env/99-shpc.sh <container>
       - braker3-exec:
             singularity exec -B <wrapperDir>/99-shpc.sh:/.singularity.d/env/99-shpc.sh <container> "$@"
       - braker3-inspect-runscript:
             singularity inspect -r <container>
       - braker3-inspect-deffile:
             singularity inspect -d <container>
       - braker3-container:
             echo "$SINGULARITY_CONTAINER"

       - braker:
             singularity exec -B <wrapperDir>/99-shpc.sh:/.singularity.d/env/99-shpc.sh <container> braker.pl "$@"


      For each of the above, you can export:

       - SINGULARITY_OPTS: to define custom options for singularity (e.g., --debug)
       - SINGULARITY_COMMAND_OPTS: to define custom options for the command (e.g., -b)
       - SINGULARITY_CONTAINER: full path to the Singularity Container

Make note of the line about how to load a module.

Loading modules¶

Let's load a module:

ml load teambraker/braker3

ml load teambraker/braker3

And let's make sure it loaded properly:

➜ which braker
/local/cqls/singularity/shpc/wrappers/teambraker/braker3/latest/bin/braker
➜ braker -help

DESCRIPTION

braker.pl   Pipeline for predicting genes with GeneMark-EX and AUGUSTUS with
            RNA-Seq and/or proteins

SYNOPSIS

braker.pl [OPTIONS] --genome=genome.fa {--bam=rnaseq.bam | --prot_seq=prot.fa}

INPUT FILE OPTIONS

--genome=genome.fa                  fasta file with DNA sequences
--bam=rnaseq.bam                    bam file with spliced alignments from
                                    RNA-Seq
--prot_seq=prot.fa                  A protein sequence file in multi-fasta
                                    format used to generate protein hints.
                                    Unless otherwise specified, braker.pl will
                                    run in "EP mode" or "ETP mode which uses
                                    ProtHint to generate protein hints and
                                    GeneMark-EP+ or GeneMark-ETP to
                                    train AUGUSTUS.
--hints=hints.gff                   Alternatively to calling braker.pl with a
                                    bam or protein fasta file, it is possible to
                                    call it with a .gff file that contains
                                    introns extracted from RNA-Seq and/or
                                    protein hints (most frequently coming
                                    from ProtHint). If you wish to use the
                                    ProtHint hints, use its
                                    "prothint_augustus.gff" output file.
                                    This flag also allows the usage of hints
                                    from additional extrinsic sources for gene
                                    prediction with AUGUSTUS. To consider such
                                    additional extrinsic information, you need
                                    to use the flag --extrinsicCfgFiles to
                                    specify parameters for all sources in the
                                    hints file (including the source "E" for
                                    intron hints from RNA-Seq).
                                    In ETP mode, this option can be used together
                                    with --geneMarkGtf and --traingenes to provide
                                    BRAKER with results of a previous GeneMark-ETP
                                    run, so that the GeneMark-ETP step can be
                                    skipped. In this case, specify the hintsfile of
                                    a previous BRAKER run here, or generate a
                                    hintsfile from the GeneMark-ETP working
                                    directory with the script get_etp_hints.py.
--rnaseq_sets_ids=SRR1111,SRR1115   IDs of RNA-Seq sets that are either in
                                    one of the directories specified with
                                    --rnaseq_sets_dir, or that can be downloaded
                                    from SRA. If you want to use local files, you
                                    can use unaligned reads in FASTQ format
                                    (they have to be named ID.fastq if unpaired or
                                    ID_1.fastq, ID_2.fastq if paired), or aligned reads
                                    as a BAM file (named ID.bam).
--rnaseq_sets_dir=/path/to/rna_dir1 Locations where the local files of RNA-Seq data
                                    reside that were specified with --rnaseq_sets_ids.

FREQUENTLY USED OPTIONS

--species=sname                     Species name. Existing species will not be
                                    overwritten. Uses Sp_1 etc., if no species
                                    is assigned
--AUGUSTUS_ab_initio                output ab initio predictions by AUGUSTUS
                                    in addition to predictions with hints by
                                    AUGUSTUS
--softmasking_off                   Turn off softmasking option (enables by
                                    default, discouraged to disable!)
--esmode                            Run GeneMark-ES (genome sequence only) and
                                    train AUGUSTUS on long genes predicted by
                                    GeneMark-ES. Final predictions are ab initio
--gff3                              Output in GFF3 format (default is gtf
                                    format)
--threads                           Specifies the maximum number of threads that
                                    can be used during computation. Be aware:
                                    optimize_augustus.pl will use max. 8
                                    threads; augustus will use max. nContigs in
                                    --genome=file threads.
--workingdir=/path/to/wd/           Set path to working directory. In the
                                    working directory results and temporary
                                    files are stored
--nice                              Execute all system calls within braker.pl
                                    and its submodules with bash "nice"
                                    (default nice value)
--alternatives-from-evidence=true   Output alternative transcripts based on
                                    explicit evidence from hints (default is
                                    true).
--fungus                            GeneMark-EX option: run algorithm with
                                    branch point model (most useful for fungal
                                    genomes)
--crf                               Execute CRF training for AUGUSTUS;
                                    resulting parameters are only kept for
                                    final predictions if they show higher
                                    accuracy than HMM parameters.
--keepCrf                           keep CRF parameters even if they are not
                                    better than HMM parameters
--makehub                           Create track data hub with make_hub.py
                                    for visualizing BRAKER results with the
                                    UCSC GenomeBrowser
--busco_lineage=lineage             If you provide a BUSCO lineage, BRAKER will
                                    run compleasm on genome level to generate hints
                                    from BUSCO to enhance BUSCO discovery in the
                                    protein set. Also, if you provide a BUSCO
                                    lineage, BRAKER will run compleasm to assess
                                    the protein sets that go into TSEBRA combination,
                                    and will determine the TSEBRA mode to maximize
                                    BUSCO. Do not provide a busco_lineage if you
                                    want to determina natural BUSCO sensivity of
                                    BRAKER!
--email                             E-mail address for creating track data hub
--version                           Print version number of braker.pl
--help                              Print this help message

CONFIGURATION OPTIONS (TOOLS CALLED BY BRAKER)

--AUGUSTUS_CONFIG_PATH=/path/       Set path to config directory of AUGUSTUS
                                    (if not specified as environment
                                    variable). BRAKER1 will assume that the
                                    directories ../bin and ../scripts of
                                    AUGUSTUS are located relative to the
                                    AUGUSTUS_CONFIG_PATH. If this is not the
                                    case, please specify AUGUSTUS_BIN_PATH
                                    (and AUGUSTUS_SCRIPTS_PATH if required).
                                    The braker.pl commandline argument
                                    --AUGUSTUS_CONFIG_PATH has higher priority
                                    than the environment variable with the
                                    same name.
--AUGUSTUS_BIN_PATH=/path/          Set path to the AUGUSTUS directory that
                                    contains binaries, i.e. augustus and
                                    etraining. This variable must only be set
                                    if AUGUSTUS_CONFIG_PATH does not have
                                    ../bin and ../scripts of AUGUSTUS relative
                                     to its location i.e. for global AUGUSTUS
                                    installations. BRAKER1 will assume that
                                    the directory ../scripts of AUGUSTUS is
                                    located relative to the AUGUSTUS_BIN_PATH.
                                    If this is not the case, please specify
                                    --AUGUSTUS_SCRIPTS_PATH.
--AUGUSTUS_SCRIPTS_PATH=/path/      Set path to AUGUSTUS directory that
                                    contains scripts, i.e. splitMfasta.pl.
                                    This variable must only be set if
                                    AUGUSTUS_CONFIG_PATH or AUGUSTUS_BIN_PATH
                                    do not contains the ../scripts directory
                                    of AUGUSTUS relative to their location,
                                    i.e. for special cases of a global
                                    AUGUSTUS installation.
--BAMTOOLS_PATH=/path/to/           Set path to bamtools (if not specified as
                                    environment BAMTOOLS_PATH variable). Has
                                    higher priority than the environment
                                    variable.
--GENEMARK_PATH=/path/to/           Set path to GeneMark-ET (if not specified
                                    as environment GENEMARK_PATH variable).
                                    Has higher priority than environment
                                    variable.
--SAMTOOLS_PATH=/path/to/           Optionally set path to samtools (if not
                                    specified as environment SAMTOOLS_PATH
                                    variable) to fix BAM files automatically,
                                    if necessary. Has higher priority than
                                    environment variable.
--PROTHINT_PATH=/path/to/           Set path to the directory with prothint.py.
                                    (if not specified as PROTHINT_PATH
                                    environment variable). Has higher priority
                                    than environment variable.
--DIAMOND_PATH=/path/to/diamond     Set path to diamond, this is an alternative
                                    to NCIB blast; you only need to specify one
                                    out of DIAMOND_PATH or BLAST_PATH, not both.
                                    DIAMOND is a lot faster that BLAST and yields
                                    highly similar results for BRAKER.
--BLAST_PATH=/path/to/blastall      Set path to NCBI blastall and formatdb
                                    executables if not specified as
                                    environment variable. Has higher priority
                                    than environment variable.
--COMPLEASM_PATH=/path/to/compleasm Set path to compleasm (if not specified as
                                    environment variable). Has higher priority
                                    than environment variable.
--PYTHON3_PATH=/path/to             Set path to python3 executable (if not
                                    specified as envirnonment variable and if
                                    executable is not in your $PATH).
--JAVA_PATH=/path/to                Set path to java executable (if not
                                    specified as environment variable and if
                                    executable is not in your $PATH), only
                                    required with flags --UTR=on and --addUTR=on
--GUSHR_PATH=/path/to               Set path to gushr.py exectuable (if not
                                    specified as an environment variable and if
                                    executable is not in your $PATH), only required
                                    with the flags --UTR=on and --addUTR=on
--MAKEHUB_PATH=/path/to             Set path to make_hub.py (if option --makehub
                                    is used).
--CDBTOOLS_PATH=/path/to            cdbfasta/cdbyank are required for running
                                    fix_in_frame_stop_codon_genes.py. Usage of
                                    that script can be skipped with option
                                    '--skip_fixing_broken_genes'.


EXPERT OPTIONS

--augustus_args="--some_arg=bla"    One or several command line arguments to
                                    be passed to AUGUSTUS, if several
                                    arguments are given, separate them by
                                    whitespace, i.e.
                                    "--first_arg=sth --second_arg=sth".
--skipGeneMark-ES                   Skip GeneMark-ES and use provided
                                    GeneMark-ES output (e.g. provided with
                                    --geneMarkGtf=genemark.gtf)
--skipGeneMark-ET                   Skip GeneMark-ET and use provided
                                    GeneMark-ET output (e.g. provided with
                                    --geneMarkGtf=genemark.gtf)
--skipGeneMark-EP                   Skip GeneMark-EP and use provided
                                    GeneMark-EP output (e.g. provided with
                                    --geneMarkGtf=genemark.gtf)
--skipGeneMark-ETP                  Skip GeneMark-ETP and use provided
                                    GeneMark-ETP output (e.g. provided with
                                    --gmetp_results_dir=GeneMark-ETP/)
--geneMarkGtf=file.gtf              If skipGeneMark-ET is used, braker will by
                                    default look in the working directory in
                                    folder GeneMarkET for an already existing
                                    gtf file. Instead, you may provide such a
                                    file from another location. If geneMarkGtf
                                    option is set, skipGeneMark-ES/ET/EP/ETP is
                                    automatically also set. Note that gene and
                                    transcript ids in the final output may not
                                    match the ids in the input genemark.gtf
                                    because BRAKER internally re-assigns these
                                    ids.
                                    In ETP mode, this option hast to be used together
                                    with --traingenes and --hints to provide BRAKER
                                    with results of a previous GeneMark-ETP run.
--gmetp_results_dir                 Location of results from a previous
                                    GeneMark-ETP run, which will be used to
                                    skip the GeneMark-ETP step. This option
                                    can be used instead of --geneMarkGtf,
                                    --traingenes, and --hints to skip GeneMark.
--rounds                            The number of optimization rounds used in
                                    optimize_augustus.pl (default 5)
--skipAllTraining                   Skip GeneMark-EX (training and
                                    prediction), skip AUGUSTUS training, only
                                    runs AUGUSTUS with pre-trained and already
                                    existing parameters (not recommended).
                                    Hints from input are still generated.
                                    This option automatically sets
                                    --useexisting to true.
--useexisting                       Use the present config and parameter files
                                    if they exist for 'species'; will overwrite
                                    original parameters if BRAKER performs
                                    an AUGUSTUS training.
--filterOutShort                    It may happen that a "good" training gene,
                                    i.e. one that has intron support from
                                    RNA-Seq in all introns predicted by
                                    GeneMark-EX, is in fact too short. This flag
                                    will discard such genes that have
                                    supported introns and a neighboring
                                    RNA-Seq supported intron upstream of the
                                    start codon within the range of the
                                    maximum CDS size of that gene and with a
                                    multiplicity that is at least as high as
                                    20% of the average intron multiplicity of
                                    that gene.
--skipOptimize                      Skip optimize parameter step (not
                                    recommended).
--skipIterativePrediction           Skip iterative prediction in --epmode (does
                                    not affect other modes, saves a bit of runtime)
--skipGetAnnoFromFasta              Skip calling the python3 script
                                    getAnnoFastaFromJoingenes.py from the
                                    AUGUSTUS tool suite. This script requires
                                    python3, biopython and re (regular
                                    expressions) to be installed. It produces
                                    coding sequence and protein FASTA files
                                    from AUGUSTUS gene predictions and provides
                                    information about genes with in-frame stop
                                    codons. If you enable this flag, these files
                                    will not be produced and python3 and
                                    the required modules will not be necessary
                                    for running brkaker.pl.
--skip_fixing_broken_genes          If you do not have python3, you can choose
                                    to skip the fixing of stop codon including
                                    genes (not recommended).
--eval=reference.gtf                Reference set to evaluate predictions
                                    against (using evaluation scripts from GaTech)
--eval_pseudo=pseudo.gff3           File with pseudogenes that will be excluded
                                    from accuracy evaluation (may be empty file)
--AUGUSTUS_hints_preds=s            File with AUGUSTUS hints predictions; will
                                    use this file as basis for UTR training;
                                    only UTR training and prediction is
                                    performed if this option is given.
--flanking_DNA=n                    Size of flanking region, must only be
                                    specified if --AUGUSTUS_hints_preds is given
                                    (for UTR training in a separate braker.pl
                                    run that builds on top of an existing run)
--verbosity=n                       0 -> run braker.pl quiet (no log)
                                    1 -> only log warnings
                                    2 -> also log configuration
                                    3 -> log all major steps
                                    4 -> very verbose, log also small steps
--downsampling_lambda=d             The distribution of introns in training
                                    gene structures generated by GeneMark-EX
                                    has a huge weight on single-exon and
                                    few-exon genes. Specifying the lambda
                                    parameter of a poisson distribution will
                                    make braker call a script for downsampling
                                    of training gene structures according to
                                    their number of introns distribution, i.e.
                                    genes with none or few exons will be
                                    downsampled, genes with many exons will be
                                    kept. Default value is 2.
                                    If you want to avoid downsampling, you have
                                    to specify 0.
--checkSoftware                     Only check whether all required software
                                    is installed, no execution of BRAKER
--nocleanup                         Skip deletion of all files that are typically not
                                    used in an annotation project after
                                    running braker.pl. (For tracking any
                                    problems with a braker.pl run, you
                                    might want to keep these files, therefore
                                    nocleanup can be activated.)


DEVELOPMENT OPTIONS (PROBABLY STILL DYSFUNCTIONAL)

--splice_sites=patterns             list of splice site patterns for UTR
                                    prediction; default: GTAG, extend like this:
                                    --splice_sites=GTAG,ATAC,...
                                    this option only affects UTR training
                                    example generation, not gene prediction
                                    by AUGUSTUS
--overwrite                         Overwrite existing files (except for
                                    species parameter files) Beware, currently
                                    not implemented properly!
--extrinsicCfgFiles=file1,file2,... Depending on the mode in which braker.pl
                                    is executed, it may require one ore several
                                    extrinsicCfgFiles. Don't use this option
                                    unless you know what you are doing!
--stranded=+,-,+,-,...              If UTRs are trained, i.e.~strand-specific
                                    bam-files are supplied and coverage
                                    information is extracted for gene prediction,
                                    create stranded ep hints. The order of
                                    strand specifications must correspond to the
                                    order of bam files. Possible values are
                                    +, -, .
                                    If stranded data is provided, ONLY coverage
                                    data from the stranded data is used to
                                    generate UTR examples! Coverage data from
                                    unstranded data is used in the prediction
                                    step, only.
                                    The stranded label is applied to coverage
                                    data, only. Intron hints are generated
                                    from all libraries treated as "unstranded"
                                    (because splice site filtering eliminates
                                    intron hints from the wrong strand, anyway).
--optCfgFile=ppx.cfg                Optional custom config file for AUGUSTUS
                                    for running PPX (currently not
                                    implemented)
--grass                             Switch this flag on if you are using braker.pl
                                    for predicting genes in grasses with
                                    GeneMark-EX. The flag will enable
                                    GeneMark-EX to handle GC-heterogenicity
                                    within genes more properly.
                                    NOTHING IMPLEMENTED FOR GRASS YET!
--transmasked_fasta=file.fa         Transmasked genome FASTA file for GeneMark-EX
                                    (to be used instead of the regular genome
                                    FASTA file).
--min_contig=INT                    Minimal contig length for GeneMark-EX, could
                                    for example be set to 10000 if transmasked_fasta
                                    option is used because transmasking might
                                    introduce many very short contigs.
--translation_table=INT             Change translation table from non-standard
                                    to something else.
                                    DOES NOT WORK YET BECAUSE BRAKER DOESNT
                                    SWITCH TRANSLATION TABLE FOR GENEMARK-EX, YET!
--gc_probability=DECIMAL            Probablity for donor splice site pattern GC
                                    for gene prediction with GeneMark-EX,
                                    default value is 0.001
--gm_max_intergenic=INT             Adjust maximum allowed size of intergenic
                                    regions in GeneMark-EX. If not used, the value
                                    is automatically determined by GeneMark-EX.
--traingenes=file.gtf               Training genes that are used instead of training
                                    genes generated with GeneMark.
                                    In ETP mode, this option can be used together
                                    with --geneMarkGtf and --hints to provide BRAKER
                                    with results of a previous GeneMark-ETP run, so
                                    that the GeneMark-ETP step can be skipped.
                                    In this case, use training.gtf from that run as
                                    argument.
--UTR=on                            create UTR training examples from RNA-Seq
                                    coverage data; requires options
                                    --bam=rnaseq.bam.
                                    Alternatively, if UTR parameters already
                                    exist, training step will be skipped and
                                    those pre-existing parameters are used.
                                    DO NOT USE IN CONTAINER!
                                    TRY NOT TO USE AT ALL!
--addUTR=on                         Adds UTRs from RNA-Seq coverage data to
                                    augustus.hints.gtf file. Does not perform
                                    training of AUGUSTUS or gene prediction with
                                    AUGUSTUS and UTR parameters.
                                    DO NOT USE IN CONTAINER!
                                    TRY NOT TO USE AT ALL!


EXAMPLE

To run with RNA-Seq

braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --bam=accepted_hits.bam
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --hints=rnaseq.gff

To run with protein sequences

braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --prot_seq=proteins.fa
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --hints=prothint_augustus.gff

To run with RNA-Seq and protein sequences

braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --prot_seq=proteins.fa --rnaseq_sets_ids=id_rnaseq1,id_rnaseq2 \
    --rnaseq_sets_dir=/path/to/local/rnaseq/files
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --prot_seq=proteins.fa --bam=id_rnaseq1.bam,id_rnaseq2.bam

Tip

You can find the outputs of the test braker3 runs here: /local/cqls/software/test/braker3. If you'd like to replicate a test run, you can copy the justfile that is in that directory, check out a cpu with salloc, load the braker3 module, and then run just cp and just run.

Accessing your folders and files in Singularity¶

Singularity has protections in place such that the image cannot access the entire directory structure, and that only certain directories are bound. In general, we recommend you set your $SINGULARITY_BIND variable in this way:

bash SINGULARITY_BIND

export SINGULARITY_BIND=${PWD},/scratch,/scratch:/tmp

tcsh SINGULARITY_BIND

setenv SINGULARITY_BIND ${PWD},/scratch,/scratch:/tmp

This setting binds the $PWD, aka Present Working Directory (.), the local /scratch directory, and it also maps the /scratch drive to /tmp. Most programs respect the $TMPDIR setting, but some do not, and the /tmp volumes are intentionally relatively small on our compute nodes. Therefore, the /tmp drive can sometimes easily get filled up by singularity processing.

If you have other directories that need to be mounted, e.g. raw data directories that are symlinks, or sequence databases in /local/cqls/db, those would also need to be provided as comma-delimited values.

Running loaded Singularity images¶

Once the modules are loaded, you can use the commands as you would like any other command, and submit them through the queuing system with hqsub or sbatch. Please let us know if you run into issues using the singularity images.

Unloading modules¶

If you need to unload modules, you can either ml unload MODULE_NAME, or you can ml purge to unload all modules. In general, you don't need to unload modules after using them, but you may want to have the option to unload if you are testing a singularity image that happens to also have programs on the system paths.