Using singularity images¶
The CQLS has installed singularity on compute nodes to facilitate containerized computing. Please let us know if the current singularity deployment is or is not working for you so we can better serve your needs as researchers.
Tip
We recommend using nextflow -profile singularity
when a nf-core pipeline is available. Using the singularity
profile significantly simplifies the use and maintenance of nextflow pipelines.
Finding pre-downloaded singularity images¶
We have implemented singularity image management through the Lmod module system. In this way, users can organically find singularity images without having to know where the images are stored on disk.
➜ ml avail
-------------------------------------- /local/cqls/singularity/shpc/modules --------------------------------------
gitlab-registry.in2p3.fr/phoogle/pelican/1.0.8/module teambraker/braker3/latest/module
python/3.13-rc/module
------------ /fs1/local/cqls/opt/bootstrap/lmod/x86_64/.pixi/envs/default/lmod/lmod/modulefiles/Core -------------
lmod settarg
Module defaults are chosen based on Find First Rules due to Name/Version/Version modules found in the module tree.
See https://lmod.readthedocs.io/en/latest/060_locating.html for details.
If the avail list is too long consider trying:
"module --default avail" or "ml -d av" to just list the default modules.
"module overview" or "ml ov" to display the number of modules for each name.
Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
Using the ml avail
command, one can see which singularity images have been downloaded. To get more information about
the images, one can run the ml spider
command, with optionally providing a module (image) name. As an example, to get
more information about the braker3 install:
➜ ml spider braker3
--------------------------------------------------------------------------------------------------------------
teambraker/braker3: teambraker/braker3/latest/module
--------------------------------------------------------------------------------------------------------------
This module can be loaded directly: module load teambraker/braker3/latest/module
Help:
This module is a singularity container wrapper for teambraker/braker3:latest vlatest
Container (available through variable SINGULARITY_CONTAINER):
- /local/cqls/singularity/shpc/images/teambraker/braker3/latest/teambraker-braker3-latest-sha256:8e8cc01384
971a9cf04a4dc519faf3df947ed62b2bae2ce8be3075cb5b5e1e1e.sif
Commands include:
- braker3-run:
singularity run -B <wrapperDir>/99-shpc.sh:/.singularity.d/env/99-shpc.sh <container> "$@"
- braker3-shell:
singularity shell -s /bin/sh -B <wrapperDir>/99-shpc.sh:/.singularity.d/env/99-shpc.sh <container>
- braker3-exec:
singularity exec -B <wrapperDir>/99-shpc.sh:/.singularity.d/env/99-shpc.sh <container> "$@"
- braker3-inspect-runscript:
singularity inspect -r <container>
- braker3-inspect-deffile:
singularity inspect -d <container>
- braker3-container:
echo "$SINGULARITY_CONTAINER"
- braker:
singularity exec -B <wrapperDir>/99-shpc.sh:/.singularity.d/env/99-shpc.sh <container> braker.pl "$@"
For each of the above, you can export:
- SINGULARITY_OPTS: to define custom options for singularity (e.g., --debug)
- SINGULARITY_COMMAND_OPTS: to define custom options for the command (e.g., -b)
- SINGULARITY_CONTAINER: full path to the Singularity Container
Make note of the line about how to load
a module.
Loading modules¶
Let's load a module:
And let's make sure it loaded properly:
➜ which braker
/local/cqls/singularity/shpc/wrappers/teambraker/braker3/latest/bin/braker
➜ braker -help
DESCRIPTION
braker.pl Pipeline for predicting genes with GeneMark-EX and AUGUSTUS with
RNA-Seq and/or proteins
SYNOPSIS
braker.pl [OPTIONS] --genome=genome.fa {--bam=rnaseq.bam | --prot_seq=prot.fa}
INPUT FILE OPTIONS
--genome=genome.fa fasta file with DNA sequences
--bam=rnaseq.bam bam file with spliced alignments from
RNA-Seq
--prot_seq=prot.fa A protein sequence file in multi-fasta
format used to generate protein hints.
Unless otherwise specified, braker.pl will
run in "EP mode" or "ETP mode which uses
ProtHint to generate protein hints and
GeneMark-EP+ or GeneMark-ETP to
train AUGUSTUS.
--hints=hints.gff Alternatively to calling braker.pl with a
bam or protein fasta file, it is possible to
call it with a .gff file that contains
introns extracted from RNA-Seq and/or
protein hints (most frequently coming
from ProtHint). If you wish to use the
ProtHint hints, use its
"prothint_augustus.gff" output file.
This flag also allows the usage of hints
from additional extrinsic sources for gene
prediction with AUGUSTUS. To consider such
additional extrinsic information, you need
to use the flag --extrinsicCfgFiles to
specify parameters for all sources in the
hints file (including the source "E" for
intron hints from RNA-Seq).
In ETP mode, this option can be used together
with --geneMarkGtf and --traingenes to provide
BRAKER with results of a previous GeneMark-ETP
run, so that the GeneMark-ETP step can be
skipped. In this case, specify the hintsfile of
a previous BRAKER run here, or generate a
hintsfile from the GeneMark-ETP working
directory with the script get_etp_hints.py.
--rnaseq_sets_ids=SRR1111,SRR1115 IDs of RNA-Seq sets that are either in
one of the directories specified with
--rnaseq_sets_dir, or that can be downloaded
from SRA. If you want to use local files, you
can use unaligned reads in FASTQ format
(they have to be named ID.fastq if unpaired or
ID_1.fastq, ID_2.fastq if paired), or aligned reads
as a BAM file (named ID.bam).
--rnaseq_sets_dir=/path/to/rna_dir1 Locations where the local files of RNA-Seq data
reside that were specified with --rnaseq_sets_ids.
FREQUENTLY USED OPTIONS
--species=sname Species name. Existing species will not be
overwritten. Uses Sp_1 etc., if no species
is assigned
--AUGUSTUS_ab_initio output ab initio predictions by AUGUSTUS
in addition to predictions with hints by
AUGUSTUS
--softmasking_off Turn off softmasking option (enables by
default, discouraged to disable!)
--esmode Run GeneMark-ES (genome sequence only) and
train AUGUSTUS on long genes predicted by
GeneMark-ES. Final predictions are ab initio
--gff3 Output in GFF3 format (default is gtf
format)
--threads Specifies the maximum number of threads that
can be used during computation. Be aware:
optimize_augustus.pl will use max. 8
threads; augustus will use max. nContigs in
--genome=file threads.
--workingdir=/path/to/wd/ Set path to working directory. In the
working directory results and temporary
files are stored
--nice Execute all system calls within braker.pl
and its submodules with bash "nice"
(default nice value)
--alternatives-from-evidence=true Output alternative transcripts based on
explicit evidence from hints (default is
true).
--fungus GeneMark-EX option: run algorithm with
branch point model (most useful for fungal
genomes)
--crf Execute CRF training for AUGUSTUS;
resulting parameters are only kept for
final predictions if they show higher
accuracy than HMM parameters.
--keepCrf keep CRF parameters even if they are not
better than HMM parameters
--makehub Create track data hub with make_hub.py
for visualizing BRAKER results with the
UCSC GenomeBrowser
--busco_lineage=lineage If you provide a BUSCO lineage, BRAKER will
run compleasm on genome level to generate hints
from BUSCO to enhance BUSCO discovery in the
protein set. Also, if you provide a BUSCO
lineage, BRAKER will run compleasm to assess
the protein sets that go into TSEBRA combination,
and will determine the TSEBRA mode to maximize
BUSCO. Do not provide a busco_lineage if you
want to determina natural BUSCO sensivity of
BRAKER!
--email E-mail address for creating track data hub
--version Print version number of braker.pl
--help Print this help message
CONFIGURATION OPTIONS (TOOLS CALLED BY BRAKER)
--AUGUSTUS_CONFIG_PATH=/path/ Set path to config directory of AUGUSTUS
(if not specified as environment
variable). BRAKER1 will assume that the
directories ../bin and ../scripts of
AUGUSTUS are located relative to the
AUGUSTUS_CONFIG_PATH. If this is not the
case, please specify AUGUSTUS_BIN_PATH
(and AUGUSTUS_SCRIPTS_PATH if required).
The braker.pl commandline argument
--AUGUSTUS_CONFIG_PATH has higher priority
than the environment variable with the
same name.
--AUGUSTUS_BIN_PATH=/path/ Set path to the AUGUSTUS directory that
contains binaries, i.e. augustus and
etraining. This variable must only be set
if AUGUSTUS_CONFIG_PATH does not have
../bin and ../scripts of AUGUSTUS relative
to its location i.e. for global AUGUSTUS
installations. BRAKER1 will assume that
the directory ../scripts of AUGUSTUS is
located relative to the AUGUSTUS_BIN_PATH.
If this is not the case, please specify
--AUGUSTUS_SCRIPTS_PATH.
--AUGUSTUS_SCRIPTS_PATH=/path/ Set path to AUGUSTUS directory that
contains scripts, i.e. splitMfasta.pl.
This variable must only be set if
AUGUSTUS_CONFIG_PATH or AUGUSTUS_BIN_PATH
do not contains the ../scripts directory
of AUGUSTUS relative to their location,
i.e. for special cases of a global
AUGUSTUS installation.
--BAMTOOLS_PATH=/path/to/ Set path to bamtools (if not specified as
environment BAMTOOLS_PATH variable). Has
higher priority than the environment
variable.
--GENEMARK_PATH=/path/to/ Set path to GeneMark-ET (if not specified
as environment GENEMARK_PATH variable).
Has higher priority than environment
variable.
--SAMTOOLS_PATH=/path/to/ Optionally set path to samtools (if not
specified as environment SAMTOOLS_PATH
variable) to fix BAM files automatically,
if necessary. Has higher priority than
environment variable.
--PROTHINT_PATH=/path/to/ Set path to the directory with prothint.py.
(if not specified as PROTHINT_PATH
environment variable). Has higher priority
than environment variable.
--DIAMOND_PATH=/path/to/diamond Set path to diamond, this is an alternative
to NCIB blast; you only need to specify one
out of DIAMOND_PATH or BLAST_PATH, not both.
DIAMOND is a lot faster that BLAST and yields
highly similar results for BRAKER.
--BLAST_PATH=/path/to/blastall Set path to NCBI blastall and formatdb
executables if not specified as
environment variable. Has higher priority
than environment variable.
--COMPLEASM_PATH=/path/to/compleasm Set path to compleasm (if not specified as
environment variable). Has higher priority
than environment variable.
--PYTHON3_PATH=/path/to Set path to python3 executable (if not
specified as envirnonment variable and if
executable is not in your $PATH).
--JAVA_PATH=/path/to Set path to java executable (if not
specified as environment variable and if
executable is not in your $PATH), only
required with flags --UTR=on and --addUTR=on
--GUSHR_PATH=/path/to Set path to gushr.py exectuable (if not
specified as an environment variable and if
executable is not in your $PATH), only required
with the flags --UTR=on and --addUTR=on
--MAKEHUB_PATH=/path/to Set path to make_hub.py (if option --makehub
is used).
--CDBTOOLS_PATH=/path/to cdbfasta/cdbyank are required for running
fix_in_frame_stop_codon_genes.py. Usage of
that script can be skipped with option
'--skip_fixing_broken_genes'.
EXPERT OPTIONS
--augustus_args="--some_arg=bla" One or several command line arguments to
be passed to AUGUSTUS, if several
arguments are given, separate them by
whitespace, i.e.
"--first_arg=sth --second_arg=sth".
--skipGeneMark-ES Skip GeneMark-ES and use provided
GeneMark-ES output (e.g. provided with
--geneMarkGtf=genemark.gtf)
--skipGeneMark-ET Skip GeneMark-ET and use provided
GeneMark-ET output (e.g. provided with
--geneMarkGtf=genemark.gtf)
--skipGeneMark-EP Skip GeneMark-EP and use provided
GeneMark-EP output (e.g. provided with
--geneMarkGtf=genemark.gtf)
--skipGeneMark-ETP Skip GeneMark-ETP and use provided
GeneMark-ETP output (e.g. provided with
--gmetp_results_dir=GeneMark-ETP/)
--geneMarkGtf=file.gtf If skipGeneMark-ET is used, braker will by
default look in the working directory in
folder GeneMarkET for an already existing
gtf file. Instead, you may provide such a
file from another location. If geneMarkGtf
option is set, skipGeneMark-ES/ET/EP/ETP is
automatically also set. Note that gene and
transcript ids in the final output may not
match the ids in the input genemark.gtf
because BRAKER internally re-assigns these
ids.
In ETP mode, this option hast to be used together
with --traingenes and --hints to provide BRAKER
with results of a previous GeneMark-ETP run.
--gmetp_results_dir Location of results from a previous
GeneMark-ETP run, which will be used to
skip the GeneMark-ETP step. This option
can be used instead of --geneMarkGtf,
--traingenes, and --hints to skip GeneMark.
--rounds The number of optimization rounds used in
optimize_augustus.pl (default 5)
--skipAllTraining Skip GeneMark-EX (training and
prediction), skip AUGUSTUS training, only
runs AUGUSTUS with pre-trained and already
existing parameters (not recommended).
Hints from input are still generated.
This option automatically sets
--useexisting to true.
--useexisting Use the present config and parameter files
if they exist for 'species'; will overwrite
original parameters if BRAKER performs
an AUGUSTUS training.
--filterOutShort It may happen that a "good" training gene,
i.e. one that has intron support from
RNA-Seq in all introns predicted by
GeneMark-EX, is in fact too short. This flag
will discard such genes that have
supported introns and a neighboring
RNA-Seq supported intron upstream of the
start codon within the range of the
maximum CDS size of that gene and with a
multiplicity that is at least as high as
20% of the average intron multiplicity of
that gene.
--skipOptimize Skip optimize parameter step (not
recommended).
--skipIterativePrediction Skip iterative prediction in --epmode (does
not affect other modes, saves a bit of runtime)
--skipGetAnnoFromFasta Skip calling the python3 script
getAnnoFastaFromJoingenes.py from the
AUGUSTUS tool suite. This script requires
python3, biopython and re (regular
expressions) to be installed. It produces
coding sequence and protein FASTA files
from AUGUSTUS gene predictions and provides
information about genes with in-frame stop
codons. If you enable this flag, these files
will not be produced and python3 and
the required modules will not be necessary
for running brkaker.pl.
--skip_fixing_broken_genes If you do not have python3, you can choose
to skip the fixing of stop codon including
genes (not recommended).
--eval=reference.gtf Reference set to evaluate predictions
against (using evaluation scripts from GaTech)
--eval_pseudo=pseudo.gff3 File with pseudogenes that will be excluded
from accuracy evaluation (may be empty file)
--AUGUSTUS_hints_preds=s File with AUGUSTUS hints predictions; will
use this file as basis for UTR training;
only UTR training and prediction is
performed if this option is given.
--flanking_DNA=n Size of flanking region, must only be
specified if --AUGUSTUS_hints_preds is given
(for UTR training in a separate braker.pl
run that builds on top of an existing run)
--verbosity=n 0 -> run braker.pl quiet (no log)
1 -> only log warnings
2 -> also log configuration
3 -> log all major steps
4 -> very verbose, log also small steps
--downsampling_lambda=d The distribution of introns in training
gene structures generated by GeneMark-EX
has a huge weight on single-exon and
few-exon genes. Specifying the lambda
parameter of a poisson distribution will
make braker call a script for downsampling
of training gene structures according to
their number of introns distribution, i.e.
genes with none or few exons will be
downsampled, genes with many exons will be
kept. Default value is 2.
If you want to avoid downsampling, you have
to specify 0.
--checkSoftware Only check whether all required software
is installed, no execution of BRAKER
--nocleanup Skip deletion of all files that are typically not
used in an annotation project after
running braker.pl. (For tracking any
problems with a braker.pl run, you
might want to keep these files, therefore
nocleanup can be activated.)
DEVELOPMENT OPTIONS (PROBABLY STILL DYSFUNCTIONAL)
--splice_sites=patterns list of splice site patterns for UTR
prediction; default: GTAG, extend like this:
--splice_sites=GTAG,ATAC,...
this option only affects UTR training
example generation, not gene prediction
by AUGUSTUS
--overwrite Overwrite existing files (except for
species parameter files) Beware, currently
not implemented properly!
--extrinsicCfgFiles=file1,file2,... Depending on the mode in which braker.pl
is executed, it may require one ore several
extrinsicCfgFiles. Don't use this option
unless you know what you are doing!
--stranded=+,-,+,-,... If UTRs are trained, i.e.~strand-specific
bam-files are supplied and coverage
information is extracted for gene prediction,
create stranded ep hints. The order of
strand specifications must correspond to the
order of bam files. Possible values are
+, -, .
If stranded data is provided, ONLY coverage
data from the stranded data is used to
generate UTR examples! Coverage data from
unstranded data is used in the prediction
step, only.
The stranded label is applied to coverage
data, only. Intron hints are generated
from all libraries treated as "unstranded"
(because splice site filtering eliminates
intron hints from the wrong strand, anyway).
--optCfgFile=ppx.cfg Optional custom config file for AUGUSTUS
for running PPX (currently not
implemented)
--grass Switch this flag on if you are using braker.pl
for predicting genes in grasses with
GeneMark-EX. The flag will enable
GeneMark-EX to handle GC-heterogenicity
within genes more properly.
NOTHING IMPLEMENTED FOR GRASS YET!
--transmasked_fasta=file.fa Transmasked genome FASTA file for GeneMark-EX
(to be used instead of the regular genome
FASTA file).
--min_contig=INT Minimal contig length for GeneMark-EX, could
for example be set to 10000 if transmasked_fasta
option is used because transmasking might
introduce many very short contigs.
--translation_table=INT Change translation table from non-standard
to something else.
DOES NOT WORK YET BECAUSE BRAKER DOESNT
SWITCH TRANSLATION TABLE FOR GENEMARK-EX, YET!
--gc_probability=DECIMAL Probablity for donor splice site pattern GC
for gene prediction with GeneMark-EX,
default value is 0.001
--gm_max_intergenic=INT Adjust maximum allowed size of intergenic
regions in GeneMark-EX. If not used, the value
is automatically determined by GeneMark-EX.
--traingenes=file.gtf Training genes that are used instead of training
genes generated with GeneMark.
In ETP mode, this option can be used together
with --geneMarkGtf and --hints to provide BRAKER
with results of a previous GeneMark-ETP run, so
that the GeneMark-ETP step can be skipped.
In this case, use training.gtf from that run as
argument.
--UTR=on create UTR training examples from RNA-Seq
coverage data; requires options
--bam=rnaseq.bam.
Alternatively, if UTR parameters already
exist, training step will be skipped and
those pre-existing parameters are used.
DO NOT USE IN CONTAINER!
TRY NOT TO USE AT ALL!
--addUTR=on Adds UTRs from RNA-Seq coverage data to
augustus.hints.gtf file. Does not perform
training of AUGUSTUS or gene prediction with
AUGUSTUS and UTR parameters.
DO NOT USE IN CONTAINER!
TRY NOT TO USE AT ALL!
EXAMPLE
To run with RNA-Seq
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
--bam=accepted_hits.bam
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
--hints=rnaseq.gff
To run with protein sequences
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
--prot_seq=proteins.fa
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
--hints=prothint_augustus.gff
To run with RNA-Seq and protein sequences
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
--prot_seq=proteins.fa --rnaseq_sets_ids=id_rnaseq1,id_rnaseq2 \
--rnaseq_sets_dir=/path/to/local/rnaseq/files
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
--prot_seq=proteins.fa --bam=id_rnaseq1.bam,id_rnaseq2.bam
Tip
You can find the outputs of the test braker3 runs here: /local/cqls/software/test/braker3
. If you'd like to
replicate a test run, you can copy the justfile
that is in that directory, check out a cpu with salloc
, load the
braker3 module, and then run just cp
and just run
.
Accessing your folders and files in Singularity¶
Singularity has protections in place such that the image cannot access the entire directory structure, and that only
certain directories are bound. In general,
we recommend you set your $SINGULARITY_BIND
variable in this way:
This setting binds the $PWD
, aka Present Working Directory (.
), the local /scratch
directory, and it
also maps the /scratch
drive to /tmp
. Most programs respect the $TMPDIR
setting, but some do not, and the /tmp
volumes are intentionally relatively small on our compute nodes. Therefore, the /tmp
drive can sometimes easily get
filled up by singularity processing.
If you have other directories that need to be mounted, e.g. raw data directories that are symlinks, or sequence
databases in /local/cqls/db
, those would also need to be provided as comma-delimited values.
Running loaded Singularity images¶
Once the modules are loaded, you can use the commands as you would like any other command, and submit them through
the queuing system with hqsub
or sbatch
. Please let us know if you run into
issues using the singularity images.
Unloading modules¶
If you need to unload modules, you can either ml unload MODULE_NAME
, or you can ml purge
to unload all modules. In
general, you don't need to unload modules after using them, but you may want to have the option to unload if you are
testing a singularity image that happens to also have programs on the system paths.