New User 'onboarding' / FAQs¶
Assumptions¶
You have experience using the command-line; if not, see Training below
Accounts¶
How do I login? (SSH)¶
When using mac or linux-based operating systems, you should have access to a terminal and the ssh program that will let you connect to the infrastructure. Using windows, you will need to use the Windows Subsystem for Linux (WSL) or putty.
What server do I login to?¶
You should log in to shell-hpc.cqls.oregonstate.edu using the information provided upon signing up. In order to eliminate the need to use duo, you can set up ssh keys that help confirm your identity.
How do I change my password?¶
The passwd program allows you to change your password. You’ll have to enter your current (even temporary) password before entering your new password twice. If you have ssh keys set up, entering your password just to log in is not required. Your password still is useful for signing in to https://gitlab.cqls.oregonstate.edu
What if I forget my password?¶
You have to submit a support ticket at https://shell.cqls.oregonstate.edu/support/
What am I allowed to do on the shell server?¶
On shell-hpc.cqls.oregonstate.edu
, you can edit text files, submit jobs to our queuing system (Slurm and/or SGE), and
do basic text processing. Jobs requiring lots of processors and/or memory will be killed.
What am I NOT allowed to do on the shell server? Why?¶
Most processing jobs will be killed. This is so that everyone has equal access for logging in to shell-hpc.cqls and that
processing jobs are not slowing down the shell machine. If all processors on the shell-hpc.cqls machine were used, then
users would have a difficult time logging in and currently logged in users would have difficulty submitting jobs to the
queuing system.. If your command on shell-hpc.cqls gets killed, please submit the job using hqsub
, or check out an
interactive node using salloc
.
What is the default shell?¶
The default shell is tcsh for legacy users. Users can request a change of default shell to bash by submitting a ticket.
New users will be set up with bash shell.
What is a shell?¶
The shell is a command-line interface between you (the user) and the computer or server. The shell interprets what you type as commands and interprets the commands such that the computer or server understands what you want to do.
Do I have a quota for my $HOME directory?¶
Users have a 75GB quota for their home directories.
How do I check my quota?¶
Use the quota -s command to see your current usage and quota.
TBD.
What do I do if I go over my quota?¶
You will need to remove files if you exceed your quota such that you get under the 75GB limit.
What should I store in my $HOME directory?¶
Minimal configuration and other files should be stored in your $HOME
directory. All processing should be done on
networked filesystem drives and the local /scratch
drives of the processing machines.
How do I edit my $PATH variable and save it across log-ins?¶
The exact changes you need to make depend on your $SHELL
(either bash
, tcsh
, or zsh
).
I suggest making the change temporarily first, and then change your configuration file (~/.bashrc
for bash
,
~/.cshrc
for tcsh
, and ~/.zshrc
for zsh
).
You'll need the full path to the directory that contains the program(s) you want to add to your $PATH
. To temporarily
add the programs to your $PATH
, run the appropriate command below (export
for bash
, setenv
for tcsh
):
After you test out the new command, then you can add those lines to your config file:
Keys to doing this properly are:
- Make sure to use the
>>
append redirect so you don't overwrite your config (>
will overwrite the file). - Make sure to use single quotes
'
for your echo command, otherwise the${PATH}
variable will get expanded unecessarily.
You can also just edit the appropriate file with a text editor (vim
, emacs
, nano
) as you feel comfortable.
After you add the changes to your config files, the updated $PATH
will get loaded on every new log-in shell.
File Transfers¶
What server should I use to transfer files via SFTP/SCP?¶
You must use files-hpc.cqls.oregonstate.edu
for file transfers. File transfers are disabled on
shell-hpc.cqls.oregonstate.edu
.
What is SFTP?¶
SFTP stands for Secure File Transfer Protocol. SFTP allows for file transfers, both to and from the infrastructure, using the same security provided by ssh.
What is SCP?¶
SCP stands for Secure Copy Protocol. The scp program allows secure file transfers from the infrastructure to your own computer.
Can I use FTP?¶
FTP (File Transfer Protocol) is insecure and CQLS servers will not host files over FTP. However, you can use the ftp
program from files-hpc.cqls.oregonstate.edu
to transfer files externally, e.g. to NCBI, if necessary.
Why can’t I transfer files via SFTP to shell.cgrb?¶
The shell-hpc.cqls.oregonstate.edu
machine should be used for submitting processing jobs only, not file transfers.
Can I transfer files using a Windows drive share to the infrastructure?¶
TBD
I want to publish data via the web, how can I do this?¶
TBD
Can I access my files via the web¶
TBD
Storage¶
What is ZFS / NFS?¶
ZFS is a file system and volume management technology that scales indefinitely and emphasizes zero data loss. We use the ZFS protocol on our networked file system (NFS) drives.
What is DFS / Quobyte?¶
DFS is a distributed file system. Its use has been discontinued at the CQLS.
What is stored on NFS, what should I be using it for?¶
All data and outputs should be stored on the NFS. We recommend using the /scratch
drives, which are specific to each
compute node, for doing the analysis and then copying the results back onto a NFS location.
Does my lab have NFS space?¶
You may have access to NFS; you will need to ask the post-doc/professor in your lab who will know.
I accidentally deleted a file; are my files backed up?¶
The $HOME
directory is backed up. Each NFS location may or may not be backed up depending on if your lab pays for
storage backup. Contact Support if you need to start backups on your space or if
you require recovery.
What is tape backup?¶
Tape backup is a long term storage backup for recovery after some disaster. All sequencing runs are copied to tape prior to deletion. Each lab is still responsible for copying and maintaining raw sequencing data; tape backup is used for emergencies only, and is not guaranteed.
How do I get more space added to our NFS space?¶
Please contact Support to purchase more NFS disk space.
Batch Processing¶
What is batch processing?¶
A batch process is one that can run without human interaction. When we submit processes in a non-interactive mode on the infrastructure, we are submitting batch processes.
What is SGE?¶
SGE is Son of Grid Engine, which is a queuing system. SGE allows us to submit batch jobs to different compute nodes across the infrastructure, such that each job runs when resources permit.
What is Slurm?¶
Slurm is a newer queuing system that has (mostly) replaced SGE at the CQLS.
What is a Slurm partition (SGE queue)?¶
We have different partitions (aka queues) available on the infrastructure so that each lab may have different resources available at any given time.
What partitions are available to me?¶
Not all labs/colleges have access to the same resources. You can see which resources are available to you by running
hqavail
.
How do I submit jobs?¶
There are multiple ways to submit jobs. A single job can be run with the hqsub
command. More information about
queueing systems can be found here.
How do I check out a compute node for interactive use?¶
Use the salloc
command.
What is Slurm (or SGE) Array?¶
hqsub
allows easy submission of array jobs, which are most commonly used when a user has a command that they want to
run on many individual inputs (10s-1000s). Instead of submitting hundreds or thousands of jobs, you can submit a single
array job and control how many tasks are running at once.
How do I check the status of jobs?¶
The hqstat
command allows you to see which jobs are running. You can look at a single job by running
scontrol show job $JOBID
.
How do I kill jobs?¶
You kill jobs with the scancel $JOBID
command.
How can I use multiple CPUs?¶
Make sure your program has the capability of using multiple CPUs. Examine the help file for your program for more
details. To check out multiple CPUs using hqsub
, use the -p
flag and provide the number of CPUs you want to check
out. You can use the $NPROCS
variable in your hqsub
commands e.g. blastp -num_threads $NPROCS
to make sure they
are synced.
For an interactive job, you can salloc -c $NUM
where $NUM
is the number of CPUs to check out.
How do I check out GPUs?¶
See which partitions have GPUs available to you - hqavail --gpu
- and submit a job to the appropriate partition. You
can use the hqsub --opts='--gpus=$NUM'
where $NUM
is the number of gpus you want to reserve.
Note
A built-in gpu flag will be added to hqsub
after some testing.
For an interactive job, you can salloc -p $GPU_QUEUE --gpus $NUM
.
How do I know how long my bioinformatics job will run?¶
We suggest running a small test dataset through your pipeline(s) to determine an expected amount of processing
time/resource utilization. The user is responsible for ensuring that their jobs are only using the amount of resources
requested in the queuing system. Please monitor your jobs (you can salloc
to the same node that your job is running on
to check the health of the machine using e.g. htop
) to ensure everything is going as requested.
How does my lab obtain more processing resources?¶
Most colleges have access to shared computing resources; if you are a member of a college where you think you should have access to machines and they are not available, please submit a support request. If you or your college does not have resources available, we have machines available to rent for up to 6 months in length. If you have more needs than that, you can email Support and discuss other options, including current costs of purchasing machines.
Another lab has asked me to collaborate with them, but I cannot access their files or compute resources, what do I do?¶
Email Support and cc the appropriate collaborators to get access to their files.
Support Tickets¶
How do I follow up to obtain further support?¶
To request general support, use the support form
Please use the 'cgrb-support' option for general questions and 'cgrb-software' option for software requests.
How do I accurately describe my issue?¶
Please provide information regarding:
- What machine you are having an issue with (use
qstat
to see the node) - What software you are trying to run
- What your expected output is, and what the observed output is
- A copy/paste of any error messages you may have
- How to reproduce the issue
- What you may have tried to resolve the issue
- Any links to the software or software help pages that might help
If you are submitting a software install request, please provide a link to the github page or other source material.
How do I check on the status of my support ticket?¶
Please follow up by emailing Support for support requests, and Ed Davis for software requests.
Training¶
How can I learn more about using the command line?¶
We offer 'Intro to Unix/Linux' and 'Command-line data analysis' courses - see the Workshops page for more information. We also offer one-on-one training for an hourly fee; please email the bioinformatics team.
Software¶
Conda¶
How do I get conda set up?¶
Information about conda can be found here.
How can I use pixi?¶
See information about configuring here and using pixi here.
Why is my pixi saying 'Set PIXI_CACHE_DIR to continue.'¶
You probably need to set up your configuration.
Tip
Make sure you exec $SHELL
to reload your shell and enable the settings.
If you see a directory name when you echo $PIXI_CACHE_DIR
, pixi
should be working.
If you don't want to use hpcman
to automatically manage your pixi cache dir, you can
manually set it to a directory you own on a nfs drive to continue, e.g.
export PIXI_CACHE_DIR=/nfs/dept/lab/user/opt/pixi/cache
.
We require the PIXI_CACHE_DIR
to be set because otherwise the default cache home directory will get full of conda
packages, which can take up considerable space.
If this has already happened, clean your cache:
How do I fix my broken login/configs¶
The raw configuration files can be found here:
You can make a backup of your current file and then copy the raw configuration files into your home directory. You can
also remove a ~/.tcshrc
file if it's present.
mv ~/.bashrc ~/.bashrc.bak
mv ~/.cshrc ~/.cshrc.bak
rm -f ~/.tcshrc
cp /local/cqls/etc/inits/bashrc ~/.basrhc
cp /local/cqls/etc/inits/cshrc ~/.cshrc
Then log out and log back in. If your configuration seems fixed, you can add some of the modifications from your backups to the newly refreshed config files.
Where can I learn more about conda env activation?¶
See this link from the conda documentation
For most conda environments on our infrastructure, I run the scripts in /local/cluster/conda/conda_*_setup.sh
to
resolve version mismatches.
R isn't working in my conda env, why?¶
You likely have $R_LIBS
or $R_LIBS_USER
set and the R is pulling libraries from your home directory or other
location that are incompatible with the R environment. You can manually unset those env vars or go to the base env of
your conda directory and run bash /local/cluster/conda/conda_R_setup.sh
to automatically set up the appropriate env
vars on conda activate
and conda deactivate
.
For a copy/paste option if the conda env is active:
See above for more information
Python is not working or has version mismatches¶
You may need to unalias python unalias python
. You can set it in your ~/.bashrc
or ~/.cshrc
files as well. You
will have to fully type out /local/cluster/bin/python2
or /local/cluster/bin/python3
or add /local/cluster/bin
in
your $PATH
upstream of /usr/bin
e.g. export PATH=/local/cluster/bin:${PATH}
so you don't have to type it out
fully.
Your python is probably pulling from your .local
install. You can manually unset those env vars or go to the base env
of your conda directory and run bash /local/cluster/conda/conda_python_setup.sh
to automatically set up the
appropriate env vars on conda activate
and conda deactivate
.
For a copy/paste option if the conda env is active:
See above for more information
Perl is not working or has version mismatches¶
Your perl is probably pulling from your PERL5LIB
env var. You can manually unset those env vars or go to the base env
of your conda directory and run bash /local/cluster/conda/conda_perl_setup.sh
to automatically set up the appropriate
env vars on conda activate
and conda deactivate
.
For a copy/paste option if the conda env is active:
See above for more information
Software is not working due to a mismatch in linked libraries (lib.so missing)¶
The compiler on our infrastructure is old (gcc 4.8.5), and does not provide the most up-to-date linked libraries. The
conda LD_LIBRARY_PATH
is not getting set properly. You can manually set your LD_LIBRARY_PATH
to include
$CONDA_PREFIX/lib
, or you can go to the base env of your conda directory and run bash
/local/cluster/conda/conda_perl_setup.sh
to automatically set up the appropriate env vars on conda activate
and
conda deactivate
. For a copy/paste option if the conda env is active:
See above for more information
What software do I use for…¶
Adapter trimming¶
For automated adapter trimming, we currently recommend fastp. fastp is a good option for situations where you have adapters on the 3’ end of reads due to read-through of short inserts into the sequencing adapter. For trimming of primer sequences or other custom sequences, we suggest using bbduk.sh or cutadapt.
Short read alignment¶
bwa mem
Spliced alignment¶
STAR or hisat2
Long read alignment¶
minimap2
Genome assembly (Illumina)¶
SPAdes
Genome assembly (long read)¶
flye or nextdenovo
Genome annotation (prokaryote)¶
Bakta
RNA-Seq quantification¶
salmon
Differential gene expression analysis¶
deseq2
Pairwise sequence alignment¶
blast or diamond
Multiple sequence alignment¶
mafft --auto
Phylogenetic tree construction¶
IQ-TREE; fasttree can be useful for preliminary analysis
Orthologous group calculation (Prokaryote)¶
PIRATE for cultured organisms; PPanGGOLiN for MAGs/SAGs
anvio is useful for pangenome analysis as well
Principal component analysis or other ordination/dimensional reduction¶
Using R: vegan handles all types of ordinations. For nonmetric multidimensional scaling (NMDS), use the metaMDS function. For principal component analysis (PCA) use the rda function with no constraints. For principal coordinate analysis (PCoA), use the wcmdscale function. For constrained ordinations, use capscale (or dbRDA, constrained versions of PCoA) or rda (constrained PCA).
How do I…¶
Figure out how to run a program¶
- Use -h e.g.
$program -h
- Use --help e.g.
$program --help
- Use help e.g.
$program help
- Use man (this works for system installed things like cat, mkdir, ls) e.g.
man $program
- Use tldr (works for common programs, awk, sed, tar, wget) e.g.
tldr $program
- Examine the script with less e.g.
which $program; less -S /local/cluster/bin/$program
(Note: does not work for compiled software) - Search the program name on google
- Search the program name on the updates website https://software.cqls.oregonstate.edu/updates/tags
You can also try helpme
for some curated help data.
Display a formatted markdown file on the command line¶
Use glow
Download reads from NCBI¶
Use prefetch
and fasterqdump
. See here for more info.
Link to SRA toolkit under construction.
Download genomes from NCBI¶
You can use the data-hub to get genome data. Use
files-hpc.cqls.oregonstate.edu
for downloads.
You can also use the get_assemblies program. Try this:
Generate a BLASTDB¶
Use makeblastdb -in INPUT.fasta --dbtype [nucl|prot]
to generate your BLASTDB.
Please submit using hqsub 'makeblastdb ...'
.
Do a BLASTN or BLASTP search¶
Use blastp
or blastn
. Use the -help
flag for options. Do not use blastall
as it is old and unsupported now.
Miscellaneous¶
My terminal output is garbled¶
Run reset
. This should reset the output on your screen and you should be able to continue as normal.
Do my sequences have adapters?¶
All index sequences will not be included in the raw reads. Adapters could be on the 3' end of reads, depending on the
library prep type. Please check the FASTQC reports sent by Matthew with each sequencing run to determine if your reads
have adapters. Use the fastp
program to remove them.
What is going on with my 16S sequencing results?¶
See this page for some information regarding the 16S preps provided at the CQLS.