Entering edit mode
3.9 years ago
Kai_Qi
▴
130
Hi All:
I just learned how to use sbatch to mapping the RNA-sequence data. My script for sbatch is:
#!/bin/bash
#SBATCH --job-name=starmapping_sbatch
#SBATCH --output=starmapping_sbatch.out
#SBATCH --error=starmapping_sbatch.err
#SBATCH --time=02:00:00
#SBATCH --partition=bigmem2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=10
#SBATCH --mem-per-cpu=20G
module load STAR
STAR --genomeDir /scratch/midway2/caiqi/GRCm38_star_index_125bp \
--runThreadN 16 \
--readFilesIn SRR5048027_1.fastq SRR5048027_2.fastq \
--outFileNamePrefix GRCm38E18SRR5048027 \
--twopassMode Basic \
--sjdbOverhang 125 \
--outSAMtype BAM SortedByCoordinate \
--outFilterMultimapNmax 20 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.06 \
--alignIntronMin 70 \
--alignIntronMax 500000 \
--alignMatesGapMax 500000 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outSAMstrandField intronMotif \
--outFilterType BySJout
In the --readFilesin step I need to change the file name every time I submit the job. Is there solution to modify it to make it done at the command line each time instead of get into the sh file.
Thank you for any advice,
Use a loop. Instead of using a script like this you can submit jobs directly on the command line. A very general idea is like this:
Thank you for your advice. It is much easier for me to understand and practice. Do you think it is possible to put your loop in a bash file then then directly type "bash file.sh" ?
Thank you,
Doing it this way submits individual jobs for each of your samples. This allows you to make the best use of your cluster compute resources by parallelization. You can submit any number of jobs and have them run in sequence based on job slots allocated to your account.
You could put that
for
loop inside a script like above but then a single job would be running that script and you would never get through all of your samples in the time allotted for the parent job.Appreciate a lot for your reply. But I still cannot comprehend. Let's say I have 2 replicates inside a folder, they are paried end sequenced fastq files: (SRR1_1.fastq, SRR1_2.fastq); (SRR2_1.fastq, SRR2_2.fastq) Using your advice
But in this case, when I was about to mapping the SRR2_1.fastq/SRR2_2.fastq, I still need to retype this command again. Also it looks that in this way I have to wait until the job is finished and retype it again?
I know I did not understand it correctly. Can you help me understand it so I can use it on my samples?
Thanks again
You're missing the
basename
command.basename $i .fastq
(note the space between$i
and.fastq
) will remove.fastq
from the name. The rest will work out as$name
will be defined properly. Please pay attention to each command and experiment on your own - that's how we all learn shell scripting.Everything that starts with
#SBATCH
in your script is an option for SLURM. When you submit a job on the command line you just need to put those options one following othersbatch option1 option2 so on
.--wrap
actually wraps the program you are trying to run in this caseSTAR
. All options forSTAR
would go next with entire unix command being enclosed by the"
.Run this part to understand how the names of the samples are being extracted from the files.
Thank you for your patient explanation. I will try it today. Last night I have modified the script in this way:
Then all I need to do is:
sbatch file.sh SRR1
Thank you all very much. I will lookup the methods you mentioned here for further learningI use an advanced version of this with HEREDOCs. Sample:
I've removed a bunch of other things from the loop above, but the premise is simple: Create a script for each sample and run it. The advantage is that I can customize multiple parameters for each run. Of course, a workflow manager would be a lot more elegant but this is easier to understand for most shell programmers including myself.
Thank you so much, but I am a little bit overwhelmed. It looks I have to submit the sbatch file one by one for a while, until I can understand what your script. It is good that I know where to look at next time.
Thanks a lot for your comments
use a worflow manager like snakemake or nextflow.
For casual users/non-programmers activation energy needed is too high to effectively use these. Not saying they should not use workflow managers but we should be careful about recommending these as the solution.
Thank you for the advice. Frankly speaking I have not heard these 2 before. But I will keep it in mind.
see Use a workflow management tool to manage your computational pipelines ; Snakemake vs. Nextflow: strengths and weaknesses ; etc..