Question

How to modify my sbatch script to make it be able to be used for other files

0

Entering edit mode

3.9 years ago

Kai_Qi ▴ 130

Hi All:

I just learned how to use sbatch to mapping the RNA-sequence data. My script for sbatch is:

#!/bin/bash
#SBATCH --job-name=starmapping_sbatch
#SBATCH --output=starmapping_sbatch.out
#SBATCH --error=starmapping_sbatch.err
#SBATCH --time=02:00:00
#SBATCH --partition=bigmem2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=10
#SBATCH --mem-per-cpu=20G

module load STAR
STAR --genomeDir /scratch/midway2/caiqi/GRCm38_star_index_125bp \
--runThreadN 16 \
--readFilesIn SRR5048027_1.fastq SRR5048027_2.fastq \
--outFileNamePrefix GRCm38E18SRR5048027 \
--twopassMode Basic \
--sjdbOverhang 125 \
--outSAMtype BAM SortedByCoordinate \
--outFilterMultimapNmax 20 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.06 \
 --alignIntronMin 70 \
 --alignIntronMax 500000 \
 --alignMatesGapMax 500000 \
 --alignSJoverhangMin 8 \
 --alignSJDBoverhangMin 1 \
 --outSAMstrandField intronMotif \
 --outFilterType BySJout

In the --readFilesin step I need to change the file name every time I submit the job. Is there solution to modify it to make it done at the command line each time instead of get into the sh file.

Thank you for any advice,

RNA-Seq next-gen alignment • 3.1k views

ADD COMMENT • link 3.9 years ago by Kai_Qi ▴ 130

2

Entering edit mode

Use a loop. Instead of using a script like this you can submit jobs directly on the command line. A very general idea is like this:

for i in *R1*.fastq
  do
    name = $(basename ${i} .fastq)
    sbatch --time=2:00:00 --partition=bigmem2 -n 1 --output=starmapping_${name}_sbatch.out blah --wrap="STAR blah --readFilesIn ${name}_R1.fastq ${name}_R2.fastq  --outFileNamePrefix  GRCm38E18${name} ......"
done

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

Thank you for your advice. It is much easier for me to understand and practice. Do you think it is possible to put your loop in a bash file then then directly type "bash file.sh" ?

Thank you,

ADD REPLY • link 3.9 years ago by Kai_Qi ▴ 130

0

Entering edit mode

Doing it this way submits individual jobs for each of your samples. This allows you to make the best use of your cluster compute resources by parallelization. You can submit any number of jobs and have them run in sequence based on job slots allocated to your account.

You could put that for loop inside a script like above but then a single job would be running that script and you would never get through all of your samples in the time allotted for the parent job.

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

Appreciate a lot for your reply. But I still cannot comprehend. Let's say I have 2 replicates inside a folder, they are paried end sequenced fastq files: (SRR1_1.fastq, SRR1_2.fastq); (SRR2_1.fastq, SRR2_2.fastq) Using your advice

for i in *SRR1*.fastq
  do name = $(${i}.fastq) # this will generate name=SRR1.fastq ?
  sbatch --time=2:00:00 \
  --partition=bigmem2 \
  -n 1 \
--output=starmapping_${name}_sbatch.out  #here the name will be SRR1.fastq ?
--continued like above #your advised blah mean
module load STAR
--wrap = "STAR blah --readFilesIn ${name}_1.fastq ${name}_2.fastq  --outFileNamePrefix  GRCm38E18${name} ......"
done

But in this case, when I was about to mapping the SRR2_1.fastq/SRR2_2.fastq, I still need to retype this command again. Also it looks that in this way I have to wait until the job is finished and retype it again?

I know I did not understand it correctly. Can you help me understand it so I can use it on my samples?

Thanks again

ADD REPLY • link 3.9 years ago by Kai_Qi ▴ 130

1

Entering edit mode

You're missing the basename command. basename $i .fastq (note the space between $i and .fastq) will remove .fastq from the name. The rest will work out as $name will be defined properly. Please pay attention to each command and experiment on your own - that's how we all learn shell scripting.

ADD REPLY • link 3.9 years ago by Ram 43k

1

Entering edit mode

Everything that starts with #SBATCH in your script is an option for SLURM. When you submit a job on the command line you just need to put those options one following other sbatch option1 option2 so on. --wrap actually wraps the program you are trying to run in this case STAR. All options for STAR would go next with entire unix command being enclosed by the ".

Run this part to understand how the names of the samples are being extracted from the files.

for i in *R1*.fastq
  do
    name = $(basename ${i} .fastq)
    echo ${name}
done

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

Thank you for your patient explanation. I will try it today. Last night I have modified the script in this way:

#get the input data file
INPUT=${1}


--readFilesIn ${INPUT}_1.fastq ${INPUT}_2.fastq \
--outFileNamePrefix GRCm38E18${INPUT} \

Then all I need to do is: sbatch file.sh SRR1 Thank you all very much. I will lookup the methods you mentioned here for further learning

ADD REPLY • link 3.9 years ago by Kai_Qi ▴ 130

0

Entering edit mode

I use an advanced version of this with HEREDOCs. Sample:

for fq in $(/bin/ls -1 FASTQ/*_R1*.gz | xargs -I v_f basename v_f)
do
    smpl=$(echo $fq | cut -d_ -f1)
    cat > "./pbs_scripts/${smpl}.pbs" <<EOF
#!/bin/bash
#PBS -S /bin/bash
#PBS -N ${smpl}
#PBS -M email@email.com
#PBS -l nodes=1:ppn=12
#PBS -l vmem=40gb
#PBS -l walltime=24:00:00
#PBS -d /project/my_project/my_working_dir
#PBS -e pbslogs/\${PBS_JOBNAME}.\${PBS_JOBID%%.*}.pbserr
#PBS -o pbslogs/\${PBS_JOBNAME}.\${PBS_JOBID%%.*}.pbsout

set -eo pipefail
set -x

python ~/myScript.py \
        --readFile1 \${PBS_O_WORKDIR}/FASTQ/trimmed/${fq} \
        --readFile2 \${PBS_O_WORKDIR}/FASTQ/trimmed/${fq/_R1/_R2} \
        --outDir \${PBS_O_WORKDIR}/result_working/ \
        --sampleName ${smpl} 
EOF
    cat "./pbs_scripts/${smpl}.pbs" | qsub
done

I've removed a bunch of other things from the loop above, but the premise is simple: Create a script for each sample and run it. The advantage is that I can customize multiple parameters for each run. Of course, a workflow manager would be a lot more elegant but this is easier to understand for most shell programmers including myself.

ADD REPLY • link 3.9 years ago by Ram 43k

0

Entering edit mode

Thank you so much, but I am a little bit overwhelmed. It looks I have to submit the sbatch file one by one for a while, until I can understand what your script. It is good that I know where to look at next time.

Thanks a lot for your comments

ADD REPLY • link 3.9 years ago by Kai_Qi ▴ 130

0

Entering edit mode

use a worflow manager like snakemake or nextflow.

ADD REPLY • link 3.9 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

For casual users/non-programmers activation energy needed is too high to effectively use these. Not saying they should not use workflow managers but we should be careful about recommending these as the solution.

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

Thank you for the advice. Frankly speaking I have not heard these 2 before. But I will keep it in mind.

ADD REPLY • link 3.9 years ago by Kai_Qi ▴ 130

1

Entering edit mode

see Use a workflow management tool to manage your computational pipelines ; Snakemake vs. Nextflow: strengths and weaknesses ; etc..

ADD REPLY • link 3.9 years ago by Pierre Lindenbaum 161k