Question: Trimmomatic job script to run on multiple pair end read file
0
gravatar for Bioinfonext
19 months ago by
Bioinfonext150
Korea
Bioinfonext150 wrote:

I run below Trimmomatic job script on single pair end read file at HPC SERVER, How I can run it as loop if there are multiple pair end read file like this:

41_R1.fastq                                  41_R2.fastq

42_R1.fastq                                  42_R2.fastq

43_R2.fastq                                  43_R2.fastq



 #!/bin/bash

#$ -N trimmomtic
**strong text**
#$ -o /mnt/scratch/users/3052771/testdata/trimread.out

#$ -pe smp-verbose 20

module add apps/trimmomatic 

#$ -wd /mnt/scratch/users/3052771/testdata

trimmomatic PE -phred33 41_R1.fastq 41_R2.fastq 41_R1_paired.fq.gz 41_R1_unpaired.fq.gz 41_R2_paired.fq.gz 41_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
rna-seq • 4.4k views
ADD COMMENTlink modified 3 months ago by lakhujanivijay4.3k • written 19 months ago by Bioinfonext150

A: Using trimmomatic on multiple paired-end read files

ADD REPLYlink written 19 months ago by genomax70k
2
gravatar for genomax
19 months ago by
genomax70k
United States
genomax70k wrote:

One way to do this.

for i in `ls -1 *R1*.fastq | sed 's/\_R1.fastq//'`; do echo trimmomatic PE -phred33 $i\_R1.fastq $i\_R2.fastq $i\_R1_paired.fq.gz $i\_R1_unpaired.fq.gz $i\_R2_paired.fq.gz $i\_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 >> cmd_file; done

This will save all command lines in file cmd_file. Which will look like this.

$ cat cmd_file
trimmomatic PE -phred33 41_R1.fastq 41_R2.fastq 41_R1_paired.fq.gz 41_R1_unpaired.fq.gz 41_R2_paired.fq.gz 41_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
trimmomatic PE -phred33 42_R1.fastq 42_R2.fastq 42_R1_paired.fq.gz 42_R1_unpaired.fq.gz 42_R2_paired.fq.gz 42_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Use the lines to replace in the scheduler script above. Adjust the log output files as needed.

ADD COMMENTlink modified 19 months ago • written 19 months ago by genomax70k

Thanks a lot.

what if my input files is in fq.gz formats, can I change fastq to fq.gz?

for i in `ls -1 *R1*.fa.gz | sed 's/\_R1.fa.gz//'`; do echo trimmomatic PE -phred33 $i\_R1.fa.gz $i\_R2.fa.gz $i\_R1_paired.fq.gz $i\_R1_unpaired.fq.gz $i\_R2_paired.fq.gz $i\_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 >> cmd_file; done
ADD REPLYlink written 19 months ago by Bioinfonext150

That is correct. Replace the part that is common to all files. I am going to assume that the fa.gz above is an error. Trimmomatic will expect `fastq format files as input.

ADD REPLYlink written 19 months ago by genomax70k

This script is working perfectly on one pair end reads:

#!/bin/bash
#$ -N trimmomtic
#$ -o /mnt/scratch/users/3052771/testdata/test-job/trimread.out

#$ -pe smp-verbose 20

module add apps/trimmomatic

#$ -wd /mnt/scratch/users/3052771/testdata/test-job

trimmomatic PE -phred33 SRR4104638_R1.fastq SRR4104638_R2.fastq SRR4104638_R1_paired.fq.gz SRR4104638_R1_unpaired.fq.gz SRR4104638_R2_paired.fq.gz SRR4104638_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

But this loop is not working, but above script is giving output: Could you please suggest where I am making mistake:

Input file names are like this:

SRR4104639_R1.fastq                  SRR4104639_R2.fastq                               

SRR4104640_R1.fastq                  SRR4104640_R2.fastq                               

SRR4104637_R1.fastq                  SRR4104637_R2.fastq 




#!/bin/bash
#$ -N trimmomtic
#$ -o /mnt/scratch/users/3052771/testdata/test-jbo/trimread.out
#$ -pe smp-verbose 20
module add apps/trimmomatic
#$ -wd /mnt/scratch/users/3052771/testdata/test-job

for i in `ls -1 *R1*.fastq | sed 's/\_R1.fastq//'`; do echo trimmomatic PE -phred33 $i\_R1.fastq $i\_R2.fastq $i\_R1_paired.fq.gz $i\_R1_unpaired.fq.gz $i\_R2_paired.fq.gz $i\_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50 >> cmd_file; done
ADD REPLYlink modified 19 months ago • written 19 months ago by Bioinfonext150

It is not a good idea to spawn multiple jobs from inside a main job. You may need to look into job arrays if you want to do that. What job scheduler are you using?

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax70k

I am not sure how to use task array:

HPC cluster admin suggest this:

Used to run an array of serial jobs
–Input and output files can be separated
–Individual tasks are independent; may run anywhere
–Use the “-t <start>-<end>” option

#!/bin/bash
#$ -t 1-100 –cwd–o ~/results/taskjob.$JOB_ID
module load apps/fastapp
Fastapp–i~/inputdata/intput.$SGE_TASK_ID–o \~/results/out.$SGE_TASK_ID
ADD REPLYlink written 19 months ago by Bioinfonext150

I am using Grid Engine Scheduler.

ADD REPLYlink written 19 months ago by Bioinfonext150

I run above loop and all cmd command put in the above script but it is not working:

#!/bin/bash

#$ -N trimmomtic

#$ -o /mnt/scratch/users/3052771/testdata/test-job

#$ -pe smp-verbose 20

module load apps/trimmomatic

#$ -wd /mnt/scratch/users/3052771/testdata/test-job

trimmomatic PE -phred33 SRR4104638_R1.fastq SRR4104638_R2.fastq SRR4104638_R1_paired.fq.gz SRR4104638_R1_unpaired.fq.gz SRR4104638_R2_paired.fq.gz SRR4104638_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104639_R1.fastq SRR4104639_R2.fastq SRR4104639_R1_paired.fq.gz SRR4104639_R1_unpaired.fq.gz SRR4104639_R2_paired.fq.gz SRR4104639_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104640_R1.fastq SRR4104640_R2.fastq SRR4104640_R1_paired.fq.gz SRR4104640_R1_unpaired.fq.gz SRR4104640_R2_paired.fq.gz SRR4104640_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104641_R1.fastq SRR4104641_R2.fastq SRR4104641_R1_paired.fq.gz SRR4104641_R1_unpaired.fq.gz SRR4104641_R2_paired.fq.gz SRR4104641_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104642_R1.fastq SRR4104642_R2.fastq SRR4104642_R1_paired.fq.gz SRR4104642_R1_unpaired.fq.gz SRR4104642_R2_paired.fq.gz SRR4104642_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104643_R1.fastq SRR4104643_R2.fastq SRR4104643_R1_paired.fq.gz SRR4104643_R1_unpaired.fq.gz SRR4104643_R2_paired.fq.gz SRR4104643_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104644_R1.fastq SRR4104644_R2.fastq SRR4104644_R1_paired.fq.gz SRR4104644_R1_unpaired.fq.gz SRR4104644_R2_paired.fq.gz SRR4104644_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104645_R1.fastq SRR4104645_R2.fastq SRR4104645_R1_paired.fq.gz SRR4104645_R1_unpaired.fq.gz SRR4104645_R2_paired.fq.gz SRR4104645_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50
ADD REPLYlink written 19 months ago by Bioinfonext150

Create a separate file for each job with just one trimmomatic command in it.

ADD REPLYlink written 19 months ago by genomax70k

Hi Genomax,

Above script is also working, I just got confused because job was in queue, now it is generated all result files. Now could you also suggest loop to save hisat command line in cmd file for the file so that I can run together like trimmomatic.

files names is like this:

SRR4104637_R1_paired.fq.gz
SRR4104637_R2_paired.fq.gz


SRR4104638_R1_paired.fq.gz
SRR4104638_R2_paired.fq.gz

Hisat2 command for single pair end files:

hisat2 -x genome_index -1 SRR4104637_R1_paired.fq.gz -2 SRR4104637_R2_paired.fq.gz -S SRR4104637.sam
ADD REPLYlink modified 19 months ago • written 19 months ago by Bioinfonext150
1

This time the common part in the file names is _paired.fq.gz so you can chop that off with sed and then create a new loop. Command will be saved in cmd_file as before.

for i in `ls -1 *R1_paired*| sed 's/\_R1_paired.fq.gz//'`; do echo hisat2 -x genome_index -1 $i\_R1_paired.fq.gz -2 $i\_R2_paired.fq.gz -S $i.sam >> cmd_file
ADD REPLYlink modified 19 months ago • written 19 months ago by genomax70k

Hi genomax:

Loop for hisat2 generating files like this:

hisat2 -x rice_index -1 SRR4104637_R1_R1_paired.fq.gz -2 SRR4104637_R1_R2_paired.fq.gz -S SRR4104637_R1.sam
hisat2 -x rice_index -1 SRR4104638_R1_R1_paired.fq.gz -2 SRR4104638_R1_R2_paired.fq.gz -S SRR4104638_R1.sam

Can you correct it so that it can generate files like this:

hisat2 -x genome_index -1 SRR4104637_R1_paired.fq.gz -2 SRR4104637_R2_paired.fq.gz -S SRR4104637.sam
ADD REPLYlink modified 19 months ago • written 19 months ago by Bioinfonext150

Try the modified code above.

ADD REPLYlink written 19 months ago by genomax70k

Thank you very much. Now its working perfectly.

ADD REPLYlink written 19 months ago by Bioinfonext150

Hi Genomax,

I am trying to run the hisat2 loop but not getting it correct: How I can modify command to remove extra R1

I am getting output like this:

hisat2 -x chrX_tran -1 Leaf_T1_F_R10_S1_L001_R1_R1_paired.fq.gz -2 Leaf_T1_F_R10_S1_L001_R1_R2_paired.fq.gz -S Leaf_T1_F_R10_S1_L001_R1.sam

command:

for i in `ls -1 *R1_paired*| sed 's/\_R1_paired.fq.gz//'`; do echo hisat2 -x chrX_tran -1 $i\_R1_paired.fq.gz -2 $i\_R2_paired.fq.gz -S $i.sam >> cmd2_file;done

input files are like this:

Root_T3_S_R2_S60_L001_R2_paired.fq.gz
Root_T3_S_R3_S59_L001_R1_paired.fq.gz
Root_T3_S_R3_S59_L001_R2_paired.fq.gz
Root_T3_S_R5_S58_L001_R1_paired.fq.gz
Root_T3_S_R5_S58_L001_R2_paired.fq.gz

I will be thankful for your time and advise.

Thanks Bioinfonext

ADD REPLYlink written 4 days ago by Bioinfonext150

I am not sure what you mean by extra R1? That loop should work with the example file names posted. Here I am just printing the name prefixes.

$ for i in `ls -1 *R1_paired*| sed 's/\_R1_paired.fq.gz//'`; do echo ${i};done
Root_T3_S_R3_S59_L001
Root_T3_S_R5_S58_L001

You could also do :

$ for i in *R1*.gz; do name=$(basename ${i} _R1_paired.fq.gz); echo ${name}; done
Root_T3_S_R3_S59_L001
Root_T3_S_R5_S58_L001
ADD REPLYlink modified 4 days ago • written 4 days ago by genomax70k
1
gravatar for lakhujanivijay
3 months ago by
lakhujanivijay4.3k
India
lakhujanivijay4.3k wrote:

ADD COMMENTlink written 3 months ago by lakhujanivijay4.3k

Dear Vijay,

Thanks for the above script. what about this adapter fasta file: ILLUMINACLIP:/data/results/NEB_adapter.fasta

do we need to change the path for this in the script and from where we can get this adapter file?

thanks bioinfonext

ADD REPLYlink written 22 days ago by Bioinfonext150

Use the adapter sequence file for your own data.

ADD REPLYlink written 22 days ago by genomax70k

Thank you very much genomax!

ADD REPLYlink written 22 days ago by Bioinfonext150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1313 users visited in the last hour