Question: Trimmomatic job script to run on multiple pair end read file
0
gravatar for Bioinfonext
2.8 years ago by
Bioinfonext250
Korea
Bioinfonext250 wrote:

I run below Trimmomatic job script on single pair end read file at HPC SERVER, How I can run it as loop if there are multiple pair end read file like this:

41_R1.fastq                                  41_R2.fastq

42_R1.fastq                                  42_R2.fastq

43_R2.fastq                                  43_R2.fastq



 #!/bin/bash

#$ -N trimmomtic
**strong text**
#$ -o /mnt/scratch/users/3052771/testdata/trimread.out

#$ -pe smp-verbose 20

module add apps/trimmomatic 

#$ -wd /mnt/scratch/users/3052771/testdata

trimmomatic PE -phred33 41_R1.fastq 41_R2.fastq 41_R1_paired.fq.gz 41_R1_unpaired.fq.gz 41_R2_paired.fq.gz 41_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
rna-seq • 11k views
ADD COMMENTlink modified 18 months ago by lakhujanivijay5.3k • written 2.8 years ago by Bioinfonext250

A: Using trimmomatic on multiple paired-end read files

ADD REPLYlink written 2.8 years ago by genomax91k
3
gravatar for genomax
2.8 years ago by
genomax91k
United States
genomax91k wrote:

One way to do this.

for i in `ls -1 *R1*.fastq | sed 's/\_R1.fastq//'`; do echo trimmomatic PE -phred33 $i\_R1.fastq $i\_R2.fastq $i\_R1_paired.fq.gz $i\_R1_unpaired.fq.gz $i\_R2_paired.fq.gz $i\_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 >> cmd_file; done

This will save all command lines in file cmd_file. Which will look like this.

$ cat cmd_file
trimmomatic PE -phred33 41_R1.fastq 41_R2.fastq 41_R1_paired.fq.gz 41_R1_unpaired.fq.gz 41_R2_paired.fq.gz 41_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
trimmomatic PE -phred33 42_R1.fastq 42_R2.fastq 42_R1_paired.fq.gz 42_R1_unpaired.fq.gz 42_R2_paired.fq.gz 42_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Use the lines to replace in the scheduler script above. Adjust the log output files as needed.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by genomax91k

Thanks a lot.

what if my input files is in fq.gz formats, can I change fastq to fq.gz?

for i in `ls -1 *R1*.fa.gz | sed 's/\_R1.fa.gz//'`; do echo trimmomatic PE -phred33 $i\_R1.fa.gz $i\_R2.fa.gz $i\_R1_paired.fq.gz $i\_R1_unpaired.fq.gz $i\_R2_paired.fq.gz $i\_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 >> cmd_file; done
ADD REPLYlink written 2.8 years ago by Bioinfonext250

That is correct. Replace the part that is common to all files. I am going to assume that the fa.gz above is an error. Trimmomatic will expect `fastq format files as input.

ADD REPLYlink written 2.8 years ago by genomax91k

This script is working perfectly on one pair end reads:

#!/bin/bash
#$ -N trimmomtic
#$ -o /mnt/scratch/users/3052771/testdata/test-job/trimread.out

#$ -pe smp-verbose 20

module add apps/trimmomatic

#$ -wd /mnt/scratch/users/3052771/testdata/test-job

trimmomatic PE -phred33 SRR4104638_R1.fastq SRR4104638_R2.fastq SRR4104638_R1_paired.fq.gz SRR4104638_R1_unpaired.fq.gz SRR4104638_R2_paired.fq.gz SRR4104638_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

But this loop is not working, but above script is giving output: Could you please suggest where I am making mistake:

Input file names are like this:

SRR4104639_R1.fastq                  SRR4104639_R2.fastq                               

SRR4104640_R1.fastq                  SRR4104640_R2.fastq                               

SRR4104637_R1.fastq                  SRR4104637_R2.fastq 




#!/bin/bash
#$ -N trimmomtic
#$ -o /mnt/scratch/users/3052771/testdata/test-jbo/trimread.out
#$ -pe smp-verbose 20
module add apps/trimmomatic
#$ -wd /mnt/scratch/users/3052771/testdata/test-job

for i in `ls -1 *R1*.fastq | sed 's/\_R1.fastq//'`; do echo trimmomatic PE -phred33 $i\_R1.fastq $i\_R2.fastq $i\_R1_paired.fq.gz $i\_R1_unpaired.fq.gz $i\_R2_paired.fq.gz $i\_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50 >> cmd_file; done
ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Bioinfonext250

It is not a good idea to spawn multiple jobs from inside a main job. You may need to look into job arrays if you want to do that. What job scheduler are you using?

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by genomax91k

I am not sure how to use task array:

HPC cluster admin suggest this:

Used to run an array of serial jobs
–Input and output files can be separated
–Individual tasks are independent; may run anywhere
–Use the “-t <start>-<end>” option

#!/bin/bash
#$ -t 1-100 –cwd–o ~/results/taskjob.$JOB_ID
module load apps/fastapp
Fastapp–i~/inputdata/intput.$SGE_TASK_ID–o \~/results/out.$SGE_TASK_ID
ADD REPLYlink written 2.8 years ago by Bioinfonext250

I am using Grid Engine Scheduler.

ADD REPLYlink written 2.8 years ago by Bioinfonext250

can you please help me with, that which file is to be taken for the further analysis?

ADD REPLYlink written 9 months ago by vishalchanda36420

I run above loop and all cmd command put in the above script but it is not working:

#!/bin/bash

#$ -N trimmomtic

#$ -o /mnt/scratch/users/3052771/testdata/test-job

#$ -pe smp-verbose 20

module load apps/trimmomatic

#$ -wd /mnt/scratch/users/3052771/testdata/test-job

trimmomatic PE -phred33 SRR4104638_R1.fastq SRR4104638_R2.fastq SRR4104638_R1_paired.fq.gz SRR4104638_R1_unpaired.fq.gz SRR4104638_R2_paired.fq.gz SRR4104638_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104639_R1.fastq SRR4104639_R2.fastq SRR4104639_R1_paired.fq.gz SRR4104639_R1_unpaired.fq.gz SRR4104639_R2_paired.fq.gz SRR4104639_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104640_R1.fastq SRR4104640_R2.fastq SRR4104640_R1_paired.fq.gz SRR4104640_R1_unpaired.fq.gz SRR4104640_R2_paired.fq.gz SRR4104640_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104641_R1.fastq SRR4104641_R2.fastq SRR4104641_R1_paired.fq.gz SRR4104641_R1_unpaired.fq.gz SRR4104641_R2_paired.fq.gz SRR4104641_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104642_R1.fastq SRR4104642_R2.fastq SRR4104642_R1_paired.fq.gz SRR4104642_R1_unpaired.fq.gz SRR4104642_R2_paired.fq.gz SRR4104642_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104643_R1.fastq SRR4104643_R2.fastq SRR4104643_R1_paired.fq.gz SRR4104643_R1_unpaired.fq.gz SRR4104643_R2_paired.fq.gz SRR4104643_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104644_R1.fastq SRR4104644_R2.fastq SRR4104644_R1_paired.fq.gz SRR4104644_R1_unpaired.fq.gz SRR4104644_R2_paired.fq.gz SRR4104644_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104645_R1.fastq SRR4104645_R2.fastq SRR4104645_R1_paired.fq.gz SRR4104645_R1_unpaired.fq.gz SRR4104645_R2_paired.fq.gz SRR4104645_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50
ADD REPLYlink written 2.8 years ago by Bioinfonext250

Create a separate file for each job with just one trimmomatic command in it.

ADD REPLYlink written 2.8 years ago by genomax91k

Hi Genomax,

Above script is also working, I just got confused because job was in queue, now it is generated all result files. Now could you also suggest loop to save hisat command line in cmd file for the file so that I can run together like trimmomatic.

files names is like this:

SRR4104637_R1_paired.fq.gz
SRR4104637_R2_paired.fq.gz


SRR4104638_R1_paired.fq.gz
SRR4104638_R2_paired.fq.gz

Hisat2 command for single pair end files:

hisat2 -x genome_index -1 SRR4104637_R1_paired.fq.gz -2 SRR4104637_R2_paired.fq.gz -S SRR4104637.sam
ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Bioinfonext250
1

This time the common part in the file names is _paired.fq.gz so you can chop that off with sed and then create a new loop. Command will be saved in cmd_file as before.

for i in `ls -1 *R1_paired*| sed 's/\_R1_paired.fq.gz//'`; do echo hisat2 -x genome_index -1 $i\_R1_paired.fq.gz -2 $i\_R2_paired.fq.gz -S $i.sam >> cmd_file
ADD REPLYlink modified 2.7 years ago • written 2.8 years ago by genomax91k

Hi genomax:

Loop for hisat2 generating files like this:

hisat2 -x rice_index -1 SRR4104637_R1_R1_paired.fq.gz -2 SRR4104637_R1_R2_paired.fq.gz -S SRR4104637_R1.sam
hisat2 -x rice_index -1 SRR4104638_R1_R1_paired.fq.gz -2 SRR4104638_R1_R2_paired.fq.gz -S SRR4104638_R1.sam

Can you correct it so that it can generate files like this:

hisat2 -x genome_index -1 SRR4104637_R1_paired.fq.gz -2 SRR4104637_R2_paired.fq.gz -S SRR4104637.sam
ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Bioinfonext250

Try the modified code above.

ADD REPLYlink written 2.7 years ago by genomax91k

Thank you very much. Now its working perfectly.

ADD REPLYlink written 2.7 years ago by Bioinfonext250

Hi Genomax,

I am trying to run the hisat2 loop but not getting it correct: How I can modify command to remove extra R1

I am getting output like this:

hisat2 -x chrX_tran -1 Leaf_T1_F_R10_S1_L001_R1_R1_paired.fq.gz -2 Leaf_T1_F_R10_S1_L001_R1_R2_paired.fq.gz -S Leaf_T1_F_R10_S1_L001_R1.sam

command:

for i in `ls -1 *R1_paired*| sed 's/\_R1_paired.fq.gz//'`; do echo hisat2 -x chrX_tran -1 $i\_R1_paired.fq.gz -2 $i\_R2_paired.fq.gz -S $i.sam >> cmd2_file;done

input files are like this:

Root_T3_S_R2_S60_L001_R2_paired.fq.gz
Root_T3_S_R3_S59_L001_R1_paired.fq.gz
Root_T3_S_R3_S59_L001_R2_paired.fq.gz
Root_T3_S_R5_S58_L001_R1_paired.fq.gz
Root_T3_S_R5_S58_L001_R2_paired.fq.gz

I will be thankful for your time and advise.

Thanks Bioinfonext

ADD REPLYlink written 14 months ago by Bioinfonext250

I am not sure what you mean by extra R1? That loop should work with the example file names posted. Here I am just printing the name prefixes.

$ for i in `ls -1 *R1_paired*| sed 's/\_R1_paired.fq.gz//'`; do echo ${i};done
Root_T3_S_R3_S59_L001
Root_T3_S_R5_S58_L001

You could also do :

$ for i in *R1*.gz; do name=$(basename ${i} _R1_paired.fq.gz); echo ${name}; done
Root_T3_S_R3_S59_L001
Root_T3_S_R5_S58_L001
ADD REPLYlink modified 14 months ago • written 14 months ago by genomax91k

Dear genomax,

thanks for your all help. Can I run same loop for HTseq, I got alignment files and now now to run the loop to generate the read count for each alignment sam file saperetly. alignment file are like this:

 Root_T3_F_R9_S27_L001.sam

 Root_T3_S_R2_S60_L001.sam

 Root_T3_S_R3_S59_L001.sam

 Leaf_T1_F_R10_S1_L001.sam

 Leaf_T1_F_R2_S5_L001.sam

 Leaf_T1_F_R3_S4_L001.sam

for single file HTseq command:

htseq-count  Leaf_T1_F_R3_S4_L001.sam rice.gff  > Leaf_T1_F_R3_S4.count.txt

How can use above loop to generate HTseq command for all sam files.

I am thinking to use this, but how I can add count.txt to save output files.

for i in `ls -1 *L001.sam*| sed 's/\_L001.sam//'`; do echo htseq-count rice.gff  $i\_L001.sam >> cmd2_file;done

above command generated output like this:

htseq-count  rice.gff  Leaf_T3_S_R6_S42_L001.sam
htseq-count  rice.gff   Leaf_T3_S_R7_S41_L001.sam

But if it can write script like this will be a great help:

htseq-count  rice.gff  Leaf_T3_S_R6_S42_L001.sam >  Leaf_T3_S_R6_S42_L001.count.txt

Thanks bioinfonext

ADD REPLYlink modified 14 months ago • written 14 months ago by Bioinfonext250

I got out put file like txt by using below command:

for i in `ls -1 *L001.sam*| sed 's/\_L001.sam//'`; do echo htseq-count rice.gff  $i\_L001.sam >$i\_L001.COUNT.txt>> cmd2_file;done

Please let me know if it is ok Thanks Bioinfonext

ADD REPLYlink written 14 months ago by Bioinfonext250
1

That looks ok. Instead of using using bare $i, it is a better practice to use ${i} to reference a variable.

ADD REPLYlink written 14 months ago by genomax91k

Hi Genomax,

HTSeq loop command:

for i in `ls -1 *L001.sam*| sed 's/\_L001.sam//'`; do echo htseq-count rice.gff  $i\_L001.sam >$i\_L001.COUNT.txt>> cmd2_file;done

above command generate cmd file like this, it do not direct output to txt file in cmd file; instead of adding output txt file to cmd command, it generate empty txt file according to sample name.

htseq-count Leaf_T1_F_R10_S1_L001.sam Oryza_indica.ASM465v1.44.chr.gtf

htseq-count Leaf_T1_F_R2_S5_L001.sam Oryza_indica.ASM465v1.44.chr.gtf

htseq-count Leaf_T1_F_R3_S4_L001.sam Oryza_indica.ASM465v1.44.chr.gtf

It will be great if it can also add output count txt file according to sample name in the command.

ADD REPLYlink written 14 months ago by Bioinfonext250

it will be great if can generate cmd file like this:

htseq-count Leaf_T1_F_R10_S1_L001.sam Oryza_indica.ASM465v1.44.chr.gtf >Leaf_T1_F_R10_S1_L001.count.txt
ADD REPLYlink written 14 months ago by Bioinfonext250

Use this:

$ for i in `ls -1 *L001.sam*| sed 's/\_L001.sam//'`; do echo "htseq-count ${i}_L001.sam  Oryza_indica.ASM465v1.44.chr.gtf  >${i}_L001.COUNT.txt" >> cmd2_file;done
ADD REPLYlink modified 14 months ago • written 14 months ago by genomax91k
1
gravatar for lakhujanivijay
18 months ago by
lakhujanivijay5.3k
India/Ahmedabad
lakhujanivijay5.3k wrote:

ADD COMMENTlink written 18 months ago by lakhujanivijay5.3k

Dear Vijay,

Thanks for the above script. what about this adapter fasta file: ILLUMINACLIP:/data/results/NEB_adapter.fasta

do we need to change the path for this in the script and from where we can get this adapter file?

thanks bioinfonext

ADD REPLYlink written 15 months ago by Bioinfonext250

Use the adapter sequence file for your own data.

ADD REPLYlink written 15 months ago by genomax91k

Thank you very much genomax!

ADD REPLYlink written 15 months ago by Bioinfonext250
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 831 users visited in the last hour