Question

Trimmomatic job script to run on multiple pair end read file

0

Entering edit mode

7.5 years ago

Bioinfonext ▴ 470

I run below Trimmomatic job script on single pair end read file at HPC SERVER, How I can run it as loop if there are multiple pair end read file like this:

41_R1.fastq                                  41_R2.fastq

42_R1.fastq                                  42_R2.fastq

43_R2.fastq                                  43_R2.fastq



 #!/bin/bash

#$ -N trimmomtic
**strong text**
#$ -o /mnt/scratch/users/3052771/testdata/trimread.out

#$ -pe smp-verbose 20

module add apps/trimmomatic 

#$ -wd /mnt/scratch/users/3052771/testdata

trimmomatic PE -phred33 41_R1.fastq 41_R2.fastq 41_R1_paired.fq.gz 41_R1_unpaired.fq.gz 41_R2_paired.fq.gz 41_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

RNA-Seq • 32k views

ADD COMMENT • link updated 6.2 years ago by lakhujanivijay 5.9k • written 7.5 years ago by Bioinfonext ▴ 470

0

Entering edit mode

A: Using trimmomatic on multiple paired-end read files

ADD REPLY • link 7.5 years ago by GenoMax 152k

1

Entering edit mode

6.2 years ago

lakhujanivijay 5.9k

Automation of trimmomatic

What is Trimmomatic?

Trimmomatic is a fast, multithreaded command line tool that can be used to trim and crop Illumina (FASTQ) data as well as to remove adapters. These adapters can pose a real problem depending on the library preparation and downstream application.

For details, read the manual here

✂️ About the script

The script can be used for running trimmomatic automatically for N no.of samples regardless of the file extension. Read below to understand the functionalities, usage and reasons for failure.

The script is accessible here

🚩 Why should one use this?

This script directly works with compressed/uncompressed fastq files exploiting the functionalities of trimmomatic.
Obviously, it will save a lot of time as trimming is one of the routine tasks and it ought to be automated.
It is robust and intuitive and errors while execution (if any) are self-explanatory as explained below.

🔧 Using the script:

Consider you have a directory with a mixture of fastq files with different file extensions

    |-- all_R1.fq
    |-- all_R1.fq.gz
    |-- all_R2.fq
    |-- all_R2.fq.gz
    |-- demo_R1.fq
    |-- demo_R2.fq
    |-- make_2_R1.fastq
    |-- make_2_R1.fastq.gz
    |-- make_2_R2.fastq
    |-- make_2_R2.fastq.gz

Example (1): Running script only for .fq.gz files:

$ sh auto_trimmomatic.sh *.fq.gz
Example (2): Running script only for .fq files:

$ sh auto_trimmomatic.sh *.fq
Example (3): Running script only for .fastq files:

$ sh auto_trimmomatic.sh *.fastq
Example (4): Running script only for .fastq.gz files:

$ sh auto_trimmomatic.sh *.fastq.gz
Example (5): Running script only for all files in the directory:

$ sh auto_trimmomatic.sh *

ℹ️ Invoking help

$ sh auto_trimmomatic.sh --help OR

$ sh auto_trimmomatic.sh -h

⚠️ Error handling

No parameters passed to script

[mypc]$ sh auto_trimmomatic.sh 
Error: No parameter(s) provided
Usage: sh auto_trim [*.extension]
       extension: <fq> or <fastq> or <fq.gz> or <fastq.gz> or <*>
       example: sh auto_trim.sh *.fq.gz or sh auto_trim.sh *
 
Help:  sh auto_trimmomatic.sh -h or --help

No files in the directory with user provided extension

[mypc]$ sh auto_trimmomatic.sh *.fq

FileNotFoundError: No such file with extension *.fq found!
Supported extensions are: <.fq> or <.fastq> or <.fq.gz> or <.fastq.gz>

Fastq file names do not have R1 - R2 naming conventions. Say if you have files like these - demo_1.fq, demo_2.fq, the script will fail:

[mypc]$ sh auto_trimmomatic.sh *.fq

Filename Error: Paired end file names should contain _R1 _R2
Example: test_R1.fq.gz, test_R2.fq.gz

ℹ️ Rename the fastq files as `demo_R1.fq`, `demo_R2.fq`. This checkpoint was essential to maintain integrity of the script.

view raw auto-trimmomatic.md hosted with ❤ by GitHub

ADD COMMENT • link 6.2 years ago by lakhujanivijay 5.9k

0

Entering edit mode

Dear Vijay,

Thanks for the above script. what about this adapter fasta file: ILLUMINACLIP:/data/results/NEB_adapter.fasta

do we need to change the path for this in the script and from where we can get this adapter file?

thanks bioinfonext

ADD REPLY • link 6.0 years ago by Bioinfonext ▴ 470

0

Entering edit mode

Use the adapter sequence file for your own data.

ADD REPLY • link 6.0 years ago by GenoMax 152k

0

Entering edit mode

Thank you very much genomax!

ADD REPLY • link 6.0 years ago by Bioinfonext ▴ 470

score 5 · Accepted Answer · 2018-01-23

5

Entering edit mode

7.5 years ago

GenoMax 152k

One way to do this.

for i in `ls -1 *R1*.fastq | sed 's/\_R1.fastq//'`; do echo trimmomatic PE -phred33 $i\_R1.fastq $i\_R2.fastq $i\_R1_paired.fq.gz $i\_R1_unpaired.fq.gz $i\_R2_paired.fq.gz $i\_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 >> cmd_file; done

This will save all command lines in file cmd_file. Which will look like this.

$ cat cmd_file
trimmomatic PE -phred33 41_R1.fastq 41_R2.fastq 41_R1_paired.fq.gz 41_R1_unpaired.fq.gz 41_R2_paired.fq.gz 41_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
trimmomatic PE -phred33 42_R1.fastq 42_R2.fastq 42_R1_paired.fq.gz 42_R1_unpaired.fq.gz 42_R2_paired.fq.gz 42_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Use the lines to replace in the scheduler script above. Adjust the log output files as needed.

ADD COMMENT • link 7.5 years ago by GenoMax 152k

0

Entering edit mode

Thanks a lot.

what if my input files is in fq.gz formats, can I change fastq to fq.gz?

for i in `ls -1 *R1*.fa.gz | sed 's/\_R1.fa.gz//'`; do echo trimmomatic PE -phred33 $i\_R1.fa.gz $i\_R2.fa.gz $i\_R1_paired.fq.gz $i\_R1_unpaired.fq.gz $i\_R2_paired.fq.gz $i\_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 >> cmd_file; done

ADD REPLY • link 7.5 years ago by Bioinfonext ▴ 470

0

Entering edit mode

That is correct. Replace the part that is common to all files. I am going to assume that the fa.gz above is an error. Trimmomatic will expect `fastq format files as input.

ADD REPLY • link 7.5 years ago by GenoMax 152k

0

Entering edit mode

This script is working perfectly on one pair end reads:

#!/bin/bash
#$ -N trimmomtic
#$ -o /mnt/scratch/users/3052771/testdata/test-job/trimread.out

#$ -pe smp-verbose 20

module add apps/trimmomatic

#$ -wd /mnt/scratch/users/3052771/testdata/test-job

trimmomatic PE -phred33 SRR4104638_R1.fastq SRR4104638_R2.fastq SRR4104638_R1_paired.fq.gz SRR4104638_R1_unpaired.fq.gz SRR4104638_R2_paired.fq.gz SRR4104638_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

But this loop is not working, but above script is giving output: Could you please suggest where I am making mistake:

Input file names are like this:

SRR4104639_R1.fastq                  SRR4104639_R2.fastq                               

SRR4104640_R1.fastq                  SRR4104640_R2.fastq                               

SRR4104637_R1.fastq                  SRR4104637_R2.fastq 




#!/bin/bash
#$ -N trimmomtic
#$ -o /mnt/scratch/users/3052771/testdata/test-jbo/trimread.out
#$ -pe smp-verbose 20
module add apps/trimmomatic
#$ -wd /mnt/scratch/users/3052771/testdata/test-job

for i in `ls -1 *R1*.fastq | sed 's/\_R1.fastq//'`; do echo trimmomatic PE -phred33 $i\_R1.fastq $i\_R2.fastq $i\_R1_paired.fq.gz $i\_R1_unpaired.fq.gz $i\_R2_paired.fq.gz $i\_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50 >> cmd_file; done

ADD REPLY • link 7.5 years ago by Bioinfonext ▴ 470

0

Entering edit mode

It is not a good idea to spawn multiple jobs from inside a main job. You may need to look into job arrays if you want to do that. What job scheduler are you using?

ADD REPLY • link 7.5 years ago by GenoMax 152k

0

Entering edit mode

I am not sure how to use task array:

HPC cluster admin suggest this:

Used to run an array of serial jobs
–Input and output files can be separated
–Individual tasks are independent; may run anywhere
–Use the “-t <start>-<end>” option

#!/bin/bash
#$ -t 1-100 –cwd–o ~/results/taskjob.$JOB_ID
module load apps/fastapp
Fastapp–i~/inputdata/intput.$SGE_TASK_ID–o \~/results/out.$SGE_TASK_ID

ADD REPLY • link 7.5 years ago by Bioinfonext ▴ 470

0

Entering edit mode

I am using Grid Engine Scheduler.

ADD REPLY • link 7.5 years ago by Bioinfonext ▴ 470

0

Entering edit mode

can you please help me with, that which file is to be taken for the further analysis?

ADD REPLY • link 5.5 years ago by vishalchanda364 ▴ 20

0

Entering edit mode

I run above loop and all cmd command put in the above script but it is not working:

#!/bin/bash

#$ -N trimmomtic

#$ -o /mnt/scratch/users/3052771/testdata/test-job

#$ -pe smp-verbose 20

module load apps/trimmomatic

#$ -wd /mnt/scratch/users/3052771/testdata/test-job

trimmomatic PE -phred33 SRR4104638_R1.fastq SRR4104638_R2.fastq SRR4104638_R1_paired.fq.gz SRR4104638_R1_unpaired.fq.gz SRR4104638_R2_paired.fq.gz SRR4104638_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104639_R1.fastq SRR4104639_R2.fastq SRR4104639_R1_paired.fq.gz SRR4104639_R1_unpaired.fq.gz SRR4104639_R2_paired.fq.gz SRR4104639_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104640_R1.fastq SRR4104640_R2.fastq SRR4104640_R1_paired.fq.gz SRR4104640_R1_unpaired.fq.gz SRR4104640_R2_paired.fq.gz SRR4104640_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104641_R1.fastq SRR4104641_R2.fastq SRR4104641_R1_paired.fq.gz SRR4104641_R1_unpaired.fq.gz SRR4104641_R2_paired.fq.gz SRR4104641_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104642_R1.fastq SRR4104642_R2.fastq SRR4104642_R1_paired.fq.gz SRR4104642_R1_unpaired.fq.gz SRR4104642_R2_paired.fq.gz SRR4104642_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104643_R1.fastq SRR4104643_R2.fastq SRR4104643_R1_paired.fq.gz SRR4104643_R1_unpaired.fq.gz SRR4104643_R2_paired.fq.gz SRR4104643_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104644_R1.fastq SRR4104644_R2.fastq SRR4104644_R1_paired.fq.gz SRR4104644_R1_unpaired.fq.gz SRR4104644_R2_paired.fq.gz SRR4104644_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104645_R1.fastq SRR4104645_R2.fastq SRR4104645_R1_paired.fq.gz SRR4104645_R1_unpaired.fq.gz SRR4104645_R2_paired.fq.gz SRR4104645_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

ADD REPLY • link 7.5 years ago by Bioinfonext ▴ 470

0

Entering edit mode

Create a separate file for each job with just one trimmomatic command in it.

ADD REPLY • link 7.5 years ago by GenoMax 152k

0

Entering edit mode

Hi Genomax,

Above script is also working, I just got confused because job was in queue, now it is generated all result files. Now could you also suggest loop to save hisat command line in cmd file for the file so that I can run together like trimmomatic.

files names is like this:

SRR4104637_R1_paired.fq.gz
SRR4104637_R2_paired.fq.gz


SRR4104638_R1_paired.fq.gz
SRR4104638_R2_paired.fq.gz

Hisat2 command for single pair end files:

hisat2 -x genome_index -1 SRR4104637_R1_paired.fq.gz -2 SRR4104637_R2_paired.fq.gz -S SRR4104637.sam

ADD REPLY • link 7.5 years ago by Bioinfonext ▴ 470

1

Entering edit mode

This time the common part in the file names is _paired.fq.gz so you can chop that off with sed and then create a new loop. Command will be saved in cmd_file as before.

for i in `ls -1 *R1_paired*| sed 's/\_R1_paired.fq.gz//'`; do echo hisat2 -x genome_index -1 $i\_R1_paired.fq.gz -2 $i\_R2_paired.fq.gz -S $i.sam >> cmd_file

ADD REPLY • link 7.5 years ago by GenoMax 152k

0

Entering edit mode

Hi genomax:

Loop for hisat2 generating files like this:

hisat2 -x rice_index -1 SRR4104637_R1_R1_paired.fq.gz -2 SRR4104637_R1_R2_paired.fq.gz -S SRR4104637_R1.sam
hisat2 -x rice_index -1 SRR4104638_R1_R1_paired.fq.gz -2 SRR4104638_R1_R2_paired.fq.gz -S SRR4104638_R1.sam

Can you correct it so that it can generate files like this:

hisat2 -x genome_index -1 SRR4104637_R1_paired.fq.gz -2 SRR4104637_R2_paired.fq.gz -S SRR4104637.sam

ADD REPLY • link 7.5 years ago by Bioinfonext ▴ 470

0

Entering edit mode

Try the modified code above.

ADD REPLY • link 7.5 years ago by GenoMax 152k

0

Entering edit mode

Thank you very much. Now its working perfectly.

ADD REPLY • link 7.5 years ago by Bioinfonext ▴ 470

0

Entering edit mode

Hi Genomax,

I am trying to run the hisat2 loop but not getting it correct: How I can modify command to remove extra R1

I am getting output like this:

hisat2 -x chrX_tran -1 Leaf_T1_F_R10_S1_L001_R1_R1_paired.fq.gz -2 Leaf_T1_F_R10_S1_L001_R1_R2_paired.fq.gz -S Leaf_T1_F_R10_S1_L001_R1.sam

command:

for i in `ls -1 *R1_paired*| sed 's/\_R1_paired.fq.gz//'`; do echo hisat2 -x chrX_tran -1 $i\_R1_paired.fq.gz -2 $i\_R2_paired.fq.gz -S $i.sam >> cmd2_file;done

input files are like this:

Root_T3_S_R2_S60_L001_R2_paired.fq.gz
Root_T3_S_R3_S59_L001_R1_paired.fq.gz
Root_T3_S_R3_S59_L001_R2_paired.fq.gz
Root_T3_S_R5_S58_L001_R1_paired.fq.gz
Root_T3_S_R5_S58_L001_R2_paired.fq.gz

I will be thankful for your time and advise.

Thanks Bioinfonext

ADD REPLY • link 5.9 years ago by Bioinfonext ▴ 470

0

Entering edit mode

I am not sure what you mean by extra R1? That loop should work with the example file names posted. Here I am just printing the name prefixes.

$ for i in `ls -1 *R1_paired*| sed 's/\_R1_paired.fq.gz//'`; do echo ${i};done
Root_T3_S_R3_S59_L001
Root_T3_S_R5_S58_L001

You could also do :

$ for i in *R1*.gz; do name=$(basename ${i} _R1_paired.fq.gz); echo ${name}; done
Root_T3_S_R3_S59_L001
Root_T3_S_R5_S58_L001

ADD REPLY • link 5.9 years ago by GenoMax 152k

0

Entering edit mode

Dear genomax,

thanks for your all help. Can I run same loop for HTseq, I got alignment files and now now to run the loop to generate the read count for each alignment sam file saperetly. alignment file are like this:

 Root_T3_F_R9_S27_L001.sam

 Root_T3_S_R2_S60_L001.sam

 Root_T3_S_R3_S59_L001.sam

 Leaf_T1_F_R10_S1_L001.sam

 Leaf_T1_F_R2_S5_L001.sam

 Leaf_T1_F_R3_S4_L001.sam

for single file HTseq command:

htseq-count  Leaf_T1_F_R3_S4_L001.sam rice.gff  > Leaf_T1_F_R3_S4.count.txt

How can use above loop to generate HTseq command for all sam files.

I am thinking to use this, but how I can add count.txt to save output files.

for i in `ls -1 *L001.sam*| sed 's/\_L001.sam//'`; do echo htseq-count rice.gff  $i\_L001.sam >> cmd2_file;done

above command generated output like this:

htseq-count  rice.gff  Leaf_T3_S_R6_S42_L001.sam
htseq-count  rice.gff   Leaf_T3_S_R7_S41_L001.sam

But if it can write script like this will be a great help:

htseq-count  rice.gff  Leaf_T3_S_R6_S42_L001.sam >  Leaf_T3_S_R6_S42_L001.count.txt

Thanks bioinfonext

ADD REPLY • link 5.9 years ago by Bioinfonext ▴ 470

0

Entering edit mode

I got out put file like txt by using below command:

for i in `ls -1 *L001.sam*| sed 's/\_L001.sam//'`; do echo htseq-count rice.gff  $i\_L001.sam >$i\_L001.COUNT.txt>> cmd2_file;done

Please let me know if it is ok Thanks Bioinfonext

ADD REPLY • link 5.9 years ago by Bioinfonext ▴ 470

1

Entering edit mode

That looks ok. Instead of using using bare $i, it is a better practice to use ${i} to reference a variable.

ADD REPLY • link 5.9 years ago by GenoMax 152k

0

Entering edit mode

Hi Genomax,

HTSeq loop command:

for i in `ls -1 *L001.sam*| sed 's/\_L001.sam//'`; do echo htseq-count rice.gff  $i\_L001.sam >$i\_L001.COUNT.txt>> cmd2_file;done

above command generate cmd file like this, it do not direct output to txt file in cmd file; instead of adding output txt file to cmd command, it generate empty txt file according to sample name.

htseq-count Leaf_T1_F_R10_S1_L001.sam Oryza_indica.ASM465v1.44.chr.gtf

htseq-count Leaf_T1_F_R2_S5_L001.sam Oryza_indica.ASM465v1.44.chr.gtf

htseq-count Leaf_T1_F_R3_S4_L001.sam Oryza_indica.ASM465v1.44.chr.gtf

It will be great if it can also add output count txt file according to sample name in the command.

ADD REPLY • link 5.9 years ago by Bioinfonext ▴ 470

0

Entering edit mode

it will be great if can generate cmd file like this:

htseq-count Leaf_T1_F_R10_S1_L001.sam Oryza_indica.ASM465v1.44.chr.gtf >Leaf_T1_F_R10_S1_L001.count.txt

ADD REPLY • link 5.9 years ago by Bioinfonext ▴ 470

0

Entering edit mode

Use this:

$ for i in `ls -1 *L001.sam*| sed 's/\_L001.sam//'`; do echo "htseq-count ${i}_L001.sam  Oryza_indica.ASM465v1.44.chr.gtf  >${i}_L001.COUNT.txt" >> cmd2_file;done

ADD REPLY • link 5.9 years ago by GenoMax 152k

Automation of trimmomatic

What is Trimmomatic?

✂️ About the script

🚩 Why should one use this?

🔧 Using the script:

ℹ️ Invoking help

⚠️ Error handling

ℹ️ Rename the fastq files as demo_R1.fq, demo_R2.fq. This checkpoint was essential to maintain integrity of the script.

ℹ️ Rename the fastq files as `demo_R1.fq`, `demo_R2.fq`. This checkpoint was essential to maintain integrity of the script.