Question

automated script to trim paired end RNA seq reads using Trimmomatic

1

Entering edit mode

6.7 years ago

nikelle.petrillo ▴ 110

Hi all,

I am using an automated script to trim paired end fastq.gz files using Trimmomatic. I have this code:

#!/bin/bash
for f1 in *1.fastq.gz
do
    f2=${f1%%1.fastq.gz}"2.fastq.gz"
    java -jar /Trimmomatic-0.36/trimmomatic-0.36.jar PE -trimlog 
TrimLog $f1 $f2 -baseout filtered.fq.gz SLIDINGWINDOW:5:20
done

I would like to name my output files according to the input files. For example: Say I have these 2 input files: 12345.1.fastq.gz and 12345.2.fastq.gz. I would like the output files (using the -baseout flag) to be named: filtered.12345.1P.fastq.gz filtered.12345.1U.fastq.gz filtered.12345.2P.fastq.gz filtered.12345.2U.fastq.gz

(Trimmomatic spits out 4 output files, and automatically adds The 1P, 1U, 2P, 2U, parts of the output file names. ) Does anyone know how to edit my script to do this? Thanks!

RNA-Seq trimming trimmomatic loops • 5.1k views

ADD COMMENT • link updated 6.7 years ago by h.mon 35k • written 6.7 years ago by nikelle.petrillo ▴ 110

1

Entering edit mode

You can specify the outputs in the command:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

So, for your code, I think it would be something like this, using basename to get each file prefix:

#!/bin/bash
for f1 in *1.fastq.gz
do
    f2=${f1%%1.fastq.gz}"2.fastq.gz"
    java -jar /Trimmomatic-0.36/trimmomatic-0.36.jar PE -trimlog 
TrimLog $f1 $f2 filtered.$(basename $f1 .1.fastq.gz).1P.fastq.gz filtered.$(basename $f1 .1.fastq.gz).1U.fast.gz filtered.$(basename $f2 .2.fastq.gz).2P.fastq.gz filtered.$(basename $f2 .2.fastq.gz).2U.fastq.gz SLIDINGWINDOW:5:20
done

Alternatively, you can create a tab-delimited file, and use xargs:

prefixes.txt:

12345.1        12345.2
67890.1        67890.2
...

Save as trimmomatic.sh:

#!/usr/bin/bash
 java -jar /Trimmomatic-0.36/trimmomatic-0.36.jar PE -trimlog 
    TrimLog $1 $2 filtered."$1".1P.fastq.gz filtered."$1".1U.fast.gz filtered."$2".2P.fastq.gz filtered."$2".2U.fastq.gz SLIDINGWINDOW:5:20

Run as:

cat prefixes.txt  | xargs -n 2 bash trimmomatic.sh

ADD REPLY • link 6.7 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Thanks for your reply! I do have another question regarding using the basename to get each file prefix. Your code outputs my files as: filtered.12345.1.fastq.gz.1P.fastq.gz filtered.12345.1.fastq.gz.1U.fastq.gz filtered.12345.1.fastq.gz.2P.fastq.gz filtered.12345.1.fastq.gz.2U.fastq.gz

Is there a way to edit your script so that it is just filtered.12345.1P.fastq.gz filtered.12345.1U.fastq.gz etc etc ? So, just taking out the extra fastq.gz part?

ADD REPLY • link 6.7 years ago by nikelle.petrillo ▴ 110

0

Entering edit mode

I'm not sure why you're getting those outputs. Running this test, with 12345.1.fastq.gz in a dir:

#!/bin/bash
for f1 in *1.fastq.gz
do
    f2=${f1%%1.fastq.gz}"2.fastq.gz"
    echo $f2     # double check f2 filename
    echo filtered.$(basename $f1 .1.fastq.gz).1P.fastq.gz filtered.$(basename $f1 .1.fastq.gz).1U.fastq.gz filtered.$(basename $f2 .2.fastq.gz).2P.fastq.gz filtered.$(basename $f2 .2.fastq.gz).2U.fastq.gz
done

Yields:

12345.2.fastq.gz
filtered.12345.1P.fastq.gz filtered.12345.1U.fastq.gz filtered.12345.2P.fastq.gz filtered.12345.2U.fastq.gz

ADD REPLY • link 6.7 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

that is very weird. I ran that exact script and this is what I got: ( i know the file names are different than 12345.1... just used that for simplicity. Maybe these different file names are the cause of the problem? )

CTG-0011_ACAGTG_AC62F2ANXX_L003_001.R2.fastq.gz
filtered.CTG-0011_ACAGTG_AC62F2ANXX_L003_001.R1.fastq.gz.1P.fastq.gz filtered.CTG-
0011_ACAGTG_AC62F2ANXX_L003_001.R1.fastq.gz.1U.fastq.gz filtered.CTG-
0011_ACAGTG_AC62F2ANXX_L003_001.R2.fastq.gz.2P.fastq.gz filtered.CTG-
0011_ACAGTG_AC62F2ANXX_L003_001.R2.fastq.gz.2U.fastq.gz

ADD REPLY • link 6.7 years ago by nikelle.petrillo ▴ 110

0

Entering edit mode

I believe I got it to work by adding:

(basename -s .fastq.gz $f1) so it strips off the suffix of $f1. Thanks for all the help.

ADD REPLY • link 6.7 years ago by nikelle.petrillo ▴ 110

score 0 · Answer 1 · 2017-08-23

0

Entering edit mode

6.7 years ago

h.mon 35k

(In my opinion) your question isn't really a bioinformatics question, it is a bash scripting question. If you ask this on StackOverflow you will get tons of answers (or your question will be closed because variations of it have been asked before).

#!/bin/bash
for f1 in *1.fastq.gz
do
    base=${f1%%1.fastq.gz}
    echo "Sample is $base"
    echo "Original fastq files are ${base}1.fastq.gz  ${base}2.fastq.gz"
    echo "Output fastq files are ${base}_filtered.1.fastq.gz ${base}_filtered.2.fastq.gz"
done

ADD COMMENT • link 6.7 years ago by h.mon 35k

0

Entering edit mode

In my opinion, this is a bioinformatics question because it involves scripting and the original poster is trying to solve a bioinformatics problem.

ADD REPLY • link 6.7 years ago by Istvan Albert 100k

0

Entering edit mode

Thats fine, I may be wrong on this one. For the record, I just stated my opinion, I didn't close the question nor moderated it in any other way.

ADD REPLY • link 6.7 years ago by h.mon 35k