Question: automated script to trim paired end RNA seq reads using Trimmomatic
1
gravatar for nikelle.petrillo
2.8 years ago by
Providence College, Providence, RI
nikelle.petrillo100 wrote:

Hi all,

I am using an automated script to trim paired end fastq.gz files using Trimmomatic. I have this code:

#!/bin/bash
for f1 in *1.fastq.gz
do
    f2=${f1%%1.fastq.gz}"2.fastq.gz"
    java -jar /Trimmomatic-0.36/trimmomatic-0.36.jar PE -trimlog 
TrimLog $f1 $f2 -baseout filtered.fq.gz SLIDINGWINDOW:5:20
done

I would like to name my output files according to the input files. For example: Say I have these 2 input files: 12345.1.fastq.gz and 12345.2.fastq.gz. I would like the output files (using the -baseout flag) to be named: filtered.12345.1P.fastq.gz filtered.12345.1U.fastq.gz filtered.12345.2P.fastq.gz filtered.12345.2U.fastq.gz

(Trimmomatic spits out 4 output files, and automatically adds The 1P, 1U, 2P, 2U, parts of the output file names. ) Does anyone know how to edit my script to do this? Thanks!

ADD COMMENTlink modified 2.8 years ago by h.mon29k • written 2.8 years ago by nikelle.petrillo100
1

You can specify the outputs in the command:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

So, for your code, I think it would be something like this, using basename to get each file prefix:

#!/bin/bash
for f1 in *1.fastq.gz
do
    f2=${f1%%1.fastq.gz}"2.fastq.gz"
    java -jar /Trimmomatic-0.36/trimmomatic-0.36.jar PE -trimlog 
TrimLog $f1 $f2 filtered.$(basename $f1 .1.fastq.gz).1P.fastq.gz filtered.$(basename $f1 .1.fastq.gz).1U.fast.gz filtered.$(basename $f2 .2.fastq.gz).2P.fastq.gz filtered.$(basename $f2 .2.fastq.gz).2U.fastq.gz SLIDINGWINDOW:5:20
done

Alternatively, you can create a tab-delimited file, and use xargs:

prefixes.txt:

12345.1        12345.2
67890.1        67890.2
...

Save as trimmomatic.sh:

#!/usr/bin/bash
 java -jar /Trimmomatic-0.36/trimmomatic-0.36.jar PE -trimlog 
    TrimLog $1 $2 filtered."$1".1P.fastq.gz filtered."$1".1U.fast.gz filtered."$2".2P.fastq.gz filtered."$2".2U.fastq.gz SLIDINGWINDOW:5:20

Run as:

cat prefixes.txt  | xargs -n 2 bash trimmomatic.sh
ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by st.ph.n2.5k

Thanks for your reply! I do have another question regarding using the basename to get each file prefix. Your code outputs my files as: filtered.12345.1.fastq.gz.1P.fastq.gz filtered.12345.1.fastq.gz.1U.fastq.gz filtered.12345.1.fastq.gz.2P.fastq.gz filtered.12345.1.fastq.gz.2U.fastq.gz

Is there a way to edit your script so that it is just filtered.12345.1P.fastq.gz filtered.12345.1U.fastq.gz etc etc ? So, just taking out the extra fastq.gz part?

ADD REPLYlink written 2.8 years ago by nikelle.petrillo100

I'm not sure why you're getting those outputs. Running this test, with 12345.1.fastq.gz in a dir:

#!/bin/bash
for f1 in *1.fastq.gz
do
    f2=${f1%%1.fastq.gz}"2.fastq.gz"
    echo $f2     # double check f2 filename
    echo filtered.$(basename $f1 .1.fastq.gz).1P.fastq.gz filtered.$(basename $f1 .1.fastq.gz).1U.fastq.gz filtered.$(basename $f2 .2.fastq.gz).2P.fastq.gz filtered.$(basename $f2 .2.fastq.gz).2U.fastq.gz
done

Yields:

12345.2.fastq.gz
filtered.12345.1P.fastq.gz filtered.12345.1U.fastq.gz filtered.12345.2P.fastq.gz filtered.12345.2U.fastq.gz
ADD REPLYlink written 2.8 years ago by st.ph.n2.5k

that is very weird. I ran that exact script and this is what I got: ( i know the file names are different than 12345.1... just used that for simplicity. Maybe these different file names are the cause of the problem? )

CTG-0011_ACAGTG_AC62F2ANXX_L003_001.R2.fastq.gz
filtered.CTG-0011_ACAGTG_AC62F2ANXX_L003_001.R1.fastq.gz.1P.fastq.gz filtered.CTG-
0011_ACAGTG_AC62F2ANXX_L003_001.R1.fastq.gz.1U.fastq.gz filtered.CTG-
0011_ACAGTG_AC62F2ANXX_L003_001.R2.fastq.gz.2P.fastq.gz filtered.CTG-
0011_ACAGTG_AC62F2ANXX_L003_001.R2.fastq.gz.2U.fastq.gz
ADD REPLYlink written 2.8 years ago by nikelle.petrillo100

I believe I got it to work by adding:

(basename -s .fastq.gz $f1) so it strips off the suffix of $f1. Thanks for all the help.

ADD REPLYlink written 2.8 years ago by nikelle.petrillo100
0
gravatar for h.mon
2.8 years ago by
h.mon29k
Brazil
h.mon29k wrote:

(In my opinion) your question isn't really a bioinformatics question, it is a bash scripting question. If you ask this on StackOverflow you will get tons of answers (or your question will be closed because variations of it have been asked before).

#!/bin/bash
for f1 in *1.fastq.gz
do
    base=${f1%%1.fastq.gz}
    echo "Sample is $base"
    echo "Original fastq files are ${base}1.fastq.gz  ${base}2.fastq.gz"
    echo "Output fastq files are ${base}_filtered.1.fastq.gz ${base}_filtered.2.fastq.gz"
done
ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by h.mon29k

In my opinion, this is a bioinformatics question because it involves scripting and the original poster is trying to solve a bioinformatics problem.

ADD REPLYlink written 2.8 years ago by Istvan Albert ♦♦ 84k

Thats fine, I may be wrong on this one. For the record, I just stated my opinion, I didn't close the question nor moderated it in any other way.

ADD REPLYlink written 2.8 years ago by h.mon29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1341 users visited in the last hour