automated script to trim paired end RNA seq reads using Trimmomatic
1
1
Entering edit mode
3.7 years ago

Hi all,

I am using an automated script to trim paired end fastq.gz files using Trimmomatic. I have this code:

#!/bin/bash
for f1 in *1.fastq.gz
do
    f2=${f1%%1.fastq.gz}"2.fastq.gz"
    java -jar /Trimmomatic-0.36/trimmomatic-0.36.jar PE -trimlog 
TrimLog $f1 $f2 -baseout filtered.fq.gz SLIDINGWINDOW:5:20
done

I would like to name my output files according to the input files. For example: Say I have these 2 input files: 12345.1.fastq.gz and 12345.2.fastq.gz. I would like the output files (using the -baseout flag) to be named: filtered.12345.1P.fastq.gz filtered.12345.1U.fastq.gz filtered.12345.2P.fastq.gz filtered.12345.2U.fastq.gz

(Trimmomatic spits out 4 output files, and automatically adds The 1P, 1U, 2P, 2U, parts of the output file names. ) Does anyone know how to edit my script to do this? Thanks!

RNA-Seq trimming trimmomatic loops • 2.7k views
ADD COMMENT
1
Entering edit mode

You can specify the outputs in the command:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

So, for your code, I think it would be something like this, using basename to get each file prefix:

#!/bin/bash
for f1 in *1.fastq.gz
do
    f2=${f1%%1.fastq.gz}"2.fastq.gz"
    java -jar /Trimmomatic-0.36/trimmomatic-0.36.jar PE -trimlog 
TrimLog $f1 $f2 filtered.$(basename $f1 .1.fastq.gz).1P.fastq.gz filtered.$(basename $f1 .1.fastq.gz).1U.fast.gz filtered.$(basename $f2 .2.fastq.gz).2P.fastq.gz filtered.$(basename $f2 .2.fastq.gz).2U.fastq.gz SLIDINGWINDOW:5:20
done

Alternatively, you can create a tab-delimited file, and use xargs:

prefixes.txt:

12345.1        12345.2
67890.1        67890.2
...

Save as trimmomatic.sh:

#!/usr/bin/bash
 java -jar /Trimmomatic-0.36/trimmomatic-0.36.jar PE -trimlog 
    TrimLog $1 $2 filtered."$1".1P.fastq.gz filtered."$1".1U.fast.gz filtered."$2".2P.fastq.gz filtered."$2".2U.fastq.gz SLIDINGWINDOW:5:20

Run as:

cat prefixes.txt  | xargs -n 2 bash trimmomatic.sh
ADD REPLY
0
Entering edit mode

Thanks for your reply! I do have another question regarding using the basename to get each file prefix. Your code outputs my files as: filtered.12345.1.fastq.gz.1P.fastq.gz filtered.12345.1.fastq.gz.1U.fastq.gz filtered.12345.1.fastq.gz.2P.fastq.gz filtered.12345.1.fastq.gz.2U.fastq.gz

Is there a way to edit your script so that it is just filtered.12345.1P.fastq.gz filtered.12345.1U.fastq.gz etc etc ? So, just taking out the extra fastq.gz part?

ADD REPLY
0
Entering edit mode

I'm not sure why you're getting those outputs. Running this test, with 12345.1.fastq.gz in a dir:

#!/bin/bash
for f1 in *1.fastq.gz
do
    f2=${f1%%1.fastq.gz}"2.fastq.gz"
    echo $f2     # double check f2 filename
    echo filtered.$(basename $f1 .1.fastq.gz).1P.fastq.gz filtered.$(basename $f1 .1.fastq.gz).1U.fastq.gz filtered.$(basename $f2 .2.fastq.gz).2P.fastq.gz filtered.$(basename $f2 .2.fastq.gz).2U.fastq.gz
done

Yields:

12345.2.fastq.gz
filtered.12345.1P.fastq.gz filtered.12345.1U.fastq.gz filtered.12345.2P.fastq.gz filtered.12345.2U.fastq.gz
ADD REPLY
0
Entering edit mode

that is very weird. I ran that exact script and this is what I got: ( i know the file names are different than 12345.1... just used that for simplicity. Maybe these different file names are the cause of the problem? )

CTG-0011_ACAGTG_AC62F2ANXX_L003_001.R2.fastq.gz
filtered.CTG-0011_ACAGTG_AC62F2ANXX_L003_001.R1.fastq.gz.1P.fastq.gz filtered.CTG-
0011_ACAGTG_AC62F2ANXX_L003_001.R1.fastq.gz.1U.fastq.gz filtered.CTG-
0011_ACAGTG_AC62F2ANXX_L003_001.R2.fastq.gz.2P.fastq.gz filtered.CTG-
0011_ACAGTG_AC62F2ANXX_L003_001.R2.fastq.gz.2U.fastq.gz
ADD REPLY
0
Entering edit mode

I believe I got it to work by adding:

(basename -s .fastq.gz $f1) so it strips off the suffix of $f1. Thanks for all the help.

ADD REPLY
0
Entering edit mode
3.7 years ago
h.mon 32k

(In my opinion) your question isn't really a bioinformatics question, it is a bash scripting question. If you ask this on StackOverflow you will get tons of answers (or your question will be closed because variations of it have been asked before).

#!/bin/bash
for f1 in *1.fastq.gz
do
    base=${f1%%1.fastq.gz}
    echo "Sample is $base"
    echo "Original fastq files are ${base}1.fastq.gz  ${base}2.fastq.gz"
    echo "Output fastq files are ${base}_filtered.1.fastq.gz ${base}_filtered.2.fastq.gz"
done
ADD COMMENT
0
Entering edit mode

In my opinion, this is a bioinformatics question because it involves scripting and the original poster is trying to solve a bioinformatics problem.

ADD REPLY
0
Entering edit mode

Thats fine, I may be wrong on this one. For the record, I just stated my opinion, I didn't close the question nor moderated it in any other way.

ADD REPLY

Login before adding your answer.

Traffic: 1724 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6