Question

Running Trimmomatic for bulk RNAseq data

0

Entering edit mode

4.8 years ago

Farah ▴ 80

Hello,

I performed trimmomatic for two pairs of my *sortmerna_1.fq.gz and *sortmerna_2.fq.gz files as follows:

cd /scratch/fs/trimmomatic
find /scratch/fs/sortmerna -name "*sortmerna_[12].fq.gz" | sort | head -n 4 | while read FW_READ
do
 read RV_READ
 FILEBASE=$(basename "${FW_READ%_1.fq.gz}")
 trimmomatic PE -threads 16 -phred64 "$FW_READ" "$RV_READ" \
 "$FILEBASE-trimmomatic_1.fq.gz" "$FILEBASE-trimmomatic-unpaired_1.fq.gz" \
 "$FILEBASE-trimmomatic_2.fq.gz" "$FILEBASE-trimmomatic-unpaired_2.fq.gz"  \
 ILLUMINACLIP:"/home/local/software/biobuilds/2017.11/libexec/trimmomatic-0.36/adapters/TruSeq3-PE-2.fa":2:30:10 \
 SLIDINGWINDOW:5:20 MINLEN:50 2> "$FILEBASE-timmomatic.log"
done

It produced 10 files (each file with size of 1) including *-sortmerna-timmomatic.log , *-sortmerna-trimmomatic-unpaired_1.fq.gz , *-sortmerna-trimmomatic-unpaired_2.fq.gz , *-sortmerna-trimmomatic_1.fq.gz, *-sortmerna-trimmomatic_2.fq.gz for each pair.

But, when I performed fastqc, as below, on the produced trimmomatic files, It failed to do that.

fastqc -o /scratch/fs/qa/trimmomatic -t 4 /scratch/fs/trimmomatic/*trimmomatic_[12].fq.gz

Failure message is as bellow:

Started analysis of TM_17-sortmerna-trimmomatic_1.fq.gz
Analysis complete for TM _17-sortmerna-trimmomatic_1.fq.gz
Started analysis of TM_17-sortmerna-trimmomatic_2.fq.gz
Analysis complete for TM_17-sortmerna-trimmomatic_2.fq.gz
Failed to process file TM_17-sortmerna-trimmomatic_1.fq.gz
Failed to process file TM_17-sortmerna-trimmomatic_2.fq.gz
java.lang.ArrayIndexOutOfBoundsException: -1
       at uk.ac.babraham.FastQC.Modules.SequenceLengthDistribution.calculateDistribution(SequenceLengthDistribution.java:100)
       at uk.ac.babraham.FastQC.Modules.SequenceLengthDistribution.raisesError(SequenceLengthDistribution.java:184)
       at uk.ac.babraham.FastQC.Report.HTMLReportArchive.startDocument(HTMLReportArchive.java:336)
       at uk.ac.babraham.FastQC.Report.HTMLReportArchive.<init>(HTMLReportArchive.java:84)
       at uk.ac.babraham.FastQC.Analysis.OfflineRunner.analysisComplete(OfflineRunner.java:155)
       at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:110)
       at java.lang.Thread.run(Thread.java:745)

Would you please advise me what I did wrong? and how to fix it? Thank you very much.

RNA-Seq Trimmomatic FastQC • 1.8k views

ADD COMMENT • link 4.8 years ago by Farah ▴ 80

0

Entering edit mode

It produced 10 files (each file with size of 1)

Is that 1 byte? That means you trimming did not work at all.

ADD REPLY • link 4.8 years ago by GenoMax 142k

0

Entering edit mode

Yes, each file is 1 kb. So, if the trimming did not work, would you please let me know why it did not work? what I did wrong in that scripts? Thanks a lot.

ADD REPLY • link 4.8 years ago by Farah ▴ 80

1

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

As for why the trimming did not work I suggest you start looking into the log file $FILEBASE-timmomatic.log for clues first.

Note: If you have data of recent vintage then it most certainly is not going to be -phred64. It should be phread+33.

ADD REPLY • link 4.8 years ago by GenoMax 142k

0

Entering edit mode

Sure. Thanks. As you suggested, I first looked into $FILEBASE-timmomatic.log which is showing the following:

TrimmomaticPE: Started with arguments:
 -threads 16 -phred64 /scratch/fs/sortmerna/TM_17-sortmerna_1.fq.gz /scratch/fs/sortmerna/TM_17-sortmerna_2.fq.gz TM_17-sortmerna-trimmomatic_1.fq.gz TM_17-sortmerna-trimmomatic-unpaired_1.fq.gz TM_17-sortmerna-trimmomatic_2.fq.gz TM_17-sortmerna-trimmomatic-unpaired_2.fq.gz ILLUMINACLIP:/home/local/software/biobuilds/2017.11/libexec/trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10 SLIDINGWINDOW:5:20 MINLEN:50
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 49410501 Both Surviving: 0 (0.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 49410501 (100.00%)
TrimmomaticPE: Completed successfully

From the above, I do not know what is the problem. Also, the data was obtained on 2015. I am not sure whether I should use -phred64 or phread+33.

May I kindly ask for your advise. Thank you.

ADD REPLY • link updated 4.8 years ago by GenoMax 142k • written 4.8 years ago by Farah ▴ 80

1

Entering edit mode

It looks like all of your reads are getting discarded. Are they shorter than 50 bp? If they are then you should remove MINLEN:50 as well. You can also remove -phred64 while you are at it.

I will also recommend that you try bbduk.sh from BBMap suite, as an alternative (https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/ ) . Simple to understand options, easy to use and fast.

ADD REPLY • link 4.8 years ago by GenoMax 142k

0

Entering edit mode

Thanks. The sequence length is 101 bp.

ADD REPLY • link 4.8 years ago by Farah ▴ 80

1

Entering edit mode

Run without -phred64 and let us know what happens. It would be highly unlikely that your entire dataset only has adapter dimers etc.

ADD REPLY • link 4.8 years ago by GenoMax 142k

0

Entering edit mode

Thank you for your advise. I ran without -phred64. It could finish the analysis only for one pair of my *18-sortmerna_1.fq.gz and *18-sortmerna_2.fq.gz files. While another pair remains without trimmomatic.

I then ran fastqc to check trimmomatic output. It looks good in "per base sequence quality", but now basic statistic section of fastqc shows the sequence length of 50-101 bp, while for my raw fq.gz data and also sortmerna-filtered data it is 101 bp. Is it a potential problem or it is normal?

Also, FastQC shows Encoding of my reads as: Sanger / Illumina 1.9. Which I do not know it should be -phred64 or phred+33.

At this step and condition, would you please guide me how can I improve it to do trimmomatic for all of my files (and not only for one pair).

Many thanks for your help.

ADD REPLY • link 4.8 years ago by Farah ▴ 80

1

Entering edit mode

While another pair remains without trimmomatic.

Does it mean there is an error or there is no output?

now basic statistic section of fastqc shows the sequence length of 50-101 bp

That is expected because now your reads have been trimmed so they are no longer all of equal original length. Since you had minlen:50 in there reads that went below that size were dropped from output.

FastQC shows Encoding of my reads as: Sanger / Illumina 1.9

So -phred33 is correct encoding as I had expected.

ADD REPLY • link 4.8 years ago by GenoMax 142k

0

Entering edit mode

Thanks a lot for your great help. The size of my standard error output (stderr) from my job submission is zero. So, It did not produce any error. But. it just successfully performed trimmomatic for one pair of my fq.gz files, and there is no output for another pair (I mean, in the output file, there are only five generated files which all belong to the first pair of my files), while the scripts is a loop that should perform it for all of my fq.gz files.

ADD REPLY • link 4.8 years ago by Farah ▴ 80

0

Entering edit mode

Many thanks to genomax for the helpful guide. I could successfully run trimmomatic for all of my samples. Thanks again.

ADD REPLY • link 4.8 years ago by Farah ▴ 80

0

Entering edit mode

Try running your loop with echo in front of trimmomatic. It will show you how the command would look like. See if that's correct.

ADD REPLY • link 4.8 years ago by WouterDeCoster 47k

0

Entering edit mode

As you suggested, I put echo in front of trimmomatic as below:

....

 FILEBASE=$(basename "${FW_READ%_1.fq.gz}")
 trimmomatic echo PE -threads 16 -phred64 "$FW_READ" "$RV_READ" \
 "$FILEBASE-trimmomatic_1.fq.gz" "$FILEBASE-trimmomatic-unpaired_1.fq.gz" \
 "$FILEBASE-trimmomatic_2.fq.gz" "$FILEBASE-trimmomatic-unpaired_2.fq.gz"  \
 ILLUMINACLIP:"/home/local/software/biobuilds/2017.11/libexec/trimmomatic-0.36/adapters/TruSeq3-PE-2.fa":2:30:10 \
 SLIDINGWINDOW:5:20 MINLEN:50 2> "$FILEBASE-timmomatic.log"
done

But it just produced log files, no other files.

ADD REPLY • link updated 4.8 years ago by GenoMax 142k • written 4.8 years ago by Farah ▴ 80