Question: Running Trimmomatic for bulk RNAseq data
0
gravatar for F. Golestan
12 days ago by
F. Golestan10
F. Golestan10 wrote:

Hello,

I performed trimmomatic for two pairs of my *sortmerna_1.fq.gz and *sortmerna_2.fq.gz files as follows:

cd /scratch/fs/trimmomatic
find /scratch/fs/sortmerna -name "*sortmerna_[12].fq.gz" | sort | head -n 4 | while read FW_READ
do
 read RV_READ
 FILEBASE=$(basename "${FW_READ%_1.fq.gz}")
 trimmomatic PE -threads 16 -phred64 "$FW_READ" "$RV_READ" \
 "$FILEBASE-trimmomatic_1.fq.gz" "$FILEBASE-trimmomatic-unpaired_1.fq.gz" \
 "$FILEBASE-trimmomatic_2.fq.gz" "$FILEBASE-trimmomatic-unpaired_2.fq.gz"  \
 ILLUMINACLIP:"/home/local/software/biobuilds/2017.11/libexec/trimmomatic-0.36/adapters/TruSeq3-PE-2.fa":2:30:10 \
 SLIDINGWINDOW:5:20 MINLEN:50 2> "$FILEBASE-timmomatic.log"
done

It produced 10 files (each file with size of 1) including *-sortmerna-timmomatic.log , *-sortmerna-trimmomatic-unpaired_1.fq.gz , *-sortmerna-trimmomatic-unpaired_2.fq.gz , *-sortmerna-trimmomatic_1.fq.gz, *-sortmerna-trimmomatic_2.fq.gz for each pair.

But, when I performed fastqc, as below, on the produced trimmomatic files, It failed to do that.

fastqc -o /scratch/fs/qa/trimmomatic -t 4 /scratch/fs/trimmomatic/*trimmomatic_[12].fq.gz

Failure message is as bellow:

Started analysis of TM_17-sortmerna-trimmomatic_1.fq.gz
Analysis complete for TM _17-sortmerna-trimmomatic_1.fq.gz
Started analysis of TM_17-sortmerna-trimmomatic_2.fq.gz
Analysis complete for TM_17-sortmerna-trimmomatic_2.fq.gz
Failed to process file TM_17-sortmerna-trimmomatic_1.fq.gz
Failed to process file TM_17-sortmerna-trimmomatic_2.fq.gz
java.lang.ArrayIndexOutOfBoundsException: -1
       at uk.ac.babraham.FastQC.Modules.SequenceLengthDistribution.calculateDistribution(SequenceLengthDistribution.java:100)
       at uk.ac.babraham.FastQC.Modules.SequenceLengthDistribution.raisesError(SequenceLengthDistribution.java:184)
       at uk.ac.babraham.FastQC.Report.HTMLReportArchive.startDocument(HTMLReportArchive.java:336)
       at uk.ac.babraham.FastQC.Report.HTMLReportArchive.<init>(HTMLReportArchive.java:84)
       at uk.ac.babraham.FastQC.Analysis.OfflineRunner.analysisComplete(OfflineRunner.java:155)
       at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:110)
       at java.lang.Thread.run(Thread.java:745)

Would you please advise me what I did wrong? and how to fix it? Thank you very much.

fastqc rna-seq trimmomatic • 174 views
ADD COMMENTlink modified 12 days ago • written 12 days ago by F. Golestan10

It produced 10 files (each file with size of 1)

Is that 1 byte? That means you trimming did not work at all.

ADD REPLYlink written 12 days ago by genomax70k

Yes, each file is 1 kb. So, if the trimming did not work, would you please let me know why it did not work? what I did wrong in that scripts? Thanks a lot.

ADD REPLYlink written 12 days ago by F. Golestan10
1

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

As for why the trimming did not work I suggest you start looking into the log file $FILEBASE-timmomatic.log for clues first.

Note: If you have data of recent vintage then it most certainly is not going to be -phred64. It should be phread+33.

ADD REPLYlink modified 12 days ago • written 12 days ago by genomax70k

Sure. Thanks. As you suggested, I first looked into $FILEBASE-timmomatic.log which is showing the following:

TrimmomaticPE: Started with arguments:
 -threads 16 -phred64 /scratch/fs/sortmerna/TM_17-sortmerna_1.fq.gz /scratch/fs/sortmerna/TM_17-sortmerna_2.fq.gz TM_17-sortmerna-trimmomatic_1.fq.gz TM_17-sortmerna-trimmomatic-unpaired_1.fq.gz TM_17-sortmerna-trimmomatic_2.fq.gz TM_17-sortmerna-trimmomatic-unpaired_2.fq.gz ILLUMINACLIP:/home/local/software/biobuilds/2017.11/libexec/trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10 SLIDINGWINDOW:5:20 MINLEN:50
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 49410501 Both Surviving: 0 (0.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 49410501 (100.00%)
TrimmomaticPE: Completed successfully

From the above, I do not know what is the problem. Also, the data was obtained on 2015. I am not sure whether I should use -phred64 or phread+33.

May I kindly ask for your advise. Thank you.

ADD REPLYlink modified 11 days ago by genomax70k • written 11 days ago by F. Golestan10
1

It looks like all of your reads are getting discarded. Are they shorter than 50 bp? If they are then you should remove MINLEN:50 as well. You can also remove -phred64 while you are at it.

I will also recommend that you try bbduk.sh from BBMap suite, as an alternative (https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/ ) . Simple to understand options, easy to use and fast.

ADD REPLYlink modified 11 days ago • written 11 days ago by genomax70k

Thanks. The sequence length is 101 bp.

ADD REPLYlink written 11 days ago by F. Golestan10
1

Run without -phred64 and let us know what happens. It would be highly unlikely that your entire dataset only has adapter dimers etc.

ADD REPLYlink modified 11 days ago • written 11 days ago by genomax70k

Thank you for your advise. I ran without -phred64. It could finish the analysis only for one pair of my *18-sortmerna_1.fq.gz and *18-sortmerna_2.fq.gz files. While another pair remains without trimmomatic.

I then ran fastqc to check trimmomatic output. It looks good in "per base sequence quality", but now basic statistic section of fastqc shows the sequence length of 50-101 bp, while for my raw fq.gz data and also sortmerna-filtered data it is 101 bp. Is it a potential problem or it is normal?

Also, FastQC shows Encoding of my reads as: Sanger / Illumina 1.9. Which I do not know it should be -phred64 or phred+33.

At this step and condition, would you please guide me how can I improve it to do trimmomatic for all of my files (and not only for one pair).

Many thanks for your help.

ADD REPLYlink written 10 days ago by F. Golestan10
1

While another pair remains without trimmomatic.

Does it mean there is an error or there is no output?

now basic statistic section of fastqc shows the sequence length of 50-101 bp

That is expected because now your reads have been trimmed so they are no longer all of equal original length. Since you had minlen:50 in there reads that went below that size were dropped from output.

FastQC shows Encoding of my reads as: Sanger / Illumina 1.9

So -phred33 is correct encoding as I had expected.

ADD REPLYlink modified 10 days ago • written 10 days ago by genomax70k

Thanks a lot for your great help. The size of my standard error output (stderr) from my job submission is zero. So, It did not produce any error. But. it just successfully performed trimmomatic for one pair of my fq.gz files, and there is no output for another pair (I mean, in the output file, there are only five generated files which all belong to the first pair of my files), while the scripts is a loop that should perform it for all of my fq.gz files.

ADD REPLYlink written 10 days ago by F. Golestan10

Many thanks to genomax for the helpful guide. I could successfully run trimmomatic for all of my samples. Thanks again.

ADD REPLYlink written 7 days ago by F. Golestan10

Try running your loop with echo in front of trimmomatic. It will show you how the command would look like. See if that's correct.

ADD REPLYlink written 12 days ago by WouterDeCoster40k

As you suggested, I put echo in front of trimmomatic as below:

....

 FILEBASE=$(basename "${FW_READ%_1.fq.gz}")
 trimmomatic echo PE -threads 16 -phred64 "$FW_READ" "$RV_READ" \
 "$FILEBASE-trimmomatic_1.fq.gz" "$FILEBASE-trimmomatic-unpaired_1.fq.gz" \
 "$FILEBASE-trimmomatic_2.fq.gz" "$FILEBASE-trimmomatic-unpaired_2.fq.gz"  \
 ILLUMINACLIP:"/home/local/software/biobuilds/2017.11/libexec/trimmomatic-0.36/adapters/TruSeq3-PE-2.fa":2:30:10 \
 SLIDINGWINDOW:5:20 MINLEN:50 2> "$FILEBASE-timmomatic.log"
done

But it just produced log files, no other files.

ADD REPLYlink modified 11 days ago by genomax70k • written 11 days ago by F. Golestan10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2121 users visited in the last hour