Trimmomatic low PE returns
1
0
Entering edit mode
9.6 years ago
oliver.tills ▴ 10

Hi,

I have a query about using Trimmomatic on PE MiSeq data. I am receiving what seems to be quite low paired returns. Is this typical and/or can anyone suggest what the problem might be Input Read Pairs: 982783 Both Surviving: 346732 (35.28%) Forward Only Surviving: 635840 (64.70%) Reverse Only Surviving: 17 (0.00%) Dropped: 194 (0.02%). See the code run below. The problem does appear to be the adapter trimming rather than the quality trimming (as if I remove the adapter trimming step PE returns increases to > 60 %.

Can anyone offer any advice?

Thanks, Oli

TrimmomaticPE: Started with arguments: /home/otills/data/all/140331_C1CR_M01145_0120_000000000-A6UJV_1_IL-TP-014_1.sanfastq.gz \
/home/otills/data/all/140331_C1CR_M01145_0120_000000000-A6UJV_1_IL-TP-014_2.sanfastq.gz
/home/otills/data/all/140331_C1CR_M01145_0120_000000000-A6UJV_1_IL-TP-014_1.sanfastq.fq.gz \
/home/otills/data/all/140331_C1CR_M01145_0120_000000000-A6UJV_1_IL-TP-014_1.sanfastq.bd.fq.gz \
/home/otills/data/all/140331_C1CR_M01145_0120_000000000-A6UJV_1_IL-TP-014_2.sanfastq.fq.gz
/home/otills/data/all/140331_C1CR_M01145_0120_000000000-A6UJV_1_IL-TP-014_2.sanfastq.bd.fq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:2 MINLEN:20

Multiple cores found: Using 8 threads
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 982783 Both Surviving: 346732 (35.28%) Forward Only Surviving: 635840 (64.70%) Reverse Only Surviving: 17 (0.00%) Dropped: 194 (0.02%)
TrimmomaticPE: Completed successfully
RNA-Seq • 5.2k views
ADD COMMENT
1
Entering edit mode

Run fastqc on the files and see if anything is obviously wrong with them.

ADD REPLY
1
Entering edit mode
9.6 years ago
oliver.tills ▴ 10

I don't think there is anything obviously wrong with the files - https://www.dropbox.com/s/gz49ljhbxs95ont/140331_C1CR_M01145_0120_000000000-A6UJV_1_IL-TP-014_1.sanfastq_fastqc.html?dl=0.

However, I think I've figured out what is happening. The adapter trimming is running in 'palindrome' mode and therefore - 'After read-though has been detected by palindrome mode, and the adapter sequence removed, the reverse read contains the same sequence information as the forward read, albeit in reverse complement. For this reason, the default behaviour is to entirely drop the reverse read. By specifying "true‟ for this parameter, the reverse read will also be retained, which may be useful e.g. if the downstream tools cannot handle a combination of paired and unpaired reads.'.

If I set this to TRUE I get ~99 % paired reads surviving. I am not sure I fully understand what is happening with this, but I guess because the MiSeq reads are quite long (250 bp) I am getting high levels of adapter read through.

ADD COMMENT
0
Entering edit mode

Interesting observation, thanks for sharing. Your data though has plenty of things going on, lots of adapter contamination but clearly not at the level to remove so much of it.

The fact that Trimmomatic would drop palindromic reads is pretty crazy and I've never realized that myself. Usually I don't even care about palindromes. The palindromic behavior is a nice trick to detect very short read through but removing data should be explicit specified by the user.

The reads are not incorrect, perhaps redundant but the solution is not to destroy pairing information.

ADD REPLY
0
Entering edit mode

Wow, I never knew about that palindromic option (I don't normally use trimmomatic), the default should really be true for that.

ADD REPLY
0
Entering edit mode

Two questions linked to this - i) is this situation normal for pe MiSeq reads, and ii) what are people's opinions on the best way to prepare these data for assembly and mapping?

ADD REPLY
0
Entering edit mode

When reads overlap substantially the best may be to merge them into a single read (using tools like Flash and many others). A typical paired end data has a "gap" in between, once reads overlap each fragment will produce regions that are doubly and redundantly measured from two reads and regions that are covered only once from the same read. This will alter the assumptions of just about any mathematical or statistical model that downstream tools may rely on. In that case it is best to merge the reads.

On the other hand there are tools that require paired end reads as input.

Long story short data analysis gets a little more complicated as if it wasn't already.

ADD REPLY

Login before adding your answer.

Traffic: 1951 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6