Different trimming SLIDINGWINDOW threshold value mapping comparison
2
1
Entering edit mode
6.8 years ago
Hughie ▴ 80

Hello!

Recentlly, I trimmed my poor tail .fastq dataset using Trimmomatic's SLIDINGWINDOW parameter, in which I chose threshold value 15~30(i.e SLIDINGWINDOW:4:15~SLIDINGWINDOW:4:30). after, I mapped them using STAR, but I found with the increase of threshold, totally, unique and multiple mapped reads were all decreased across my 5 test_samples.

So, I'm in doubt with this situation, should it's a normal sense that my totally mapped or unique mapped reads increase with the accompany of increased trimming threshold?

Mapped results

I would really appreciate for your helping!

RNA-Seq next-gen alignment • 3.3k views
ADD COMMENT
3
Entering edit mode
6.8 years ago

Sliding window trimming is not optimal. Because in this case there is an optimal algorithm (the Phred algorithm), I suggest you never use sliding window. The Phred algorithm is implemented in seqtk and BBDuk.

Different aligners handle low-quality bases differently. Quality-trimming will almost universally decrease the proportion of uniquely-mapped reads, from a given set of reads that map whether trimmed or untrimmed. Decreasing the rate of total mapped reads generally means that you are quality-trimming too aggressively, or simply using an aligner that is not sensitive enough. STAR is very fast, but not as sensitive as some other aligners; perhaps it does not handle the shorter reads resulting from quality-trimming very well, or perhaps it is very tolerant of low-quality bases (local aligners tend to be).

Anyway, the image would be more useful if you used optimal quality trimming and included thresholds below Q15, which is higher than is generally recommended for mapping (particularly for anything quantitative like RNA-seq). Q10 and Q5 are worth including, and from there you can bisect if you want the optimal value.

ADD COMMENT
0
Entering edit mode

Thanks Brain! I'm sorry to mistaken replying a wrong place below instead of commmenting your answer.

ADD REPLY
0
Entering edit mode
6.8 years ago
Hughie ▴ 80

Thanks! Brain Bushnell.

Actually, I also used MINLEN:35 to discard the reads with length beneath 35 accroding an answer for my last post, I'm not sure whether this effected my results observably because actually I also trimmed my data before without using MINLEN(i.e keep all reads after trimming),I plotted a similar image as below.

You can see that 2 of my samples increased their unique and total mapping rate with trimmed_15.

I will try Q10 and Q5 later,thanks to your advice. Also, I want to ask if you see the situation(bad tail quality)as below, will you trim it?(if yes, which package do you usually choose?) enter image description here

Last, really appreciate for your helping!

ADD COMMENT
1
Entering edit mode

It really depends on the aligner. BBMap does not care about quality very much. I'm not sure about the latest versions of bwa and bowtie; earlier versions did not handle low-quality reads well, but at least for bwa, the latest versions seem to do a very good job.

I've never benchmarked STAR because the last time I tested it, it core-dumped and I couldn't get it to produce any output, so I lost interest. But according to your results, it seems like STAR is fairly tolerant of low-quality data. Considering that in some cases it improves things, it looks like your quality-trimming is simply not optimal - using the Phred algorithm and adjusting the cutoff will probably improve your results.

In this case it is very informative that you only have 50bp (or maybe 49bp) reads, and are tossing reads shorter than 35bp after trimming. That will discard a lot of reads in this odd situation where you have terrible quality after ~40bp and a lot of the reads sink to zero quality after ~34bp. I suggest you set the cutoff lower, at maybe 30bp, for this data. There is obviously something seriously wrong with this run and the best thing to do is to rerun it for free, because it looks like a massive failure on the part of the sequencing machine. But if it is data that cannot be reproduced, then you are stuck with these horrible results from a sequencing failure. In that case I suggest you use Illumina's software to see if there is a positional component to the bad reads; if so, filtering out the junk is fairly easy.

Edit - the default or recommended min length filter of tools like Trimmomatic and BBDuk tend to assume you have much longer reads than you are using (typically, at least 100bp). When using really short reads like 50bp you need to reduce it (for example, if you had 30bp reads and told the program to discard everything shorter than 35bp... you'd end up with nothing). Your comparison results are somewhat misleading when the methodology discards reads that are only slightly shorter than your maximum read length. It is not useful to describe the number of reads that uniquely map somewhere as a function of preprocessing, when the preprocessing delivers different numbers of reads; that's only enlightening when all reads are retained.

That said, it's still odd that your read quality is so low. What platform is this?

ADD REPLY
0
Entering edit mode

Thanks! I'm gonna try Phred algorithm.

ADD REPLY

Login before adding your answer.

Traffic: 1763 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6