Question: Why Longer Reads Must Be Trimmed Or Divided Into 36Bp ?
gravatar for Liuyunlong
8.8 years ago by
Kunming China
Liuyunlong130 wrote:

Diversity of Human Copy Number Variation and Multicopy Genes ,before mapping reads to reference genome, there is a step in reads preprocessing pipeline: "All reads exceeding 36 base pairs (bp) in length were truncated to 36 bp, or divided into their constituent nonoverlapping 36-bp sequences to eliminate potential mapping biases between genomes sequenced at different read lengths." Is this necessary? and why 36 bp? If I have a dataset that most read's lengths of all sample are about 95~100bp after qc, Can I just trim all reads into uniform length like 95bp? If I use mrfast and divide longer reads into 36bp,which tools can help me deal with pair-end sequences to keep them pairing after divided。

or is it necessary just because of limitation of alignment tool,mrfast ? bwa, dynamicly handle reads of different length, doesn't have this problem ,right ? Any advice will be helpful.thanks

short aligner cnv • 2.2k views
ADD COMMENTlink written 8.8 years ago by Liuyunlong130

What is it that you want to do with the mapped reads?

ADD REPLYlink written 8.8 years ago by Sean Davis26k

to predict copy number variation with read depth-based methods

ADD REPLYlink written 8.8 years ago by Liuyunlong130

Do they have 50bp SOLiD reads? Maybe they just wanted to make sure they've gotten rid of all the adapters, so they chose a fixed restrictive length of 36 for everything.

ADD REPLYlink written 8.8 years ago by Damian Kao15k
gravatar for Chris Miller
8.8 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

From a quick glance at the abstract of that paper, I'm guessing that they wanted to be able to directly compare the results across many samples that were sequenced with different read lengths. Under typical circumstances, there shouldn't be any reason to split your reads up. In fact, longer reads allow you to map into repetitive regions that shorter reads can't access. This enhances your ability to detect CNV in these potentially unstable regions.

ADD COMMENTlink written 8.8 years ago by Chris Miller21k

thanks, agree. In the other hand, if different length reads from different samples (like 36 76 100 120 etc) were used to map and call reads depth, do the predicted CNV results have any bias except the potential mapping biases?

ADD REPLYlink written 8.8 years ago by Liuyunlong130

I'm not sure exactly what you mean by that. The read depth in repetitive regions is going to be affected by the length of the reads. If this isn't explicitly corrected for, you might inadvertently call CNA that don't exist. In your case, where all reads are 95-100 bp, I wouldn't worry about that slight difference, but I would still choose an algorithm that does explicit correction for mapability.

ADD REPLYlink written 8.8 years ago by Chris Miller21k
gravatar for Gustavo
8.8 years ago by
Gustavo530 wrote:

Another possible consideration: certain short read mappers can accept only a small number of mismatches to the reference before they fail to map the read. Longer reads have a higher probability of accruing mismatches for a given error rate... and for some technologies the error rate increases with read length.

It is possible for longer reads to have a lower mapping rate than shorter ones (of course, too short reads have higher mapping ambiguity). A simple method for making samples comparable is to trim the reads as described.

ADD COMMENTlink written 8.8 years ago by Gustavo530
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1539 users visited in the last hour