Question: Is It Safe To Remove Exact Duplicate Reads In The Denovo Transcriptome Assembly?
gravatar for lwc628
7.4 years ago by
United States
lwc628220 wrote:

I use trinity/Oases for de novo transcriptome assembly. In my pipeline, I remove exact duplicate reads(forward and reverse strands) because I believe that duplicates don't add any information to the assembly and it reduces the input size and thus expedites the downstream analysis. But is my assumption correct?

I am also confused about this because in the oases paper, the authors say "assemblies with longer k vlaues perform best on high expression genes, but poorly on low expression genes" ( But if we remove duplicates and thus only have unique set of reads, don't we lose the expression value?

ADD COMMENTlink modified 7.4 years ago by swbarnes28.6k • written 7.4 years ago by lwc628220

Indeed, I'm not sure about the answer, but I do think removing duplicates would affect your estimates of expression levels. Therefore if the de novo assembler happens to use expression levels as sort of support (by mapping the reads back to the contigs) to assembled contigs, it may actually affect your assembly.

ADD REPLYlink written 7.4 years ago by Vitis2.4k
gravatar for swbarnes2
7.4 years ago by
United States
swbarnes28.6k wrote:

There will always be some number of duplicates that are PCR artifacts, and some number that are "real", that is, you were unlucky, and two distinct molecules of DNA broke in exactly the same way.

Keeping the former in overestimates transcript abundance, getting rid of the latter underestimate apparent abundance. So the question is, of all the duplicates you see, what is the ratio of PCR artifacts to genuine separate, but identical reads?

Unless your coverage is extremely high, I think most of your duplicates will be the former, so getting rid of them will give you more accurate results. Only once coverage starts going up to hundreds do genuine identical looking molecules start being independently generated.

Or to put it another way, removing duplicates puts a hard ceiling on the maximum coverage you can possible get for a given sequence. If your sequence has a higher coverage than that ceiling, you will lose your ability to know exactly how high. But that ceiling is awfully high, and likely duplicate removal is the right thing to do for samples whose coverage is well below that ceiling.

I figure that an assembler wants all the regions in a contig to have about the same coverage, so if PCR duplicates are throwing that off, fixing that is probably the right thing to do.

ADD COMMENTlink written 7.4 years ago by swbarnes28.6k

For mRNA-Seq experiments, the ceiling could be really high, as the dynamic range of gene expression is quite big. Also, I have never seen even coverage for a transcript in mRNA-Seq experiments, there are always highs and lows even within an exon. I think both made de novo assembly of transcriptome harder than genome.

ADD REPLYlink written 7.4 years ago by Vitis2.4k

so if I do the de novo transcriptome assembly, is duplication removal recommended or no?

ADD REPLYlink written 7.4 years ago by lwc628220

Looks like there is no good answer to it: do two assemblies, with or without the duplicates and compare the contigs, if you happen to have some sort of annotation to work with (usually not, otherwise you wouldn't be doing de novo assemblies), it will be better to evaluate the de novo assemblies.

ADD REPLYlink written 7.4 years ago by Vitis2.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 878 users visited in the last hour