Question

transcriptome assembly redundant removal

0

Entering edit mode

7.6 years ago

402374688 ▴ 30

I de novo assembled several transcriptomes of the same organism and found that with the increase of reads (samples), the size of resulting assembly is larger and larger. But to my knowledege, this should contain redundant transcripts, right? what I want to ask is why this happens and how to remove these redundant transcripts. One more concern is that when removing redundance, is it possible that we lose some genes of the same family or the following quantification steps can be disturbed within the same family? As far as I know, there are following steps that may help: when assembling, use --normalize_reads to limit max read coverage and after trinity assembly, use Tgicl to extend the transcripts and use cd-hit to remove highly similar sequences. Are there some other effective tools or strategies that can help with this?

rna-seq Assembly • 3.6k views

ADD COMMENT • link updated 7.6 years ago by Antonio R. Franco ★ 5.1k • written 7.6 years ago by 402374688 ▴ 30

score 0 · Answer 1 · 2016-09-03

Yes. You most probably will get redundant fasta transcripts, even if you previously used the normalize read option within Trinity.

Some of these assemblies correspond to the same genes, and not necessarily to different isoforms, but to different assembly alternatives

Now, depending on the size of your assembled transcriptome you have several choices.

One is using CD-HIT, but you can be limited by the size of your transcriptome

Another choice is the use of the MIRA or even the CAP3 assemblers, that can generate contigs for you. Another approach is used for programs like IDBA-tran

I am sure that there have to be some other alternatives as well. Here you have a nice paper dealing with this subject