Question: Best Strategy For De Novo Assembly Of A Reference Transcriptome Without A Genome
gravatar for Wrf
7.6 years ago by
Wrf210 wrote:

I have sequences from several tissues of the same animal. I'd like to generate a reference transcriptome to then map my reads onto a search for differential expression. There is no genome for this animal, not even anything close.

The most obvious strategy would be to assemble each tissue de novo, then combine them and remove duplicate sequences. Is there any reason why this would not be the best way?

Does anyone know of a data structure or program that could include one "gene" and all exon combinations for mapping, so I could clearly see that reads are mapping to splice variants and not see it as mapping to possibly unrelated contigs? For example, a gene with 3 exons (1,2,3) might have two transcripts (isoform A: 1+2, isoform B:1+2+3). While the first is a subsequence of the second, I don't want to remove the first since the inclusion of the c-terminal exon might be biologically important in one of the tissues. If I were to then map the reads with bowtie, some of them would hit isoform B and some to A. Since they are the same gene, at some level I just would want to know that, and could possibly disregard the cassette exons.

assembly rna-seq transcriptome • 4.1k views
ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Wrf210

For the transcriptome assembly I would personally recommend to pool the reads from all different tissues - the more information you have available the better an assembler can perform.

ADD REPLYlink written 7.6 years ago by Sebastian Kurscheid300
gravatar for Wrf
7.6 years ago by
Wrf210 wrote:

I guess I can answer my own question a bit...

I had just exchanged some emails with Daniel Zerbino, the creator of Velvet. He said that in a case with the same 3 exons, if one tissue had 1+2 and another had 2+3, Velvet/Oases would make a final transcript of 1+2+3, even though this never occurs in the real animal. This is probably true of most assemblers.

As far as I can tell, that is a reason specifically NOT to pool the reads. One would never want to pool them and end up with more than the sum of the two individually. In fact since housekeeping genes should be common for both tissues, I would suspect that the combined set should necessarily be smaller than the sum.

He also pointed to this program: which supposedly can use the read counts to generate the splice variants.

ADD COMMENTlink written 7.6 years ago by Wrf210

This does not make sense to me.

The way I would approach this analysis (painting with the broadest brush here) is:

1) create a reference transcriptome based on reads obtained from all tissues 2) create an annotation of the transcriptome, including identification of putative splice sites 3) perform alignment of the same reads (from step 1) to this reference, but this time doing it for each library (tissue) separately

ADD REPLYlink written 7.6 years ago by Sebastian Kurscheid300

so then step 1 is not really a reference "transcriptome" since it contains non-real transcripts, its an all-intron-removed genome. that might work. my complaint is still that the 'reference transcriptome' might be treated as real when it is not proven to be real.

ADD REPLYlink written 7.5 years ago by Wrf210
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 930 users visited in the last hour