Question

Validation Of De Novo Rna-Seq Assembly

4

Entering edit mode

11.5 years ago

Pasta ★ 1.3k

Hi,

I recently performed the de novo assembly of plant RNA-seq data using Velvet/Oases with the multiple k-mer approach (merging of several sub-assemblies). The assembled transcripts were annotated and filtered for redundancy using CD-HIT EST.

However, I still have a lot of different isoforms and I've been asked to confirm that these isoforms are real and not due to some assembly error. Of course, being de novo I don't have the genome sequence to compare to. I could still use the nucleotide sequence of a closely-related plant of the same family but I am afraid that the distant between the 2 plants is too far and would not really help validating my assembly.

How would you proceed in this case ?

Thank you for your answers.

velvet rna-seq • 5.7k views

ADD COMMENT • link updated 11.4 years ago by Wrf ▴ 210 • written 11.5 years ago by Pasta ★ 1.3k

score 3 · Answer 1 · 2013-03-13

3

Entering edit mode

11.5 years ago

Mikael Huss 4.8k

Your related organism may be more helpful than you think in assessing your assembly. Look at this paper, Quantitative RNA-Seq analysis in non-model species: assessing transcriptome assemblies as a scaffold and the utility of evolutionary divergent genomic reference species by Hornett and Wheat, for some inspiration.

ADD COMMENT • link 11.5 years ago by Mikael Huss 4.8k

0

Entering edit mode

Thank you for the link, I will have a look.

ADD REPLY • link 11.5 years ago by Pasta ★ 1.3k

score 2 · Answer 2 · 2013-04-04

There is no standard way of doing this validation. You can try any or all of this:

Paired-end reads. If your protocol is paired-end, you try to align all reads to the transcripts using Bowtie/BWA/SOAP/Blast, .etc. See the percentage of [paired reads] / [total reads]. Given higher value, we would expect the quality of the transcript is higher. This works well for those transcripts without sharing exons with others.
Do blastx. If the transcript is real, it would likely to be aligned to some existing proteins, which may be from related species. This helps to validate complex splicing patterns.
Wet lab to validate the transcripts.
Case by case. Tedious, but you can try to align reads to the some transcript, get the spliced reads, study the alternative splicing patterns.

Finally, I would not say removing using CD-HIT is a good idea. Unless you do not mind removing some real isoforms, which have high similarity as others.

score 1 · Answer 3 · 2013-04-10

CD-HIT will remove real isoforms. If you have a transcript with exons 1+2+3 and another transcript with exons 1+2, even though the second is a subsequence, it is still possibly real and biologically relevant.

If you want 'canonical' transcripts, then oases confidence value is probably the way to go, as it has the 'bulk' of the contigs in that transcript.

I often find that with translating or blastx, real sequences tend to be more common.

You can look at the AMOS file from oases (generated by request) for one kmer to get a sense of how the reads map. Presumably, a good transcript would have a fairly even distribution of reads, while a bad one would stack a bunch of closely related repeats somewhere, or have a very thin band in the middle. Unfortunately I don't know of a good way to automate that.