Question: Validation Of De Novo Rna-Seq Assembly
gravatar for Pasta
7.4 years ago by
Pasta1.3k wrote:


I recently performed the de novo assembly of plant RNA-seq data using Velvet/Oases with the multiple k-mer approach (merging of several sub-assemblies). The assembled transcripts were annotated and filtered for redundancy using CD-HIT EST.

However, I still have a lot of different isoforms and I've been asked to confirm that these isoforms are real and not due to some assembly error. Of course, being de novo I don't have the genome sequence to compare to. I could still use the nucleotide sequence of a closely-related plant of the same family but I am afraid that the distant between the 2 plants is too far and would not really help validating my assembly.

How would you proceed in this caseĀ ?

Thank you for your answers.

velvet rna-seq • 4.3k views
ADD COMMENTlink modified 7.3 years ago by Wrf210 • written 7.4 years ago by Pasta1.3k
gravatar for Mikael Huss
7.4 years ago by
Mikael Huss4.7k
Mikael Huss4.7k wrote:

Your related organism may be more helpful than you think in assessing your assembly. Look at this paper, Quantitative RNA-Seq analysis in non-model species: assessing transcriptome assemblies as a scaffold and the utility of evolutionary divergent genomic reference species by Hornett and Wheat, for some inspiration.

ADD COMMENTlink written 7.4 years ago by Mikael Huss4.7k

Thank you for the link, I will have a look.

ADD REPLYlink written 7.4 years ago by Pasta1.3k
gravatar for Shaojiang Cai
7.4 years ago by
Shaojiang Cai100
Shaojiang Cai100 wrote:

There is no standard way of doing this validation. You can try any or all of this:

  1. Paired-end reads. If your protocol is paired-end, you try to align all reads to the transcripts using Bowtie/BWA/SOAP/Blast, .etc. See the percentage of [paired reads] / [total reads]. Given higher value, we would expect the quality of the transcript is higher. This works well for those transcripts without sharing exons with others.

  2. Do blastx. If the transcript is real, it would likely to be aligned to some existing proteins, which may be from related species. This helps to validate complex splicing patterns.

  3. Wet lab to validate the transcripts.

  4. Case by case. Tedious, but you can try to align reads to the some transcript, get the spliced reads, study the alternative splicing patterns.

Finally, I would not say removing using CD-HIT is a good idea. Unless you do not mind removing some real isoforms, which have high similarity as others.

ADD COMMENTlink written 7.4 years ago by Shaojiang Cai100
gravatar for Wrf
7.3 years ago by
Wrf210 wrote:

CD-HIT will remove real isoforms. If you have a transcript with exons 1+2+3 and another transcript with exons 1+2, even though the second is a subsequence, it is still possibly real and biologically relevant.

If you want 'canonical' transcripts, then oases confidence value is probably the way to go, as it has the 'bulk' of the contigs in that transcript.

I often find that with translating or blastx, real sequences tend to be more common.

You can look at the AMOS file from oases (generated by request) for one kmer to get a sense of how the reads map. Presumably, a good transcript would have a fairly even distribution of reads, while a bad one would stack a bunch of closely related repeats somewhere, or have a very thin band in the middle. Unfortunately I don't know of a good way to automate that.

ADD COMMENTlink written 7.3 years ago by Wrf210
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 712 users visited in the last hour