Question: How to define a full-length transcript for transcriptome assembly?
gravatar for Shaojiang Cai
5.9 years ago by
Shaojiang Cai100
Shaojiang Cai100 wrote:

Hi, I am using some RNA-seq library to test my assembler. Now what I am wondering is: HOW can we say a transcript is there?

  • In RNA-seq libraries, can the reads be from UTR regions?
  • If above is true, will the full UTR regions always be fully covered (or fragmented)?
  • Usually, can I say, "transcript A is expressed because its full coding regions are assembled?"


rna-seq utr • 3.0k views
ADD COMMENTlink modified 5.9 years ago by Charles Warden7.6k • written 5.9 years ago by Shaojiang Cai100
gravatar for Charles Warden
5.9 years ago by
Charles Warden7.6k
Duarte, CA
Charles Warden7.6k wrote:

At least in my experience, I think full coverage of the coding region in a single assembled transcript is probably difficult to achieve. This is part of why I would always prefer a direct alignment over de novo assembly (when a reference is available). When working with assembled transcripts, I would favor using a partial contig as a proxy for expression of the relevant gene (rather than requiring a full coding region to be present in the assembly)..

Yes, you will have reads from UTRs. Just like the coding regions, my guess is that that long, high-quality (and not incorrectly stitched) contigs will not necessarily cover all the real UTRs as a contiguous extension of the coding region transcript.

If it helps, I've collected a list of pointer for a slightly different assembly question in this blog post.

ADD COMMENTlink modified 3 months ago by RamRS26k • written 5.9 years ago by Charles Warden7.6k

Hi Charles,

Would you recommend Trinity for 454 sequences then? If yes, how can we define the start and end of the transcript? 


ADD REPLYlink written 5.8 years ago by MAPK1.5k

You should ask the developers about 454 sequences. My guess is that you would at least need to change some parameters.

There is also an FAQ page.

I'm not sure if I understand your second question. Based upon my experience with Illumina data, one problem I have within Trinity is that is seemed to inappropriately stitch unrelated sequences. Also, the RNA-Seq data that I see typically don't have complete or even coverage across known transcripts (when aligned to a reference instead of doing de novo assembly), which is why I think using coverage of a well-defined but partial sequence is better for differential expression purposes. In general, depth coverage of reads aligned to the assembly and uniformity of coverage across that assembly are quality control metrics to assess the quality of the assembly. Unless the sequencing technology directly produces reads that span the whole transcript and you can be absolutely certain that the RNA didn't get fragmented prior to assembly, I can't think of a specific reason why analysis strategies would be fundamentally different (and, in that scenario, there wouldn't be a need for de novo assembly in the first place).

ADD REPLYlink modified 5 months ago by RamRS26k • written 5.8 years ago by Charles Warden7.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 938 users visited in the last hour