Question: Rna-Seq: Novel Transcripts Found. What Next?
gravatar for jobinv
5.1 years ago by
Bergen, Norway
jobinv1.1k wrote:

If I were use cufflinks in de novo mode to find transcripts or genes in my data that did not align to known transcripts from UCSC or Ensembl, I wouldn't know what to do downstream of this. How would one go about confirming that these are indeed novel? What sort of validation steps would one take (computational or non-computational), what in-depth information can I go looking for, what databases would have useful information for me? Thanks!

rna-seq • 6.4k views
ADD COMMENTlink modified 5.1 years ago by Devon Ryan81k • written 5.1 years ago by jobinv1.1k

No offense, but is this a homework or take-home test question? It reminds me of something I would write for a test.

If not, try giving some details regarding what you've tried and what sorts of things you're actually interested in. You might also mention what species you're using, since some of them have better annotations than others. Your question is extremely broad, so there will be no single best answer.

ADD REPLYlink written 5.1 years ago by Devon Ryan81k

Oh, sorry. The issue is precisely that it is a hypothetical question at this point. I am working with human cancer in a mouse xenograft model, and I'm wondering whether it's even worth attempting to look for novel transcripts, or if it would just be a waste of money to start a pipeline where I wouldn't know what to do with the results that I find. My apologies if this makes it too broad a question, it wasn't my intention.

If it is not an appropriate question for this forum, should I perhaps delete it?

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by jobinv1.1k

No worries and thanks for the clarification!

In your case, I wouldn't personally bother following up on novel transcripts that aren't differentially expressed (and even then I'd be very hesitant). The real question to me would be one of biological meaning and significance. It's far from implausible that there are novel transcripts in cancer that are biologically/clinically meaningful. However, in the context of a xenograft, it's difficult to discern these transcripts from those appearing due to a weird xenograft-specific effect.

Perhaps others will have a different opinion.

BTW, should you decide to follow up on this, I'll post an answers below (it's tough to format things in the comment section).

ADD REPLYlink written 5.1 years ago by Devon Ryan81k
gravatar for Malachi Griffith
5.1 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith16k wrote:

How decide if a transcript predicted by Cufflinks is 'novel'

A 'novel' transcript could simply be defined as one that has not been observed before. If the transcript structure you observe is not currently represented in RefSeq, Ensembl, or UCSC there is a good chance that it might be novel. You can view transcripts from each of these sources in IGV or the UCSC browser, etc.

If the putatitive novel transcript is an alternative isoform of a known gene, examine the structure of your novel transcript. Is there is a particular feature of the transcript that is distinctive (e.g. a novel exon, exon skipping event, intron retention). You can examine the complete corpus of mrnas and ESTs from GenBank for your species. You can view these in the UCSC browser, or download them in fasta format here: est.fa.gz and mrna.fa.gz (again using human as an example). If your putative novel transcript is not in a known gene region at all does it share any similarity to known transcripts?

Validation of novel transcripts

This will commonly involve some combination of RT-PCR, qPCR, cloning and Sanger sequencing. Does the predicted transcript sequence contain an ORF. Try feeding it into ORF finder for example. If not, does it have features of any known types of RNA gene? You could try folding it. Many RNA-folding tools exist already.

Functional validation will depend on what you find above...

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Malachi Griffith16k

Thanks! This is very helpful indeed.

ADD REPLYlink written 5.1 years ago by jobinv1.1k

This is very helpful. I have another newbie question, could you please give me some pointers about what software/pipeline should I use for detection of the novel transcript? Thank you so much.

ADD REPLYlink written 9 months ago by archie.w.lee40

My personal choice pipeline for Novel transcript ditection:

Tophat/STAR (Mapping) -> Cufflinks (Assembly) -> Cuffmerge -> Cuffcompare (To reference genome) Pubmet

Anyway now that updated: HISAT2 -> Stringtie -> gffcompare Pubmed

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by sangram_keshari30
gravatar for Devon Ryan
5.1 years ago by
Devon Ryan81k
Freiburg, Germany
Devon Ryan81k wrote:

Firstly, see my comment above regarding my personal opinion of how useful this would be for your situation. But, should you disagree (and when you do so and get a Nature paper because of it, rest assured that I will eat a sufficient amount of crow):

  • For a non-automated step, I would blast any hits to see if someone picked up something similar (some projects have found a HUGE number of random transcripts). Having said that, even if it's been seen before, that doesn't mean anyone has followed up on it. So, even if something isn't novel (strictly speaking) that doesn't mean it's not interesting for follow-up.

  • Look to see how conserved this region is in related species. If a region is unconserved, the odds are good that it's just noise (i.e., it may be transcribed, but it probably does nothing). I would strongly encourage you to place a good bit of emphasis on this in ranking candidates for follow-up. If a region is conserved, people will MUCH more readily believe that what you found is meaningful.

  • Does the transcript look like it might encode a protein (look for an open reading frame, etc.)? If so, does it have homology to anything?

  • Does any of the ENCODE data suggest that this might be a gene (PolII binding, histone modifications, etc.)? If the ENCODE data suggests that there might be transcription there but the region isn't conserved, I would follow my recommendations above for non-conserved regions.

Those are some initial non-wet bench things to do to get you started. Among the wet-bench follow-ups:

  • Northern blot/qPCR/whatever to look at tissue distribution (in your case, I guess also to look in non-xenograft samples).

  • RACE or some other method to try to asses the full length of the transcript.

  • Generate an antibody against it to see if it actually makes a protein (there are other ways of doing this, of course).

There are a number of other things one could do, mostly dependent upon whether the transcript is coding or not.

ADD COMMENTlink written 5.1 years ago by Devon Ryan81k

Thanks a lot. I'll try to look into these and see what I manage to learn more about. I appreciate the answer, and will let you know when my Nature publication is out! ;)

ADD REPLYlink written 5.1 years ago by jobinv1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1560 users visited in the last hour