Question

Help with identifying novel transcripts

1

Entering edit mode

9.2 years ago

martyflores ▴ 10

I am trying to identify novel transcripts across two developmental stages. We'll call them dev1 and dev2. I have the RNA-seq for dev1 and dev2 which I performed the following:

Align via TopHat
Assemble via Cufflinks
Merged the two sets of transcripts via Cuffcompare using refFlat as a reference.
Determine differential expression via CuffDiff from CuffCompare (gtf) with dev1 and dev2.
Join my CuffDiff file with my CuffCompare transcript tracking file to be able to identify transcripts by their CufflinksID.

This has lead to a few questions:

A. Looking at the transcript differential expression testing on CuffDiff, an FPKM is given to a particular transcript in for both dev1 and dev2, even when the transcript does not appear in assembled transcripts for dev2.

B. The FPKM for Cufflinks and Cuffdiff are different. I've seen other people with this question. But still, what's up with that?

C. Essentially the opposite of question A where, looking at my CuffCompare transcript tracking file, I'll have an identified transcript TCONS_xxx but it won't have any values in my cuffdiff file.

Any insight would be greatly appreciated, especially for question A which I have a sneaking suspicion would give insight to the rest of the questions.

Thanks!

cuffdiff RNA-Seq cufflinks • 2.8k views

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.2 years ago by martyflores ▴ 10

Ram · Answer 1 · 2015-03-19

The novelty of a transcripts can be find out :-

Aligning it back to the genome and the available annotation (coding sequences). If your assembled transcript aligns well to the genome[length and identity], but not to the coding sequence . Then there may be rough chances that this is the novel transcript.
So to get more evidence - Now once you are sure that its aligning very well with the genome and not with CDS then look for the read depth for that particular transcript as well as with the genome[align back your input reads to the genome].
Then do a similarity search in different database [ GenBank blast] and make sure that this transcript is not bacterial contig, plasmid or any other sequence. In case you didn't get any hit then translate your sequence into protein[ may be the longest ORF] and go for protein search- if again there is no hit then go for psi-blast. And if this transcript is really a novel transcript then you will definitely get some hit to the closest species [ any one - Do correct me if I am wrong].

For rest of the questions I actually don't have a clear answer - Sorry .:P