Question

Direct Alignment To Transcriptome- Circumventing Tophat, Tweak Cufflinks

1

Entering edit mode

10.8 years ago

Bharat Iyengar ▴ 330

I want to use RNAseq data to quantify gene expression with no intention of finding new transcripts. It would be timesaving to align the reads directly to a transcriptome index built from Refseq RNA, instead of aligning it to genome and look for annotations. I use bowtie to align the reads and then am using cufflinks to quantify the reads (cufflinks has no way of differentiating transcriptome alignments from genome alignments). However, I am not sure if cufflinks can calculate FPKM correctly with this.

How to incorporate gene length information (would I have to make a GTF file for that ?).
I did read the methodology of cufflinks given in the supplementary info of the paper but am not exactly sure how it approximates the values. Moreover the statistical model that it uses, considers transcriptome as a subset of genome and equations are written accordingly. I think this particular question must have been asked numerous times but- how exactly does cufflinks calculate FPKM ?
Would it rather be better, in this case, that I write my own script to calculate FPKM (considering one pair as a fragment) ?
Should fragment lengths be normalized ?

rna-seq cufflinks bowtie tophat • 8.9k views

ADD COMMENT • link updated 10.8 years ago by Rm 8.3k • written 10.8 years ago by Bharat Iyengar ▴ 330

0

Entering edit mode

I think, that the benefit in speed will be minimal (if it is your only motivation), it's easier to run the pipeline using -G avoiding assembly, and continue with DE analyses.

ADD REPLY • link 10.8 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

won't searching through the entire genome using annotations (GTF) be more time-taking than searching just a few annotated RNAs (~30000).

Does the -T/--transcriptome-only option in tophat require GTF annotation?

ADD REPLY • link 10.8 years ago by Bharat Iyengar ▴ 330

0

Entering edit mode

This will reduce the search space too, - good find! (If you mean annotation as functions - I don't think TopHat cares.) In general, again it is my opinion, the indexing/search problem is solved at large. Once index is built, it is constant time.

ADD REPLY • link 10.8 years ago by Pavel Senin ★ 1.9k

score 3 · Answer 1 · 2013-07-05

3

Entering edit mode

10.8 years ago

Mikael Huss 4.8k

If the primary motivation is to save time, you could look at using STAR which in my experience is so fast that you wouldn't need to think too much about genome vs transcriptome alignment.

If the primary motivation is to focus on the known transcripts (and maybe gain in sensitivity that way) I think eXpress is what you are looking for. It uses direct mappings to the transcriptome and uses a Cufflinks-like methodology to calculate FPKM (and counts). It's developed by the same group that made Cufflinks. RSEM is another option.

ADD COMMENT • link 10.8 years ago by Mikael Huss 4.8k

0

Entering edit mode

thank you.. STAR seems to have a great reputation. My motivation is not only speed but to avoid inputting many files. Do you suggest STAR | eXpress ??

<just an additional query> is FPKM an exact synonym of RPKM for paired end seq or is it a statistical estimate ?

ADD REPLY • link 10.8 years ago by Bharat Iyengar ▴ 330

0

Entering edit mode

I was actually thinking bowtie -> eXpress but I suppose STAR -> eXpress might be even better. Not sure what you mean by "inputting many files" - do you mean many reference sequences or many sample FASTQ files?

FPKM is an exact synonym for RPKM for paired end seq, at least according to my understanding.

ADD REPLY • link 10.8 years ago by Mikael Huss 4.8k

0

Entering edit mode

Just read about eXpress. As of now I think I'll stick to bowtie -> eXpress. I have to see how STAR actually functions/output modes etc. By many files I meant stuff like - genome index, GTF, transcrriptome, select sequences etc. But bowtie-eXpress would be a good combination: i can align to trancriptome and pass to eXpress for quantification.

Regarding FPKM. I heard this term for the first time in the cuflinks paper. I am not sure if it was used before. So all these algorithms go for expectation maximization and get a max likelihood estimate of FPKM. Whereas RPKM calculation was pretty straightforward (though, EM can be applied for that too)

ADD REPLY • link 10.8 years ago by Bharat Iyengar ▴ 330

score 1 · Answer 2 · 2013-07-05

1

Entering edit mode

10.8 years ago

Rm 8.3k

Instead of using just the transcriptome for alignments: it would be great to use entire genome and align the reads using STAR. It is super fast aligner for RNASeq and additionally can take GTF file too.

STAR + cufflinks

ADD COMMENT • link 10.8 years ago by Rm 8.3k