Question: Direct Alignment To Transcriptome- Circumventing Tophat, Tweak Cufflinks
1
gravatar for Bharat Iyengar
6.0 years ago by
Bombay, India
Bharat Iyengar260 wrote:

I want to use RNAseq data to quantify gene expression with no intention of finding new transcripts. It would be timesaving to align the reads directly to a transcriptome index built from Refseq RNA, instead of aligning it to genome and look for annotations. I use bowtie to align the reads and then am using cufflinks to quantify the reads (cufflinks has no way of differentiating transcriptome alignments from genome alignments). However, I am not sure if cufflinks can calculate FPKM correctly with this.

  1. How to incorporate gene length information (would I have to make a GTF file for that ?).

  2. I did read the methodology of cufflinks given in the supplementary info of the paper but am not exactly sure how it approximates the values. Moreover the statistical model that it uses, considers transcriptome as a subset of genome and equations are written accordingly. I think this particular question must have been asked numerous times but- how exactly does cufflinks calculate FPKM ?

  3. Would it rather be better, in this case, that I write my own script to calculate FPKM (considering one pair as a fragment) ?

  4. Should fragment lengths be normalized ?

tophat cufflinks rna-seq bowtie • 7.4k views
ADD COMMENTlink modified 6.0 years ago by Rm7.9k • written 6.0 years ago by Bharat Iyengar260

I think, that the benefit in speed will be minimal (if it is your only motivation), it's easier to run the pipeline using -G avoiding assembly, and continue with DE analyses.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Pavel Senin1.9k

won't searching through the entire genome using annotations (GTF) be more time-taking than searching just a few annotated RNAs (~30000).

Does the -T/--transcriptome-only option in tophat require GTF annotation?

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Bharat Iyengar260

This will reduce the search space too, - good find! (If you mean annotation as functions - I don't think TopHat cares.) In general, again it is my opinion, the indexing/search problem is solved at large. Once index is built, it is constant time.

ADD REPLYlink written 6.0 years ago by Pavel Senin1.9k
3
gravatar for Mikael Huss
6.0 years ago by
Mikael Huss4.6k
Stockholm
Mikael Huss4.6k wrote:

If the primary motivation is to save time, you could look at using STAR which in my experience is so fast that you wouldn't need to think too much about genome vs transcriptome alignment.

If the primary motivation is to focus on the known transcripts (and maybe gain in sensitivity that way) I think eXpress is what you are looking for. It uses direct mappings to the transcriptome and uses a Cufflinks-like methodology to calculate FPKM (and counts). It's developed by the same group that made Cufflinks. RSEM is another option.

ADD COMMENTlink written 6.0 years ago by Mikael Huss4.6k

thank you.. STAR seems to have a great reputation. My motivation is not only speed but to avoid inputting many files. Do you suggest STAR | eXpress ??

<just an additional query> is FPKM an exact synonym of RPKM for paired end seq or is it a statistical estimate ?

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Bharat Iyengar260

I was actually thinking bowtie -> eXpress but I suppose STAR -> eXpress might be even better. Not sure what you mean by "inputting many files" - do you mean many reference sequences or many sample FASTQ files?

FPKM is an exact synonym for RPKM for paired end seq, at least according to my understanding.

ADD REPLYlink written 6.0 years ago by Mikael Huss4.6k

Just read about eXpress. As of now I think I'll stick to bowtie -> eXpress. I have to see how STAR actually functions/output modes etc. By many files I meant stuff like - genome index, GTF, transcrriptome, select sequences etc. But bowtie-eXpress would be a good combination: i can align to trancriptome and pass to eXpress for quantification.

Regarding FPKM. I heard this term for the first time in the cuflinks paper. I am not sure if it was used before. So all these algorithms go for expectation maximization and get a max likelihood estimate of FPKM. Whereas RPKM calculation was pretty straightforward (though, EM can be applied for that too)

ADD REPLYlink written 6.0 years ago by Bharat Iyengar260
1
gravatar for Rm
6.0 years ago by
Rm7.9k
Danville, PA
Rm7.9k wrote:

Instead of using just the transcriptome for alignments: it would be great to use entire genome and align the reads using STAR. It is super fast aligner for RNASeq and additionally can take GTF file too.

STAR + cufflinks

ADD COMMENTlink written 6.0 years ago by Rm7.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1890 users visited in the last hour