Question: Get Normalized Read Count Sample Matrix Tophat/Cufflinks
gravatar for Irsan
6.6 years ago by
Irsan7.0k wrote:

I have sequenced RNA from 12 human samples (6 tumor (of which 3 tumor group A and 3 tumor group B), 6 matched non-tumor) samples. Using tophat, I have aligned the reads to hg19 and with cufflinks I have made transcript models for each sample. I would like to extract the FPKM values for each sample in matrix format so that I can do hierarchical clustering and principal compononent analysis on all 12 samples.

The problem is that for each sample, different transcripts are assembled by cufflinks so I cannot just paste the cufflinks files together to get the matrix. Something that came into my mind to do this was using a reference transcript file and use bedtools/bedops to look for intersecting transcripts in all 12 samples. However, I hope I am overlooking some functionality in cufflinks/cuffcompare/cuffdiff to get this done more easily

ADD COMMENTlink modified 5.8 years ago by Biostar ♦♦ 20 • written 6.6 years ago by Irsan7.0k
gravatar for biopaw
6.6 years ago by
biopaw30 wrote:

CummeRbund R package may be what you need. A colleague of mine uses this (I don't), and it continues the workflow, creates a database with the outputs from Tophat/Cuffdiff and implements several plotting functions. When using the tophat suite the workflow is more tightly controlled (hence CummeRbund), which may be great for the casual user

But I can recommend that you may be better off using a STAR (if you have a puter with at least 36G RAM, you need 16G for Human index), HTseq for counting, and EdgeR (there are other good R seq packages as well). Your RNA-seq alignment would be completed in a few minutes, instead of a few hours, and you trade in the convenience of the more rigid workflow for a more flexible one in R (more work), but you can take advantage of the other Packages in Bioconductor.

With Edge, you could model the effect directly A vs B by creating a contrast A[TNT] - B[T-NT], where as in Cuffdiff, it seems you can only model direct, T vs NT for example. Then you can also perform any plot you like usein the Bioconductor tools in R, so you can then do the hierarchical clustering, PCA, MDS etc; ggplot is a nice plotting tool in R.


ADD COMMENTlink written 6.6 years ago by biopaw30

Thanks, I will try STAR + HTSEQ + EdgeR next to tophat + cufflinks + cuffdiff

ADD REPLYlink written 6.6 years ago by Irsan7.0k
gravatar for Ryan Thompson
6.6 years ago by
Ryan Thompson3.4k
TSRI, La Jolla, CA
Ryan Thompson3.4k wrote:

You probably want to use cuffmerge to combine all the individual sample assemblies, and then re-run each sample using the merged assembly as a reference.

ADD COMMENTlink written 6.6 years ago by Ryan Thompson3.4k

That sounds like something I am looking for indeed :-)

ADD REPLYlink written 6.6 years ago by Irsan7.0k

Thanks Ryan, I ran cufflinks with --GTF (not --GTF-guide) with the transcripts.gtf from cuffmerge and it worked like a charm :-)

ADD REPLYlink written 6.6 years ago by Irsan7.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2415 users visited in the last hour