Question: Cufflinks / Cuffdiff Output - How Are Tests Different?
10
gravatar for Stephen
2.5 years ago by
Stephen1.9k
Nashville, TN
Stephen1.9k wrote:

I've got two different cellular fractions and I'm looking for genes that are alternatively spliced, alternatively polyadenylated, differentially expressed, etc. I'm running cufflinks/cuffdiff in galaxy and I'm trying to grok what the different tests are doing.

Cuffdiff outputs 11 files (four FPKM tracking files, 7 files of results). Omitting the four FPKM tracking files, here are the 7 results files with a snippet from the the cuffdiff documentation:

  1. Differential expression testing for transcripts: FPKM of one group vs FPKM of the other.
  2. Differential expression testing for genes: This sums the FPKM for transcripts sharing the same gene_id.
  3. Differential expression testing for coding sequence (CDS): This sums the FPKM of transcripts sharing a common p_id, which is the id of the coding sequence that this transcript contains.
  4. Differential expression testing for primary transcripts: This sums FPKM of transcripts sharing a common tss_id (transcription start site).
  5. Differential splicing tests: For each primary transcript, this tests the amount of overloading detected among isoforms, i.e. how much differential splicing exists between isoforms processed from a single primary transcript.
  6. Differential coding output: For each gene, this tests the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples.
  7. Differential promoter use: For each gene, the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples.

My questions are:

  1. How are tests for differential splicing (#5) different from tests for differential coding output (#6)?
  2. How are the tests for differential gene expression summing over gene ids (#2) different that tests for gene expression summing over CDS ids (#3)?
  3. Tests #5-7 above are testing something fundamentally different than the tests for differential gene expression (tests #1-4). I'd like a good explanation of how these groups of tests differ. E.g. how does #3 (differential expression over CDS) differ from #6 (differential coding output).

Thanks very much in advance.

ADD COMMENTlink modified 2.1 years ago • written 2.5 years ago by Stephen1.9k
4
gravatar for Daniele Merico
2.3 years ago by
Daniele Merico40 wrote:

Hello, I think I got most of this figured out:

How are tests for differential splicing (#5) different from tests for differential coding output (#6)

  • differential splicing is at the primary transcript level, so you will look at each group of transcripts that share the same TSS (more correct definition: that have the same pre mRNA processing transcript, so you are clustering different splicing isoforms), and test if the mix of splicing isoforms is different. The statistical test is based on the Jensen-Shannon divergence, which is a test on the distribution difference, so it will be sensitive if in one sample there is one (or more) splicing isoform is more representative of that primary transcript output than in the other sample; however, the test is not sensitive to difference in primary transcript total volume (you will have to use differential expression tests for that).

  • different CDS output looks at the different coding sequences you produce after splicing, i.e. the different combinations of exons you can produce; it's a proxy for protein output, but of course it does not take into account anything post-mRNA processing. The test is at the gene level, not at the primary transcript level, so it will also factor in alternative TSS usage and alternative promoter usage; also, if you have differential splicing for one primary transcript, but that primary transcript does not have the lion share's of the gene's transcription output, it will scarcely affect the CDS output difference. However, if you have transcripts that do not differ by their exon sequence but differ by UTRs, this difference will not be factored in (as there is no difference in coding sequence). The statistical test is again based on the Jensen-Shannon divergence, so it won't be sensitive to difference in total gene transcription (you will have to use differential expression tests for that).

I think this also sheds light on the other questions.

In summary: differential CDS and splicing output tests look at difference in distribution over different possible isoforms (of spliced transcripts or coding sequences), whereas differential expression tests look at difference in total level.

ADD COMMENTlink written 2.3 years ago by Daniele Merico40

Thanks Daniele. Great answer.

ADD REPLYlink written 2.3 years ago by Stephen1.9k

I'm wondering if cufflinks supplies the percent representation of transcripts or CDS's within a gene (equivalent to the field PSI from MISO, or the IsoPct field from RSEM)? I understand that the spilicing.diff and cds.diff supply a differential test based on the differences in relative abundance within the gene, but what about the actual values?

In the same vein, what does the √JS(x,y) actually mean in terms of the change in that transcripts' role in the mix? The manual gives 0.22115 as an example of a significant value for the test stat - but doesn't explain if in this case the tested transcript has increased or decreased its portion out of the total gene expression and by how much.

ADD REPLYlink written 5 months ago by kreitzman.maayan0
4
gravatar for Stephen
2.3 years ago by
Stephen1.9k
Nashville, TN
Stephen1.9k wrote:

To answer a part of my own question, I drew out a schematic of what tests 1-4 are doing. Each is grouping transcripts at a different level.

  1. Doesn't group any - each is a separate transcript and tested independently.
  2. All are grouped at the gene level.
  3. Transcripts B and C are grouped because they share a common protein coding sequence.
  4. Transcripts A and C are grouped because they share a common primary transcript.

Image: http://i43.tinypic.com/35am6j7.jpg

alt text

ADD COMMENTlink written 2.3 years ago by Stephen1.9k
0
gravatar for Flashton
2.5 years ago by
Flashton0
Flashton0 wrote:

Hi Stephen,

I'm afraid I can't help you with your question (other than to suggest there might be two streams of analysis, one for ORFs and another for CDSs).

However, I was hoping you can shed some light on why you used Cuffdiff for your analysis rather than DESeq, EdgeR or BaySeq. I'm about to embark on an RNA-seq analysis project and any input you might have on the relative merits of these programs would be greatly appreciated.

Many thanks,

Phil

ADD COMMENTlink written 2.5 years ago by Flashton0

Cuffdiff was just the first thing I tried - I was helping someone with an analysis where all the data was already in Galaxy, and cuffdiff was easy to run. I'm looking at DESeq now as integrated into the ExpressionPlot suite expressionplot.com), which has some nice functionality

ADD REPLYlink written 2.5 years ago by Stephen1.9k
0
gravatar for Josh
2.4 years ago by
Josh0
Josh0 wrote:

I can't help with your analysis but I have been using Expressionplot on our local server for several months and really like it. Just for what it's worth.

ADD COMMENTlink written 2.4 years ago by Josh0
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 332 users visited in the last hour