Cuffdiff (weird) output
1
0
Entering edit mode
8.6 years ago

Hi all,

Recently I got some results from cuffdiff pipeline (aligned with tophat, assembled with cufflinks, merged transcripts from all samples with cuffcompare, quantified with cuffquant and DE with cuffdiff). I ran two independent datasets in parallel:

  1. In vitro data with 4 treatments and 2 replicates each, 2 time points (8 samples)
  2. In vivo data with 2 treatments and 7 samples each (14 samples)

Within the outputs, I was more interested in differentially expressed novel isoforms (those labeled "j" in the merged transcript file). I used IGV to check the alignments and the assembly coordinates of those new isoforms, but turns out that 1. the nearest_ref does not seem the nearest reference at all (http://goo.gl/rPxrJd) and 2. I checked the .tracking file generated by cuffmerge to see which samples had this isoform and only one, from a different time point, had reads. And that puzzles me: how come these reads are reassigned downstream to an isoform that was in only 1 of 8 samples and worse, detected as DE?! Also, from the figure wasn't the Known isoform 1 supposed to be the nearest ref for the novel isoform, since known isoform 2 has two extra exons?

The same thing happened with the second dataset, with some novel isoforms being crazily detected as DE even though they were present in only one out of 14 samples. So am I missing something, interpreting the data the wrong way...? Should I filter the merged transcript file to exclude these weird assemblies? Or in case cufflinks/cuffdiff is really flawed, does anyone have another tool to suggest?

Many thanks!

de-novo novel-isoforms RNA-Seq cufflinks cuffdiff • 2.2k views
ADD COMMENT
4
Entering edit mode
8.6 years ago

I've been down the tuxedo rabbit hole, and to be honest, it's not worth it. For the "novel detection" claim it makes, most of the time when I go to check the alignment of a reported novel transcript, it's complete rubbish. On top of that, the way that tuxedo assigns XLOC codes to loci and collapse all reads in that region to do gene level quantification, I don't really agree with. Here's what I would suggest you do:

Gene Level - Use Tophat (or your favourite aligner) to align your reads for each sample. -> Run the bam files through htSeq_Count. -> Use the counts in DESeq2 for your gene level analysis.

Transcript Level - Use Kallisto to quantify your raw fastq files at the transcript level (pretty damn quick). -> Run the output through Sleuth to do differential transcript expression.

"But I'm super interested in Novel stuff" - Detecting novel transcripts is incredibly difficult with short reads, I'd recommend actually looking at the Junctions.bed file from Tophat, and try to find novel junctions that tophat finds, that are supported by x number of reads.

ADD COMMENT
0
Entering edit mode

Hi Andrew, thanks for your reply! I was thinking, though, if it would be possible to use the assembled transcriptome as the reference .gtf that RSEM uses, for example. Or even the assembly is not reliable at all...?

ADD REPLY
1
Entering edit mode

I've had the same thought before, but I honestly think the novel calls that Cufflinks make aren't worth it - to be fair to Cufflinks, it's a very difficult problem, and there aren't that many alternatives out there. Biologically you'll get more out your experiment by reliably identifying statistically robust results with known annotation. Remember that if you do novel discovery using cufflinks and discover another potential X thousands of transcripts, then that's another X thousands of tests you're going to have to do in a differential expression analysis, which makes it harder for features to pass significance thresholds.

ADD REPLY

Login before adding your answer.

Traffic: 1787 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6