Question

Cuffcompare output: thousands of novel transcripts

1

Entering edit mode

7.9 years ago

dec986 ▴ 380

Hello,

I am using cuffcompare to identify novel transcripts. However, I am suspicious that many of these "novel" transcripts may be junk/noise, because I was expecting something like 5-10 novel transcripts, but am getting thousands. Many transcripts even show up without chromosomes in the combined GTF!

my output:

# Cuffcompare v2.2.1 | Command line was:
#cuffcompare -s /illumina/runs/RNASeq/Gencode27/GRCh38.p10.genome.fa -r /illumina/runs/RNASeq/Gencode27/gencode.v27.annotation.gtf tra
nscripts.gtf
#

#= Summary for dataset: transcripts.gtf :
#     Query mRNAs :  212216 in   66959 loci  (173199 multi-exon transcripts)
#            (20867 multi-transcript loci, ~3.2 transcripts per locus)
# Reference mRNAs :  198869 in   54870 loci  (174194 multi-exon)
# Super-loci w/ reference transcripts:    47911
#--------------------|   Sn   |  Sp   |  fSn |  fSp  
        Base level:      99.7    92.1     -       - 
        Exon level:     150.1   146.9   100.0   100.0
      Intron level:      99.4    98.7   100.0   100.0
Intron chain level:      96.0    96.6   100.0   100.0
  Transcript level:      96.1    90.0    87.7    82.2
       Locus level:     100.0    81.8   100.0    81.8

     Matching intron chains:  167245
              Matching loci:   54851

          Missed exons:      12/573839  (  0.0%)
           Novel exons:   13301/586450  (  2.3%)
        Missed introns:    1783/352804  (  0.5%)
         Novel introns:     155/355547  (  0.0%)
           Missed loci:       6/54870   (  0.0%)
            Novel loci:   12140/66959   ( 18.1%)

 Total union super-loci across all input datasets: 66949

I have tried to run cuffcompare with output on a public data set, but for some reason cuffcompare isn't reporting this information with the public data set:

# Cuffcompare v2.2.1 | Command line was:
#cuffcompare -s /illumina/runs/RNASeq/Gencode27/GRCh38.p10.genome.fa -r /illumina/runs/RNASeq/Gencode27/gencode.v27.annotation.gtf SRR5335744/transcripts.gtf SRR5335745/transcripts.gtf SRR5335746/transcripts.gtf SRR5335747/transcripts.gtf SRR5335748/transcripts.gtf SRR533
5749/transcripts.gtf SRR5335750/transcripts.gtf SRR5335751/transcripts.gtf SRR5335752/transcripts.gtf SRR5335753/transcripts.gtf SRR5335754/transcripts.gtf SRR5335755/transcripts.gtf SRR5335756/transcripts.gtf SRR5335757/transcripts.gtf SRR5335758/transcripts.gtf SRR53357
59/transcripts.gtf SRR5335760/transcripts.gtf SRR5335761/transcripts.gtf SRR5335762/transcripts.gtf SRR5335763/transcripts.gtf SRR5335764/transcripts.gtf SRR5335765/transcripts.gtf SRR5335766/transcripts.gtf SRR5335767/transcripts.gtf SRR5335768/transcripts.gtf SRR5335769
/transcripts.gtf
#

 Total union super-loci across all input datasets: 72350 
  (23997 multi-transcript, ~5.0 transcripts per locus)

are these results typical for cuffcompare in RNA-Seq?

cuffcompare RNA-Seq • 3.1k views

ADD COMMENT • link updated 7.9 years ago by Kevin Blighe 89k • written 7.9 years ago by dec986 ▴ 380

score 3 · Accepted Answer · 2018-01-02

Looking at the difference between the 'novel exons' and 'novel introns' figures, one can infer that the majority of the novel transcripts are single exon genes and therefore most likely non-coding RNAs. It follows, then, that many of these could indeed be just transcriptional 'noise'. You should look through your transcripts.gtf file(s) and take a look at the counts over these transcripts. Some may be genuine.

Keep in mind, too, the following: transcriptional 'noise' may relate to genuine RNA that was in your sample and that was transcribed by the transcriptional machinery, but it may very well have no function other than occupying volume. The genome is 'fluid' and transcription is a constant process whereby polymerases and transcription factors (and other proteins and RNAs) are binding to various regions of DNA and initiating transcription at varying degrees. Different SNPs and other genetic factors can help to modulate these activities too.

You can most likely avoid these transcripts by going back to the Cufflinks step and requiring a transcript to have a higher read count abundance. I think that the default minimum is 10.

Finally, keep in mind that HISAT2 / StringTie has replaced TopHat2 / Cufflinks.

Kevin