Cufflinks vs. Cuffcompare - number of assembled transcripts
2
4
Entering edit mode
9.7 years ago
wstfljs ▴ 100

Hi all,

I'm having some difficulties in understanding how cufflinks and cuffcompare count the assembled transcripts. When I extract the transcript sequences into a FASTA file, in case of my data I get 15,127 sequences. The headers in FASTA file look like this, e.g:

>rna_NCLIV_068720-1 gene=NCLIV_068720
>rna_NCLIV_068730-1 gene=NCLIV_068730
>rna_NCLIV_068740-1 gene=NCLIV_068740
>rna_NCLIV_068750-1 gene=NCLIV_068750
>rna_NCLIV_068770-1 gene=NCLIV_068770
>CUFF.1.1 gene=CUFF.1
>rna_NCLIV_068790-1 gene=CUFF.1
>rna_NCLIV_068800-1 gene=NCLIV_068800
>rna_NCLIV_068810-1 gene=NCLIV_068810
>CUFF.2.1 gene=CUFF.2

After simple calculations I can say that these 15,127 transcripts are product of 8,087 genes.

Now when I run cuffcompare against the reference genome annotation, this is what I get:

#= Summary for dataset: ../cufflinks/cufflinks_out/transcripts.gtf :
#     Query mRNAs :   12904 in    7986 loci  (9484 multi-exon transcripts)
#            (3050 multi-transcript loci, ~1.6 transcripts per locus)
# Reference mRNAs :    7265 in    7265 loci  (5738 multi-exon)
# Super-loci w/ reference transcripts:     6832
#--------------------|   Sn   |  Sp   |  fSn |  fSp  
        Base level:      99.7     52.2      -       -
        Exon level:      90.7     70.2     91.1     70.5
      Intron level:      99.9     86.4    100.0     86.5
Intron chain level:      90.4     54.7    100.0     65.8
  Transcript level:      70.4     39.6     70.4     39.6
       Locus level:      92.4     81.5     99.9     85.8

     Matching intron chains:    5188
              Matching loci:    6714

          Missed exons:      53/43090    (  0.1%)
           Novel exons:    4543/55662    (  8.2%)
        Missed introns:      45/35825    (  0.1%)
         Novel introns:    2455/41434    (  5.9%)
           Missed loci:       8/7265    (  0.1%)
            Novel loci:    1128/7986    ( 14.1%)

 Total union super-loci across all input datasets: 7986

The number of query mRNAs is estimated to be 12,904. How does this relate to my previous count of transcripts (15,127)?

The summary of cufflinks "class codes" also gives the same number - 12,904.

u 752
i 93
j 3737
x 135
o 218
c 2
p 307
= 7295
e 365

Any clue how these numbers (12,904 vs 15,127) should be interpreted?

cufflinks assembly RNA-Seq Assembly cuffcompare • 5.2k views
ADD COMMENT
1
Entering edit mode

Doesn't the GTF file you used to produce the fasta file contain the class codes? If so, just look and see if you have genes lacking codes and then have a look at those genes. I would guess that these are mRNAs with no supporting reads.

ADD REPLY
3
Entering edit mode
9.7 years ago
Manvendra Singh ★ 2.2k

As per my understanding

  • "u" is novel intergenic trancscript, (you may check that whether some of them could be expressing retro-viral elements).
  • "i" is intronic transcript (there are some intronic lncRNA and repetitive elements get expressed in cell-type specific manner)
  • "j" is potential novel isoform (alternately spliced, since has no splice site match, you might check the coding potential)
  • "x" is cis-antisense transcript (exonic overlap but opposite strand)
  • "o" is "other overlap" - that is an exonic overlap with the reference transcript that doesn't fall in any other, "more interesting" overlap categories - e.g. no splice sites match ('j' class), no containment ('c' code) etc. These 'o' codes could be assigned, for example, to assembled single-exon fragments that happen to overlap one of the terminal exons of a reference transcript (but not enough to make it "contained")
  • "p" is "polymerase run" - it's supposed to signal that the relative positioning of the transcripts to the reference transcript suggests a potential polymerase read-through downstream of the 3' UTRs of the reference. ( I would ignore it)
  • "=" is complete match.
  • "e" is single exon transfag with some basepairs of intron retention ( could be premRNA contaminant)

HTH

ADD COMMENT
0
Entering edit mode

I know what these classes mean. As I wrote, if you sum up the number of transcripts in each class, the result you get is 12,904. How does this compare to 15,127 that I get in the FASTA file?

ADD REPLY
0
Entering edit mode

Yah, In that case, class code "=" is your complete match. As Devon wrote that Probably, you need to have a look at the genes which are not represented by any class codes, and then you could visualize bam for same set of genes to check whether they have read coverage.

ADD REPLY
1
Entering edit mode
9.7 years ago
wstfljs ▴ 100

So after some thinking I decided to run cuffcompare in verbose mode (-V). This is what I get:

...
 transfrag rna_NCLIV_067460-1 discarded (made redundant by CUFF.7105.1)
 transfrag rna_NCLIV_067470-1 discarded (made redundant by CUFF.7106.1)
 transfrag CUFF.7109.2 discarded (made redundant by rna_NCLIV_067510-1)
 transfrag rna_NCLIV_067410-1 discarded (made redundant by CUFF.7110.1)
 transfrag rna_NCLIV_067420-1 discarded (made redundant by CUFF.7111.1)
 transfrag CUFF.7112.2 discarded (made redundant by CUFF.7112.3)
 transfrag rna_NCLIV_067540-1 discarded (made redundant by CUFF.7114.1)
 transfrag rna_NCLIV_067550-1 discarded (made redundant by CUFF.7130.1)
 transfrag rna_NCLIV_067740-1 discarded (made redundant by CUFF.7132.1)
...
 Kept 12904 transfrags out of 15127
 2223 redundant cufflinks transfrags discarded.

Seems like I have found where the "problem" lies. The question is, what does cuffcompare understand by "redundant"? I could not find any information on the Internet. I compared few transcripts and they differ e.g. by number of exons. Shouldn't they be treated as isoforms and not discarded? Below I put example of kept and removed (redundant) trancripts.

transfrag CUFF.7112.2 discarded (made redundant by CUFF.7112.3)
FR823393    Cufflinks    transcript    5675649    5679730    1000    +    .    gene_id "CUFF.7112"; transcript_id "CUFF.7112.2"; FPKM "8.7729148666"; frac "0.469270"; conf_lo "7.971517"; conf_hi "9.574313"; cov "84.793074"; full_read_support "yes";
FR823393    Cufflinks    exon    5675649    5676999    1000    +    .    gene_id "CUFF.7112"; transcript_id "CUFF.7112.2"; exon_number "1"; FPKM "8.7729148666"; frac "0.469270"; conf_lo "7.971517"; conf_hi "9.574313"; cov "84.793074";
FR823393    Cufflinks    exon    5677332    5677511    1000    +    .    gene_id "CUFF.7112"; transcript_id "CUFF.7112.2"; exon_number "2"; FPKM "8.7729148666"; frac "0.469270"; conf_lo "7.971517"; conf_hi "9.574313"; cov "84.793074";
FR823393    Cufflinks    exon    5677887    5677979    1000    +    .    gene_id "CUFF.7112"; transcript_id "CUFF.7112.2"; exon_number "3"; FPKM "8.7729148666"; frac "0.469270"; conf_lo "7.971517"; conf_hi "9.574313"; cov "84.793074";
FR823393    Cufflinks    exon    5678386    5678494    1000    +    .    gene_id "CUFF.7112"; transcript_id "CUFF.7112.2"; exon_number "4"; FPKM "8.7729148666"; frac "0.469270"; conf_lo "7.971517"; conf_hi "9.574313"; cov "84.793074";
FR823393    Cufflinks    exon    5679005    5679159    1000    +    .    gene_id "CUFF.7112"; transcript_id "CUFF.7112.2"; exon_number "5"; FPKM "8.7729148666"; frac "0.469270"; conf_lo "7.971517"; conf_hi "9.574313"; cov "84.793074";
FR823393    Cufflinks    exon    5679632    5679730    1000    +    .    gene_id "CUFF.7112"; transcript_id "CUFF.7112.2"; exon_number "6"; FPKM "8.7729148666"; frac "0.469270"; conf_lo "7.971517"; conf_hi "9.574313"; cov "84.793074";
FR823393    Cufflinks    transcript    5675649    5680347    321    +    .    gene_id "CUFF.7112"; transcript_id "CUFF.7112.3"; FPKM "2.8218974975"; frac "0.164859"; conf_lo "2.378087"; conf_hi "3.265708"; cov "27.274557"; full_read_support "yes";
FR823393    Cufflinks    exon    5675649    5676999    321    +    .    gene_id "CUFF.7112"; transcript_id "CUFF.7112.3"; exon_number "1"; FPKM "2.8218974975"; frac "0.164859"; conf_lo "2.378087"; conf_hi "3.265708"; cov "27.274557";
FR823393    Cufflinks    exon    5677332    5677511    321    +    .    gene_id "CUFF.7112"; transcript_id "CUFF.7112.3"; exon_number "2"; FPKM "2.8218974975"; frac "0.164859"; conf_lo "2.378087"; conf_hi "3.265708"; cov "27.274557";
FR823393    Cufflinks    exon    5677887    5677979    321    +    .    gene_id "CUFF.7112"; transcript_id "CUFF.7112.3"; exon_number "3"; FPKM "2.8218974975"; frac "0.164859"; conf_lo "2.378087"; conf_hi "3.265708"; cov "27.274557";
FR823393    Cufflinks    exon    5678386    5678494    321    +    .    gene_id "CUFF.7112"; transcript_id "CUFF.7112.3"; exon_number "4"; FPKM "2.8218974975"; frac "0.164859"; conf_lo "2.378087"; conf_hi "3.265708"; cov "27.274557";
FR823393    Cufflinks    exon    5679005    5679159    321    +    .    gene_id "CUFF.7112"; transcript_id "CUFF.7112.3"; exon_number "5"; FPKM "2.8218974975"; frac "0.164859"; conf_lo "2.378087"; conf_hi "3.265708"; cov "27.274557";
FR823393    Cufflinks    exon    5679632    5679821    321    +    .    gene_id "CUFF.7112"; transcript_id "CUFF.7112.3"; exon_number "6"; FPKM "2.8218974975"; frac "0.164859"; conf_lo "2.378087"; conf_hi "3.265708"; cov "27.274557";
FR823393    Cufflinks    exon    5680273    5680347    321    +    .    gene_id "CUFF.7112"; transcript_id "CUFF.7112.3"; exon_number "7"; FPKM "2.8218974975"; frac "0.164859"; conf_lo "2.378087"; conf_hi "3.265708"; cov "27.274557";
ADD COMMENT
0
Entering edit mode

The -e and -d parameters in cuffcompare give the explanation of "redundant".

-e max. distance (range) allowed from free ends of terminal exons of reference
   transcripts when assessing exon accuracy (100)
-d max. distance (range) for grouping transcript start sites (100)
ADD REPLY

Login before adding your answer.

Traffic: 1529 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6