Question

What is the purpose of running Cufflinks without a reference annotation?

1

Entering edit mode

7.8 years ago

BioinfGuru ★ 1.7k

My task is to repeat the DATA analysis of RNA-seq data as presented in a journal article using the tophat cufflinks pipeline.

For simplicity Ill just mention the 4 controls

The authors run cufflinks without a reference annotation on each control "to detect possible novel transcripts" --> then cuffmerge on the results --> they then say they run cufflinks again using the merged transctiprts.gtf as the reference annotation. It seems over complicated.

Cufflinks requires a .BAM file as input but cuffmerge output doesnt give a BAM file....so the only way i can see they did it is by re running cufflinks on every sample for a second time (waste of time?) except this time using the cuffmerge output as the reference annotation. This would mean re running cuffmerge again also afterward.

Surely " to detect possible novel transcripts" doesnt require running cufflinks on everything twice....I mean, isnt this the whole point of cufflinks.

Thanks in advance. Kenneth

cufflinks reference annotation • 3.5k views

ADD COMMENT • link 7.8 years ago by BioinfGuru ★ 1.7k

3

Entering edit mode

Hi, I don't really see what is your question here. You answered "What is the purpose of running Cufflinks without a reference annotation?" yourself with that line "to detect possible novel transcripts", so its not so clear to me what you are asking for.

Also, a link to the original article would help commenting on this.

ADD REPLY • link 7.8 years ago by Carlo Yague 8.7k

0

Entering edit mode

7.8 years ago

BioinfGuru ★ 1.7k

Thank you all for the replies.

The paper: http://www.nature.com/nbt/journal/v32/n9/full/nbt.3001.html

The pipeline: https://s31.postimg.org/tkcichqkb/pipeline_5.png

Our group has bundled onward so... We completed the first cufflinks run for each sample, then cuff merge and have attempted the second cufflinks run (using the transcripts.gtf file from cuffmerge as reference annotation) with the command:

cufflinks -g [path]/transcripts.gtf -b [path]/genome.fa -u --library-type fr-unstranded [path]/accepted_hits.bam

Is there any dissagreement with the command? Should -g be upper case -G? Should we remove -b option?

The runs started fine (we have 4 computers available to take 2 runs each) however they have all now failed with the following error returned:

Error: duplicate GFF ID 'CUFF.4.1' encountered! https://s32.postimg.org/7p43nd04l/Sup2.jpg

Also one while still running has been stuck at the same point for over an hour: https://s31.postimg.org/yzbl2fu2z/Lee.jpg

Again, thank you in advance. Kenneth.

ADD COMMENT • link 7.8 years ago by BioinfGuru ★ 1.7k

0

Entering edit mode

You should probably post this as a separate question.

ADD REPLY • link 7.8 years ago by Jason H ▴ 20

0

Entering edit mode

7.8 years ago

BioinfGuru ★ 1.7k

Adding an annotation file during cuffmerge resolved the issue.

ADD COMMENT • link 7.8 years ago by BioinfGuru ★ 1.7k

score 3 · Accepted Answer · 2016-07-07

The first Cufflinks run is to generate a new annotation for each sample to discover novel transcripts. The Cuffmerge run is to merge together all the annotations for each individual sample to create one merged annotation of better quality. The second Cufflinks run is to quantify the transcripts based on the merged annotation file.

Yes, it is complicated, and the results will contain many false positives. More importantly, it's generally a waste of time, unless you're working on a poorly annotated genome. For well-annotated genomes like the mice, human, or drosophila genomes, you shouldn't bother trying to discover novel transcripts. Just use the most recent annotation available.