Question

Cufflinks output fewer features than the annoation file.

0

Entering edit mode

6.5 years ago

--panda-- ▴ 30

I'm using data from e coli, which is a pretty much well studied organism and a nice annotation file(NC000913.3) can be downloaded from NCBI.

After I used cufflinks with the very basic command line(see below) and got the genes.fpkm_tracking.txt. I found 2576 features(or lines in cufflinks' output file )' fpkm, while the original annotation file consisted of ~4500 genes. In another dataset(different experiment) I used, cufflinks gave me 3502 features.

I found that part of the reason was that cufflinks grouped some genes together(assume they belonged to a single transcript ??!!). And I think another explanation is that the 0 count genes were not reported.

Do you think it's a reasonable analysis result?

Command line cufflinks -o <out_dir >="" -g="" <annotation.gff=""> <aligned.bam>

Example lines of cufflinks output: tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status

CUFF.8 - - CUFF.8 fixC,fixX - NC_000913.3:44179-45750 - - 10.5497 1.36756 19.7577 OK

gene44 - - gene44 yaaU - NC_000913.3:45806-47138 - - 1.98785 0.960989 3.00309 OK

CUFF.9 - - CUFF.9 kefC,kefF - NC_000913.3:47245-49631 - - 49.5173 34.9478 49.0769 OK

CUFF.10 - - CUFF.10 folA - NC_000913.3:49822-50302 - - 0 0 1.00003 OK

CUFF.11 - - CUFF.11 - - NC_000913.3:50567-57253 - - 116.716 112.037 122.431 OK

gene48 - - gene48 apaH - NC_000913.3:50379-51222 - - 0 0 0.379608 OK

gene49 - - gene49 apaG - NC_000913.3:51228-51606 - - 0 0 0.846585 OK

CUFF.12 - - CUFF.12 pdxA,rsmA,surA - NC_000913.3:51608-54702 - - 1039.98 872.984 933.547 OK

RNA-Seq quantification cufflinks • 1.7k views

ADD COMMENT • link 6.5 years ago by --panda-- ▴ 30

score 0 · Answer 1 · 2017-11-04

Hi panda,

There is an explanation for what you're seeing.

In a prokaryotic genome such as that of Escherichia coli, the genes can be in much greater proximity than in a eukaryote like Homo sapiens. In bacteria, I've seen coding genes as close as ~50bp in some species. Also, genes that are closely bundled together in this way can actually be transcribed together as operons.

All of this is very problematic for TopHat and Cufflinks because they specifically look for splicing events and search for evidence of exon-intron boundaries. There are command-line parameters that you can supply to Cufflinks that affect the identifiction of these, but if you look at the default values of these parameters, you'll see why they are instantly not applicable to bacteria. Take a look at --overlap-radius, --min-intron-length, and --max-intron-length.

Thus, yes, TopHat/Cufflinks will have merged much of your genes together into a single transcript. The way to minimise this is to always supply a reference FASTA genome and reference GTF/GFF transcriptome. If you don't, the problem becomes exacerbated even further.

Also consider that TopHat/Cufflinks have been retired and that HISAT2/StringTie are now the programs to use. If you have FASTQ sequence data, I would additionally consider just doing pseudo-alignment and read-count abundance over a reference E. coli transcriptome FASTA by using Kallisto, or use RNA-seq tools secifically designed for bacterial RNA-seq analyses, such as Rockhopper or Trinity.

Kevin