Question: Cufflinks output fewer features than the annoation file.
gravatar for --panda--
2.6 years ago by
--panda--30 wrote:

I'm using data from e coli, which is a pretty much well studied organism and a nice annotation file(NC000913.3) can be downloaded from NCBI.

After I used cufflinks with the very basic command line(see below) and got the genes.fpkm_tracking.txt. I found 2576 features(or lines in cufflinks' output file )' fpkm, while the original annotation file consisted of ~4500 genes. In another dataset(different experiment) I used, cufflinks gave me 3502 features.

I found that part of the reason was that cufflinks grouped some genes together(assume they belonged to a single transcript ??!!). And I think another explanation is that the 0 count genes were not reported.

Do you think it's a reasonable analysis result?

Command line cufflinks -o <out_dir &gt;="" -g="" <annotation.gff=""> <aligned.bam>

Example lines of cufflinks output: tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status

CUFF.8 - - CUFF.8 fixC,fixX - NC_000913.3:44179-45750 - - 10.5497 1.36756 19.7577 OK

gene44 - - gene44 yaaU - NC_000913.3:45806-47138 - - 1.98785 0.960989 3.00309 OK

CUFF.9 - - CUFF.9 kefC,kefF - NC_000913.3:47245-49631 - - 49.5173 34.9478 49.0769 OK

CUFF.10 - - CUFF.10 folA - NC_000913.3:49822-50302 - - 0 0 1.00003 OK

CUFF.11 - - CUFF.11 - - NC_000913.3:50567-57253 - - 116.716 112.037 122.431 OK

gene48 - - gene48 apaH - NC_000913.3:50379-51222 - - 0 0 0.379608 OK

gene49 - - gene49 apaG - NC_000913.3:51228-51606 - - 0 0 0.846585 OK

CUFF.12 - - CUFF.12 pdxA,rsmA,surA - NC_000913.3:51608-54702 - - 1039.98 872.984 933.547 OK

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by --panda--30
gravatar for Kevin Blighe
2.6 years ago by
Kevin Blighe59k
Kevin Blighe59k wrote:

Hi panda,

There is an explanation for what you're seeing.

In a prokaryotic genome such as that of Escherichia coli, the genes can be in much greater proximity than in a eukaryote like Homo sapiens. In bacteria, I've seen coding genes as close as ~50bp in some species. Also, genes that are closely bundled together in this way can actually be transcribed together as operons.

All of this is very problematic for TopHat and Cufflinks because they specifically look for splicing events and search for evidence of exon-intron boundaries. There are command-line parameters that you can supply to Cufflinks that affect the identifiction of these, but if you look at the default values of these parameters, you'll see why they are instantly not applicable to bacteria. Take a look at --overlap-radius, --min-intron-length, and --max-intron-length.

Thus, yes, TopHat/Cufflinks will have merged much of your genes together into a single transcript. The way to minimise this is to always supply a reference FASTA genome and reference GTF/GFF transcriptome. If you don't, the problem becomes exacerbated even further.

Also consider that TopHat/Cufflinks have been retired and that HISAT2/StringTie are now the programs to use. If you have FASTQ sequence data, I would additionally consider just doing pseudo-alignment and read-count abundance over a reference E. coli transcriptome FASTA by using Kallisto, or use RNA-seq tools secifically designed for bacterial RNA-seq analyses, such as Rockhopper or Trinity.


ADD COMMENTlink written 2.6 years ago by Kevin Blighe59k

Actually the aligner is not so much a problem because there is few splice reads reported by tophat(as one should expect for a bacteria data). I agree with you about the fact that some genes were bundled together due to their close locations and maybe relatively short length. I think the --overlap-radius –min-intron-length default values are the reason for that merge.

I've been used HISAT and Rockhopper recently. Did not know much about StringTie but I'll check their protocol for details.

Thanks a lot for the advice.

ADD REPLYlink written 2.6 years ago by --panda--30

Hi panda,

Yes, and you can slightly improve the analysis by reducing the values of the overlap radius and min intron length, but you'll still see some genes merged.

I have not actually used StringTie unfortunately, so, cannot comment further. These programs have obviously mostly been designed with the human genome/transcriptome in mind.

Rockhopper was quite useful to me. Once you get your significant sequences, you can BLASTx them in order to infer functionality.

ADD REPLYlink written 2.6 years ago by Kevin Blighe59k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1914 users visited in the last hour