Question: Tophat / Cufflinks Not Using Gene_Id Or Gene_Name From Gtf File
3
gravatar for Owen S.
6.8 years ago by
Owen S.350
Oakland CA
Owen S.350 wrote:

I am trying to get tophat / cufflinks to work with Ensembl annotations.

Although this looks like a long post, my problem is quite simple: I keep getting CUFF.1, CUFF.2, CUFF.3, etc, as my geneid, in the genes.fpkmtracking file:

$ head genes.fpkm_tracking 
tracking_id     class_code      nearest_ref_id  gene_id gene_short_name tss_id  locus   length  coverage        FPKM    FPKM_conf_lo    FPKM_conf_hi    FPKM_status
CUFF.1  -       -       CUFF.1  -       -       chr1:568966-570302      -       -       1080.24 1019.13 1141.36 OK
CUFF.4  -       -       CUFF.4  -       -       chr1:979748-982084      -       -       19.74   10.1647 29.3153 OK
CUFF.2  -       -       CUFF.2  -       -       chr1:982297-984413      -       -       11.6648 5.19434 18.1353 OK
CUFF.3  -       -       CUFF.3  -       -       chr1:881909-892399      -       -       24.6448 15.3299 33.9596 OK

What I want is to have the Ensembl ENSG annotations, and possibly gene_name as well.

Yes, I realize the chromosome annotation (1st column) in the GTF/GFF file must match the FASTA header exactly.

I have tried the "native" Ensembl GTF file (chromosomes numbered 1, 2, 3..., X, Y, MT) and a bowtie2 genome index built from Ensembl's GRCh37.69 FASTA:

1       processed_transcript    exon    11869   12227   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002234944";
1       processed_transcript    exon    12613   12721   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002867822";
1       processed_transcript    exon    13221   14409   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002312635";

I have tried the native GTF with "chr" added in, and with the FASTA file modified in the same way (and of course bt2 index rebuilt):

chr1    processed_transcript    exon    11869   12227   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002234944";
chr1    processed_transcript    exon    12613   12721   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002867822";
chr1    processed_transcript    exon    13221   14409   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002312635";

I have also tried using cufflinks gffread utility, to make a GFF file from the above GTF file:

chr1    processed_transcript    gene    11869   14412   .       +       .       ID=ENSG00000223972;Name=DDX11L1;transcripts=ENST00000450305,ENST00000456328,ENST00000515242,ENST00000518655
chr1    processed_transcript    misc_RNA        11869   14409   .       +       .       ID=ENST00000456328;Parent=ENSG00000223972;geneID=ENSG00000223972;gene_name=DDX11L1;Name=DDX11L1-002;type=pseudogene
chr1    processed_transcript    exon    11869   12227   .       +       .       Parent=ENST00000456328
chr1    processed_transcript    exon    12613   12721   .       +       .       Parent=ENST00000456328
chr1    processed_transcript    exon    13221   14409   .       +       .       Parent=ENST00000456328

For which the corresponding bowtie2 index has these entries:

$ bowtie2-inspect -n /hulk/genomes/Ensembl/Homo_sapiens.GRCh37.69.dna.chromosome 
chr10
chr11
chr12
chr13
...

I have even tried using the iGenomes files, with the same results!

The way I am running tophat and cufflinks is like this:

tophat -G Homo_sapiens.GRCh37.69.gtf --transcriptome-index=Homo_sapiens.GRCh37.69.genes -p 24 -o output Homo_sapiens.GRCh37.69.dna Sample1_1.fq.gz Sample1_2.fq.gz
cufflinks -p 6 -o output output/accepted_hits.bam

It is almost like I am running it without annotations, but the tophat / cufflinks programs never complain about being unable to match annotations with the FASTA file or anything:

[2012-11-09 17:11:40] Beginning TopHat run (v2.0.6)
-----------------------------------------------
[2012-11-09 17:11:40] Checking for Bowtie
                  Bowtie version:        2.0.2.0
[2012-11-09 17:11:40] Checking for Samtools
                Samtools version:        0.1.18.0
[2012-11-09 17:11:41] Checking for Bowtie index files
[2012-11-09 17:11:41] Checking for reference FASTA file
[2012-11-09 17:11:41] Generating SAM header for /hulk/genomes/Ensembl/Homo_sapiens.GRCh37.69.dna.chromosome
        format:          fastq
        quality scale:   phred33 (default)
[2012-11-09 17:11:44] Reading known junctions from GTF file
[2012-11-09 17:12:02] Preparing reads
         left reads: min. length=30, max. length=101, 999911 kept reads (89 discarded)
        right reads: min. length=30, max. length=101, 999341 kept reads (659 discarded)
[2012-11-09 17:12:43] Creating transcriptome data files..
...snip...
[2012-11-09 18:10:58] Reporting output tracks
-----------------------------------------------
[2012-11-09 18:16:25] Run complete: 01:04:44 elapsed

You are using Cufflinks v2.0.2, which is the most recent release.
[18:16:26] Inspecting reads and determining fragment length distribution.
> Processed 120637 loci.                       [*************************] 100%
> Map Properties:
>       Normalized Map Mass: 985482.42
>       Raw Map Mass: 985482.42
>       Fragment Length Distribution: Empirical (learned)
>                     Estimated Mean: 163.12
>                  Estimated Std Dev: 50.10
[18:17:15] Assembling transcripts and estimating abundances.
> Processed 120951 loci.                       [*************************] 100%

Yes, this topic has been discussed online ad nauseum. Here, and here, and Cufflinks - Assigning Transcripts Or Exons, and many other places. I feel like I have read every possible internet page about this topic, but still the problem vexes me. In the past I once had this working just fine. In fact, I found that tophat / cufflinks worked very nicely with Ensembl annotations. However, it seems like something has changed -- either the software or the annotations? Am I the only one experiencing this? Can someone help me see what I am overlooking? Any help would be much appreciated.

Here are the software versions I am using:

$ bowtie2 --version
/usr/local/bin/bowtie2-2.0.2/bowtie2-align version 2.0.2
64-bit
Built on igm1
Wed Oct 31 23:16:47 EDT 2012
$ tophat --version
TopHat v2.0.6
$ cufflinks 
cufflinks v2.0.2
linked against Boost version 104900
gtf ensembl tophat cufflinks gff • 12k views
ADD COMMENTlink written 6.8 years ago by Owen S.350
3

Are you just trying to get tag counts? Have you tried running cufflinks with the -G option?

ADD REPLYlink written 6.8 years ago by Damian Kao15k
1

Damian, Thank you. I am slightly embarrassed by the question now but I will leave it here for others to see. I see I was trying to adhere too closely to the code shown in the authors' Nature Protocols paper (p.571) which does not show using -G or -g with cufflinks. In the past, I must have been using this option, because I know in the past I was getting proper carry-over of the reference annotations.

Thanks again for taking the time to answer.

ADD REPLYlink written 6.8 years ago by Owen S.350

could you please add [solved] to the title for people scanning unanswered questions

ADD REPLYlink written 6.8 years ago by Ying W3.9k
1
gravatar for Nick Crawford
6.8 years ago by
Nick Crawford210
Philadelphia PA
Nick Crawford210 wrote:

It doesn't appear that you're setting either the -G/--GTF or the -g/--GTF-guide flags when you run cufflinks. See comments above as well.

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by Nick Crawford210

Gss, totally saved my day this post, has been trying to figure out this for days now, i thought the first step in the tophat is going to inherit all of the gene_id...

ADD REPLYlink written 6.2 years ago by mad.cichlids110

Hi, Sorry to bother you guys I have take the argument -g gene.gtf along the way of tophat, cufflinks, cuffmerge. Why the name of my gene_exp.diff file still using gene names rather than gene_id, and why the gene_id was replaced by some automatically generated numbers like :XLOC_OOOOO1,,,XLOC_OOOOO15466...... Do i need to specify it in cuffdiff too? but i did not see this -g argument in the cuff.diff. Any way that i can specify to take gene_id rather than gene names? By the way, i was using the ensemble gene annotation file. Thank you so much.

tophat -p 4 -G genes.gtf -o tophat_results024 genome paired_024-3_L1_I005.R1.clean.fastq,paired_024-3_L1_I005.R2.clean.fastq

cufflinks -p 4 -g genes.gtf -o cufflink_results025_mt/ tophat_results025_mt/accepted_hits.bam

cuffmerge -g genes.gtf -s genome.fa -p 4 assemblies_mt.txt

cuffdiff -o diff_out_mt -b genome.fa -p 4 -L Y24,Y25 -u merged_asm/merged.gtf ./tophat_results024_mt/accepted_hits.bam ./tophat_results025_mt/accepted_hits.bam

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by mad.cichlids110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1481 users visited in the last hour