Ensembl And Ucsc Gtfs To Match
Entering edit mode
10.8 years ago
dfernan ▴ 770


I'd like to transform a ucsc gtf file that I downloaded from the table browser to an ensembl-gtf-like format.


I'd like to transform this GTF format:

head -n 20 ~/Downloads/GRCm38.ucsc.gtf
chr1    mm10_knownGene    exon    3205904    3207317    0.000000    -    .    gene_id "uc007aet.1"; transcript_id "uc007aet.1"; 
chr1    mm10_knownGene    exon    3213439    3215632    0.000000    -    .    gene_id "uc007aet.1"; transcript_id "uc007aet.1"; 
chr1    mm10_knownGene    stop_codon    3216022    3216024    0.000000    -    .    gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1    mm10_knownGene    CDS    3216025    3216968    0.000000    -    2    gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1    mm10_knownGene    exon    3214482    3216968    0.000000    -    .    gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1    mm10_knownGene    CDS    3421702    3421901    0.000000    -    1    gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";

Into the following ensembl-like GTF format:

GL456350.1    protein_coding    exon    993    1059    .    -    .     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "1"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; exon_id "ENSMUSE00000980949";
GL456350.1    protein_coding    CDS    993    1059    .    -    0     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "1"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; protein_id "ENSMUSP00000136267";
GL456350.1    protein_coding    start_codon    1057    1059    .    -    0     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "1"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201";
GL456350.1    protein_coding    exon    784    904    .    -    .     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "2"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; exon_id "ENSMUSE00000967099";
GL456350.1    protein_coding    CDS    784    904    .    -    2     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "2"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; protein_id "ENSMUSP00000136267";
GL456350.1    protein_coding    exon    507    689    .    -    .     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "3"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; exon_id "ENSMUSE00001013697";

The problem is that in the UCSC file that I downloaded from the table browser I am missing the geneid, or gene name info (tx id and gene id look the same to me in the ucsc gtf file).

Would it be possible to download a ucsc gtf file (or something similar) and get it into an ensembl-like gtf with all the info, including gene id?

gtf ensembl ucsc • 5.4k views
Entering edit mode

There's no perfect (1:1) correspondence between UCSC gene_ids and ensemble gene_ids. This is largely because UCSC gene_ids are a mess. You can find a given one on both strands of a single chromosome or even across multiple chromosomes. That might make sense for a gene_name (there are a good number of multi-copy genes), but I always found that totally non-sencical for what should be a unique ID. You should ask yourself if you're not just better off sticking to the Ensembl annotation (it's much less of a mess).

Entering edit mode

@dpryan79 thanks makes sense. I need to work with ucsc for consistency with other analysis :-(


Login before adding your answer.

Traffic: 1005 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6