Ensembl And Ucsc Gtfs To Match
0
2
Entering edit mode
10.5 years ago
dfernan ▴ 760

Hi,

I'd like to transform a ucsc gtf file that I downloaded from the table browser to an ensembl-gtf-like format.

I.e.,

I'd like to transform this GTF format:

head -n 20 ~/Downloads/GRCm38.ucsc.gtf
chr1    mm10_knownGene    exon    3205904    3207317    0.000000    -    .    gene_id "uc007aet.1"; transcript_id "uc007aet.1"; 
chr1    mm10_knownGene    exon    3213439    3215632    0.000000    -    .    gene_id "uc007aet.1"; transcript_id "uc007aet.1"; 
chr1    mm10_knownGene    stop_codon    3216022    3216024    0.000000    -    .    gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1    mm10_knownGene    CDS    3216025    3216968    0.000000    -    2    gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1    mm10_knownGene    exon    3214482    3216968    0.000000    -    .    gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1    mm10_knownGene    CDS    3421702    3421901    0.000000    -    1    gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";

Into the following ensembl-like GTF format:

GL456350.1    protein_coding    exon    993    1059    .    -    .     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "1"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; exon_id "ENSMUSE00000980949";
GL456350.1    protein_coding    CDS    993    1059    .    -    0     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "1"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; protein_id "ENSMUSP00000136267";
GL456350.1    protein_coding    start_codon    1057    1059    .    -    0     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "1"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201";
GL456350.1    protein_coding    exon    784    904    .    -    .     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "2"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; exon_id "ENSMUSE00000967099";
GL456350.1    protein_coding    CDS    784    904    .    -    2     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "2"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; protein_id "ENSMUSP00000136267";
GL456350.1    protein_coding    exon    507    689    .    -    .     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "3"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; exon_id "ENSMUSE00001013697";

The problem is that in the UCSC file that I downloaded from the table browser I am missing the geneid, or gene name info (tx id and gene id look the same to me in the ucsc gtf file).

Would it be possible to download a ucsc gtf file (or something similar) and get it into an ensembl-like gtf with all the info, including gene id?

gtf ensembl ucsc • 5.4k views
ADD COMMENT
2
Entering edit mode

There's no perfect (1:1) correspondence between UCSC gene_ids and ensemble gene_ids. This is largely because UCSC gene_ids are a mess. You can find a given one on both strands of a single chromosome or even across multiple chromosomes. That might make sense for a gene_name (there are a good number of multi-copy genes), but I always found that totally non-sencical for what should be a unique ID. You should ask yourself if you're not just better off sticking to the Ensembl annotation (it's much less of a mess).

ADD REPLY
0
Entering edit mode

@dpryan79 thanks makes sense. I need to work with ucsc for consistency with other analysis :-(

ADD REPLY

Login before adding your answer.

Traffic: 2735 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6