Question: Ensembl And Ucsc Gtfs To Match
1
gravatar for dfernan
5.5 years ago by
dfernan640
United States
dfernan640 wrote:

Hi,

I'd like to transform a ucsc gtf file that I downloaded from the table browser to an ensembl-gtf-like format.

I.e.,

I'd like to transform this GTF format:

head -n 20 ~/Downloads/GRCm38.ucsc.gtf
chr1    mm10_knownGene    exon    3205904    3207317    0.000000    -    .    gene_id "uc007aet.1"; transcript_id "uc007aet.1"; 
chr1    mm10_knownGene    exon    3213439    3215632    0.000000    -    .    gene_id "uc007aet.1"; transcript_id "uc007aet.1"; 
chr1    mm10_knownGene    stop_codon    3216022    3216024    0.000000    -    .    gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1    mm10_knownGene    CDS    3216025    3216968    0.000000    -    2    gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1    mm10_knownGene    exon    3214482    3216968    0.000000    -    .    gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1    mm10_knownGene    CDS    3421702    3421901    0.000000    -    1    gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";

Into the following ensembl-like GTF format:

GL456350.1    protein_coding    exon    993    1059    .    -    .     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "1"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; exon_id "ENSMUSE00000980949";
GL456350.1    protein_coding    CDS    993    1059    .    -    0     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "1"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; protein_id "ENSMUSP00000136267";
GL456350.1    protein_coding    start_codon    1057    1059    .    -    0     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "1"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201";
GL456350.1    protein_coding    exon    784    904    .    -    .     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "2"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; exon_id "ENSMUSE00000967099";
GL456350.1    protein_coding    CDS    784    904    .    -    2     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "2"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; protein_id "ENSMUSP00000136267";
GL456350.1    protein_coding    exon    507    689    .    -    .     gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "3"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; exon_id "ENSMUSE00001013697";

The problem is that in the UCSC file that I downloaded from the table browser I am missing the geneid, or gene name info (tx id and gene id look the same to me in the ucsc gtf file).

Would it be possible to download a ucsc gtf file (or something similar) and get it into an ensembl-like gtf with all the info, including gene id?

gtf ensembl ucsc • 3.7k views
ADD COMMENTlink written 5.5 years ago by dfernan640
1

There's no perfect (1:1) correspondence between UCSC gene_ids and ensemble gene_ids. This is largely because UCSC gene_ids are a mess. You can find a given one on both strands of a single chromosome or even across multiple chromosomes. That might make sense for a gene_name (there are a good number of multi-copy genes), but I always found that totally non-sencical for what should be a unique ID. You should ask yourself if you're not just better off sticking to the Ensembl annotation (it's much less of a mess).

ADD REPLYlink written 5.5 years ago by Devon Ryan89k

@dpryan79 thanks makes sense. I need to work with ucsc for consistency with other analysis :-(

ADD REPLYlink written 5.5 years ago by dfernan640
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 661 users visited in the last hour