Entering edit mode
9.6 years ago
dfernan
▴
730
Hi,
I'd like to transform a ucsc gtf file that I downloaded from the table browser to an ensembl-gtf-like format.
I.e.,
I'd like to transform this GTF format:
head -n 20 ~/Downloads/GRCm38.ucsc.gtf
chr1 mm10_knownGene exon 3205904 3207317 0.000000 - . gene_id "uc007aet.1"; transcript_id "uc007aet.1";
chr1 mm10_knownGene exon 3213439 3215632 0.000000 - . gene_id "uc007aet.1"; transcript_id "uc007aet.1";
chr1 mm10_knownGene stop_codon 3216022 3216024 0.000000 - . gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1 mm10_knownGene CDS 3216025 3216968 0.000000 - 2 gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1 mm10_knownGene exon 3214482 3216968 0.000000 - . gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1 mm10_knownGene CDS 3421702 3421901 0.000000 - 1 gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
Into the following ensembl-like GTF format:
GL456350.1 protein_coding exon 993 1059 . - . gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "1"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; exon_id "ENSMUSE00000980949";
GL456350.1 protein_coding CDS 993 1059 . - 0 gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "1"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; protein_id "ENSMUSP00000136267";
GL456350.1 protein_coding start_codon 1057 1059 . - 0 gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "1"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201";
GL456350.1 protein_coding exon 784 904 . - . gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "2"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; exon_id "ENSMUSE00000967099";
GL456350.1 protein_coding CDS 784 904 . - 2 gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "2"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; protein_id "ENSMUSP00000136267";
GL456350.1 protein_coding exon 507 689 . - . gene_id "ENSMUSG00000094121"; transcript_id "ENSMUST00000177695"; exon_number "3"; gene_name "Ccl21c"; gene_biotype "protein_coding"; transcript_name "Ccl21c-201"; exon_id "ENSMUSE00001013697";
The problem is that in the UCSC file that I downloaded from the table browser I am missing the geneid, or gene name info (tx id and gene id look the same to me in the ucsc gtf file).
Would it be possible to download a ucsc gtf file (or something similar) and get it into an ensembl-like gtf with all the info, including gene id?
There's no perfect (1:1) correspondence between UCSC gene_ids and ensemble gene_ids. This is largely because UCSC gene_ids are a mess. You can find a given one on both strands of a single chromosome or even across multiple chromosomes. That might make sense for a gene_name (there are a good number of multi-copy genes), but I always found that totally non-sencical for what should be a unique ID. You should ask yourself if you're not just better off sticking to the Ensembl annotation (it's much less of a mess).
@dpryan79 thanks makes sense. I need to work with ucsc for consistency with other analysis :-(