Question: Resources for converting between UCSC <-> Gencode <-> Ensembl chromosome names
8
gravatar for Devon Ryan
2.7 years ago by
Devon Ryan73k
Freiburg, Germany
Devon Ryan73k wrote:

We're currently using a variety of different versions of a variety of different organism reference genomes and are often running into the need to convert between chromosome coordinate naming systems (e.g., when someone wants data aligned against the hg19 reference from ensembl and for a gencode GTF file to be used). This is often as simple as a quick add/remove of "chr", but not always (e.g., who would know that JH806595.1 in gencode is HG1441_PATCH in ensembl?). So, does anyone know of a nice resource somewhere that provides the mappings?

At the end of the day, I just need a tab separated file with the name mappings. I've already written a little python script to perform all of the conversion (a trivial task), but making the mapping files is proving to be a PITA and I assume someone else has already done this.

Edit: BTW, if I have to make the mapping files myself I'll put them on github. It's absurd for that to ever need to be repeated by anyone.

Edit: For what it's worth, at least some of the hg19 gencode<->ensembl mappings are here.

ucsc gencode ensembl • 3.8k views
ADD COMMENTlink modified 19 months ago by CAnna0 • written 2.7 years ago by Devon Ryan73k

I wrote a tool using to compare chromosomes using MD5: http://plindenbaum.blogspot.fr/2013/07/g1kv37-vs-hg19.html could that help ? mappings: https://github.com/lindenb/jvarkit/tree/master/src/main/resources/chromnames

ADD REPLYlink written 2.7 years ago by Pierre Lindenbaum102k

It wouldn't be so useful in this case, since I need this for GTF/BED/etc. files. Though it's good to know about that tool!

ADD REPLYlink written 2.7 years ago by Devon Ryan73k
6
gravatar for Devon Ryan
2.7 years ago by
Devon Ryan73k
Freiburg, Germany
Devon Ryan73k wrote:

Should someone ever need this sort of thing in the future, I've started a github repository with a few conversions here. Everyone is encouraged to add additional conversions or fix any errors they see in those already there. Just submit a pull request.

I'll likely add more of these over time as we actually need them (I still need to add some for the fruit fly genome).

ADD COMMENTlink written 2.7 years ago by Devon Ryan73k

Hey,

wherefrom do you have all these informations for example for GRCh38_ensembl2UCSC.txt. Its very useful Why it exsits GRCm38_UCSC2ensembl.txt and GRCm38_ensembl2UCSC.txt with different content its is not bijective (1:1) ?

ADD REPLYlink modified 8 months ago • written 8 months ago by xd_d70
1

The original information comes from the genome assemblies deposited in NCBI. There are multiple names for each contig therein. The trick is simply to figure out who uses which column (sometimes they like to modify them further).

ADD REPLYlink written 8 months ago by Devon Ryan73k
2
gravatar for Emily_Ensembl
2.7 years ago by
Emily_Ensembl14k
EMBL-EBI
Emily_Ensembl14k wrote:

Ensembl=Gencode

ADD COMMENTlink written 2.7 years ago by Emily_Ensembl14k
1

That's unfortunately not the case.
 

ADD REPLYlink written 2.7 years ago by Devon Ryan73k
1

The naming might be different but the data in the GTFs are the same. The Ensembl geneset is the Gencode geneset.

ADD REPLYlink written 2.7 years ago by Emily_Ensembl14k

True, unfortunately some users complain if things aren't processed exactly as requested, even if just using Ensembl (my preferred solution!) produces the same results.

ADD REPLYlink written 2.7 years ago by Devon Ryan73k

Hi Emily, I understand that but why there is a difference in gene counts between Gencode and Ensembl, can you please have a look at this question i recently posted Discrepancy in gene counts between GENCODE 23 and Ensembl 81/82?

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Veerendra Gadekar0
0
gravatar for CAnna
19 months ago by
CAnna0
CAnna0 wrote:

Hi,

I went to your github repository to access these conversion tables, thank you this is very useful.

I am very new at bioinformatics and I am currently trying to convert the ENSEMBL chromosomes names of an annotation gtf file to UCSC chromosomes names, in order to index them with STAR (the STAR manual specify that the chromosome names of the fasta file and the gtf file should be the same)

But then, the only thing I have to do is to replace the names in the gtf annotation file, by their UCSC equivalent, and then I can run my indexing?

It's a trivial question but I am really new to all of this, Thank you very much,

Camille

ADD COMMENTlink written 19 months ago by CAnna0

You could save yourself time and download the sequence/annotation/index bundles (though you would need to create your own STAR indexes) from iGenomes site.

ADD REPLYlink written 19 months ago by genomax39k

Download the fasta file from Ensembl instead and save yourself the hassle. You can get the whole bundle from iGenomes, but that'll be a larger download.

ADD REPLYlink written 19 months ago by Devon Ryan73k

Ok thank you for you advice!

ADD REPLYlink written 19 months ago by CAnna0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1341 users visited in the last hour