How To Convert Chromosome Names Of Ensembl Annotation To The Ones Of Ucsc Refseq Database?
5
5
Entering edit mode
10.9 years ago
Ning-Yi Shao ▴ 390

I am working on a database of many species. And I found the chromosome names of ensembl and refseq are terrible to convert between each other. I found there are tables named as ucscToEnsembl on UCSC, and they are great:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "select * from ucscToEnsembl;" hg19

But I manualy checked hg19, mm9, mm10, rn4, rn5, dm3, and Zv9, only hg19 and mm10 have this table. So are there ready tables to sovle the problem? Thanks for any suggestion.

ensembl refseq chromosome • 14k views
ADD COMMENT
2
Entering edit mode
10.9 years ago

Hi Ning-Yi Shao,

I have a little more information about this for you.

UCSC probably do this because the names that they use are not the officially accepted names.

In most cases, Ensembl does use the official accession.version in the seq_region table. When they don't (for example - all chromosomes) then they try to add the official name as a synonym which can be found here: mysql -uanonymous -hensembldb.ensembl.org -P3306 -Dhomo_sapiens_core_72_37 -e "select name, synonym from seq_region, seq_region_synonym, external_db where seq_region.seq_region_id = seq_region_synonym.seq_region_id and seq_region_synonym.external_db_id=external_db.external_db_id "

The reason that they sometimes don't use the official accession.version in the seq region table is because there is a more commonly used or human readable form. For example, "1" for chromosome one when the INSDC accession is CM000663.1. Users probably prefer to see "1" when they are on location view and not "CM000663.1".

I hope that helps.

ADD COMMENT
2
Entering edit mode
10.3 years ago
Gregor Rot ▴ 540

Additional note for ucsc to Ensembl table:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "select * from ucscToEnsembl;" hg19

All is fine, except for cases like this (GL*):

chr1_gl000192_random -> GL000192

These are named GL000192.1 in the Ensembl annotation (".1" is added) and the table points to GL000192 (without the .1). For mouse (mm10) everything looks fine.

ADD COMMENT
0
Entering edit mode
10.9 years ago

see those posts for the conversion:

Naming chromosomes: from NCBI NC_0000ABC to UCSC chrABC (it's for ncbi but you should retrieve the names ucsc<->ncbi<->ensembl : please, share your findings !)

How to sort bed format file

ADD COMMENT
0
Entering edit mode

Thank you, Pierre. The information is interesting. It is the way to convert chromosome names of NCBI to of Ensembl way, is there the way to convert ucsc's to NCBI's?

ADD REPLY
0
Entering edit mode
4.2 years ago
max ▴ 50

Note that historically there are small differences in the way that NCBI, EBI and UCSC name the chromosomes. What is "MT" for EBI, is called "chrMT" for NCBI and "chrM" for UCSC. If you used a genome not from UCSC for your analysis, you may have to fix up these small differences. To convert EBI or NCBI chrom names to UCSC chrom names in a wig or bed text file, you can use UCSC's little utility chromToUcsc. Download it with "wget https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/chromToUcsc", make it executable with "chmod a+x chromToUcsc" (it's a python2/3 script) and run it without arguments to get the usage message. Here is an example call: chromToUcsc -g hg19 --get && chromToUcsc -i test.wig -o test.ucsc.wig -a hg19.chromAlias.tsv -g hg19

ADD COMMENT

Login before adding your answer.

Traffic: 1648 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6