Tool:Converting Genome Coordinates From One Genome Version To Another (Ucsc Liftover, Ncbi Remap, Ensembl Api)
3
194
Entering edit mode
11.5 years ago

Some recent posts reminded me that it might be useful for us to review the options for converting between genome coordinate systems.

This comes up in several contexts. Probably the most common is that you have some coordinates for a particular version of a reference genome and you want to determine the corresponding coordinates on a different version of the reference genome for that species. For example, you have a bed file with exon coordinates for human build GRC37 (hg19) and wish to update to GRCh38. By the way, for a nice summary of genome versions and their release names refer to the Assembly Releases and Versions FAQ

Or perhaps you have coordinates of a gene and wish to determine the corresponding coordinates in another species. For example, you have coordinates of a gene in human GRCh38 and wish to determine corresponding coordinates in mouse mm10.

Finally you may wish to convert coordinates between coordinate systems within a single assembly. For example, you have the coordinates of a series of exons and you want to determine the position of these exons with respect to the transcript, gene, contig, or entire chromosome.

There are now several well known tools that can help you with these kinds of tasks:

1. UCSC liftOver. This tool is available through a simple web interface or it can be downloaded as a standalone executable. To use the executable you will also need to download the appropriate chain file. Each chain file describes conversions between a pair of genome assemblies. Liftover can be used through Galaxy as well. There is a python implementation of liftover called pyliftover that does conversion of point coordinates only.

2. NCBI Remap. This tool is conceptually similar to liftOver in that in manages conversions between a pair of genome assemblies but it uses different methods to achieve these mappings. It is also available through a simple web interface or you can use the API for NCBI Remap.

3. The Ensembl API. The final example I described above (converting between coordinate systems within a single genome assembly) can be accomplished with the Ensembl core API. Many examples are provided within the installation, overview, tutorial and documentation sections of the Ensembl API project. In particular, refer to these sections of the tutorial: 'Coordinates', 'Coordinate systems', 'Transform', and 'Transfer'.

4. Assembly Converter. Ensembl also offers their own simple web interface for coordinate conversions called the Assembly Converter.

5. Bioconductor rtracklayer package. For R users, Bioconductor has an implementation of UCSC liftOver in the rtracklayer package. To see documentation on how to use it, open an R session and run the following commands.

source("http://bioconductor.org/biocLite.R")
biocLite("rtracklayer")
library(rtracklayer)
?liftOver

6. CrossMap. A standalone open source program for convenient conversion of genome coordinates (or annotation files) between different assemblies. It supports most commonly used file formats including SAM/BAM, Wiggle/BigWig, BED, GFF/GTF, VCF. CrossMap is designed to liftover genome coordinates between assemblies. It's not a program for aligning sequences to reference genome. Not recommended for converting genome coordinates between species.

7. Flo. A liftover pipeline for different reference genome builds of the same species. It describes the process as follows: "align the new assembly with the old one, process the alignment data to define how a coordinate or coordinate range on the old assembly should be transformed to the new assembly, transform the coordinates."

8. Picard Liftover VCF. Lifts over a VCF file from one reference build to another. This tool adjusts the coordinates of variants within a VCF file to match a new reference. The tool is based on the UCSC liftOver and uses a UCSC chain file to guide its operation.

ensembl genome-coordinates liftover • 149k views
ADD COMMENT
8
Entering edit mode

CrossMap is a program for convenient conversion of genome coordinates between assemblies. It supports most commonly used file formats including SAM/BAM, Wiggle/BigWig, BED, GFF/GTF, VCF.

http://crossmap.sourceforge.net/

ADD REPLY
0
Entering edit mode

neat tool, should add this to my toolbelt

ADD REPLY
4
Entering edit mode

By default i use liftOver and haven't really ever considered using the other offering, so thanks for the summary. I wonder, at least for the common genomes like hg19 or mm9, whether anyone has tested to see whether any of the tools outperform the others. I know UCSC uses "chains", but presumably the other methods differ.

ADD REPLY
3
Entering edit mode

Thanks for summarizing these. We need to start linking to this post when this question pops up again.

ADD REPLY
3
Entering edit mode

I'll just add that for R/Bioc users, the rtracklayer has an implementation of liftOver, but it is native to R, so the UCSC liftOver tool is not needed directly. The Bioc version is said to be faster than the UCSC version, but I have not tested this myself.

ADD REPLY
2
Entering edit mode

Thanks. I have now added a brief intro to this in the original post.

ADD REPLY
1
Entering edit mode

Thanks for informative post

ADD REPLY
1
Entering edit mode

Thanks for the informative post, I wanna convert SNPs file for maize from V2 to V3. How can I create the chain file to perform this conversion?

ADD REPLY
0
Entering edit mode

Did you find a way to create the chain file? Thanks!

ADD REPLY
0
Entering edit mode

No, But I contacted the Maize database and they send it to me.

ADD REPLY
1
Entering edit mode

We've recently posted the segment_liftover tool to biorxiv https://www.biorxiv.org/content/early/2018/03/01/274084. The name is rather descriptive of aim & methodology (the paper has a bit more :-) ).

ADD REPLY
1
Entering edit mode

Nice summary, thanks. I have encountered a problem recently when trying to convert coordinates between GRCH38 and CRCH37 assemblies.

The problem is that, the tools sometimes give very unexpected conversions. Below is an example using liftover, which I have tested in both the UCSC web portal and the local standalone version.

The test SNP is rs138257042, which has the coordinate GRCH38 chr22:15528888 and GRCH37 chr22:16449075. When I converted it from the GRCH38 to the GRCH37 with the following input format chr22 15528888 15528889 rs138257042 through the following web portal http://www.genome.ucsc.edu/cgi-bin/hgLiftOver I got the following unexpected output chr14 19378323 19378324 rs138257042

Obviously the converted coordinate is incorrect even to a different chromosome 14.

In some cases, the error is not so severe but is more difficult to identify. For example, SNP rs200923174 has a coordinate chr22:16287557 in the GRCH37 assembly. When liftover it using the following "bed" format input : chr22 16287557 16287558 rs200923174 and the chain file downloaded from here: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz Then the converted coordinate has the following "bed" output: chr22 15690404 15690405 rs200923174 The GRCH38 coordinate of this SNP is found to be 15690406 in the dbSNP, https://www.ncbi.nlm.nih.gov/snp/?term=rs200923174 which has a 2-base difference with the liftovered one.

Because I rely on converted coordinate as a key to match genomic variants called from different assemblies, this would be a big issue for me.

I have tried another tool Remap, but the same issue exists. So, is there an excellent tool that is able to convert to the precise coordinate for EVERY given genomic variant, or I just misused tools? Many thanks for any suggestion.

ADD REPLY
0
Entering edit mode

I just misused tools?

I think you are confused by 0 vs 1-based coordinate systems: Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems

If you liftOver the following site from hg19 to hg38 :

SNP: rs200923174

  • hg19 - chr22:16287557-16287557

becomes

  • hg38 - chr22:15690406-15690406

Also specifically using your example of 57-58 still gives chr22:15690405-15690406 in hg38 using the UCSC online tool... Not sure where your discrepancy is coming from - maybe strand information?

ADD REPLY
0
Entering edit mode

Thank you for asking this question. I tried to use liftover online website to transform the site rs138257042. I also failed when I used 1-based coordinates (

chr22:15528888-15528888

) to switch from version hg38 to hg19. The result is

chr14:19378323-19378323

But the correct result can be obtained from hg19 to hg38. I have tried both the 0-based and 1-based coordinate systems. The problem is that the version hg38 is converted to hg19.

ADD REPLY
0
Entering edit mode

rtracklayer's description seems incomplete.

ADD REPLY
0
Entering edit mode

With the release of hg38 need to revisit contents of this post!! :)

ADD REPLY
0
Entering edit mode

I have updated the example, but it should be noted that the majority of these tools generically support all major builds for multiple species (not just human). When a new build comes out, the "chain" files that explain how to convert to/from that build are usually released soon after.

ADD REPLY
0
Entering edit mode

Any updates? I'm trying to find a reliable tool fr cross-species mapping. Unfortunately I can;'t find a bench mark on the tools that are out there. Suggestions welcome.

ADD REPLY
0
Entering edit mode

Hey! did you find anything? I have a still unpublished assembly that I used to align, map and annotate my Seq data, and would need to have the zebra-finch ref assembly converted to its coordinates, so that I can use the zebra finche as my ref in my evo analysis. :/

ADD REPLY
0
Entering edit mode

Can I use this liftover to map co-ordinates between bacterial subspecies? My aim is to do an integrative analysis of certain public RNA-seq data available for a particular bacterial species S. aureus. But each experiment are done in different strains/subspecies.

What I plan to do is to align the reads to their respective reference genomes, and for further analysis, create an annotation file (GFF/GTF) - based on one of the selected subspecies (chosen "target" for lift over) and combine it with the mapped annotation of other subspecies ("source" for lift over).

Is this procedure right? Or are there any other alternatives? I do not wish to do all RNA-seq analysis separately and then simply compare the results of differential expressed gene lists.

ADD REPLY
0
Entering edit mode

Hello, thankyou for the amazing work compiling all these different tools.

For what I got, most of this tools work well between different assembly versions (let's say 37 vs 38), BUT if I want to compare data within the same assembly but different releases do you have any suggestion of the best approach? I need to compare data from mouse assembly 38, ensembl release 73 with the latest release - Ensembl 90. I have the transposon data with the coordinates of the hit and information about the genome region it hit (gene X/intergenic) Thank you !

ADD REPLY
0
Entering edit mode

converting coordinates across assemblies is the easy bit. That is the basic function of most of these tools.

ADD REPLY
0
Entering edit mode

My problem here is that I want to make comparisons within the same assembly, but just different releases. And for that none of these tools will work. Or is there any way to do it?

Thank you !

ADD REPLY
0
Entering edit mode

What exactly is it that you want to compare? What difference are you hoping to show between releases?

Have you taken a look at Ensembl Biomarts archive?

ADD REPLY
0
Entering edit mode

I have transposons data organised in the following manner (txt)

chr start end hit 1 111 130 geneXXX 2 1546 1867 intergenic 3 123 234 geneYYY

this is for the assembly 38, Ensembl release 73.

I want to compare it now with the release 90 of Ensembl to confirm if the hits of the transposon are still accurate with most updated version. (e.g. to know if the transposon on chr 2 is still hitting on an intergenic region or if that region is now attributed to a gene. Or if the region in chr 3 where my transposon hit is still annotated as geneYYY and so on.

Thanks

ADD REPLY
0
Entering edit mode

Comparing annotations within the same assembly (which is what you are asking, I think) is a problem unrelated to converting genomic coordinates. I'd suggest asking a new question, taking care to describe specifically what you want to do.

ADD REPLY
0
Entering edit mode

Thanks for this summary list

ADD REPLY
0
Entering edit mode

Anybody knows whether any of these tools can be used to convert hg19 to a denovo genome assembled coordinate?

ADD REPLY
0
Entering edit mode

Does anyone know if there is a tool to make the chain files required for all of these programs without relying on UCSC tools? It looks like every one depends on UCSC dependencies that you need to pay for if you aren't academic if you want to make a chain file for your own genome.

ADD REPLY
0
Entering edit mode

Anyone tried VCF-liftover tool? It claims to be faster and more memory efficient than Picard.

ADD REPLY
0
Entering edit mode

I came across this thread while searching for a way to convert my Dante BAM file from h19 to h39. I got caught in Dante's transition from h19 to h38, my results are h19. I have instructions from 2019 to do this via Sequencing.com with their EVE program but that has changed and no longer does the conversion. In short it takes my FASTQ file and gives me an h38 BAM file. The instruction in EVE, select the following options: “Select Target Format”->”BAM”, “Select Reference Genome”->”GRCh38.p12 (Dec 2017)”,”Select Preprocessing”->(Use both but know the cutadapt is supposed to remove sequencing primers and if you are VERY unlucky it might mess up a disease interpretation but the primers cause issues with all downstream analysis in general), “Select Alignment / Mapping”->BWA, “Select Variant Calling”->”GATK”, “Select Annocation”->”VEP”,”Select Interpretation”->”Both ClinVar Report & Annotation”. The options are in drop down menus. Is there something "simple" like this currently available that will do the conversion with instructions that someone without a genomics background, me, can follow?

ADD REPLY
0
Entering edit mode

But FASTQ format is for raw unmapped reads, while BAM is for (trimmed, filtered, etc) reads mapped to reference genome. So it sounds like you are just mapping reads to a new reference, which is not what liftover does. Liftover takes files already mapped to one reference and converts the coordinates to a different reference.

ADD REPLY
16
Entering edit mode
10.7 years ago

This new tool seems to be interesting: Crossmap. It allows to convert many formats, like SAM and wiggle.

ADD COMMENT
4
Entering edit mode
ADD REPLY
3
Entering edit mode
3.4 years ago
bw. ▴ 260

I recently put up https://liftover.broadinstitute.org/ to make it a little easier to check liftover coords for specific variants or intervals. It takes fewer clicks and supports more input formats than UCSC liftover and Assembly Converter.

ADD COMMENT
0
Entering edit mode

Liftoff will maps annotations in GFF or GTF between assemblies.

ADD REPLY
1
Entering edit mode
3.1 years ago
olechnwin ▴ 60

Liftoff is the new kids on the block?

Has anyone tried Liftoff ? I've used flo before. It was rather difficult to install but it worked. Now, I need to do this again. I wonder whether it's worth trying something new (liftoff was developed in recent year, 2020) or stick with flo since I know it worked.

ADD COMMENT
0
Entering edit mode

I recently used liftoff and the result was pretty good but currently still running QC the output. Not needing to make a chain file is really nice.

ADD REPLY

Login before adding your answer.

Traffic: 2651 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6