Question

Reversing direction of chain files (E.g. from hg18-->hg19 to hg19-->hg18)

1

Entering edit mode

9.2 years ago

dylkot ▴ 10

I need to convert a chain file that allows liftover from one reference to another (reference X-->Y) to one that would permit the reverse liftover (reference Y-->X). I thought that this would be as simple as swapping the second and third columns and making the corresponding adjustment to the header lines. This is what the Kent utilities program chainSwap does. However, downloading the files hg18ToHg19.over.chain.gz and hg19ToHg18.over.chain.gz shows that the program would not convert the hg18ToHg19 file into the hg19ToHg18 one. The first few lines of each file look like this:

hg18 --> hg19

chain 21270150726 chr1 247249719 + 0 247199719 chr1 249250621 + 10000 249233096 2
616    0    137
166664    50000    50000
40302    50000    50000
153649    50000    50000
1098446    269    272
773    1    1

hg19 --> hg18

chain 21270138829 chr1 249250621 + 10000 249233096 chr1 247249719 + 0 247199719 2
619    137    0
166661    50000    50000
40302    50000    50000
153649    50000    50000
1098479    1    1
47    1    1

Am I wrong in concluding that these chain files correspond to different ungapped interval mappings between the 2 references? I would interpret the hg18-->hg19 file as corresponding to the following interval mappings:

[0, 616) --> [10000, 10616)
[616, 167280) --> [10753, 177417)
[217280, 257582) --> [227417, 267719)
[307582, 461231) --> [317719, 471368)
[511231, 1609677) --> [521368, 1619814)
[1609946, 1610719) --> [1620086, 1620859)

while the hg19-->hg18 file would correspond to the following interval mappings:

[10000, 10619) --> [0, 619)
[10756, 177417) --> [619, 167280)
[227417, 267719) --> [217280, 257582)
[317719, 471368) --> [307582, 461231)
[521368, 1619847), --> [511231, 1609710)
[1619848, 1619895) --> [1609711, 1609758)

where square bracket [ denotes an interval containing the boundary and parentheses (denotes an interval excluding the boundary. I would have thought that these intervals mappings should be the same. Am I missing something here? Is there a good general approach for switching the direction of the chain file?

Thanks!

chain-files alignment liftover • 3.9k views

ADD COMMENT • link updated 24 months ago by Ram 43k • written 9.2 years ago by dylkot ▴ 10

Ram · Answer 1 · 2015-03-16

2

Entering edit mode

9.2 years ago

Brian Bushnell 20k

This is not possible in the general case because the difference is not simply different coordinates for a given feature. Rather, there are features that do not exist in one or the other, condensed repeats, etc which would cause errors. It would only be reversible if the relationship was 1-to-1 and onto, but neither is the case.

Of course, liftover is not lossless even in the best case, so it may not matter all that much.

ADD COMMENT • link updated 24 months ago by Ram 43k • written 9.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brian,

Thanks for for your response. Perhaps I'm misunderstanding but isn't it the case that any feature that is present in 1 reference but not the other can be represented in a chain file by a gap? It still seems to me like the same interval pairings should exist in the reversed file.

ADD REPLY • link updated 24 months ago by Ram 43k • written 9.2 years ago by dylkot ▴ 10

0

Entering edit mode

If you assume that the human genome can be represented in exactly 25 nonrepetitive sequences (1-22, X, Y, M), that implies any changes between one build and another is just a rearrangement, which can be accommodated by liftovers. In that case you could remap things without loss of information.

Unfortunately, real genomic sequences are redundant, and officially sanctioned sequences still contain errors. Generally, if you have two builds of the same genome, they were probably not done by all of the same people, with the same assumptions - so, they will differ, even if the source data was identical.

Long story short, no, missing portions in one assembly will not always correspond to a gap in the other. Imagine that there is some sequence in reference 1 that is present in 4 copies in reference 2. Which one should you map it to? There is no way to say; possibly, liftover will give proper guidance, and 75% of the time, not - by default, a typical mapper will randomly assign it, which favors the genome with the most copies. But even if liftover gives a correct assignment, it does not take into account the new differences between the genomic sequences discovered since the liftover file was generated.

When possible, I highly recommend remapping raw reads to a new genome and calling variants, rather than using liftover. You will get different results; and where the results differ, those derived from the newer data are probably more correct.

ADD REPLY • link updated 24 months ago by Ram 43k • written 9.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks for the clarification. So if I understand correctly, the issues relate both to the specific implementation of alignment used to generate the chain files, errors in the officially sanctioned sequences, and incompatibilities due to lack of contiguity information for repeated segments.

I think I follow your example.

ADD REPLY • link updated 24 months ago by Ram 43k • written 9.2 years ago by dylkot ▴ 10