Taking Hapmap Data To The Latest Genome Build
3
7
Entering edit mode
11.2 years ago

I would like to use the LD data in HapMap in combination with the latest genome annotation data in Ensembl 59. Unfortunately, if I am not wrong, HapMap rel. 27 is based on NCBI 36 coordinates, while latest ensembl uses the latest genome build (GRCh37). As far as I understand, the latest ensembl version with compatible coordinates would be 54, correct?

Edit:

Core question: how many SNPs (rd-ids) are re-annotated (deleted,renamed) between releases of dbSNP?

It seems like nobody has undertaken a full lift over of the HapMap bulk data to update all coordinates, at least I didn't find any information about this. So I was thinking about trying to do this.

This question is somewhat related http://biostar.stackexchange.com/questions/916/how-do-you-manage-moving-existing-projects-to-a-new-genome-build where the LiftOver tool was presented as a solution.

So here are my questions:

• Did anybody already try this, or would like to have this data, too?
• What would be the best approach to do the bulk conversion. For example running liftOver on the genomic coordinates, or is it be better to convert based on matching rs-snp ids?
• Is that a valid approach at all?

Any suggestions welcome.

Edit: One of the main concerns that I should mention, is that SNPs are re-named, deleted, positions changed. So I more and more get the impression, that both approaches, just mapping coordinates (the coodinates could be fine, but the SNP could have disappeared/renamed in dbSNP) or simply mapping the ids is not enough, even though it made be safer. I think Jorge's answer points into a good direction.

hapmap ensembl coordinates conversion genome • 6.4k views
4
Entering edit mode
11.2 years ago

The LD data look like this:

14430353 14441016 ASW rs2334386 rs9617528 1.0 0.0020 0.06 144
14430353 14564328 ASW rs2334386 rs7288972 1.0 0.0010 0.04 144
14441016 14564328 ASW rs9617528 rs7288972 1.0 0.0020 0.04 144
14805814 14809328 ASW rs7285246 rs12163493 1.0 0.0040 0.1 148
14805814 14870204 ASW rs7285246 rs8138488 0.088 0.0040 0.04 148


so, to re-map the data, for each snp I would sort the data on the SNP name and unix-join||sql with the SNP table from the UCSC.

1
Entering edit mode

the easiest way to be completely sure about that is to have a look to the merge data from dbSNP. for several reasons we are waiting to update to dbSNP 131, but I can tell you that we maintain a list of such merge information for HapMap using dbSNP 130, and the amount of rs codes merged (from dbSNP 126, which is HapMap's template) are ~1200. add to that figure a small number of rs codes that have been removed (unfortunately we do not keep track of that information), and you will be able to take your own decisions.

0
Entering edit mode

Thanks, Pierre. That looks like a viable solution. But what if rs-ids were changed, SNPs renamed, deleted, or whatever? I heard that happened, rarely, but still.

0
Entering edit mode

Michael, Pierre: Is this SNP re-annotation(as in renamining, deletion etc) common as in the case protein sequence revisions. Is there a way to track the changes in SNP annotations ?

0
Entering edit mode

@Khader I don't now how frequent this re-annotation of snps is, that was one of the reasons for asking this. If it was a few 10 I would not bother, but I am not sure. Possibly Jorge's answer contains the clue.

4
Entering edit mode
11.2 years ago

we have indeed come across this issue a few months ago, and we did considered then all the options mentioned on your post. if you are looking to integrate LD data, Pierre's solution is really interesting, but if you want to dig deeper (as I can guess from your comment to his answer) I can tell you what we did.

we work with several variational repositories (hapmap, 1000 genomes, perlegen, ...), and we have to do a huge normalization effort in order to merge seemlessly all this data. initially, we tried using liftOver, but we had some issues trying to understand all the information that was left out so we decided to study another option. and this option was to use dbSNP as template for all the rs codes we had. the main reason was that dbSNP maintains a record of the rs codes changes in all previous versions, so we were able to get all the data we needed for each rs snp (chromosome, position, validation status, gene presence, ...) from the latest dbSNP build's chromosome reports and merging information being sure that we weren't losing any data on the road.

the latest dbSNP build is currently 131, being mapped to GRCh37, so this has worked for us perfectly, but using Pierre's solution (merging hapmap's data with UCSC's) and adding the snp merging data from dbSNP should work too. please take into account that hapmap data is still using (even in the last release a few days ago) dbSNP's build 126, so using dbSNP's snp code merging information is almost mandatory if you are planning to work only at rs code level.

2
Entering edit mode
11.2 years ago

Just one minor point to add: We maintain a "SNP alias/Common name" field in order to incorporate a) names used in the literature (before or in deference to dbSNP IDs) or b) private, non-dbSNP variants.

I agree that dbSNP is the place to look for rs synonyms.

0
Entering edit mode

we do so in our systems too, as we use data from CEPH arrays and Perlegen, and we wanted to allow users to look for private codes if they wanted. also, as we work with 1000 genomes data, if we include information that hasn't been uploaded to dbSNP yet we have to use such field to identify somehow the variant (we use a "chromosome_position" format).