Question

SNP to CHR:POS mapping: dealing with genome patches

1

Entering edit mode

4.1 years ago

petar.v.todorov ▴ 10

Hi!

I'm attempting to map some rsids to chromosome and position on hg19 via biomart as described in a prior post. I've run into several cases where a single SNP will return multiple entries for chromosome and position.

refsnp_id   chr_name      chrom_start   chrom_end
rs2950012   17            43667365  43667365
rs2950012   HG1146_PATCH  43777997  43777997
rs2950012   HSCHR17_1     44585673  44585673

As per another post, I noticed that the patches can be assigned to a particular chromosome. I'm curious which is the correct out of the three and should be retained in a situation like this - the patch, or the one mapped to chr 17?

Thanks!

biomart SNP identifier genome hg19 • 1.1k views

ADD COMMENT • link updated 4.1 years ago by Ben_Ensembl ★ 2.4k • written 4.1 years ago by petar.v.todorov ▴ 10

score 0 · Answer 1 · 2020-03-23

Hi Petar,

What is 'correct' will depend on what you want to do with the data after your mapping process. The alternative sequences (patches and haplotypes) are representations of the same region of the genome. Many people decide to work only with the primary assembly due to the redundancy of some of these sequences for processes like performing alignments. On the other hand, understanding the alternative sequences that have been found in this region of the genome may be important for your project.

Here are some brief definitions to help you make your decision:

Primary assembly The underlying genome sequence, without alternative sequence included.

Haplotype (genome) Known variations to the primary assembly, due to variability in the human genome sequence (eg. the highly variable MHC locus). These were included as part of the genome assembly when it was first produced.

Patch New sequences that have been added to the genome assembly since its release. There are two types: fix and novel patches.

Finally, I've noticed that you are using BioMart for this mapping process, which is great. However, depending on the size of your dataset, you could also consider using the Variant Effect Predictor (VEP): http://www.ensembl.org/info/docs/tools/vep/index.html

BioMart is only useful for small and medium-sized datasets (approx. 500 variants per query), while the VEP can process millions of variants per query.

Best wishes

Ben Ensembl Helpdesk