How Did The Hapmap Project Work Out Their Predictions For Human Variations
1
1
Entering edit mode
13.2 years ago
User 6659 ▴ 970

Hello

I'm not from a biology background and would appreciate a little help. How did the HapMap project work out its predictions for human variation such as the fact that people on average 300-400 loss of function variants? I have read the paper but don't understand all of the biological methods, hence my quesiton on here. I am assuming that they predicted the number of loss of function variations (using in silico tools) for each individual and found the average number. Based on the fact that they assumed they had 95% of the common variants they could have adjusted this average value for what it would be if they had found 100% of the common variants?

If this is correct (which i very much doubt) then it seems like they are underpredicting the extent of variation as i have read that uncommon variations far outnumber the amount of common variations. So for example (and I'm making this figure up) 95% of common variations could be 10% of the total variation and individuals could display vast differences in the amount of variation they exhibit making a prediction quite arbitary.

I expect I'm on totally the wrong track!

thanks

hapmap • 2.7k views
ADD COMMENT
8
Entering edit mode
13.2 years ago

you first have to understand that the HapMap project started back in 2002-2003, and that the high-throughput genotyping technologies were emerging. when it was defined at the very beginning, the aim was to aquire the most variation from the less effort. have in mind that TaqMan was somehow stablished a few years before with the single-SNP typing, then Sequenom led the way of the medium-throughput, then SNPlex thought that they were going to eat the whole high-throughput genotyping cake, but suddenly Illumina appeared and a couple of years after was Affymetrix and their current gold-standard for large genotyping.

when it all started the HapMap project only had a list of potential candidate sites for variation coming from the recent human genome assembly, and typing was very expensive, so the main goal was to discover not only variations but haplotypes, groups of variations that would move as blocks, so finding one of them would be enough to describe the rest of the block (here you may want to read something about Linkage Disequilibrium and Tag SNPs). as the number of samples and the amount of variations they were aiming were limited, they definitely had to focus their effort on common variation. and they did it great, in my honest opinion: they evolved their goals with the techniques, they genotyped more samples (x10 of what it was first thought), they reached the ~4M variants typed in some of the original populations CEU, CHB, JPT and YRI (which is ~15% of the variation stored currently on dbSNP), and among many other things they did a pretty nice job with the data analysis, spotting LD blocks through the genome, finding selective sweeps, ...

I know now things are much easier: several databases have already plenty of data (I even remember when we had to pay yearly to access the Celera's database as they had a different source of variation that was sometimes needed to complete a genotyping design, until they "released it to the world" by bulk loading it on dbSNP), the technologies are able to genotype fast and relatively cheap, the next generation sequencing is getting close to be the substitute of those genotyping techniques when regions need to be deeply characterize or when those regions are too large, ...

the initial HapMap soul has somehow been resurrected a couple of years ago: it is called the 1000 Genomes project. the goal of both projects is the same (finding human variation), although the first one was aware of the technology limitations and tried to deal very intelligently with them, and the second one knows that the ultrasequencing will soon deliver millions of variants that need to be characterized in a very fast and easy manner, so they aim also for rare variants by analyzing thousands of samples, trying to have some nice statistics at the end that may be used as a reference.

ADD COMMENT
1
Entering edit mode

I started doing bioinformatics when HapMap was born, so I wanted to share here some of my thoughts (sorry for that huge answer). regarding your exact question, HapMap does not intend to characterize more than what they typed, and those were common variants, with known positions, and as they have deeply study them they can do such estimation. rare variants or individual mutations are more unpredictable, and for that reason the estimation for functional locations can't include that 5%. I hope I didn't bother you too much, and that I brought you some light on the issue ;)

ADD REPLY
0
Entering edit mode

Thank you Jorge Amigo for that excellent summary. Some of the interesting papers are linked here. http://bit.ly/gciL9Y

ADD REPLY
0
Entering edit mode

Jorge, thank you for your great answer. I feel very bad now but i asked the question wrong :(. I meant to say the 1000 genomes project in the title but said HapMap by mistake. I was referring to the recently published paper on human variation predictions. I thought it would be easier to open a new question rather than confuse this one. However your answer on the HapMap was really helpful to provide some context and history to the past 10 years

ADD REPLY

Login before adding your answer.

Traffic: 1868 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6