Question

How Did The Hapmap Project Work Out Their Predictions For Human Variations

1

Entering edit mode

13.2 years ago

User 6659 ▴ 970

Hello

I'm not from a biology background and would appreciate a little help. How did the HapMap project work out its predictions for human variation such as the fact that people on average 300-400 loss of function variants? I have read the paper but don't understand all of the biological methods, hence my quesiton on here. I am assuming that they predicted the number of loss of function variations (using in silico tools) for each individual and found the average number. Based on the fact that they assumed they had 95% of the common variants they could have adjusted this average value for what it would be if they had found 100% of the common variants?

If this is correct (which i very much doubt) then it seems like they are underpredicting the extent of variation as i have read that uncommon variations far outnumber the amount of common variations. So for example (and I'm making this figure up) 95% of common variations could be 10% of the total variation and individuals could display vast differences in the amount of variation they exhibit making a prediction quite arbitary.

I expect I'm on totally the wrong track!

thanks

hapmap • 2.7k views

ADD COMMENT • link updated 13.2 years ago by Jorge Amigo 14k • written 13.2 years ago by User 6659 ▴ 970

score 8 · Answer 1 · 2011-02-12

you first have to understand that the HapMap project started back in 2002-2003, and that the high-throughput genotyping technologies were emerging. when it was defined at the very beginning, the aim was to aquire the most variation from the less effort. have in mind that TaqMan was somehow stablished a few years before with the single-SNP typing, then Sequenom led the way of the medium-throughput, then SNPlex thought that they were going to eat the whole high-throughput genotyping cake, but suddenly Illumina appeared and a couple of years after was Affymetrix and their current gold-standard for large genotyping.

when it all started the HapMap project only had a list of potential candidate sites for variation coming from the recent human genome assembly, and typing was very expensive, so the main goal was to discover not only variations but haplotypes, groups of variations that would move as blocks, so finding one of them would be enough to describe the rest of the block (here you may want to read something about Linkage Disequilibrium and Tag SNPs). as the number of samples and the amount of variations they were aiming were limited, they definitely had to focus their effort on common variation. and they did it great, in my honest opinion: they evolved their goals with the techniques, they genotyped more samples (x10 of what it was first thought), they reached the ~4M variants typed in some of the original populations CEU, CHB, JPT and YRI (which is ~15% of the variation stored currently on dbSNP), and among many other things they did a pretty nice job with the data analysis, spotting LD blocks through the genome, finding selective sweeps, ...

I know now things are much easier: several databases have already plenty of data (I even remember when we had to pay yearly to access the Celera's database as they had a different source of variation that was sometimes needed to complete a genotyping design, until they "released it to the world" by bulk loading it on dbSNP), the technologies are able to genotype fast and relatively cheap, the next generation sequencing is getting close to be the substitute of those genotyping techniques when regions need to be deeply characterize or when those regions are too large, ...

the initial HapMap soul has somehow been resurrected a couple of years ago: it is called the 1000 Genomes project. the goal of both projects is the same (finding human variation), although the first one was aware of the technology limitations and tried to deal very intelligently with them, and the second one knows that the ultrasequencing will soon deliver millions of variants that need to be characterized in a very fast and easy manner, so they aim also for rare variants by analyzing thousands of samples, trying to have some nice statistics at the end that may be used as a reference.