Human Variation Prectictions From 1000 Genomes
3
3
Entering edit mode
14.7 years ago
User 6659 ▴ 990

Hi

I recently asked this question but got the title wrong and put the respondents in totally the wrong direction. I apologise and feel it would be easiest to open another question to try and ask the question properly.

In this paper the 1000 genomes project work out predictions for human variation such as the fact that people have on average 300-400 loss of function variants. I have read the paper but don't understand all of the biological methods, hence my quesiton on here. I am assuming that they predicted the number of loss of function variations (using in silico tools) for each individual and found the average number. Based on the fact that they assumed they had 95% of the common variants they could have adjusted this average value for what it would be if they had found 100% of the common variants?

If this is correct (which i very much doubt) then it seems like they are underpredicting the extent of variation as i have read that uncommon variations far outnumber the amount of common variations. So for example (and I'm making this figure up) 95% of common variations could be 10% of the total variation and individuals could display vast differences in the amount of variation they exhibit making a prediction quite arbitary.

thanks

genome variation prediction • 5.3k views
ADD COMMENT
0
Entering edit mode

also, why don't you register to the forum? :-)

ADD REPLY
0
Entering edit mode

i am registered aren't i?

ADD REPLY
6
Entering edit mode
14.7 years ago
Dgmacarthur ▴ 310

Hey Daniel,

I helped to coordinate the analysis of loss-of-function (LOF) variants in the Project (although loads of other people were heavily involved in this work, especially Suganthi Balasubramanian at Yale).

Firstly, you're correct that the project under-sampled LOF variants at the low end of the frequency spectrum; and because LOF variants are highly enriched for low-frequency variants (because many of them are evolutionarily deleterious) we clearly missed a substantial fraction of them. We're currently in the process of performing the same analysis on other data-sets with better ascertainment, and it's clear that there are huge numbers of low-frequency LOF variants out there.

However, there are two other factors in play here. The first is that reduced ascertainment of low-frequency variants has a surprisingly small effect on the number of variants seen [?]per individual[?]. This is because the vast majority of the variants seen in any given individual are actually common. Thus, while poor capture of low-frequency variants will have a big effect on the total number of variants you find in a cohort as a whole, it has a disproportionately small effect on per-individual estimates. (This effect will be larger for LOF variants than other classes of variation, but it's still small, and in fact it's outweighed by the second factor in the opposite direction described below.)

The second factor to bear in mind is that LOF variants are highly, highly enriched for false positives of all sorts. The reason for this is pretty straightforward: because most LOF variants are deleterious, the level of true polymorphism at these sites is low; however, the level of error (from both sequencing and annotation artefacts) is randomly distributed across the genome. That means that we tend to observe much less variation at LOF sites than the genomic average (as expected), but the variation we do see is enriched for error.

We've almost completed a follow-up study of all of the LOF variants identified by 1KG, looking at both sequencing and annotation error, and the false positive rate is indeed very high (more details to come once the article has been submitted). For that reason we didn't spend too much time trying to tweak our estimates of the number of LOF variants per individual in the 1KG pilot paper; we knew they were going to be wrong. I can tell you that we do now have a reasonably good estimate of the [?]real[?] number of LOF variants in a typical human genome, although you'll have to wait a little while longer to find out exactly what it is. :-)

ADD COMMENT
1
Entering edit mode

Thanks for your reply. How did you predict the loss of function variants? Did you use in silico tools. In the human data I have the number of each category of variants is almost always different from the 1000 genomes data by the same factor

ADD REPLY
0
Entering edit mode

Thanks, Dan, for checking out BioStar and sorry about the earlier mistake I had in your name. The middle initial threw me off...

ADD REPLY
0
Entering edit mode

now that is what I call a great answer: getting the researcher that performed the analysis to BioStar and posting his impressions! thank you Dr. MacArthur for having enlightened us with your experience.

ADD REPLY
0
Entering edit mode

unknown (google) - the LOF variants were called by aligning the variants called by the project against the Gencode gene annotation set. We basically regarded all variants predicted to result in a premature stop codon, or disrupt a splice site, or create a frameshift, as potential LOF.

ADD REPLY
0
Entering edit mode

thanks a lot - lets say a SNP has a non synonymous effect on 4 transcripts of the same gene, do you count that as 1 or 4 non synonymous variants?

ADD REPLY
4
Entering edit mode
14.7 years ago

just focusing on the abstract, there are a couple of sentences which state the basis of their study, and those should be properly understood before starting going through the whole paper:

1a - "we have catalogued the vast majority of common variation": they were in fact able to find almost all the previously known variation in their samples (dbSNP in other words, which was consider to be the best resource characterizing common variation when the project started).

1b - "over 95% of the currently accessible variants found in any individual are present in this data set": this figure is referring to what the "vast majority" meant on the previous sentence, plus an estimation of what anyone of us is expected to carry on our genome. so they are stating that we are basically 95% of common plus 5% of particular variation.

2 - "each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders": these are very interesting figures, as they estimate averages to be used as references in forthcoming whole genome sequencing studies, and they are based not only on that 95% common variation, but on the 5% particular variation too. considering that these figures came out of the pilot 3 (~700 samples, whole exome sequencing and relatively high covered), the conclusions are at least very credible.

I think I understand why you are getting the wrong impression with these figures. when you say "95% of common variations could be 10% of the total variation", even if you make that figure out, you are probably thinking on very common variation among certain populations. when they use the term "common variation" they mean to say that a variant site was considered common when it is found with certain significative allele frequency on the studied samples (at least not to consider it particular, so may be worth now reviewing the definition of SNP, which they tend to ommit through the paper in favour of "variant" or "variant site" as this one does not imply any kind of assumption on its population frequency). so I can almost assure you that you won't be making any mistake if you use those numbers (250-300 loss-of-function variants, 50-100 disease associated variants) as general figures of reference, as all the variants covered by the project (commmon and non-common) were used to build them up.

as a final note let me just point out that although when they started this project dbSNP was at its #129 build, which described ~11M SNPs, that resource was considered the best repository of common variants. with this project they not only confirmed it, but they were also capable of finding new common variants, which of course were loaded into dbSNP. now it is currently at its #132 build, which describes ~30M SNPs, as they have included non-common or "rare variants" (the term "rare" or even "mutation" are usually avoided, since they are often used as misleading synonyms of "desease associated variant", which may not be always the case) as well as all the variants found by the 1000 Genomes project which, by frequency, had to be considered also as common.

ADD COMMENT
0
Entering edit mode

I like how you take each point and address it for the one who posed the question.

ADD REPLY
0
Entering edit mode

you're very welcome. considering that I've been dealing with human variation since bioinformatics started dealing with them (~2002), and although for that reason I may not be an expert on this field of knowledge (some of my lab colleagues have been dealing with human variation for decades), I love sharing my thoughts with other bioinformaticians working on research lines analogous to mines. I guess that if I can't have a face to face chat with the question posters, the less I can do is try to explain myself as concise as possible, even if it becomes a too large answer.

ADD REPLY
0
Entering edit mode

thanks again for a great answer. I'm still not really with you though. Are you saying that they are saying that the average human has 95% common variation and 5% rare variation? Because i read it as saying '95% of the known common variation' and we do now know how much of the total variation the known variation represents

ADD REPLY
0
Entering edit mode

thanks again for a great answer. I'm still not really with you though. Are you saying that they are saying that the average human has 95% common variation and 5% rare variation? Because i read it as saying '95% of the known common variation' and we do now know how much of the total variation the known 'common' variation represents

ADD REPLY
0
Entering edit mode

let me try to explain it in another way: they state that ~95% of ALL THE VARIANT SITES THEY FOUND (~27M sites) were shared among individuals (of course with their particular haplotypes, i.e. combination of alleles for that sites), and only ~5% of ALL THE VARIANT SITES THEY FOUND (~1M sites) were particular to single individuals. this means that each individual had ~1K variant sites completely particular, and the rest were common. does this sound better now?

ADD REPLY
0
Entering edit mode

i read it totally wrong. I read the term currently accessible variants to mean the variants currently known in dbSNP!

ADD REPLY
0
Entering edit mode

note that when the paper refers to dbSNP, unless otherway specified, they do refer to build 129. considering that their sequencing effort has almost tripled what that build contained, they are in fact setting a variation resource of reference from scratch, which has been progressively loaded into dbSNP builds up to the current build 132.

ADD REPLY
3
Entering edit mode
14.7 years ago

Rather than trying to figure this out on your own and possibly making a false assumption, I'd suggest that you contact Daniel McArthur, the guy who did this work. He's friendly and communicative and I'm sure he'd address your email. He presented this at ASHG (Am Soc Human Genetics) annual meeting in November 2010. The talk is available on-line but you'll need to find a meeting participant in order to get access to his talk.

ADD COMMENT
0
Entering edit mode

Minor correction, I think you mean Daniel MacArthur.

ADD REPLY
0
Entering edit mode

Nevertheless, it would be interesting to see what is the answer to this question. The good advantage of forums like biostar is that the discussion is public, so if someone makes a good question, anyone else interested can follow the discussion and see the answer.

ADD REPLY
0
Entering edit mode

Exactly. If I cannot help you, I would hope to be able to point you to someone who knows more or is better able to get the answers you need to accelerate your research.

ADD REPLY

Login before adding your answer.

Traffic: 4107 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6