Question: Extract Individual Genotypes From 1000 Genomes When Snp Not In Vcf
1
gravatar for David Quigley
7.8 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

I would like individual level genotypes for a SNP that appears in the 1000 genomes browser and dbSNP. When I pull the VCF for the region using the Data Slicer, I get calls for SNPs around my Mystery SNP but not for my actual mystery SNP. Pulling down the VCF and searching with tabix gives the same result. It's not clear to me why this SNP doesn't have individual calls. Is the most reasonable way to go forward to pull down the region around the SNP from the source BAM files and call genotypes with mpileup? If so, is there a better way to do this than manually scripting it out? Thanks.

1000genomes • 3.7k views
ADD COMMENTlink modified 7.8 years ago by Adam1.0k • written 7.8 years ago by David Quigley11k
2

The most likely explanation is that your SNP is in dbSNP but not in 1000Genomes. dbSNP contains many false positive SNPs, so maybe they have removed it because they didn't find it in the 1000 Genomes data.

ADD REPLYlink written 7.8 years ago by Giovanni M Dall'Olio27k

A equally likely cause is that 1000g missed it. There are a whole bunch of filters in SNP calling. We know occasionally even common SNPs may get filtered out.

ADD REPLYlink written 7.8 years ago by lh332k

The unfiltered input call sets for phase1 can be found ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/input_call_sets/

If your site is in this set and we filter it out we believe it is a false positive

If you site isn't in this set it might be real but rare or it might be a false positive, you would have to assess the quality and source of your data to make that decision

ADD REPLYlink written 7.7 years ago by Laura1.7k
2
gravatar for lh3
7.8 years ago by
lh332k
United States
lh332k wrote:

My suggestion is to check the individual call set first, which are available here. If you cannot find the SNP in all of them, it is likely to be a false one. A caveat is most of these call sets do not provide accurate genotypes. When you can get genotype likelihoods in GL or PL, you can use beagle to impute genotypes.

ADD COMMENTlink written 7.8 years ago by lh332k

Thanks for the suggestion and link. Would that be the data in 20110512wgVQSRv2GLbeagle_genotypes?

ADD REPLYlink written 7.8 years ago by David Quigley11k
1

First check if the SNP is called in any call sets from 20110302_phase1_wg_snps. All the call sets are filtered, but if every call set has filtered out your SNP, it is likely to be a false one. Once you confirm the presence of the SNP, you can check 20110512. If it is not there, you will need to extract GLs from a 20110302 call set and run beagle by yourself.

ADD REPLYlink modified 7.8 years ago • written 7.8 years ago by lh332k

Thanks again for the help. SNP's not in those call sets, and I now see that following the link back to 1000g from dbSNP puts the SNP on a track called "dbSNP submissions not present in 1000 Genomes", so that presumably means the SNP entered the 1000g browser from dbSNP but there's no evidence in the 1000g data for it.

ADD REPLYlink written 7.8 years ago by David Quigley11k
2
gravatar for Adam
7.8 years ago by
Adam1.0k
United States
Adam1.0k wrote:

Another possibility is that this SNP was called in the early phases of the 1000G, but removed in later phases as calling methods improved. Some of those old SNPs might be in older versions of dbSNP, which could cause confusion.

ADD COMMENTlink written 7.8 years ago by Adam1.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1369 users visited in the last hour