Question: Counting Snps From 1000 Genome Data
gravatar for Matt W
7.4 years ago by
Matt W240
Matt W240 wrote:

I am looking at the 1000 Genome data found here:

And I am trying to count the number of SNPs for CEU individuals that exist within this list:

Each time I try, my count is almost double what is expected. I used vcftools:

vcftools --vcf CEU.exon.2010_03.genotypes.vcf --keep keep.txt --out vcfoutput/CEU_targets --freq --recode

Where keep.txt has the list in pastebin.

And then I looked at the number of lines in the recoded file because each line should represent a SNP. It has 3489 lines without the header, but according to a paper that I am referencing (table 2) there should only be 826 between this data and HapMap data. Why are my numbers excessively high?

Thanks in advance! -Matt

EDIT: I am counting the SNPs correctly, but I don't know how to restrict my ROI. The paper states "we restricted the analysis to the 470 kb of sequence that overlapped with the exon capture boundaries of the 1000 Genomes pilot project". I'm not sure how to do this, so if anyone has some insight, it would be greatly appreciated!

vcf 1000genomes vcftools snp • 2.1k views
ADD COMMENTlink modified 7.4 years ago by Adam990 • written 7.4 years ago by Matt W240
gravatar for Adam
7.4 years ago by
United States
Adam990 wrote:

You might want to add the --maf 0.000001 option to your command in order to remove SNPs that are not polymorphic in your sample of individuals.



ADD COMMENTlink written 7.4 years ago by Adam990

That cut it down to 2274, but this is still significantly higher. Thanks for your help. Any other thoughts?

ADD REPLYlink written 7.4 years ago by Matt W240
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2361 users visited in the last hour