Question: Counting Snps From 1000 Genome Data
2
gravatar for Matt W
6.7 years ago by
Matt W240
Matt W240 wrote:

I am looking at the 1000 Genome data found here: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/exon/snps/

And I am trying to count the number of SNPs for CEU individuals that exist within this list: http://pastebin.com/JUwNLh9E

Each time I try, my count is almost double what is expected. I used vcftools:

vcftools --vcf CEU.exon.2010_03.genotypes.vcf --keep keep.txt --out vcfoutput/CEU_targets --freq --recode

Where keep.txt has the list in pastebin.

And then I looked at the number of lines in the recoded file because each line should represent a SNP. It has 3489 lines without the header, but according to a paper that I am referencing (table 2) there should only be 826 between this data and HapMap data. Why are my numbers excessively high?

Thanks in advance! -Matt

EDIT: I am counting the SNPs correctly, but I don't know how to restrict my ROI. The paper states "we restricted the analysis to the 470 kb of sequence that overlapped with the exon capture boundaries of the 1000 Genomes pilot project". I'm not sure how to do this, so if anyone has some insight, it would be greatly appreciated!

vcf 1000genomes vcftools snp • 1.9k views
ADD COMMENTlink modified 6.7 years ago by Adam980 • written 6.7 years ago by Matt W240
1
gravatar for Adam
6.7 years ago by
Adam980
United States
Adam980 wrote:

You might want to add the --maf 0.000001 option to your command in order to remove SNPs that are not polymorphic in your sample of individuals.

Regards,

Adam

ADD COMMENTlink written 6.7 years ago by Adam980

That cut it down to 2274, but this is still significantly higher. Thanks for your help. Any other thoughts?

ADD REPLYlink written 6.7 years ago by Matt W240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1036 users visited in the last hour