Question: How Can I Count Snps In My Vcf Files?
gravatar for Matt W
6.7 years ago by
Matt W240
Matt W240 wrote:

I have had some trouble trying to figure this out for awhile now, so I was hoping someone could help. I'm trying to count the number of SNPs in my VCF files. I have 22 different VCF files, one for per chromosome, and I extracted different regions of interest from each using:

vcftools --vcf SNP_chr1_EXOME.vcf --chr 1 --from-bp 11531680 --to-bp 11631919 --out vcfoutput/chr1_1_targets --freq --recode

I did this multiple times and generated a handful of chr1_num_targets.recode.vcf files.

I then did:

vcf-concat *.recode.vcf > combined.vcf
vcf-stats combined.vcf > stats.txt
grep 'snp_count' stats.txt > snp_count.txt
awk '{sum+=$1} END {print sum}' snp_count.txt
$ 12781

This gave me 12781 SNPs after counting. There should only be 4817 SNPs as verified by previous work.

So I think I'm doing something completely wrong, and I was hoping someone could help. Is it because I concatenated the VCF files? I want to count the SNPs across all of the regions that were extracted, so I thought combining them would make life easier.

I was also hoping someone could help me understand how SNPs are stored within a VCF file, and maybe then I will be able to count them. Right now I'm relying on the built in vcf-stats which is simply a black box to me.

Thanks in advance!

vcf snp awk • 18k views
ADD COMMENTlink modified 3.6 years ago by ravi.uhdnis150 • written 6.7 years ago by Matt W240
gravatar for Matt Shirley
6.7 years ago by
Matt Shirley8.9k
Cambridge, MA
Matt Shirley8.9k wrote:

According to the VCF 4.1 spec, each line (excluding headers that start with #) in a VCF file contains one "snp" that you could thing of as being analogous to an RSID from a conventional genotype microarray. vcf-stats is reporting more SNPs than you expect because it is telling you how many SNPs there are for each individual in your file. What you probably want is to count the number of lines in your VCF:

grep -v # combined.vcf | wc -l

This will count the lines that are not header information, giving you a count of the number of position-specific SNPs in your sample file. To answer your question about how genotypes are stored in VCF, imaging you have information about genotype (GT), a quality score (GQ), and the read depth at the position of the call (DP). Your genotype field for that sample in your VCF would be in one of the rightmost columns under "sample name":

REF=A ALT=G # below, reference allele will be 0 and alt will be 1
0/1:40:30 # this would be a heterozygous unphased site with phred quality 40 and read depth 30
0/0:40:30 # homozygous reference
0|0:40:30 # phased alleles

EDIT: Actually, I think I spotted your mistake. You should be using vcf-merge instead of concat. Merge will take the files you have generated from each individual, and then merge these genotypes into one file, instead of "pasting" each file to the next one. A command would look like:

vcf-merge -d *.recode.vcf > merged.recode.vcf

Using the -d flag tells vcf-merge not to print and duplicate (same chromosome and position) SNPs to the output file.

ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Matt Shirley8.9k

Thanks for the response. That makes a lot more sense. After excluding the headers, my combined file is only 268 lines, so something else must be wrong. What about vcf-concat? Will that combine SNPs? I'm not sure whether I should use concat or merge. Thanks again for your help!

ADD REPLYlink written 6.7 years ago by Matt W240

You should use merge, and I have updated my answer accordingly.

ADD REPLYlink written 6.7 years ago by Matt Shirley8.9k
gravatar for Pierre Lindenbaum
6.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

Using some classical unix tools :

rm -f all.vcf
egrep -v "^#" SNP_chr1_EXOME.vcf |
awk -F ' ' ($1=="1" && int($2)>=11531680 && int($2)<=11631919)'  >> all.vcf

(...same for the other regions... )

cut -d ' ' -f1,2,4,5 < all.vcf |
sort -t ' ' -k1,1 -k2,2 -k3,3 -k4,4 |
uniq |
wc -l

here two SNPs are the same if they have the same CHROM/POS/REF/ALT

ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Pierre Lindenbaum118k

Matt, I would use your vcftools solution and then go for Pierre's cut/sort/uniq/wc pipeline. Having said that, I think a plain call to sort without parameters is sufficient above and since VCFs are tab-delimited, it should be alright to call cut without -d ' ' parameter too.

ADD REPLYlink written 6.7 years ago by Joachim2.8k

you might also use sort with '-u' (unique) and -i (case insensitive) if some lower/upper letters have been used for REF/ALT

ADD REPLYlink written 6.7 years ago by Pierre Lindenbaum118k

Thanks! All of these comments and suggestions have definitely helped. I think I worked it out. Thanks again.

ADD REPLYlink written 6.7 years ago by Matt W240
gravatar for ravi.uhdnis
3.6 years ago by
United States
ravi.uhdnis150 wrote:

Hi, can somebody add comments for knowing "Novel SNPs" in a .vcf file ?. Many Thanks.

ADD COMMENTlink written 3.6 years ago by ravi.uhdnis150

You should post this as a separate question seeing as it has nothing to do with counting SNPs, and also you posted this as an answer which it isn't. I would suggest you post a new question, that works a lot better because this post already has 3 answers so most people will probably ignore it 

ADD REPLYlink written 3.5 years ago by Lesley Sitter460

Alright, i agree with your comment but since my original question was almost similar as "Method to count total number of SNPs and Novel SNP in a VCF file" so i though that most of the peoples will give me link of this post only so i asked the second half of my question on this post. Anyway, i'll post the second half as a new question. Thanks.

ADD REPLYlink written 3.5 years ago by ravi.uhdnis150

That's not the same question, and even if the question was already posted it's always good to get a fresh answer.

ADD REPLYlink written 3.5 years ago by Pierre Lindenbaum118k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1225 users visited in the last hour