How Can I Count Snps In My Vcf Files?
3
6
Entering edit mode
9.2 years ago
Matt W ▴ 250

I have had some trouble trying to figure this out for awhile now, so I was hoping someone could help. I'm trying to count the number of SNPs in my VCF files. I have 22 different VCF files, one for per chromosome, and I extracted different regions of interest from each using:

vcftools --vcf SNP_chr1_EXOME.vcf --chr 1 --from-bp 11531680 --to-bp 11631919 --out vcfoutput/chr1_1_targets --freq --recode

I did this multiple times and generated a handful of chr1_num_targets.recode.vcf files.

I then did:

vcf-concat *.recode.vcf > combined.vcf
vcf-stats combined.vcf > stats.txt
grep 'snp_count' stats.txt > snp_count.txt
awk '{sum+=$1} END {print sum}' snp_count.txt
$ 12781

This gave me 12781 SNPs after counting. There should only be 4817 SNPs as verified by previous work.

So I think I'm doing something completely wrong, and I was hoping someone could help. Is it because I concatenated the VCF files? I want to count the SNPs across all of the regions that were extracted, so I thought combining them would make life easier.

I was also hoping someone could help me understand how SNPs are stored within a VCF file, and maybe then I will be able to count them. Right now I'm relying on the built in vcf-stats which is simply a black box to me.

Thanks in advance!

vcf snp awk • 26k views
ADD COMMENT
15
Entering edit mode
9.2 years ago

According to the VCF 4.1 spec, each line (excluding headers that start with #) in a VCF file contains one "snp" that you could thing of as being analogous to an RSID from a conventional genotype microarray. vcf-stats is reporting more SNPs than you expect because it is telling you how many SNPs there are for each individual in your file. What you probably want is to count the number of lines in your VCF:

grep -v # combined.vcf | wc -l

This will count the lines that are not header information, giving you a count of the number of position-specific SNPs in your sample file. To answer your question about how genotypes are stored in VCF, imaging you have information about genotype (GT), a quality score (GQ), and the read depth at the position of the call (DP). Your genotype field for that sample in your VCF would be in one of the rightmost columns under "sample name":

REF=A ALT=G # below, reference allele will be 0 and alt will be 1
GT:GQ:DP
0/1:40:30 # this would be a heterozygous unphased site with phred quality 40 and read depth 30
0/0:40:30 # homozygous reference
0|0:40:30 # phased alleles

EDIT: Actually, I think I spotted your mistake. You should be using vcf-merge instead of concat. Merge will take the files you have generated from each individual, and then merge these genotypes into one file, instead of "pasting" each file to the next one. A command would look like:

vcf-merge -d *.recode.vcf > merged.recode.vcf

Using the -d flag tells vcf-merge not to print and duplicate (same chromosome and position) SNPs to the output file.

ADD COMMENT
0
Entering edit mode

Thanks for the response. That makes a lot more sense. After excluding the headers, my combined file is only 268 lines, so something else must be wrong. What about vcf-concat? Will that combine SNPs? I'm not sure whether I should use concat or merge. Thanks again for your help!

ADD REPLY
0
Entering edit mode

You should use merge, and I have updated my answer accordingly.

ADD REPLY
8
Entering edit mode
9.2 years ago

Using some classical unix tools :

rm -f all.vcf
egrep -v "^#" SNP_chr1_EXOME.vcf |
awk -F ' ' ($1=="1" && int($2)>=11531680 && int($2)<=11631919)'  >> all.vcf

(...same for the other regions... )

cut -d ' ' -f1,2,4,5 < all.vcf |
sort -t ' ' -k1,1 -k2,2 -k3,3 -k4,4 |
uniq |
wc -l

here two SNPs are the same if they have the same CHROM/POS/REF/ALT

ADD COMMENT
0
Entering edit mode

Matt, I would use your vcftools solution and then go for Pierre's cut/sort/uniq/wc pipeline. Having said that, I think a plain call to sort without parameters is sufficient above and since VCFs are tab-delimited, it should be alright to call cut without -d ' ' parameter too.

ADD REPLY
0
Entering edit mode

you might also use sort with '-u' (unique) and -i (case insensitive) if some lower/upper letters have been used for REF/ALT

ADD REPLY
0
Entering edit mode

Thanks! All of these comments and suggestions have definitely helped. I think I worked it out. Thanks again.

ADD REPLY
0
Entering edit mode
6.0 years ago
ravi.uhdnis ▴ 180

Hi, can somebody add comments for knowing "Novel SNPs" in a .vcf file ?. Many Thanks.

ADD COMMENT
2
Entering edit mode

You should post this as a separate question seeing as it has nothing to do with counting SNPs, and also you posted this as an answer which it isn't. I would suggest you post a new question, that works a lot better because this post already has 3 answers so most people will probably ignore it 

ADD REPLY
0
Entering edit mode

Alright, i agree with your comment but since my original question was almost similar as "Method to count total number of SNPs and Novel SNP in a VCF file" so i though that most of the peoples will give me link of this post only so i asked the second half of my question on this post. Anyway, i'll post the second half as a new question. Thanks.

ADD REPLY
1
Entering edit mode

That's not the same question, and even if the question was already posted it's always good to get a fresh answer.

ADD REPLY

Login before adding your answer.

Traffic: 1259 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6