Question

How Can I Count Snps In My Vcf Files?

6

Entering edit mode

11.8 years ago

Matt W ▴ 250

I have had some trouble trying to figure this out for awhile now, so I was hoping someone could help. I'm trying to count the number of SNPs in my VCF files. I have 22 different VCF files, one for per chromosome, and I extracted different regions of interest from each using:

vcftools --vcf SNP_chr1_EXOME.vcf --chr 1 --from-bp 11531680 --to-bp 11631919 --out vcfoutput/chr1_1_targets --freq --recode

I did this multiple times and generated a handful of chr1_num_targets.recode.vcf files.

I then did:

vcf-concat *.recode.vcf > combined.vcf
vcf-stats combined.vcf > stats.txt
grep 'snp_count' stats.txt > snp_count.txt
awk '{sum+=$1} END {print sum}' snp_count.txt
$ 12781

This gave me 12781 SNPs after counting. There should only be 4817 SNPs as verified by previous work.

So I think I'm doing something completely wrong, and I was hoping someone could help. Is it because I concatenated the VCF files? I want to count the SNPs across all of the regions that were extracted, so I thought combining them would make life easier.

I was also hoping someone could help me understand how SNPs are stored within a VCF file, and maybe then I will be able to count them. Right now I'm relying on the built in vcf-stats which is simply a black box to me.

Thanks in advance!

vcf snp awk • 39k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 11.8 years ago by Matt W ▴ 250

0

Entering edit mode

Hi, can somebody add comments for knowing "Novel SNPs" in a .vcf file ?. Many Thanks.

ADD REPLY • link 8.6 years ago by ravi.uhdnis ▴ 220

2

Entering edit mode

You should post this as a separate question seeing as it has nothing to do with counting SNPs, and also you posted this as an answer which it isn't. I would suggest you post a new question, that works a lot better because this post already has 3 answers so most people will probably ignore it

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.6 years ago by Lesley Sitter ▴ 600

0

Entering edit mode

Alright, i agree with your comment but since my original question was almost similar as "Method to count total number of SNPs and Novel SNP in a VCF file" so i though that most of the peoples will give me link of this post only so i asked the second half of my question on this post. Anyway, i'll post the second half as a new question. Thanks.

ADD REPLY • link 8.6 years ago by ravi.uhdnis ▴ 220

1

Entering edit mode

That's not the same question, and even if the question was already posted it's always good to get a fresh answer.

ADD REPLY • link 8.6 years ago by Pierre Lindenbaum 161k

score 18 · Answer 1 · 2012-07-19

According to the VCF 4.1 spec, each line (excluding headers that start with #) in a VCF file contains one "snp" that you could thing of as being analogous to an RSID from a conventional genotype microarray. vcf-stats is reporting more SNPs than you expect because it is telling you how many SNPs there are for each individual in your file. What you probably want is to count the number of lines in your VCF:

grep -v # combined.vcf | wc -l

This will count the lines that are not header information, giving you a count of the number of position-specific SNPs in your sample file. To answer your question about how genotypes are stored in VCF, imaging you have information about genotype (GT), a quality score (GQ), and the read depth at the position of the call (DP). Your genotype field for that sample in your VCF would be in one of the rightmost columns under "sample name":

REF=A ALT=G # below, reference allele will be 0 and alt will be 1
GT:GQ:DP
0/1:40:30 # this would be a heterozygous unphased site with phred quality 40 and read depth 30
0/0:40:30 # homozygous reference
0|0:40:30 # phased alleles

EDIT: Actually, I think I spotted your mistake. You should be using vcf-merge instead of concat. Merge will take the files you have generated from each individual, and then merge these genotypes into one file, instead of "pasting" each file to the next one. A command would look like:

vcf-merge -d *.recode.vcf > merged.recode.vcf

Using the -d flag tells vcf-merge not to print and duplicate (same chromosome and position) SNPs to the output file.

score 8 · Answer 2 · 2012-07-19

8

Entering edit mode

11.8 years ago

Pierre Lindenbaum 161k

Using some classical unix tools :

rm -f all.vcf
egrep -v "^#" SNP_chr1_EXOME.vcf |
awk -F ' ' ($1=="1" && int($2)>=11531680 && int($2)<=11631919)'  >> all.vcf

(...same for the other regions... )

cut -d ' ' -f1,2,4,5 < all.vcf |
sort -t ' ' -k1,1 -k2,2 -k3,3 -k4,4 |
uniq |
wc -l

here two SNPs are the same if they have the same CHROM/POS/REF/ALT

ADD COMMENT • link 11.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Matt, I would use your vcftools solution and then go for Pierre's cut/sort/uniq/wc pipeline. Having said that, I think a plain call to sort without parameters is sufficient above and since VCFs are tab-delimited, it should be alright to call cut without -d ' ' parameter too.

ADD REPLY • link 11.8 years ago by Joachim ★ 2.9k

0

Entering edit mode

you might also use sort with '-u' (unique) and -i (case insensitive) if some lower/upper letters have been used for REF/ALT

ADD REPLY • link 11.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks! All of these comments and suggestions have definitely helped. I think I worked it out. Thanks again.

ADD REPLY • link 11.8 years ago by Matt W ▴ 250

score 0 · Answer 3 · 2022-03-08

import sys import os import datetime import logging import argparse python_lib = os.path.join(os.getenv("HOME"), "python_lib") if not python_lib in sys.path: sys.path.append(python_lib) from fs import FS

log_level = logging.DEBUG logging.basicConfig(filename=None, format='%(asctime)s [%(levelname)s]: %(message)s', datefmt='%Y-%m-%d %H:%M:%S', level=log_level)

def main(): fs = FS() args = get_arguments()

num_all_variants = 0
num_PASS_variants = 0
num_PASS_SNV_variants = 0
num_PASS_INS_variants = 0
num_PASS_DEL_variants = 0

for line in fs.read(args.input_vcf_file):
    if line.startswith("##"):
        continue
    elif line.startswith("#"):
        line = line[1:]
        header = line.split("\t")
    else:
        num_all_variants = num_all_variants + 1
        fields = line.split("\t")
        REF = fields[header.index("REF")]
        ALT = fields[header.index("ALT")]
        FILTER = fields[header.index("FILTER")]

        if FILTER == "PASS":
            num_PASS_variants = num_PASS_variants + 1
            if len(REF) == 1 and len(ALT) == 1:
                num_PASS_SNV_variants = num_PASS_SNV_variants + 1

            if len(REF) > 1 and len(ALT) == 1:
                num_PASS_DEL_variants = num_PASS_DEL_variants + 1

            if len(REF) == 1 and len(ALT) > 1:
                num_PASS_INS_variants = num_PASS_INS_variants + 1

output = [num_all_variants, num_PASS_variants, num_PASS_SNV_variants, num_PASS_INS_variants, num_PASS_DEL_variants]                 
header_out = "vcf_file\tall_variants\tPASS_variants\tPASS_SNV\tPASS_INS\tPASS_DEL"
#print header_out.replace(", ", "\t")
#print(header_out)
#print(os.path.basename(args.input_vcf_file) + "\t" + "\t".join(map(str,output)))

fs.write("VCF.xls", header_out+"\n")
fs.write("VCF.xls", os.path.basename(args.input_vcf_file)[:50] + "\t" + "\t".join(map(str,output)) + "\n")




#logging.info(out_file)
fs.close_without_print()
#print "Done\n\n\n"
os.system("easy VCF.xls")
sys.exit()

def get_arguments(): main_description = '''\ Count VCF

'''
epi_log='''

This program requires the python packages:

argparse
openpyxl (pypm install "openpyxl<=2.1.5")
logging ''' help_help = '''\ show this help message and exit\ ''' version_help = '''\ show the version of this program\ '''

input_vcf_file_help = '''\
input VCF file
'''

arg_parser = argparse.ArgumentParser(description=main_description, epilog=epi_log, formatter_class=argparse.RawTextHelpFormatter, add_help=False)
arg_parser.register('type','bool', lambda s: str(s).lower() in ['true', '1', 't', 'y', 'yes']) # add type keyword to registries
###############################
#    1. required arguments    #
###############################
required_group = arg_parser.add_argument_group("required arguments")


###############################
#    2. optional arguments    #
###############################
optional_group = arg_parser.add_argument_group("optional arguments")

###############################
#    3. positional arguments  #
###############################
arg_parser.add_argument('input_vcf_file', metavar="VCF_FILE", type=str, help=input_vcf_file_help)


optional_group.add_argument("-h", "--help", action="help", help=help_help)
optional_group.add_argument("-v", "--version", action="version", version="%(prog)s: version 1.0", help=version_help)
args = arg_parser.parse_args()



abs_paths = ["input_vcf_file"]
create_dirs = []
must_exist_paths = ["input_vcf_file"]
check_arguments(args, arg_parser, abs_paths, create_dirs, must_exist_paths)
return args

def check_arguments(args, arg_parser, abs_paths, create_dirs, must_exist_paths): ''' check_arguments: (1) change the paths into absolute paths (2) create diretories if needed (3) check if the paths exists in the file system ''' for abs_path in abs_paths: obj = eval("args." + abs_path) if isinstance(obj, list):

        # list of files
        exec("args.{abs_path} = [os.path.abspath(x) for x in args.{abs_path}]".format(abs_path=abs_path))  in globals(), locals()
    else:
        # single file
        exec("args.{abs_path} = os.path.abspath(args.{abs_path})".format(abs_path=abs_path))  in globals(), locals()

for create_dir in create_dirs:
    dir_path = eval("args." + create_dir)
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)
        logging.debug("%s is created!\n" % dir_path)

for must_exist_path in must_exist_paths:
    obj = eval("args." + must_exist_path)
    if isinstance(obj, list):
        # list of files
        for path in obj:
            if not os.path.exists(path):
                arg_parser.print_help()
                logging.error("No file exist as \"%s\" \n" % path )
                sys.exit()                  
    else:
        # single file
        path = obj
        if not os.path.exists(path):
            arg_parser.print_help()
            logging.error("\"%s\" does not exist!!!\n" % path )
            sys.exit()

if __name__ == '__main__': main()