Hi all,
I have a VCF file that containing 50 samples, i want to count the number of SNPs. My organism is non-model, So it does not have the chromosome.
Now, How can i count the number of SNPs for all 50 samples with this VCF?
Best Regard
Mostafa
Hi all,
I have a VCF file that containing 50 samples, i want to count the number of SNPs. My organism is non-model, So it does not have the chromosome.
Now, How can i count the number of SNPs for all 50 samples with this VCF?
Best Regard
Mostafa
Hello mostafarafiepour,
you've started with one question. In the meantime there are three :)
1. How to read a vcf file
This is a very basic question. So you need starting some literature:
If you doesn't understand any of the explanations, don't worry to ask.
2. How to count the variants in a vcf (your original question)
3. Is the resulted number of (2) correct?
Well, that's quite hard to say without knowing anything about your genome. How large is it? Is there a high diversity between individuals? As we just have the total number of different variants in all of your samples, it might be better to get a per sample count. The output of bcftool stats
(as suggested by cpad0112 ) might be useful or have a look at this thread, especially the answers by Pierre and me.
fin swimmer
Hello,
the total number you get by counting the lines in the vcf excluding the header lines.
$ grep -v "^#" input.vcf|wc -l
fin swimmer
Hi swimmer,
Many thanks for your reply, I've done it and gave me a number(20546654). But I do not know how correct this number is?
Code i use:
grep -v "^#" Final_VCF_50Sample.vcf|wc -l
Sorry swimmer,
Please describe the details in the photo for me.
Please describe the details in the photo for me.
how is it related to your original question ? Are you sure you're using the correct terms ?
Each line of the VCF is a VARIANT. A Variant can be a SNP or an INDEL or etc...
The intersection of the Variant and the Samples' names is a GENOTYPE.
Here is a quick way to count biallelic SNPs in vcf.gz files (use "cat" instead of "zcat" for uncompressed vcf files):
zcat input.vcf.gz | awk '{if ($4~/^[ACGT]$/ && $5~/^[ACGT]$/){c++}} END {print c}'
If all your variants in the vcf are SNPS, then a very quick way is to first index and then index again with the -n flag.
bcftools index data.vcf
bcftools index -n data.vcf
try this
bcftools query -f '%POS\n' file.vcf.gz | wc -l
run
bcftools stats
on vcf. It would summarize the VCF with most of the details you need.try this
bcftools query -f '%POS\n' file.vcf.gz | wc -l