Question: Count Of Variants
gravatar for win
8.2 years ago by
win860 wrote:

hi there, is there a way to get count of SNP, indels, CNVs etc from a VCF file, so some thing like

SNPs = ?

Insertions = ?

Deletions = ?

CNVs = ?

using simple linux commands

thanks, a

vcf • 14k views
ADD COMMENTlink modified 4 months ago by Stephane Plaisance420 • written 8.2 years ago by win860
gravatar for matted
8.2 years ago by
Boston, United States
matted7.3k wrote:

There are a couple of ways that variant type is annotated within a VCF file, so there are correspondingly a few ways to get close to what you want. Here's one choice that should work with most VCF files:

Use the vcftools tool vcf-annotate to fill in the variant type field:

zcat in.vcf.gz | vcftools_0.1.9/bin/vcf-annotate --fill-type > out.vcf

Then count up the variants by looking at the (newly-filled) TYPE field:

grep -oP "TYPE=\w+" out.vcf | sort | uniq -c

Or in one step that doesn't change the original VCF file:

zcat in.vcf.gz | vcftools_0.1.9/bin/vcf-annotate --fill-type | grep -oP "TYPE=\w+" | sort | uniq -c

On an example I had, this yielded:

3410 TYPE=del
4487 TYPE=ins
56744 TYPE=snp

1000 Genomes VCF files will be annotated in a finer-grained way (e.g. choices including DUP, INV, CNV, TANDEM, see here), but I'm not sure how to get their range of annotations from your own raw read data. However, if these distinctions are critical to you, that may be a useful direction to explore.

ADD COMMENTlink modified 4 months ago by RamRS30k • written 8.2 years ago by matted7.3k

This is so helpful! Thank you!

ADD REPLYlink written 2.0 years ago by kelseyca0
gravatar for Jorge Amigo
6.0 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

bcftools has a reporting tool that gives you this kind of information:

bcftools stats file.vcf > file.stats
ADD COMMENTlink modified 4 months ago by RamRS30k • written 6.0 years ago by Jorge Amigo12k

this doesn't seem to differentiate insertions or deletions. just indels.

ADD REPLYlink written 5.3 years ago by nchuang230
gravatar for Stephane Plaisance
4 months ago by
Leuven area (Belgium)
Stephane Plaisance420 wrote:

if you want a fancy output, here is a development based on the top answer above with some gawk magic

cat <my.vcf> \
| vcf-annotate --fill-type \
| grep -v '^#' \
| gawk '
  FS="\t"; OFS="\t"
  match($8, /TYPE=([^ ]+)/, arr); 
  print"# differences between the query and reference include:"
  printf "%s: %i (%i bases)\n", "snp", cnt["snp"], cnt["snp"];
  printf "%s: %i (%i bases)\n", "del", cnt["del"], tot["del"];
  printf "%s: %i (%i bases)\n", "ins", cnt["ins"], tot["ins"]

rem: it is simplistic and may not work well if your VCF has overlapping alt calls (not tested!)

ADD COMMENTlink modified 4 months ago • written 4 months ago by Stephane Plaisance420
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 911 users visited in the last hour