Count Of Variants
3
5
Entering edit mode
8.9 years ago
win ▴ 890

hi there, is there a way to get count of SNP, indels, CNVs etc from a VCF file, so some thing like

SNPs = ?

Insertions = ?

Deletions = ?

CNVs = ?

using simple linux commands

thanks, a

vcf • 17k views
ADD COMMENT
15
Entering edit mode
8.9 years ago
matted 7.5k

There are a couple of ways that variant type is annotated within a VCF file, so there are correspondingly a few ways to get close to what you want. Here's one choice that should work with most VCF files:

Use the vcftools tool vcf-annotate to fill in the variant type field:

zcat in.vcf.gz | vcftools_0.1.9/bin/vcf-annotate --fill-type > out.vcf

Then count up the variants by looking at the (newly-filled) TYPE field:

grep -oP "TYPE=\w+" out.vcf | sort | uniq -c

Or in one step that doesn't change the original VCF file:

zcat in.vcf.gz | vcftools_0.1.9/bin/vcf-annotate --fill-type | grep -oP "TYPE=\w+" | sort | uniq -c

On an example I had, this yielded:

3410 TYPE=del
4487 TYPE=ins
56744 TYPE=snp

1000 Genomes VCF files will be annotated in a finer-grained way (e.g. choices including DUP, INV, CNV, TANDEM, see here), but I'm not sure how to get their range of annotations from your own raw read data. However, if these distinctions are critical to you, that may be a useful direction to explore.

ADD COMMENT
0
Entering edit mode

This is so helpful! Thank you!

ADD REPLY
0
Entering edit mode

Great solution, after 8,5 years. Worked like a charm; zcat in.vcf.gz | vcf-annotate --fill-type > out.vcf

ADD REPLY
10
Entering edit mode
6.6 years ago

bcftools has a reporting tool that gives you this kind of information:

bcftools stats file.vcf > file.stats
ADD COMMENT
0
Entering edit mode

this doesn't seem to differentiate insertions or deletions. just indels.

ADD REPLY
0
Entering edit mode
12 months ago

if you want a fancy output, here is a development based on the top answer above with some gawk magic

cat <my.vcf> \
| vcf-annotate --fill-type \
| grep -v '^#' \
| gawk '
  BEGIN{
  FS="\t"; OFS="\t"
  }
  {
  match($8, /TYPE=([^ ]+)/, arr); 
  type=arr[1]; 
  len=sqrt((length($5)-length($4))^2); 
  cnt[type]++; 
  tot[type]+=len
  }
  END{
  print"# differences between the query and reference include:"
  printf "%s: %i (%i bases)\n", "snp", cnt["snp"], cnt["snp"];
  printf "%s: %i (%i bases)\n", "del", cnt["del"], tot["del"];
  printf "%s: %i (%i bases)\n", "ins", cnt["ins"], tot["ins"]
  }'

rem: it is simplistic and may not work well if your VCF has overlapping alt calls (not tested!)

ADD COMMENT

Login before adding your answer.

Traffic: 2267 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6