Question

Retrieve statistical information from a vcf file using VCF tools

0

Entering edit mode

8.7 years ago

niharraul • 0

Hello Everyone,

I have a vcf file containing variants for a 7-8 k patients (exome sequencing) which is not tabix indexed and I am trying to retrieve the following information:

information on columns header
number of monomorphic variants, number of multi-allelic variants
distinguish number of SNPS and indels
Depth of sequencing , human genome build each patient is
exact position of monomorphic and multi-allelic variant
frequency distribution of variants

I am looking for all this information for all the chromosome together and on per chromosome basis. I am using VCFtools on linux environment. But being new to this field I have no clue on how I will be proceeding. All the help is appreciated, if detailed it will be better.

Regards

Nihar

snp • 5.1k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by niharraul • 0

Ram · Answer 1 · 2015-08-20

Well, for some of what you are doing you'll probably need more than just vcftools. And it does probably help to bgzip and tabix the file.

A lot of the raw stats you can get with vcf-stat, although I would highly recommend using bcftools stat instead as it is much quicker (uses the fast htslib c-library). That will give you a whole bunch of aggregated statistics on your vcf file. I'll go through each of your points:

What sort of information on column header? A VCF file has pretty standard columns with the only real variable being the columns for individual samples.
bcftools stats will give you this
bcftools stats will give you this
All of the samples in the vcf file should be the same human genome build, and this info should be in the header. Since vcf file rows are laid out by coordinates, and it is the genomic coordinates that tend to change between builds, a file where samples were not aligned and called to the same reference build would be useless and lead to all sorts of downstream analysis problems. If you could even create such a file in the first place. So you shouldn't need a special tool for this, just check the vcf file header. For the depth of sequencing what do you want to measure? Just depth per sample at each position that is present in the vcf file? Depending on the process used to create the VCF file this COULD be every position in the genome, but more likely only includes positions where at least one sample had a variant allele. If so there is a DP field in the genotypes string per sample. It isn't the total reads but is a count of the reads covering that position that passed filters (as defined by whatever program was used for genotype/variant calling). Depending on what you want, if you have access to the BAM files you may want to look at samtools, GATK, bedtools, etc for calculating coverage statistics.
For this I guess you would want to just write a small script, either in perl (so you can sue the VCF.pm library from vcftools) or python (using PyVCF) or whatever language you chose to go through all of the positions in the VCF file and output locations based on passing specified criteria.
Frequency distribution of what specifically? I am assuming you mean something more like variant type or subtype? Transitions, transversions, multi-allelic? specific substitution patterns? For some of this bcftools stats will give you raw counts, you can take the total output and send it to a file and then pass it to plot-vcfstats to get some visualizations. You could also write a script to do some secondary work on the output stats if it isn't produced/covered by the plots.

Ram · Answer 2 · 2015-08-20

Hi Dan,

Thanks a lot for your reply. Started with the first 3 point and will write again about the rest.

For column header: General information about each column for example: first column say contains chromosome number , second contains Position, third contains what? And so on...
Which exact command provides the count of monomorphic and alternative variants
I am trying the following command with the first command I try, the output is following error: bcftools plugin counts input.vcf.gz ..... displays an error : No functional bcftools plugin were found. The enviornment variable BCFTOOLS_PLUGIN is not set. How do I set the environment variable it will be great of you could please provide me how to set the environment variable.

Regards
Nihar