Question

comparing 3 VCF files for concordance and VENN plotting

0

Entering edit mode

6.3 years ago

drowl1 ▴ 30

Hello everyone!

I have 3 multi-sample VCF files that I want to compare for concordance such that I can use the statistics from the comparison to do a VENN diagram.I have used vcf-compare but the numbers I get from its output (vcf-compare a.vcf b.vcf c.vcf) do not agree with the pairwise comparison of my three vcf files using bcftools isec (ie,bcftools isec a.vcf b.vcf,bcftools isec a.vcf c.vcf,bcftools isec b.vcf c.vcf,).

Is there any other tool that can run this 3-vcf file comparison for me with understandable output statistics?

software error sequence next-gen sequencing • 4.4k views

ADD COMMENT • link updated 6.3 years ago by Kevin Blighe 87k • written 6.3 years ago by drowl1 ▴ 30

score 5 · Accepted Answer · 2017-12-22

Your experience with those tools used for comparing variants is why I gave up using them many years ago. They rarely agree and many bugs have been found. 'Under the hood', they may be doing some form of automatic filtering that is not reported, such as not including variants with low quality, etc.

I would just do this manually by(for each file):

1, Normalising the variants in each file (left-aligning indels and splitting multi-allelic calls), and then creating a unique key to identify each variant, setting it in the ID field for each resulting file (BCF):

#1st pipe, splits multi-allelic calls into separate variant calls
#2nd pipe, left-aligns indels and issues warnings when the REF base in your VCF does not match the base in the supplied FASTA reference genome
#3rd pipe, sets VCF ID field to CHROM:POS:REF:ALT

bcftools norm -m-any MyVariants.vcf | bcftools norm --check-ref w -f human_g1k_v37.fasta | bcftools annotate -Ob -x 'ID' -I +'%CHROM:%POS:%REF:%ALT' > MyVariants.bcf ;
bcftools index MyVariants.bcf ;

2, extracting the variant key into 3 separate lists using cut, awk, sed, or grep:

bcftools view MyVariants.bcf | cut -f3 | grep -v "^##" | grep -v "^ID"

3, With your 3 lists of unique variants, use something like Venny to check level of overlap and unique variants in each