Question: Best way to compare two samples in a VCF file
9
gravatar for Macspider
3.8 years ago by
Macspider3.1k
Vienna - BOKU
Macspider3.1k wrote:

It might be a very stupid question for many of you but, since it's my first variant calling, I didn't figure it out yet.

I have mpileup'ped two bam files from two samples, then I filtered the results with vcfutils.pl and called the genotypes with bcftools call. Now I have a VCF file containing what I want, but I managed to analyze differences only between the two samples and the reference (which is an assembly coming from a line that is different from both samples).

I have the variants between my two samples and the assembly, but what if I want to detect the differences between the two samples? I did it with awk / sed / cut and other command line tools but is there maybe a better and more straight-forward way to do that? Perhaps using bcftools? Up to know, I didn't find it.

Any suggestion appreaciated!

EDIT: I did it with bcftools gtcheck as well, that works fine. I am asking if there are other reliable tools to test!

EDIT 2: As this post has many views now, I guess it will be useful for everyone to know that I used bcftools isec and it worked brilliantly.

snp genotype variant comparison vcf • 8.7k views
ADD COMMENTlink modified 3.1 years ago • written 3.8 years ago by Macspider3.1k
7
gravatar for Len Trigg
3.8 years ago by
Len Trigg1.5k
New Zealand
Len Trigg1.5k wrote:

If you are looking for a good way of separating the variants into those that are shared between both samples and those that are unique to each sample, I would recommend using vcfeval from RTG Tools to do the comparison, as that will correctly deal with representation differences between the two samples (this may not be such an issue if both samples were called jointly or you are only concerned with simple SNPs). For example:

rtg vcfeval -t your_reference -b sample1.vcf.gz -c sample2.vcf.gz --squash-ploidy --sample sample1,sample2 -o sample-differences

If both of your samples are in a single VCF, just use the same VCF for both -b and -c. The --squash-ploidy option will ignore differences in zygosity in the genotypes and match based on shared alleles. In the output directory you will have variants unique to sample1 in sample-differences/fn.vcf.gz and those unique to sample2 in sample-differences/fp.vcf.gz, and you can get a summary of the composition of those variants using rtg vcfstats.

(disclaimer: I work for RTG, but vcfeval is free and awesome)

ADD COMMENTlink written 3.8 years ago by Len Trigg1.5k
6
gravatar for novice
3.8 years ago by
novice970
United States
novice970 wrote:

I was recently in a very similar situation. I wrote a bash script that finds the "similarity" between two VCF files. Someone might find it useful:

#!/usr/bin/env bash

sort -u <(grep -v '^#' $1 | cut -f1,2,4,5) > a
sort -u <(grep -v '^#' $2 | cut -f1,2,4,5) > b
comm -23 a b > a_only
comm -13 a b > b_only
comm -12 a b > ab

numer=`cat a_only b_only | wc -l`
denom=`cat ab a_only b_only | wc -l`
dist=`echo "$numer/$denom" | bc -l`
sim=`echo "(1-$dist)*100" | bc -l`

echo $sim

# CAREFUL
rm -f a_only b_only a b ab


If you put it in sim.sh then you can use it like so:

$ ./sim.sh sample_1.vcf sample_2.vcf

ADD COMMENTlink written 3.8 years ago by novice970
3
gravatar for Jorge Amigo
3.8 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

depending on what you're looking for, like simple counts for instance, awk/perl solutions could be enough. if you need a deeper description, bcftools stats would give you some interesting details.

if your 2 samples are in a single vcf file I would suggest to split them first

bcftools view -Oz -c1 -s sample1 joined.vcf.gz > sample1.vcf.gz
bcftools view -Oz -c1 -s sample2 joined.vcf.gz > sample2.vcf.gz

and then compare them

bcftools stats sample1.vcf.gz sample2.vcf.gz > joined.stats.txt
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Jorge Amigo12k
2
gravatar for colindaven
3.8 years ago by
colindaven2.3k
Hannover Medical School
colindaven2.3k wrote:

I'm a big fan of vt for this.

http://genome.sph.umich.edu/wiki/Vt#Peek

#summarizes the variants found in sample.vcf

 vt peek sample.vcf

Also bedtools is very flexible and can work with VCF files so I would use bedtools intersect for this task.

Others have mentioned database joins to be useful too: http://www.gettinggeneticsdone.com/2013/11/using-database-joins-to-compare-results.html

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by colindaven2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1070 users visited in the last hour