Dear all,
please would appreciate a quick recommendation about the best way to intersect 2 VCF files. many thanks,
bogdan
Dear all,
please would appreciate a quick recommendation about the best way to intersect 2 VCF files. many thanks,
bogdan
bcftools isec is very easy to use. With a sample command line like
bcftools isec -p isec_output -Oz 1.vcf.gz 2.vcf.gz
You can get four output files in gzipped VCFs
isec_output/0000.vcf.gz
would be variants unique to 1.vcf.gz
isec_output/0001.vcf.gz
would be variants unique to 2.vcf.gz
isec_output/0002.vcf.gz
would be variants shared by 1.vcf.gz
and 2.vcf.gz
as represented in 1.vcf.gz
isec_output/0002.vcf.gz
would be variants shared by 1.vcf.gz
and 2.vcf.gz
as represented in 2.vcf.gz
It's also faster than vcftools :)
Unfortunately, vcftools, bcftools, and bedops all completely ignore the fact that in many cases there are multiple equivalent ways to express the same variant, so your input vcfs may look different but actually be saying the same thing about the underlying haplotypes. You can alleviate some of this by applying normalization, but you should cut to the chase by using a haplotype-aware vcf comparison tools such as rtg vcfeval, hap.py, or vgraph. I would recommend vcfeval as it is the most straight forward for using in an intersection type use-case (much like bcftools isec, you get separate output VCFs containing the unique and shared variants, preserving the input representations and annotations).
I tried using rtg vcfeval and dear god is it a pain to install the first time! curl -O
on a HPC gives me no way of unzipping the file, no matter what tool I use, and once I download to local Mac+unzip+scp, the SDF file for the actual intersect operation is either to be created from scratch or the pre-formatted databases download at the rate of 1 gig/4 hours. If this is the initial investment that goes into using this tool, why would I use it?
Sorry to hear you had difficulty. I had no idea that curl has problems with bog standard github release download links (wget handles these fine). Apparently you need to give curl the -L option, e.g:
$ curl -L -O https://github.com/RealTimeGenomics/rtg-tools/releases/download/3.7.1/rtg-tools-3.7.1-nojre.zip
$ unzip -q rtg-tools-3.7.1-nojre.zip
$ ./rtg-tools-3.7.1/rtg # answer y or n when prompted
Regarding the SDF creation, it's trivial difficult to make your own, and is no harder than creating a BWA reference index or faidx index, e.g:
$ ./rtg-tools-3.7.1/rtg format -o hs37d5.sdf /path/to/hs37d5.fa.gz
I installed rtg-tools
3.8.4 using bioconda. When I run vcfeval
the output contains 4 VCFs:
fn.vcf.gz
fp.vcf.gz
tp.vcf.gz
tp-baseline.gz
Judging from the slides and the preprint, the fn (false negative) file contains calls that are in the baseline but not in the calls file, the fp (false positive) file contains calls that are in the calls but not in the baseline files and the tp (true positive) contains variants which are represented in both. Is this correct?
What then is the tp-baseline
file?
And I see from the log that variants with alleles over 1000 bases are skipped. Is this cutoff limit configurable somehow?
BTW I'm using this to compare variants in a bacterial genome.
tp.vcf
contains the called variants which are present in both.
tp-baseline.vcf
contains the baseline variants which are present in both.
So the tp.vcf
is equivalent to the tp-baseline.vcf
, but the former is using the calls representation, and the latter is using the baseline representation. If you haven't already looked, there is more information in the user manual section for vcfeval
(which unfortunately you don't get with bioconda).
You can adjust the maximum allele length with --Xmax-length=N
. --X
flags usually require caution when using. In the case of allele length, synchronization points (since you've read the preprint) only occur where there is no variant in play, so if you have very long alleles as well as many short alleles within the same span (in either vcf), computational complexity can blow out. It's probably less of a concern with bacterial variants, since you'll presumably either have haploid variants or be using --squash-ploidy
.
You could use BEDOPS bedops
for this. There are a couple approaches, depending on what you mean by intersect.
There is the set operation on elements, which is probably what most mean by "intersect":
$ bedops --element-of 1 <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > answer.bed
An element-of operation is not a symmetric result — in this case, the result contains elements of one.vcf
that overlap elements in two.vcf
. You would reverse the order of inputs to get the other result.
Alternatively, and perhaps less commonly, you could do an operation on genomic regions:
$ bedops --intersect <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > answer.bed
The by-region operation is symmetric; inputs can be entered in any order to get the same result.
If you want VCF elements "common" to or overlapping between both input files, you can do the following series of operations:
$ bedops --intersect <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > common-regions.bed
$ bedops --everything <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > all-elements.bed
$ bedops --element-of 1 all-elements.bed common-regions.bed > common-elements.bed
Just important to reiterate what Len mentioned below - vcftools, bcftools, and bedops all will NOT get the right answer in many cases! Even with careful normalization (for instance, using vt or vcfallelicprimitives), there will still be equivalent / matching variants that are not appropriately handled. As of today, the only tools that do intersections correctly are vgraph, vcfeval, and hap.py.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Do you intent to compare the VCFs for similarities? i.e. SNPs common to both?
thanks gentlemen !