Dear all,
please would appreciate a quick recommendation about the best way to intersect 2 VCF files. many thanks,
bogdan
Dear all,
please would appreciate a quick recommendation about the best way to intersect 2 VCF files. many thanks,
bogdan
bcftools isec is very easy to use. With a sample command line like
bcftools isec -p isec_output -Oz 1.vcf.gz 2.vcf.gz
You can get four output files in gzipped VCFs
isec_output/0000.vcf.gz
would be variants unique to 1.vcf.gz
isec_output/0001.vcf.gz
would be variants unique to 2.vcf.gz
isec_output/0002.vcf.gz
would be variants shared by 1.vcf.gz
and 2.vcf.gz
as represented in 1.vcf.gz
isec_output/0002.vcf.gz
isec_output/0003.vcf.gz
would be variants shared by 1.vcf.gz
and 2.vcf.gz
as represented in 2.vcf.gz
It's also faster than vcftools :)
Like the description says,
3.
isec_output/0002.vcf.gz
would be variants shared by1.vcf.gz
and2.vcf.gz
as represented in1.vcf.gz
4.isec_output/0002.vcf.gz
would be variants shared by1.vcf.gz
and2.vcf.gz
as represented in2.vcf.gz
The matching variant entries in each file may (and in all probability do) have different INFO, QUAL, ID fields etc. and thus the information content of the output would be different. The criteria for a "match" are matching CHROM, POS and optionally REF and ALT or ID (based on the -c
). Commutative property does not apply to sets of complex data-types with partial attribute overlap.
EDIT: The fourth file should be isec_output/0003.vcf.gz
. I've made this change in the answer as well.
If the matches are based on CHROM and POS columns, then it would ignore other fields which are not same. Right For example:
File 1
ID QUAL CHROM POS
2 52 7 14231
File 2
ID QUAL CHROM POS
3 56 7 14231
In the above example, though ID and QUAL values are different, it will be counted as one variant. Right?. So why there is difference in the number of variants. Sorry, can you help me to understand with an example here.
Thank you
Of course, the number of records will match, that's what an intersect operation does.
I read the answer and comments more thoroughly and I see a possible source of confusion: both the third and fourth files are named isec_output/0002.vcf.gz
- the fourth should be isec_output/0003.vcf.gz
(3 not 2).
With that in mind, applying the operation to your example, 0002
would be:
ID QUAL CHROM POS
2 52 7 14231
with ID and QUAL from File 1 and 0003
would be:
ID QUAL CHROM POS
3 56 7 14231
with ID and QUAL from File 2.
Unfortunately, vcftools, bcftools, and bedops all completely ignore the fact that in many cases there are multiple equivalent ways to express the same variant, so your input vcfs may look different but actually be saying the same thing about the underlying haplotypes. You can alleviate some of this by applying normalization, but you should cut to the chase by using a haplotype-aware vcf comparison tools such as rtg vcfeval, hap.py, or vgraph. I would recommend vcfeval as it is the most straight forward for using in an intersection type use-case (much like bcftools isec, you get separate output VCFs containing the unique and shared variants, preserving the input representations and annotations).
I tried using rtg vcfeval and dear god is it a pain to install the first time! curl -O
on a HPC gives me no way of unzipping the file, no matter what tool I use, and once I download to local Mac+unzip+scp, the SDF file for the actual intersect operation is either to be created from scratch or the pre-formatted databases download at the rate of 1 gig/4 hours. If this is the initial investment that goes into using this tool, why would I use it?
Sorry to hear you had difficulty. I had no idea that curl has problems with bog standard github release download links (wget handles these fine). Apparently you need to give curl the -L option, e.g:
$ curl -L -O https://github.com/RealTimeGenomics/rtg-tools/releases/download/3.7.1/rtg-tools-3.7.1-nojre.zip
$ unzip -q rtg-tools-3.7.1-nojre.zip
$ ./rtg-tools-3.7.1/rtg # answer y or n when prompted
Regarding the SDF creation, it's trivial difficult to make your own, and is no harder than creating a BWA reference index or faidx index, e.g:
$ ./rtg-tools-3.7.1/rtg format -o hs37d5.sdf /path/to/hs37d5.fa.gz
I installed rtg-tools
3.8.4 using bioconda. When I run vcfeval
the output contains 4 VCFs:
fn.vcf.gz
fp.vcf.gz
tp.vcf.gz
tp-baseline.gz
Judging from the slides and the preprint, the fn (false negative) file contains calls that are in the baseline but not in the calls file, the fp (false positive) file contains calls that are in the calls but not in the baseline files and the tp (true positive) contains variants which are represented in both. Is this correct?
What then is the tp-baseline
file?
And I see from the log that variants with alleles over 1000 bases are skipped. Is this cutoff limit configurable somehow?
BTW I'm using this to compare variants in a bacterial genome.
tp.vcf
contains the called variants which are present in both.
tp-baseline.vcf
contains the baseline variants which are present in both.
So the tp.vcf
is equivalent to the tp-baseline.vcf
, but the former is using the calls representation, and the latter is using the baseline representation. If you haven't already looked, there is more information in the user manual section for vcfeval
(which unfortunately you don't get with bioconda).
You can adjust the maximum allele length with --Xmax-length=N
. --X
flags usually require caution when using. In the case of allele length, synchronization points (since you've read the preprint) only occur where there is no variant in play, so if you have very long alleles as well as many short alleles within the same span (in either vcf), computational complexity can blow out. It's probably less of a concern with bacterial variants, since you'll presumably either have haploid variants or be using --squash-ploidy
.
You could use BEDOPS bedops
for this. There are a couple approaches, depending on what you mean by intersect.
There is the set operation on elements, which is probably what most mean by "intersect":
$ bedops --element-of 1 <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > answer.bed
An element-of operation is not a symmetric result — in this case, the result contains elements of one.vcf
that overlap elements in two.vcf
. You would reverse the order of inputs to get the other result.
Alternatively, and perhaps less commonly, you could do an operation on genomic regions:
$ bedops --intersect <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > answer.bed
The by-region operation is symmetric; inputs can be entered in any order to get the same result.
If you want VCF elements "common" to or overlapping between both input files, you can do the following series of operations:
$ bedops --intersect <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > common-regions.bed
$ bedops --everything <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > all-elements.bed
$ bedops --element-of 1 all-elements.bed common-regions.bed > common-elements.bed
Just important to reiterate what Len mentioned below - vcftools, bcftools, and bedops all will NOT get the right answer in many cases! Even with careful normalization (for instance, using vt or vcfallelicprimitives), there will still be equivalent / matching variants that are not appropriately handled. As of today, the only tools that do intersections correctly are vgraph, vcfeval, and hap.py.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Do you intent to compare the VCFs for similarities? i.e. SNPs common to both?
thanks gentlemen !