Question: intersect VCF files
4
gravatar for Bogdan
4.6 years ago by
Bogdan1.0k
Palo Alto, CA, USA
Bogdan1.0k wrote:

Dear all,

please would appreciate a quick recommendation about the best way to intersect 2 VCF files. many thanks,

 

bogdan

myposts vcf • 9.0k views
ADD COMMENTlink modified 4.6 years ago by Len Trigg1.5k • written 4.6 years ago by Bogdan1.0k

Do you intent to compare the VCFs for similarities? i.e. SNPs common to both?

ADD REPLYlink written 4.6 years ago by John12k

thanks gentlemen !

ADD REPLYlink written 4.6 years ago by Bogdan1.0k
11
gravatar for Mitch Bekritsky
4.6 years ago by
Mitch Bekritsky1.2k
London, England
Mitch Bekritsky1.2k wrote:

bcftools isec is very easy to use. With a sample command line like

bcftools isec -p isec_output -Oz 1.vcf.gz 2.vcf.gz

You can get four output files in gzipped VCFs

  1. isec_output/0000.vcf.gz would be variants unique to 1.vcf.gz
  2. isec_output/0001.vcf.gz would be variants unique to 2.vcf.gz
  3. isec_output/0002.vcf.gz would be variants shared by 1.vcf.gz and 2.vcf.gz as represented in 1.vcf.gz
  4. isec_output/0002.vcf.gz would be variants shared by 1.vcf.gz and 2.vcf.gz as represented in 2.vcf.gz

It's also faster than vcftools :)

ADD COMMENTlink modified 2.0 years ago by RamRS30k • written 4.6 years ago by Mitch Bekritsky1.2k
1

thanks Mitch !

ADD REPLYlink written 4.6 years ago by Bogdan1.0k
1

I also would like to add that in order to obtain a vcf file containing information for common SNPs for samples from file 1.vcf.gz and 2.vcf.gz the following command should be used:

bcftools merge --merge all 0002.vcf.gz 0003.vcf.gz > merged.vcf
ADD REPLYlink written 12 months ago by rimgubaev180

For this final step (creating a VCF with the variants in both VCFs with the samples from both VCFs), is there any difference between using bcftools merge and GATK CombineVariants?

ADD REPLYlink modified 8 months ago • written 8 months ago by Leandro Lima960

Unfortunately, I have not used GATK for this purpuses.

ADD REPLYlink written 7 months ago by rimgubaev180
9
gravatar for Len Trigg
4.6 years ago by
Len Trigg1.5k
New Zealand
Len Trigg1.5k wrote:

Unfortunately, vcftools, bcftools, and bedops all completely ignore the fact that in many cases there are multiple equivalent ways to express the same variant, so your input vcfs may look different but actually be saying the same thing about the underlying haplotypes. You can alleviate some of this by applying normalization, but you should cut to the chase by using a haplotype-aware vcf comparison tools such as rtg vcfeval, hap.py, or vgraph. I would recommend vcfeval as it is the most straight forward for using in an intersection type use-case (much like bcftools isec, you get separate output VCFs containing the unique and shared variants, preserving the input representations and annotations).

ADD COMMENTlink modified 2.0 years ago by RamRS30k • written 4.6 years ago by Len Trigg1.5k
1

You should mention that you work for RTG when recommending the software - however its open source and really awesome, so thank you! :)

ADD REPLYlink written 4.6 years ago by John12k

I tried using rtg vcfeval and dear god is it a pain to install the first time! curl -O on a HPC gives me no way of unzipping the file, no matter what tool I use, and once I download to local Mac+unzip+scp, the SDF file for the actual intersect operation is either to be created from scratch or the pre-formatted databases download at the rate of 1 gig/4 hours. If this is the initial investment that goes into using this tool, why would I use it?

ADD REPLYlink written 3.7 years ago by RamRS30k
1

Sorry to hear you had difficulty. I had no idea that curl has problems with bog standard github release download links (wget handles these fine). Apparently you need to give curl the -L option, e.g:

$ curl -L -O https://github.com/RealTimeGenomics/rtg-tools/releases/download/3.7.1/rtg-tools-3.7.1-nojre.zip
$ unzip -q rtg-tools-3.7.1-nojre.zip
$ ./rtg-tools-3.7.1/rtg  # answer y or n when prompted

Regarding the SDF creation, it's trivial difficult to make your own, and is no harder than creating a BWA reference index or faidx index, e.g:

$ ./rtg-tools-3.7.1/rtg format -o hs37d5.sdf /path/to/hs37d5.fa.gz
ADD REPLYlink written 3.7 years ago by Len Trigg1.5k

Thank you - I will try rtg the next time I have to intersect VCF files. In the meantime, I'll try and prep the SDF file.

ADD REPLYlink written 3.7 years ago by RamRS30k

I installed rtg-tools 3.8.4 using bioconda. When I run vcfeval the output contains 4 VCFs:

fn.vcf.gz
fp.vcf.gz
tp.vcf.gz
tp-baseline.gz

Judging from the slides and the preprint, the fn (false negative) file contains calls that are in the baseline but not in the calls file, the fp (false positive) file contains calls that are in the calls but not in the baseline files and the tp (true positive) contains variants which are represented in both. Is this correct?

What then is the tp-baseline file?

And I see from the log that variants with alleles over 1000 bases are skipped. Is this cutoff limit configurable somehow?

BTW I'm using this to compare variants in a bacterial genome.

ADD REPLYlink modified 2.0 years ago by RamRS30k • written 2.8 years ago by Peter vH130

tp.vcf contains the called variants which are present in both.
tp-baseline.vcf contains the baseline variants which are present in both.

So the tp.vcf is equivalent to the tp-baseline.vcf, but the former is using the calls representation, and the latter is using the baseline representation. If you haven't already looked, there is more information in the user manual section for vcfeval (which unfortunately you don't get with bioconda).

You can adjust the maximum allele length with --Xmax-length=N. --X flags usually require caution when using. In the case of allele length, synchronization points (since you've read the preprint) only occur where there is no variant in play, so if you have very long alleles as well as many short alleles within the same span (in either vcf), computational complexity can blow out. It's probably less of a concern with bacterial variants, since you'll presumably either have haploid variants or be using --squash-ploidy.

ADD REPLYlink modified 2.0 years ago by RamRS30k • written 2.8 years ago by Len Trigg1.5k
4
gravatar for venu
4.6 years ago by
venu6.7k
Germany
venu6.7k wrote:

Quickly recommended vcftools

ADD COMMENTlink modified 2.0 years ago by RamRS30k • written 4.6 years ago by venu6.7k
1

Just important to reiterate what Len mentioned below - vcftools, bcftools, and bedops all will NOT get the right answer in many cases! Even with careful normalization (for instance, using vt or vcfallelicprimitives), there will still be equivalent / matching variants that are not appropriately handled. As of today, the only tools that do intersections correctly are vgraph, vcfeval, and hap.py.

ADD REPLYlink written 4.4 years ago by bofallon10
4
gravatar for Alex Reynolds
4.6 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

You could use BEDOPS bedops for this. There are a couple approaches, depending on what you mean by intersect.

There is the set operation on elements, which is probably what most mean by "intersect":

$ bedops --element-of 1 <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > answer.bed

An element-of operation is not a symmetric result — in this case, the result contains elements of one.vcf that overlap elements in two.vcf. You would reverse the order of inputs to get the other result.

Alternatively, and perhaps less commonly, you could do an operation on genomic regions:

$ bedops --intersect <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > answer.bed

The by-region operation is symmetric; inputs can be entered in any order to get the same result.

If you want VCF elements "common" to or overlapping between both input files, you can do the following series of operations:

$ bedops --intersect <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > common-regions.bed
$ bedops --everything <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > all-elements.bed
$ bedops --element-of 1 all-elements.bed common-regions.bed > common-elements.bed
ADD COMMENTlink modified 4.5 years ago • written 4.6 years ago by Alex Reynolds30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 782 users visited in the last hour