Question: How To Detect Overlaping Indels In Vcf File?
gravatar for mikyatope
6.9 years ago by
mikyatope0 wrote:

Hi all,

I have a vcf file with indels from a whole genome analysis and I want to detect overlapping between indels... I tried to use BEDtools-intersect but it asks for 2 files, and I only have a single file. Before trying the option of giving the same file 2 times to BEDtools-intersect maybe someone knows a better way or a better tool to achieve this.


indel vcf • 2.8k views
ADD COMMENTlink written 6.9 years ago by mikyatope0

If you want to remove them, then Galaxy has a tool Delete Overlapping Indels

ADD REPLYlink written 6.9 years ago by Sukhdeep Singh9.8k

interesting, but my file is 10Gb, does Galaxy support those upload sizes?

ADD REPLYlink written 6.9 years ago by mikyatope0

BEDOPS tools are designed to handle arbitrarily-sized inputs and may be a useful alternative to uploading a 10 Gb file. Please see my comment to Irsan's answer.

ADD REPLYlink written 6.9 years ago by Alex Reynolds28k

Is the data phased? If so you can use something like vcfgeno2haplo -w 1000 and it will describe when the indels are "impossible" (e.g. overlapping on the same haplotype) on stderr.

Alternatively, you could call with a method that doesn't generate overlapping indels (a haplotype detection method) and ensure that the input is left-aligned and homogenized.

ADD REPLYlink written 6.7 years ago by Erik Garrison2.2k
gravatar for Irsan
6.9 years ago by
Irsan7.0k wrote:

Try bedops. It has a merge option that collapses overlapping elements in 1 or more input files. Make sure you have sorted the vcf file with sort-bed first

ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by Irsan7.0k

sorry, but it seems that bedop only uses BED as input and I have VCF files

ADD REPLYlink written 6.9 years ago by mikyatope0

The BEDOPS suite includes a vcf2bed conversion script, if this helps. The bedops tool operates on file streams in linear time and has a low, constant memory footprint, so it will scale to your 10 Gb input file size very nicely (see the Bioinformatics paper and supplementary figures for performance analysis), but you would want to do sorting with the "Big Bed Merge Sort" (bbms) tool, instead of sort-bed, unless you have more than 10 Gb of system memory. (When BEDOPS v2 comes out in a month or so, the sort-bed tool will include the functionality in bbms and be able to do sorts on arbitrarily large BED inputs.)

Please see: for conversion, for sorting, and for documentation for the bedops tool.

The --element-of operator is probably most useful for reporting overlapping BED elements, while the --merge operator will concatenate overlapping regions. You can combine operators, if this is needed for your analysis, by using standard UNIX piping; BEDOPS apps can usually take in standard input from upstream processing, e.g.

vcf2bed < foo.vcf | bbms - | bedops --element-of - bar.bed > answer.bed
ADD REPLYlink modified 6.2 years ago • written 6.9 years ago by Alex Reynolds28k

Thanks for the explanation! I'll surely give it a try

ADD REPLYlink written 6.8 years ago by mikyatope0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1871 users visited in the last hour