Question: Fasted method to intersect VCF files
gravatar for nailu90
10 months ago by
nailu900 wrote:

Hi all,

I'm trying to find a method that intersects two VCF files as fast as possible. As far as I know, vcftools diff and bcftools isec are common tools for this issue.

However, when I try to intersect large VCF files, it still takes a lot of time: 80mins for a small vcf.gz (1 MB) vs a large vcf.gz (28 GB) using the following command: bcftools isec <file1.vcf.gz> <file2.vcf.gz> -p <outdir> -w1

Maybe there are some improvements on the programs, e.g. using threading parameters, or a workaround for this problem. I cannot imagine that there aren't any data science methods for faster joins/intersections.

I would be happy to hear your suggestions. Best regards

ADD COMMENTlink written 10 months ago by nailu900

bedtools intersect using the -sorted option (given your vcfs are sorted) could be worth a try, but still 28GB is a very large file so be patient. Alternatively, you could write the content of the small file to BED format and then use tabix to retrieve those regions from the large one with the -L option.

Extract Sub-Set Of Regions From Vcf File

ADD REPLYlink written 10 months ago by ATpoint40k

Hi ATpoint, thanks for your answer!

I was also thinking about using tabix with BED in order to extract the intersecting variants (and later on also use the remaining non-intersecting variants in the 'small' VCF for further analysis).

However, at least to my understanding, BED files contain information about the chromosome and the start/end position (e.g.: ´chr1 123 124´) of a given variant. The problem is that the large VCF file (30GB) is a reference VCF containing all known SNVs in the human genome, e.g.: chr1 123 A T, chr1 123 A C, chr1 123 A G (CHROM, POS, REF, ALT)

This means that I have to match not only on the chromosome and start/end but also on the REF and ALT which could not be done using a BED file, right? Is there any specific BED file made from VCFs?

ADD REPLYlink modified 10 months ago • written 10 months ago by nailu900

I would try to reduce search time by breaking the large vcf file into smaller vcf according to chromosome and search the records in 1 MB vcf file against the broken vcf files. With this, I am saving more time instead of searching 28 Gb vcf.

ADD REPLYlink written 10 months ago by Prakki Rama2.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1210 users visited in the last hour