Hi all,
I'm trying to find a method that intersects two VCF files as fast as possible. As far as I know, vcftools diff
and bcftools isec
are common tools for this issue.
However, when I try to intersect large VCF files, it still takes a lot of time: 80mins for a small vcf.gz (1 MB) vs a large vcf.gz (28 GB) using the following command:
bcftools isec <file1.vcf.gz> <file2.vcf.gz> -p <outdir> -w1
Maybe there are some improvements on the programs, e.g. using threading parameters, or a workaround for this problem. I cannot imagine that there aren't any data science methods for faster joins/intersections.
I would be happy to hear your suggestions. Best regards
bedtools intersect
using the-sorted
option (given your vcfs are sorted) could be worth a try, but still 28GB is a very large file so be patient. Alternatively, you could write the content of the small file to BED format and then usetabix
to retrieve those regions from the large one with the-L
option.Extract Sub-Set Of Regions From Vcf File
Hi ATpoint, thanks for your answer!
I was also thinking about using
tabix
with BED in order to extract the intersecting variants (and later on also use the remaining non-intersecting variants in the 'small' VCF for further analysis).However, at least to my understanding, BED files contain information about the chromosome and the start/end position (e.g.: ´chr1 123 124´) of a given variant. The problem is that the large VCF file (30GB) is a reference VCF containing all known SNVs in the human genome, e.g.:
chr1 123 A T
,chr1 123 A C
,chr1 123 A G
(CHROM, POS, REF, ALT)This means that I have to match not only on the chromosome and start/end but also on the REF and ALT which could not be done using a BED file, right? Is there any specific BED file made from VCFs?
I would try to reduce search time by breaking the large vcf file into smaller vcf according to chromosome and search the records in 1 MB vcf file against the broken vcf files. With this, I am saving more time instead of searching 28 Gb vcf.