I'm trying to find a method that intersects two VCF files as fast as possible. As far as I know,
vcftools diff and
bcftools isec are common tools for this issue.
However, when I try to intersect large VCF files, it still takes a lot of time: 80mins for a small vcf.gz (1 MB) vs a large vcf.gz (28 GB) using the following command:
bcftools isec <file1.vcf.gz> <file2.vcf.gz> -p <outdir> -w1
Maybe there are some improvements on the programs, e.g. using threading parameters, or a workaround for this problem. I cannot imagine that there aren't any data science methods for faster joins/intersections.
I would be happy to hear your suggestions. Best regards