Fasted method to intersect VCF files
0
0
Entering edit mode
4.4 years ago
nailu90 • 0

Hi all,

I'm trying to find a method that intersects two VCF files as fast as possible. As far as I know, vcftools diff and bcftools isec are common tools for this issue.

However, when I try to intersect large VCF files, it still takes a lot of time: 80mins for a small vcf.gz (1 MB) vs a large vcf.gz (28 GB) using the following command: bcftools isec <file1.vcf.gz> <file2.vcf.gz> -p <outdir> -w1

Maybe there are some improvements on the programs, e.g. using threading parameters, or a workaround for this problem. I cannot imagine that there aren't any data science methods for faster joins/intersections.

I would be happy to hear your suggestions. Best regards

bcftools isec vcf tabix multithreading • 2.1k views
ADD COMMENT
1
Entering edit mode

bedtools intersect using the -sorted option (given your vcfs are sorted) could be worth a try, but still 28GB is a very large file so be patient. Alternatively, you could write the content of the small file to BED format and then use tabix to retrieve those regions from the large one with the -L option.

Extract Sub-Set Of Regions From Vcf File

ADD REPLY
0
Entering edit mode

Hi ATpoint, thanks for your answer!

I was also thinking about using tabix with BED in order to extract the intersecting variants (and later on also use the remaining non-intersecting variants in the 'small' VCF for further analysis).

However, at least to my understanding, BED files contain information about the chromosome and the start/end position (e.g.: ´chr1 123 124´) of a given variant. The problem is that the large VCF file (30GB) is a reference VCF containing all known SNVs in the human genome, e.g.: chr1 123 A T, chr1 123 A C, chr1 123 A G (CHROM, POS, REF, ALT)

This means that I have to match not only on the chromosome and start/end but also on the REF and ALT which could not be done using a BED file, right? Is there any specific BED file made from VCFs?

ADD REPLY
0
Entering edit mode

I would try to reduce search time by breaking the large vcf file into smaller vcf according to chromosome and search the records in 1 MB vcf file against the broken vcf files. With this, I am saving more time instead of searching 28 Gb vcf.

ADD REPLY

Login before adding your answer.

Traffic: 2506 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6