So I am dealing with a merged vcf file containing samples sequenced by both a panel as well as exome sequencing. I have tried both vcftools and bedtools for obtaining a subset of SNPs based on the regions described in a bed file. Both worked as far as I can tell but vcftools only included half the number of SNPs that bedtools did. The number of SNPs in the bedtools subset was much closer to what I was expecting. Here are the methods I ran for both bedtools and vcftools:
vcftools --gzvcf merged.vcf.gz --bed panel.bed --out subset --recode --keep-INFO-all
bedtools intersect -a merged.vcf.gz -b panel.bed | bgzip > subset.vcf.gz
What I am hoping someone can tell me is the difference between the intersecting strategies performed by bedtools and vcftools and if there is an alternative for getting vcf subsets from a bed file that would be better than these options. Thank you!
Hello Alex,
Bedops looks like a great tool thank you for the suggestion. Why do you opt for the
--element-of
arg you shared over--intersect
? Does this have something to do with converting to bed and sorting first?--intersect
means something different inbedops
than it does in other tools. With--element-of
, this is a set membership test: does this interval in set A overlap another interval in set B by so-and-so bases. The--intersect
operation is different in that it calculates a new set of genomic intervals where there are overlaps, i.e. where intervals actually intersect.The following links: https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html#element-of-e-element-of and https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html#intersect-i-intersect show a graphical depiction of the differences.