I have vcfs from exome sequencing and I wanted to filter out from it variants from homopolymer regions with bed file covering the homopolymer regions by bedtools subtract, but to my surprise, amount of variants (rows in the file) increased after bedtools subtract. The command I used was:
bedtools subtract -header -a test.vcf -b ../../../external/homopolymer_5bp_chr01tochrY.bed > test_subtracted.vcf cat test.vcf |wc 100000 1099205 43965799 cat test_subtracted.vcf |wc 100680 1106685 44503053
Amount of rows in header is the same:
cat test_subtracted.vcf | grep "#" |wc 144 789 11114 cat test.vcf | grep "#" |wc 144 789 11114
bedtools intersect with original file (test.vcf, 100 000 variants) and file with "filtered-out" homopolymers shows 100684 rows). I am really confused - shouldn't intersect be maximally 100 000 variants?
bedtools intersect -header -a test.vcf -b test_subtracted.vcf | wc 100684 1106729 44505289
Do you have similar experience or at least some advice about how to continue? I read the bedtools manual properly but at this moment I am very confused.
Thank you very much!!