Question: Amount of variants increased after bedtools subtract
0
gravatar for Martina
16 months ago by
Martina30
Martina30 wrote:

Hello all,

I have vcfs from exome sequencing and I wanted to filter out from it variants from homopolymer regions with bed file covering the homopolymer regions by bedtools subtract, but to my surprise, amount of variants (rows in the file) increased after bedtools subtract. The command I used was:

bedtools subtract -header -a test.vcf -b ../../../external/homopolymer_5bp_chr01tochrY.bed > test_subtracted.vcf

cat test.vcf |wc
 100000 1099205 43965799
cat test_subtracted.vcf |wc
 100680 1106685 44503053

Amount of rows in header is the same:

cat test_subtracted.vcf | grep "#" |wc
    144     789   11114
cat test.vcf | grep "#" |wc
    144     789   11114

bedtools intersect with original file (test.vcf, 100 000 variants) and file with "filtered-out" homopolymers shows 100684 rows). I am really confused - shouldn't intersect be maximally 100 000 variants?

bedtools intersect -header -a test.vcf -b test_subtracted.vcf | wc
 100684 1106729 44505289

Do you have similar experience or at least some advice about how to continue? I read the bedtools manual properly but at this moment I am very confused.

Thank you very much!!
Martina

next-gen • 548 views
ADD COMMENTlink modified 16 months ago by finswimmer14k • written 16 months ago by Martina30

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLYlink written 16 months ago by _r_am31k

Ok, thank you, next time I will focuse on it

ADD REPLYlink written 16 months ago by Martina30

You can try running the diff command on the original and subtracted VCF file to see what the changes were. It may help figure out what is going on.

ADD REPLYlink written 16 months ago by colin.kern930
0
gravatar for finswimmer
16 months ago by
finswimmer14k
Germany
finswimmer14k wrote:

The problem is, if you have a variant that span multiple regions in your bed file, this variant is output multiple times.

See this example:

input.vcf

##fileformat=VCFv4.2
##contig=<ID=chr4,length=190214555>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA06986
chr4    3074876 .   CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCA    CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCA 23.2353 .   .   GT  0/1

input.bed

chr4    3074870 3074878
chr4    3074880 3074885

output

$ bedtools subtract -header -a input.vcf -b input.bed 
##fileformat=VCFv4.2
##contig=<ID=chr4,length=190214555>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA06986
chr4    3074876 .   CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCA    CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCA 23.2353 .   .   GT  0/1
chr4    3074876 .   CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCA    CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCA 23.2353 .   .   GT  0/1

If you want to remove any variant, that overlap any entry in the bed file to use bcftools:

$ bcftools view -T ^input.bed input.vcf > output.vcf

(The ^ before input.bed says to not include regions in input.bed in the output.)

ADD COMMENTlink modified 16 months ago • written 16 months ago by finswimmer14k

Hello Fin, thank you very much for your answer. Now I see that by default, bedtools filters out only variants with 100% overlap with bed file, so if I have a long variant covering 2 regions in bed file and I don't have -A option (which filters out variants with at least 1bp overlap), bedtools "touch" the variant 2x - at first while comparison with first bed region (and reports it in output because overlap is not 100%) and at second while comparison with second bed region (and reports because overlap is not 100%). I checked the file and there is a lot of long variants.

So I tried both possibilities (bedtools subtract -A and bcftools view -T) and the output is a bit different:

$ bedtools subtract -A -a test.vcf -b ../../../external/homopolymer_5bp_chr01tochrY.bed|wc
  94026 1034286 41177623

$ bcftools view -T ^../../../external/homopolymer_5bp_chr01tochrY.bed test.vcf|wc
  97258 1069023 43390082

so I think I will finally keep bedtools subtract -A output, because it was able to filter out more variants.

Thank you really a lot, you pushed me to right way. Next time I promise, I will read better :)

Have a nice day, Martina

ADD REPLYlink modified 16 months ago • written 16 months ago by Martina30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1217 users visited in the last hour