Provided the formatting is consistent all the way through the VCF or BCF, I would avoid tools like bcftools for filtering and just use AWK:
Here is an example:
cat test.vcf
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT normal tumour
chr1 14397 . CTGT C . PASS SVTYPE=DEL GT:AD 0/0:9,1 0/0:29,4
chr1 14397 . CTGT C . PASS SVTYPE=DEL GT:AD 0/0:9,3 0/0:29,4
chr1 14397 . CTGT C . PASS SVTYPE=DEL GT:AD 0/0:9,5 0/0:29,4
Greater than 3
cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] > 3) print $0}'
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT normal tumour
chr1 14397 . CTGT C . PASS SVTYPE=DEL GT:AD 0/0:9,5 0/0:29,4
Less than 3
cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] < 3) print $0}'
##fileformat=VCFv4.0
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT normal tumour
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT normal tumour
chr1 14397 . CTGT C . PASS SVTYPE=DEL GT:AD 0/0:9,1 0/0:29,4
Greater than or equal to 3
cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] >= 3) print $0}'
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT normal tumour
chr1 14397 . CTGT C . PASS SVTYPE=DEL GT:AD 0/0:9,3 0/0:29,4
chr1 14397 . CTGT C . PASS SVTYPE=DEL GT:AD 0/0:9,5 0/0:29,4
In these examples, I use AWK's split
function in order to isolate the AD from column 10 (FORMAT), and then the second part of AD. Then it's just a simple if
statement. The first part of the AWK command will ensure that the VCF header is output.
Kevin