Access array in bcftools
3
0
Entering edit mode
6.3 years ago

Hi- I'm filtering a VCF file with bcftools view according the a FORMAT field with two items. How can I access the n-th item in the m-th sample?

For example, given this vcf file:

##fileformat=VCFv4.0
...
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,1 0/0:29,4

The AD tag has two elements which in the first sample are "9,1". I want to filter records where the second item in the first sample is greater than something (say 3). I thought this should work and it should exclude the record above because 1 < 3, but it doesn't:

bcftools view -i 'AD[0][1] > 3' recs.vcf # Access AD in first sample, second element

Is this possible at all with bcftools?

Thanks

bcftools view expression • 2.7k views
ADD COMMENT
1
Entering edit mode
5.7 years ago

Better late then never... This works with bcftools 1.9 (but not with 1.5, I haven't tried other versions).

Access the AD tag of the first sample (0) and get the second item (1). Require the returned value to be > 10:

bcftools-1.9/bcftools view --include 'FORMAT/AD[0:1] > 10'

This is documented at https://samtools.github.io/bcftools/bcftools.html#expressions

ADD COMMENT
1
Entering edit mode
6.3 years ago

Provided the formatting is consistent all the way through the VCF or BCF, I would avoid tools like bcftools for filtering and just use AWK:

Here is an example:

cat test.vcf 
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,1 0/0:29,4
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,3 0/0:29,4
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,5 0/0:29,4

Greater than 3

cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] > 3) print $0}'
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,5 0/0:29,4

Less than 3

cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] < 3) print $0}'
##fileformat=VCFv4.0
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,1 0/0:29,4

Greater than or equal to 3

cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] >= 3) print $0}'
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,3 0/0:29,4
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,5 0/0:29,4

In these examples, I use AWK's split function in order to isolate the AD from column 10 (FORMAT), and then the second part of AD. Then it's just a simple if statement. The first part of the AWK command will ensure that the VCF header is output.

Kevin

ADD COMMENT
2
Entering edit mode

Hi Kevin, thanks for replying (+1).

I would avoid tools like bcftools for filtering and just use AWK

Actually, I would argue the opposite. I try to avoid parsing VCF records as raw strings as they are quite complex and it's easy to overlook some details. Since bcftools seems well tried and tested I'd use that instead.

ADD REPLY
1
Entering edit mode
6.3 years ago

using vcffilterjdk: http://lindenb.github.io/jvarkit/VcfFilterJdk.html

 java -jar dist/vcffilterjdk.jar -e 'Genotype G=variant.getGenotype(0); return G.hasAD() && G.getAD().length>1 &&  G.getAD()[1]>3;' input.vcf
ADD COMMENT

Login before adding your answer.

Traffic: 2950 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6