Question: Access array in bcftools
0
gravatar for dariober
22 months ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

Hi- I'm filtering a VCF file with bcftools view according the a FORMAT field with two items. How can I access the n-th item in the m-th sample?

For example, given this vcf file:

##fileformat=VCFv4.0
...
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,1 0/0:29,4

The AD tag has two elements which in the first sample are "9,1". I want to filter records where the second item in the first sample is greater than something (say 3). I thought this should work and it should exclude the record above because 1 < 3, but it doesn't:

bcftools view -i 'AD[0][1] > 3' recs.vcf # Access AD in first sample, second element

Is this possible at all with bcftools?

Thanks

view expression bcftools • 806 views
ADD COMMENTlink modified 15 months ago • written 22 months ago by dariober10k
1
gravatar for dariober
15 months ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

Better late then never... This works with bcftools 1.9 (but not with 1.5, I haven't tried other versions).

Access the AD tag of the first sample (0) and get the second item (1). Require the returned value to be > 10:

bcftools-1.9/bcftools view --include 'FORMAT/AD[0:1] > 10'

This is documented at https://samtools.github.io/bcftools/bcftools.html#expressions

ADD COMMENTlink written 15 months ago by dariober10k
1
gravatar for Kevin Blighe
22 months ago by
Kevin Blighe51k
Kevin Blighe51k wrote:

Provided the formatting is consistent all the way through the VCF or BCF, I would avoid tools like bcftools for filtering and just use AWK:

Here is an example:

cat test.vcf 
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,1 0/0:29,4
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,3 0/0:29,4
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,5 0/0:29,4

Greater than 3

cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] > 3) print $0}'
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,5 0/0:29,4

Less than 3

cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] < 3) print $0}'
##fileformat=VCFv4.0
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,1 0/0:29,4

Greater than or equal to 3

cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] >= 3) print $0}'
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,3 0/0:29,4
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,5 0/0:29,4

In these examples, I use AWK's split function in order to isolate the AD from column 10 (FORMAT), and then the second part of AD. Then it's just a simple if statement. The first part of the AWK command will ensure that the VCF header is output.

Kevin

ADD COMMENTlink modified 22 months ago • written 22 months ago by Kevin Blighe51k
2

Hi Kevin, thanks for replying (+1).

I would avoid tools like bcftools for filtering and just use AWK

Actually, I would argue the opposite. I try to avoid parsing VCF records as raw strings as they are quite complex and it's easy to overlook some details. Since bcftools seems well tried and tested I'd use that instead.

ADD REPLYlink modified 22 months ago • written 22 months ago by dariober10k
1
gravatar for Pierre Lindenbaum
22 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

using vcffilterjdk: http://lindenb.github.io/jvarkit/VcfFilterJdk.html

 java -jar dist/vcffilterjdk.jar -e 'Genotype G=variant.getGenotype(0); return G.hasAD() && G.getAD().length>1 &&  G.getAD()[1]>3;' input.vcf
ADD COMMENTlink written 22 months ago by Pierre Lindenbaum124k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1835 users visited in the last hour