Question: Access array in bcftools
0
gravatar for dariober
17 months ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

Hi- I'm filtering a VCF file with bcftools view according the a FORMAT field with two items. How can I access the n-th item in the m-th sample?

For example, given this vcf file:

##fileformat=VCFv4.0
...
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,1 0/0:29,4

The AD tag has two elements which in the first sample are "9,1". I want to filter records where the second item in the first sample is greater than something (say 3). I thought this should work and it should exclude the record above because 1 < 3, but it doesn't:

bcftools view -i 'AD[0][1] > 3' recs.vcf # Access AD in first sample, second element

Is this possible at all with bcftools?

Thanks

view expression bcftools • 638 views
ADD COMMENTlink modified 9 months ago • written 17 months ago by dariober10k
1
gravatar for dariober
9 months ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

Better late then never... This works with bcftools 1.9 (but not with 1.5, I haven't tried other versions).

Access the AD tag of the first sample (0) and get the second item (1). Require the returned value to be > 10:

bcftools-1.9/bcftools view --include 'FORMAT/AD[0:1] > 10'

This is documented at https://samtools.github.io/bcftools/bcftools.html#expressions

ADD COMMENTlink written 9 months ago by dariober10k
1
gravatar for Kevin Blighe
17 months ago by
Kevin Blighe43k
Republic of Ireland
Kevin Blighe43k wrote:

Provided the formatting is consistent all the way through the VCF or BCF, I would avoid tools like bcftools for filtering and just use AWK:

Here is an example:

cat test.vcf 
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,1 0/0:29,4
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,3 0/0:29,4
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,5 0/0:29,4

Greater than 3

cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] > 3) print $0}'
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,5 0/0:29,4

Less than 3

cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] < 3) print $0}'
##fileformat=VCFv4.0
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,1 0/0:29,4

Greater than or equal to 3

cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] >= 3) print $0}'
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,3 0/0:29,4
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,5 0/0:29,4

In these examples, I use AWK's split function in order to isolate the AD from column 10 (FORMAT), and then the second part of AD. Then it's just a simple if statement. The first part of the AWK command will ensure that the VCF header is output.

Kevin

ADD COMMENTlink modified 17 months ago • written 17 months ago by Kevin Blighe43k
2

Hi Kevin, thanks for replying (+1).

I would avoid tools like bcftools for filtering and just use AWK

Actually, I would argue the opposite. I try to avoid parsing VCF records as raw strings as they are quite complex and it's easy to overlook some details. Since bcftools seems well tried and tested I'd use that instead.

ADD REPLYlink modified 17 months ago • written 17 months ago by dariober10k
1
gravatar for Pierre Lindenbaum
17 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

using vcffilterjdk: http://lindenb.github.io/jvarkit/VcfFilterJdk.html

 java -jar dist/vcffilterjdk.jar -e 'Genotype G=variant.getGenotype(0); return G.hasAD() && G.getAD().length>1 &&  G.getAD()[1]>3;' input.vcf
ADD COMMENTlink written 17 months ago by Pierre Lindenbaum120k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2255 users visited in the last hour