Question: Access array in bcftools
0
gravatar for dariober
3.1 years ago by
dariober11k
WCIP | Glasgow | UK
dariober11k wrote:

Hi- I'm filtering a VCF file with bcftools view according the a FORMAT field with two items. How can I access the n-th item in the m-th sample?

For example, given this vcf file:

##fileformat=VCFv4.0
...
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,1 0/0:29,4

The AD tag has two elements which in the first sample are "9,1". I want to filter records where the second item in the first sample is greater than something (say 3). I thought this should work and it should exclude the record above because 1 < 3, but it doesn't:

bcftools view -i 'AD[0][1] > 3' recs.vcf # Access AD in first sample, second element

Is this possible at all with bcftools?

Thanks

view expression bcftools • 1.3k views
ADD COMMENTlink modified 2.5 years ago • written 3.1 years ago by dariober11k
1
gravatar for dariober
2.5 years ago by
dariober11k
WCIP | Glasgow | UK
dariober11k wrote:

Better late then never... This works with bcftools 1.9 (but not with 1.5, I haven't tried other versions).

Access the AD tag of the first sample (0) and get the second item (1). Require the returned value to be > 10:

bcftools-1.9/bcftools view --include 'FORMAT/AD[0:1] > 10'

This is documented at https://samtools.github.io/bcftools/bcftools.html#expressions

ADD COMMENTlink written 2.5 years ago by dariober11k
1
gravatar for Kevin Blighe
3.1 years ago by
Kevin Blighe71k
Republic of Ireland
Kevin Blighe71k wrote:

Provided the formatting is consistent all the way through the VCF or BCF, I would avoid tools like bcftools for filtering and just use AWK:

Here is an example:

cat test.vcf 
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,1 0/0:29,4
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,3 0/0:29,4
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,5 0/0:29,4

Greater than 3

cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] > 3) print $0}'
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,5 0/0:29,4

Less than 3

cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] < 3) print $0}'
##fileformat=VCFv4.0
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,1 0/0:29,4

Greater than or equal to 3

cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] >= 3) print $0}'
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,3 0/0:29,4
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,5 0/0:29,4

In these examples, I use AWK's split function in order to isolate the AD from column 10 (FORMAT), and then the second part of AD. Then it's just a simple if statement. The first part of the AWK command will ensure that the VCF header is output.

Kevin

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Kevin Blighe71k
2

Hi Kevin, thanks for replying (+1).

I would avoid tools like bcftools for filtering and just use AWK

Actually, I would argue the opposite. I try to avoid parsing VCF records as raw strings as they are quite complex and it's easy to overlook some details. Since bcftools seems well tried and tested I'd use that instead.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by dariober11k
1
gravatar for Pierre Lindenbaum
3.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:

using vcffilterjdk: http://lindenb.github.io/jvarkit/VcfFilterJdk.html

 java -jar dist/vcffilterjdk.jar -e 'Genotype G=variant.getGenotype(0); return G.hasAD() && G.getAD().length>1 &&  G.getAD()[1]>3;' input.vcf
ADD COMMENTlink written 3.1 years ago by Pierre Lindenbaum134k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 987 users visited in the last hour
_