How to filter INFO field of VCF by 'OR' using bcftools
1
0
Entering edit mode
8 months ago
adixon3 • 0

I have a VCF from dbVar (https://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/vcf/nstd102.GRCh38.variant_call.vcf.gz) and I want to pull out all the variants which have ANY the following values in the CLNSIG field: Pathogenic, Likely_pathogenic.

If I use the following command, bcftools delivers only variants with "Pathogenic" in the CLNSIG field:

bcftools view -i 'INFO/CLNSIG ~ "Pathogenic"|"Likely_pathogenic"' nstd102.GRCh38.variant_call.vcf.gz

How do I properly use the 'OR' operator to find the union of variants that match my list of CLNSIG values?

filtering bcftools vcf • 1.1k views
ADD COMMENT
0
Entering edit mode
8 months ago

try

    bcftools view -i 'INFO/CLNSIG ~ "Pathogenic\|Likely_pathogenic"' nstd102.GRCh38.variant_call.vcf.gz
ADD COMMENT
0
Entering edit mode

Unfortunately this produces a VCF with a header and no variants :( I've noticed that this also happens with the following command

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20231012
##reference=GCF_000001405.40
##ALT=<ID=<CNV>,Description="Copy number variable region">
##ALT=<ID=<DEL>,Description="Deletion relative to the reference">
##ALT=<ID=<DUP>,Description="Region of elevated copy number relative to the reference">
##ALT=<ID=<INS>,Description="Insertion of sequence relative to the reference">
##ALT=<ID=<INV>,Description="Inversion of reference sequence">
##INFO=<ID=DBVARID,Number=1,Type=String,Description="ID of this element in dbVar">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=DESC,Number=1,Type=String,Description="Any additional information about this call (free text, enclose in double quotes)">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=String,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=CHR2,Number=1,Type=String,Description="Second (To) Chromosome in a translocation pair">
##INFO=<ID=REGIONID,Number=.,Type=String,Description="The parent variant region accession(s)">
##INFO=<ID=EXPERIMENT,Number=1,Type=Integer,Description="The experiment_id (from EXPERIMENTS tab) of the experiment that was used to generate this call">
##INFO=<ID=EVENT,Number=.,Type=String,Description="The parent variant region accession of a mutation event">
##INFO=<ID=LINKS,Number=.,Type=String,Description="Link(s) to external database(s) - see LINKS tab of dbVar submission template for examples">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Clinical significance for this single variant">
##INFO=<ID=CLNACC,Number=.,Type=String,Description="Accessions and version numbers assigned by ClinVar">
##INFO=<ID=clinical_source,Number=1,Type=String,Description="Source of clinical significance">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Indicates that the record is a somatic mutation. NOT for clinical assertions, i.e. cancer. See also ORIGIN.">
##INFO=<ID=ORIGIN,Number=1,Type=String,Description="Origin of allele, if known; should be one of (biparental, de novo, germline, inherited, maternal, not applicable, not provided, not-reported, paternal, tested-inconclusive, uniparental, unknown, see ClinVar for details). See also SOMATIC">
##INFO=<ID=PHENO,Number=.,Type=String,Description="Phenotype(s) thought to associated with this call. NOT for clinical assertions (submit to ClinVar). (free text, enclose in double quotes)">
##INFO=<ID=SAMPLE,Number=1,Type=String,Description="sample_id from dbVar submission; every call must have SAMPLE or SAMPLESET, but NOT BOTH">
##INFO=<ID=SAMPLESET,Number=1,Type=Integer,Description="sampleset_id from dbVar submission; every call must have SAMPLESET or SAMPLE but NOT BOTH">
##INFO=<ID=VALIDATED,Number=0,Type=Flag,Description="Validated by follow-up experiment">
##INFO=<ID=SEQ,Number=1,Type=String,Description="Variation sequence">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Global Allele count">
##INFO=<ID=AF,Number=.,Type=Float,Description="Global Allele frequency">
##INFO=<ID=AN,Number=.,Type=String,Description="Global Allele name">
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=MT>
##contig=<ID=NT_113793.3>
##contig=<ID=NT_113796.3>
##contig=<ID=NT_187361.1>
##contig=<ID=NT_187513.1>
##contig=<ID=NT_187562.1>
##contig=<ID=NT_187576.1>
##contig=<ID=NT_187593.1>
##contig=<ID=NT_187594.1>
##contig=<ID=NT_187600.1>
##contig=<ID=NT_187603.1>
##contig=<ID=NT_187606.1>
##contig=<ID=NT_187613.1>
##contig=<ID=NT_187614.1>
##contig=<ID=NT_187620.1>
##contig=<ID=NT_187633.1>
##contig=<ID=NT_187648.1>
##contig=<ID=NT_187653.1>
##contig=<ID=NT_187660.1>
##contig=<ID=NT_187661.1>
##contig=<ID=NT_187681.1>
##contig=<ID=NT_187682.1>
##contig=<ID=NT_187693.1>
##contig=<ID=NW_003571049.1>
##contig=<ID=NW_003571054.1>
##contig=<ID=NW_003571055.2>
##contig=<ID=NW_003571056.2>
##contig=<ID=NW_003571057.2>
##contig=<ID=NW_003571058.2>
##contig=<ID=NW_003571060.1>
##contig=<ID=NW_003571061.2>
##contig=<ID=NW_009646195.1>
##contig=<ID=NW_009646198.1>
##contig=<ID=NW_009646206.1>
##contig=<ID=NW_009646209.1>
##contig=<ID=NW_011332698.1>
##contig=<ID=NW_011332701.1>
##contig=<ID=NW_012132918.1>
##contig=<ID=NW_015148966.1>
##contig=<ID=NW_015495298.1>
##contig=<ID=NW_018654714.1>
##contig=<ID=X>
##contig=<ID=Y>
##bcftools_viewVersion=1.13+htslib-1.13
##bcftools_viewCommand=view -i 'INFO/CLNSIG ~ "Pathogenic\|Likely_pathogenic"' nstd102.GRCh38.variant_call.vcf.gz; Date=Mon Jan  6 09:00:06 2025
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
ADD REPLY
0
Entering edit mode
##bcftools_viewVersion=1.13+htslib-1.13

your version of bcftools is just too old. https://github.com/samtools/bcftools/releases/tag/1.13

ADD REPLY
0
Entering edit mode

Unfortunately, bcftools 1.21 produces the same results. Only thing that actually filters is a single filter term using either of the two styles:

bcftools view -i 'INFO/CLNSIG ~ "Pathogenic"' nstd102.GRCh38.variant_call.vcf.gz

or

bcftools view -i 'CLNSIG="Pathogenic"' nstd102.GRCh38.variant_call.vcf.gz

As soon as I add an OR operator (e.g. "type|type_2"), it just outputs a header-only VCF.

ADD REPLY
0
Entering edit mode

Unfortunately, bcftools 1.21 produces the same results.

that's strange. that syntax works on my machine.

wget -O - "https://github.com/lindenb/jvarkit/raw/refs/heads/master/src/test/resources/gnomad.genomes.r2.0.1.sites.1.vcf.gz" | bcftools view -i 'INFO/VQSR_culprit  ~ "MQ\|FS" ' 
ADD REPLY

Login before adding your answer.

Traffic: 3783 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6