Filtering VCF by specific tag in INFO field
1
0
Entering edit mode
13 months ago
avelarbio46 ▴ 30

Hello everyone! I'm trying to filter the dbSNP VCF by using the INFO field.

These are the INFO fields I want:

INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance; 0 - Uncertain significance; 1 - not provided; 2 - Benign; 3 - Likely benign; 4 - Likely pathogenic; 5 - Pathogenic; 6 - Drug response; 8 - Confers sensitivity; 9 - Risk factor; 10 - Association; 11 - Protective; 12 - Conflicting interpretations of pathogenicity; 13 - Affects; 14 - Association not found; 15 - Benign/Likely benign; 16 - Pathogenic/Likely pathogenic; 17 - Conflicting data from submitters; 18 - Pathogenic, low penetrance; 19 Likely pathogenic, low penetrance; 20 - Established risk allele; 21 - Likely risk allele; 22 - Uncertain risk allele; 255 - other">

INFO=<ID=PM,Number=0,Type=Flag,Description="Variant has associated publication">

I want to grab all SNPS which have any CLINSIG value, and those which have the PM tag

When I try:

bcftools filter -i 'INFO/CLNSIG>0' GCF_000001405.40.gz -Oz -o  vcf_known_sites_newer/GCF_000001405.40_clinvar.gz

Or

bcftools filter -i 'INFO/CLNSIG>1' GCF_000001405.40.gz -Oz -o  GCF_000001405.40_clinvar.gz

Or

bcftools filter -i 'INFO/CLNSIG=1'  -i 'INFO/CLNSIG=2' etc up to 22 GCF_000001405.40.gz -Oz -o  GCF_000001405.40_clin.gz

I'm getting empty VCFs. I also tried view instead of filter. When using bcftools filter with -i 'INFO/PM=0' OR -i 'INFO/PM=1' OR -i 'INFO/PM', I get an empty vcf. I have no idea what I'm missing.

VCF from dbSNP (it's huge): dbsnp VCF

bcftools vcf • 768 views
ADD COMMENT
1
Entering edit mode
13 months ago
cfos4698 ★ 1.1k

Without having an example of the VCF, it's hard to know for sure. However, I'm guessing it's because INFO/CLNSIG is a string ("Type=String") rather than a float/integer, so you can't use arithmetic operators to subset the VCF.

Try: bcftools view -i 'INFO/CLNSIG != "0"' GCF_000001405.40.gz

Edit: now that you have provided the VCF, I download the first 700,000 or so records, then subset the file to those with CLNSIG info. I can see see that the CLNSIG info tag (3293 variants in this sample) have the tag in the form of 'CLNSIG=.,NUMBER', e.g. 'CLNSIG=.,1'. If there are multiple ALT variants, the tag can be more complex. For example, if the ALT value is 'A,G,T', the CLNSIG tag can be 'CLNSIG=.,.,0,0'. Therefore, subsetting with bcftools will be complex.

You might need to play around to make sure you're properly filtering multiallelic sites in the way that you want. Consider looking at array subscripts (https://samtools.github.io/bcftools/bcftools.html#expressions).

An example command could be: bcftools view -e 'INFO/CLNSIG[0-] == "0" & INFO/CLNSIG[0-] == "."' input.vcf.gz

ADD COMMENT

Login before adding your answer.

Traffic: 1649 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6