Question

Filtering of tricky overlapping sites in VCF

0

Entering edit mode

13 months ago

cfos4698 ★ 1.1k

I'm working on a project where I'm working with variants ranging from low frequency (AF~=0.03) up to high frequency (AF=1) after variant calling with lofreq. There are some cases where positions in the vcf are duplicated and this is leading to issues with my downstream processing of the data. Consider the following:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1
NC_045512.2 26299   .   C   CT  800 PASS    DP=3085;AF=0.066775;SB=192;DP4=1321,1539,162,44;INDEL;HRUN=5;TYPE=INDEL GT  1
NC_045512.2 26299   .   C   CTT 77  PASS    DP=3085;AF=0.007131;SB=73;DP4=1321,1539,22,0;INDEL;HRUN=5;TYPE=INDEL    GT  1
NC_045512.2 26299   .   C   T   49314   PASS    DP=3090;AF=0.889968;SB=12;DP4=175,150,1324,1426;TYPE=SNP    GT  1

In these cases, I would like to keep the variant at the position that has the highest allele frequency. So, after filtering, I'd hope to only have the following:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1
NC_045512.2 26299   .   C   T   49314   PASS    DP=3090;AF=0.889968;SB=12;DP4=175,150,1324,1426;TYPE=SNP    GT  1

Is there a simple way to filter out variants like these with duplicated positions using (e.g.) bcftools? I've looked through the options for bcftools norm and bcftools filter, but the options to remove duplicates don't seem to have the flexibility I need for these cases (unless I'm mistaken). Otherwise I can write function using (e.g.) cyvcf2 in python, but I'd prefer to stick with bcftools. Thanks!

vcf bcftools • 781 views

ADD COMMENT • link 13 months ago by cfos4698 ★ 1.1k

0

Entering edit mode

13 months ago

cfos4698 ★ 1.1k

While I've accepted Pierre's excellent answer, in case this is ever of use for anyone else, I'm also including here a way to do it using the cyvcf2 library in python:

from cyvcf2 import VCF, Writer

input_vcf = '/path/to/sample.vcf.gz'
output_vcf = '/path/to/sample.modified.vcf.gz'

vcf = VCF(input_vcf, strict_gt=True)

vcf.add_format_to_header({
    'ID': 'AF',
    'Description': 'Allele Frequency',
    'Type': 'Float',
    'Number': '1'
})

w = Writer(output_vcf,vcf,mode="wz")

variants = {}
# iterate through VCF generator
for v in vcf:
    # see if variant has a duplicated position
    if str(v.POS) in variants.keys():
        # test if the current variant has a greater AF than previous variant
        if variants[str(v.POS)].INFO['AF'] < v.INFO['AF']:
            del variants[str(v.POS)]
        else:
            continue
    # curate variant
    v.set_format('AF',np.array([v.INFO['AF']]))
    # add to variants dict
    variants[str(v.POS)] = v

# write variants to new file
for v in variants.items():
    w.write_record(v[1])

w.close()

Could come in handy if you're already modifying the VCF in other ways using python.

ADD COMMENT • link 13 months ago by cfos4698 ★ 1.1k

score 3 · Accepted Answer · 2023-03-07

I wrote https://jvarkit.readthedocs.io/en/latest/Biostar9556602/

$ grep -v "##" jeter.vcf
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   905593  .   A   T   3887.15 PASS    AF=0.1
1   905594  rs867360694 A   T   3887.15 PASS    AF=0.1
1   905594  rs867360694 A   TT  3887.15 PASS    AF=0.2
1   905594  rs867360694 A   TTT 3887.15 PASS    AF=0.3
1   905595  .   A   T   3887.15 PASS    AF=0.1
1   905595  .   A   G   3887.15 PASS    AF=0.01


$ java -jar dist/jvarkit.jar biostar9556602 --sorter HIGHEST_AF  jeter.vcf --filter ZZZZZZZZZZZZ | grep -v "##"
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   905593  .   A   T   3887.15 PASS    AF=0.1
1   905594  rs867360694 A   TTT 3887.15 PASS    AF=0.3
1   905594  rs867360694 A   TT  3887.15 ZZZZZZZZZZZZ    AF=0.2
1   905594  rs867360694 A   T   3887.15 ZZZZZZZZZZZZ    AF=0.1
1   905595  .   A   T   3887.15 PASS    AF=0.1
1   905595  .   A   G   3887.15 ZZZZZZZZZZZZ    AF=0.01

$ java -jar dist/jvarkit.jar biostar9556602 --sorter LOWEST_AF  jeter.vcf  | grep -v "##"#CHROM POS ID  REF ALT QUAL    FILTER  INFO
1   905593  .   A   T   3887.15 PASS    AF=0.1
1   905594  rs867360694 A   T   3887.15 PASS    AF=0.1
1   905595  .   A   G   3887.15 PASS    AF=0.01