Removing both duplicates from a VCF
1
0
Entering edit mode
4.3 years ago

Hi,

I'm aware of ways to merge duplicates in a vcf, like this one. With my vcf, however, there appear to be duplicate variants with conflicting calls. These are not multiallelic sites; It's simply a case where a variant might be listed twice, but the calls are different. Example:

chr7 chr7_12345_AT_A_b38 AT A 0/0 0/0 0/1 0/1...

chr7 some_rsID_identifier AT A 0/0 0/0 0/0 0/0...

Because of the conflicting calls, I don't trust either of these entries, and I want to remove them both. I don't know how frequently this occurs, so I just want to remove any variants that have duplicates in the VCF. Unfortunately, BCFtools --rm-dup flag just keeps the first record.

I can obviously write a script to just remove both entries. Is there a tool with this functionality? Is there a flag in BCFtools that I'm missing?

Thanks!

bcftools duplicates • 1.3k views
ADD COMMENT
2
Entering edit mode
4.3 years ago

plink2 is not a fully-general VCF processor, but if you don't have extra per-genotype fields or super-long indels, the following should work:

plink2 --vcf <input path> \
       --set-all-var-ids @_#_\$r_\$a \
       --new-id-max-allele-len 7500 \
       --rm-dup exclude-all \
       --export vcf bgz \
       --out <output path prefix; .vcf.gz automatically appended>

See https://www.cog-genomics.org/plink/2.0/filter#rm_dup for details, and https://www.cog-genomics.org/plink/2.0/data#recover_var_ids for a way to recover the original variant IDs (plink2 --rm-dup only checks identical-ID variant groups, so --set-all-var-ids is needed to invoke the position-and-allele-based deduplication you want).

You can replace 'exclude-all' with 'exclude-mismatch' to keep a single instance of each variant where the records are identical. (In addition to identical genotypes, plink2 requires identical INFO field values, etc. here.)

ADD COMMENT
0
Entering edit mode

This worked out perfectly. I don't think this option is in PLINK 1.9, because I wasn't aware of it.

ADD REPLY

Login before adding your answer.

Traffic: 2144 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6