Remove duplicate SNPs by allele frequency in PLINK
0
0
Entering edit mode
3.0 years ago
rem • 0

Summary:

I have run into an issue where I have about 2,000 SNPs which occur twice in my plink binary files. That is, they have the same chromosome and position, but different alternate alleles. Using plink, I would like to keep only that with the highest alternate allele frequency for each identical chromosome:position pair. I can think of a few ways of doing this, but they are all a bit hacky. What approach do you suggest for this problem? Is there any approach that does not require scripting outside of plink?

What I've tried:

I already have a text file of these duplicates that was automatically generated when plink2 failed to concatenate the files. I'm aware I can remove duplicates automatically using the --rm-dup flag. However, the closest option this flag has to my desired implementation is 'force-first' which would only work if the SNPs were already ordered by alternate allele frequency and I'm not sure how to do this or if it's possible.

I also thought to calculate the allele frequencies and generate a list of SNPs to remove using --exclude, but I would only want to exclude SNPs with a particular alternate allele and I couldn't find how to do this either. Finally, I definitely think I could implement this by making the variant IDs unique first using --set-missing-var-ids which would resolve the issue of the previous approach, but this strikes me as quite a hacky approach to what seems like a simple problem.

Any suggestions will be much appreciated!

QC plink • 824 views
ADD COMMENT
1
Entering edit mode

Can you join these variants (with e.g. "bcftools norm +m") instead? plink2 can handle multiallelic variants.

Otherwise, you're stuck with information-losing hacks.

ADD REPLY

Login before adding your answer.

Traffic: 2453 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6