Question: How do I remove duplicated by position SNPs using PLink?
0
gravatar for bha
3.2 years ago by
bha60
bha60 wrote:

I am working with PLINK to analyse SNP chip data.

Does anyone know how to remove duplicated SNPs (duplicated by position)?

snp plink • 15k views
ADD COMMENTlink modified 3.2 years ago by Kevin Blighe69k • written 3.2 years ago by bha60

The --list-duplicate-vars suppress-first parameter allows you to identify them, with --exclude in a following command then allowing you to remove them. They are identified by having the same position and ref/var calls.

ADD REPLYlink modified 16 months ago • written 3.2 years ago by Kevin Blighe69k

Why would a VCF have variants with the same chr, bp AND alleles? Spliiting multi-allelic variants can produce position-based "duplicates", but position AND alleles duplicated? That is just dirty data!

ADD REPLYlink written 3.2 years ago by _r_am32k

It happens quite frequently, unfortunately. Sometimes, even the same SNP rs ID can refer to 2 different locations in the genome. It's due to different genome builds.

ADD REPLYlink written 3.2 years ago by Kevin Blighe69k
1
gravatar for bha
3.2 years ago by
bha60
bha60 wrote:
Options in effect:
  --bfile HighDensity
  --exclude supress-first plink.dupvar
  --out HighDensity.DuplicatesRemoved

Error: Invalid --exclude parameter sequence.

it doesn't accept exclude, any other option?

ADD COMMENTlink modified 3.2 years ago by _r_am32k • written 3.2 years ago by bha60

"suppress-first" is misspelled in your command line.

ADD REPLYlink written 3.2 years ago by chrchang5237.6k

I tried with correct spelling, and --extract as well, but same error message

Options in effect:
  --bfile HighDensity
  --extract suppress-first plink.dupvar
  --out HighDensity.DuplicatesRemoved

Error: Invalid --extract parameter sequence.
ADD REPLYlink modified 3.2 years ago by _r_am32k • written 3.2 years ago by bha60
3

Oh, sorry, "suppress-first" is a --list-duplicate-vars modifier, not an --exclude modifier. You also want to include "ids-only" in your --list-duplicate-vars step when your goal is just to generate a file for --exclude's use.

So, the following should work:

plink --bfile HighDensity --list-duplicate-vars ids-only suppress-first
plink --bfile HighDensity -exclude plink.dupvar --out HighDensity.DuplicatesRemoved
ADD REPLYlink written 3.2 years ago by chrchang5237.6k
1

Here, you might be have a bug. When there is duplicated SNP with same ID, --exclude will remove all the SNPs simultaneously. If you want to --exclude, be sure to change the ID name to different one advanced.

ADD REPLYlink written 2.5 years ago by Shicheng Guo8.5k

Hi bha, I would highly recommend that you follow the guidance of chrchang523 for this particular problem.

ADD REPLYlink written 3.2 years ago by Kevin Blighe69k

Hi Kevin and chrchang523, many thanks for prompt help. Well, my first problem is sorted with above commands, thanks for that. My ultimate goal is to merge two plink files (HighDensity and LowDensity), after removing the duplicates in HighDensity, when I tried to merge two files I still got this warnings:

Warning: Variants 'oar3_OAR1_169947942' and 'OAR1_183082360.1' have the same position.
Warning: Variants 'oar3_OAR1_224944169' and 'OAR1_242468071.1' have the same position.
Warning: Variants 'oar3_OAR2_34247858' and 'OAR2_35595985.1' have the same position.
Warning: Variants 's46929.1' and 'oar3_OAR2_74519162' have the same position.
Warning: Variants 's73098.1' and 'oar3_OAR3_20274083' have the same position.
Warning: Variants 'oar3_OAR3_34411557' and 'OAR3_36830729.1' have the same position.

It seems some variants (42) still don't have right position, any idea how i can fix that?

ADD REPLYlink modified 3.2 years ago by GenoMax94k • written 3.2 years ago by bha60

What if you try without ids-only?

plink --bfile HighDensity --list-duplicate-vars suppress-first
plink --bfile HighDensity -exclude plink.dupvar --out HighDensity.DuplicatesRemoved

Also, if you try the merge command without removing duplicates, from my experience, will automatically identify the duplicates between the datasets you're merging and will output them to a file, which can then be used with --exclude

ADD REPLYlink written 3.2 years ago by Kevin Blighe69k

It doesn't work without ids-only, same error as with ids-only. Well, I tried to merge without removing duplicates, but same warning messages. I am using PLINK v1.90b3v 64-bit (15 Jul 2015). Do you think it will make difference if i try some other version? what version you were using for merging? I use this for merge:

plink --bfile LowDensity --bmerge HighDensity --make-bed --out mergewithduplicates
ADD REPLYlink modified 3.2 years ago by _r_am32k • written 3.2 years ago by bha60
1

Hey, I go through the process of duplicate removal and then merging of datasets here: Produce PCA for 1000 Genomes Phase III in VCF format

In summary, how I normally do it is:

plink --noweb --bfile MyData --list-duplicate-vars ;
cut -f4 plink.dupvar | cut -f1 -d" " > Duplicates.list ;
plink --noweb --bfile MyData --exclude Duplicates.list --make-bed --out MyDataLessDuplicates ;

For merging, I use:

plink --merge-list ForMerge.list --out MergedData ;

ForMerge.list just contains the PLINK datasets that I want to merge (BIM files).


Nota bene:

I think that the key that my colleague and I may have missed is that the output of --list-duplicate-vars, i.e., plink.dupvar, is not compatible with --exclude. --exclude just expects a single-column file of variant IDs that you want to exclude (or variant IDs separated by spaces). So, something like:

oar3_OAR1_169947942

OAR1_183082360.1

oar3_OAR1_224944169

OAR1_242468071.1

oar3_OAR2_34247858

OAR2_35595985.1.

s46929.1

oar3_OAR2_74519162

s73098.1

oar3_OAR3_20274083

oar3_OAR3_34411557

With --list-duplicate-vars I would still add suppress-first after it, just so that 1 copy of the duplicate will be retained (otherwise, both ar removed).


I also use PLINK 1.9, and virtually the same release as you:

/Programs/plink1.90/plink
PLINK v1.90b3.38 64-bit (7 Jun 2016)       https://www.cog-genomics.org/plink2
(C) 2005-2016 Shaun Purcell, Christopher Chang   GNU General Public License v3
ADD REPLYlink modified 3.2 years ago by _r_am32k • written 3.2 years ago by Kevin Blighe69k
1

--list-duplicate-vars's ids-only modifier bridges the gap with --exclude.

The --merge-equal-pos flag causes variants with identical positions to be merged, keeping the IDs in the first file. However, this is conservative: if there's a single pair of same-position variants which have more than 2 distinct alleles between them, the merge will error out. plink 1.x is not designed to handle triallelic SNPs.

ADD REPLYlink modified 3.2 years ago by _r_am32k • written 3.2 years ago by chrchang5237.6k
1

Why would multiple single position variant entries exist for biallelic variants? To me, that just sounds like a dirty VCF file, not a legitimate use case.

ADD REPLYlink written 3.2 years ago by _r_am32k

Yes, this is why I always normalise VCFs with:

bcftools norm -m-any My.VCF | bcftools norm --check-ref w -f MyRefGenome.fasta | bcftools annotate -Ob -x 'ID' -I +'%CHROM:%POS:%POS:%REF:%ALT' > Normalised.bcf ;

bcftools index Normalised.bcf ;

That code also assigns a unique identifier to each variant in the ID field of the VCF, as chr:pos:pos:ref:alt.

...and then read these into PLINK.

ADD REPLYlink modified 2.4 years ago • written 3.2 years ago by Kevin Blighe69k

But then you end up losing the rs# annotation no?

ADD REPLYlink written 3.2 years ago by _r_am32k

Also, why do you use -x ID -I +<string>? Would just -I <string> suffice?

https://samtools.github.io/bcftools/bcftools-man.html#annotate

ADD REPLYlink written 3.2 years ago by _r_am32k

Yes, but the rs ID system of naming variants is problematic because it's not curated sufficiently. A single rs ID can relate to multiple variants at the same position. Also, a single rs ID can relate to variants at more than 1 position in the genome. The fundamental issue being that rs IDs are not unique at all...

It's too risky working with rs IDs on clinical data, where mix-ups / mess-ups just aren't allowed...

In the code above, you have to use -x to first delete the ID field. Without -x, it will not update it to what you specify with -I.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Kevin Blighe69k
1

Yeah, we don't use rsIDs for anything significant, all annotation is done by position. The manual says By default all existing IDs are replaced. If the format string is preceded by "+", only missing IDs will be set., so I don't see why -x would be needed unless the documentation is wrong.

ADD REPLYlink written 3.2 years ago by _r_am32k
0
gravatar for bha
3.2 years ago by
bha60
bha60 wrote:
plink --bfile HighDensity --list-duplicate-vars --out test
plink --bfile HighDensity --exclude test.dupvar --out test1

Sorry, I couldn't get where --exclude suppress-first comes in above command, can you please elaborate?

ADD COMMENTlink modified 3.2 years ago by _r_am32k • written 3.2 years ago by bha60
1

Please do not add answers unless you intend answering your own question. This post belongs as a comment on Kevin's answer. Also, please use the 101010 button on the formatting bar to format your code. See for instance Kevin's comment below and you'll see the difference.

ADD REPLYlink written 3.2 years ago by _r_am32k

Hi bha,

It can be executed like this::

plink --bfile HighDensity --list-duplicate-vars
plink --bfile HighDensity --exclude suppress-first plink.dupvar --out HighDensity.DuplicatesRemoved
ADD REPLYlink modified 3.2 years ago by _r_am32k • written 3.2 years ago by Kevin Blighe69k
Options in effect: --bfile HighDensity --exclude supress-first plink.dupvar --out HighDensity.DuplicatesRemoved
Error: Invalid --exclude parameter sequence.

it doesn't accept exclude, any other option? I also tried with --extract, but doesn't work

ADD REPLYlink modified 3.2 years ago by _r_am32k • written 3.2 years ago by bha60
1

Can you paste the contents of plink.dupvar and also mention your PLINK version?

If you can, also, output a few lines from your map file?

ADD REPLYlink written 3.2 years ago by Kevin Blighe69k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1717 users visited in the last hour
_