Question: How to remove duplicate SNPs from plink ped,vcf?
0
gravatar for Tania
10 months ago by
Tania120
Tania120 wrote:

Hi Every one

I used to get this error when I try to split a vcf I get from plink format:

[E::bcf_hdr_add_sample] Duplicated sample name '103_Sp676'

These are my steps to remove duplicates, however I still get the same error:

data.bed: 
        plink --file data --maf 0.05 --make-bed --out data

data.DuplicatesRemoved.bed: 
        plink --bfile data --list-duplicate-vars ids-only suppress-first
        plink --bfile data -exclude plink.dupvar --make-bed --out data.DuplicatesRemoved

data.DuplicatesRemoved.ped: 
        plink --bfile data.DuplicatesRemoved --recode --tab --out data.DuplicatesRemoved

data.DuplicatesRemoved.vcf: 
        plink --file data.DuplicatesRemoved --recode vcf --out data.DuplicatesRemoved

Any help how to fix?

Thanks

snp plink • 992 views
ADD COMMENTlink modified 10 months ago by Kevin Blighe41k • written 10 months ago by Tania120

Also added plink tag to your post. That way, it may be picked up by the person who is much more experienced in plink than anyone else here on Biostars.

ADD REPLYlink written 10 months ago by Kevin Blighe41k
1
gravatar for Kevin Blighe
10 months ago by
Kevin Blighe41k
London, England
Kevin Blighe41k wrote:

Your issue is a duplicate sample name, not duplicate variants. Your sample that's duplicated is 103_Sp676.

Please try to understand why this sample is duplicated, and then manage the issue appropriately.

Kevin

ADD COMMENTlink written 10 months ago by Kevin Blighe41k

Thanks Kevin. I have many smaples duplicated like this. How can I understand the reason of duplication? Is it something to check with the data generation itself? or something computational I look to find? Sorry,seems naiive, but I am new here completely :)

ADD REPLYlink written 10 months ago by Tania120
1

Oh hey Tania. I answered and not realising it was you! I would have been nicer :)

What is the source of the data?

ADD REPLYlink written 10 months ago by Kevin Blighe41k

No worries :) Thanks alot for helping me :)

This is a snp-array handled to me few days ago, for some patients. I have to find out the reason for the phenotype they have. So I am trying to get a vcf then go from here. Each of these codes Spxxx , Fxxx is a patient, so I really don't know why they are duplicated, specially sometimes the data in the ped is slightly different in the duplication. Like at some position it is a G in the ped, the same position in the duplciate it is a zero? So they are not even the same to manually remove.

ADD REPLYlink written 10 months ago by Tania120

Hmm... maybe replicates of the same sample?

For recoding as VCF, you may want to try VCF-FID or VCF-IID, as mentioned here: https://www.cog-genomics.org/plink/1.9/data#recode

That will most likely produce the same issue, in which case you could update the sample IDs: https://www.cog-genomics.org/plink/1.9/data#update_indiv

For that to work smoothly, you should know the exact order of the samples in the PED file.

One wonders how they created the duplicate samples in the first place.

ADD REPLYlink written 10 months ago by Kevin Blighe41k

thanks Kevin so much. I will follow the links you mentioned and see. Thanks so much.

ADD REPLYlink written 10 months ago by Tania120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1240 users visited in the last hour