Question: Keep one SNP for duplicate SNPs
0
gravatar for janhuang.cn
2.2 years ago by
janhuang.cn150
janhuang.cn150 wrote:

I have converted vcf file to bed files, and there are some duplicate SNPs. I would like to remove the duplicate SNPs, but keep one. For example, if rs1234 appears 5 times, I want to keep only one record (maybe the first one).

Right now I used --write-snplist to get the snplist of the bed file, and use R to check the frequency of each snp, and use R to generate a duplicate snplist. With the duplicate snplist, I used --extract to get the duplicate snp bed file, and --exclude to get the bed file without any duplicate snp.

But how could I keep one snp for each duplicate snp? And also, is there a way to do the above steps in plink, without switching to R to generate the duplicate snp list?

snp duplicate • 1.8k views
ADD COMMENTlink modified 21 months ago by Biostar ♦♦ 20 • written 2.2 years ago by janhuang.cn150
0
gravatar for prasundutta87
2.2 years ago by
prasundutta87360
prasundutta87360 wrote:

What do you mean by duplicate snps? Have they been reported multiple times as in same chromosome/contig with same ref and alt coordinates as well?

This post may be helpful..

How to filter out duplicate records in a vcf with bcftools?

ADD COMMENTlink written 2.2 years ago by prasundutta87360

Thank you.

I meant the same SNP was reported in a vcf file (1000G) for multiple times, in the same chromosome.

One example is chr22:18496882 rs35404796 was reported three times, the REF allele is always G, but the ALT are different ("GAC", "GACACAC", "GACACACAC")

Another case is rs7410429 was reported twice, but the chr:pos are different, one is chr22:18003597, another is chr22:18004254, and the REF and ALT are the same.

ADD REPLYlink written 2.2 years ago by janhuang.cn150

Your first example is not of SNPs; they are insertions, and they are different.

I'm not sure why you would want to do what you want to do, but I would write a program to iterate through the VCF file line by line, maintain a hashset of RSIDs, and only retain lines whose RSID has not been seen previously.

ADD REPLYlink written 2.2 years ago by Brian Bushnell16k

I was calculating the ld using --r2, but it returns Error: Duplicate ID 'rs10656307'. It seems that this one is also insertions, the two rs10656307 records have same chr:pos (chr22:28698027), same REF (A), but different ALT (AAAT and AAATAAT). Therefore I want to exclude duplicate records.

ADD REPLYlink written 2.2 years ago by janhuang.cn150

Oh, interesting; that's unfortunate. Well, I still recommend writing a quick program to remove the duplicate RSIDs, as I mentioned above. But if there are only a handful you could easily remove all copies of them via grep instead.

ADD REPLYlink written 2.2 years ago by Brian Bushnell16k

It does not seem to be handful, and it is a large dataset. iterate through the VCF line by line sounds to be very slow, but I will see if I could do that. Thanks.

ADD REPLYlink written 2.2 years ago by janhuang.cn150

Any tool which accomplished the task would have to iterate through the VCF line by line, though :)

ADD REPLYlink written 2.2 years ago by Brian Bushnell16k

Have you solved the duplicated problem?

You gave a example that rs7410429 was reported twice, but the chr:pos are different, one is chr22:18003597, another is chr22:18004254 in the 1000 Genome vcf file.

I ran into the same situation. I found rs13406140 (in chromosome2) occurs two times in the 1000 Genome vcf and the coordinate of the same RSID is unbelievably different. as follows: 2 90430223 rs13406140 G A 100 PASS 2 91651998 rs13406140 A G 100 PASS

I queried my doubt in 1000 Genome offficial Q&A and found a likely reply: Why are there duplicate calls in the phase 3 call set http://www.internationalgenome.org/category/variants/

I'm still in doubt about this, how can a RSID SNP map to two different position? Can anyone help?

ADD REPLYlink written 14 months ago by keryruo10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 749 users visited in the last hour