how to remove duplicate SNP rows in vcf using bcftools norm
2
2
Entering edit mode
4.2 years ago
evelyn ▴ 230

Hello,

I am trying to remove duplicate SNP rows from a multiple sample vcf file. SNPs have different positions but multiple duplicate rows. I tried using

bcftools norm -d in.vcf -o out.vcf

but it does not work. Is there any other way to remove duplicates from vcf file that does not change the file format. Thank you!

SNP • 7.3k views
ADD COMMENT
1
Entering edit mode
3.2 years ago

That's also pretty strange for me! Neither bcftools norm nor bcftools concat did not remove the duplicates from my vcf file.

That's why I applied to another solution.

grep "#" myfile.vcf > header                   ## here you separate the header of
                                               ## your vcf file
grep -v "#" myfile.vcf | sort | uniq >> header ## here firstly you separate the vcf file 
                                               ## apart the header part, then sort it
                                               ## and remove the duplicates by using
                                               ## uniq command. Lastly you pass the
                                               ## output to the header.

I checked the file if it is still compatible to work with bcftools. Yes! It is!

ADD COMMENT
0
Entering edit mode
4.2 years ago

Shouldn't it be:

bcftools norm -D in.vcf -o out.vcf

(uppercase D) ?

ADD COMMENT
0
Entering edit mode

@finswimmer, thank you! I have tried D as well but it just results in the same as input file without removing duplicates.

ADD REPLY
2
Entering edit mode

Hm, do you have an example of your vcf file?

This works for me:

##fileformat=VCFv4.2
##contig=<ID=chr1,length=249250621>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1
chr1    977330  rs2799066   T   C   225 PASS    .   GT  0/1
chr1    977330  rs2799066   T   C   225 PASS    .   GT  0/1
$ bcftools norm -D in.vcf
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=249250621>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_normVersion=1.10.1+htslib-1.10.2
##bcftools_normCommand=norm -D 1.vcf; Date=Fri Feb  7 21:22:54 2020
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1
chr1    977330  rs2799066   T   C   225 PASS    .   GT  0/1
Lines   total/split/realigned/skipped:  2/0/0/0
ADD REPLY
0
Entering edit mode

Thank you! I will check again.

ADD REPLY

Login before adding your answer.

Traffic: 1787 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6