Question

how can I remove duplicated variants from vcf file?

5

Entering edit mode

7.0 years ago

kk.mahsa ▴ 140

How can I remove duplicated variants from vcf file? I googled and searched in biostars history but I did not find any way to do it.

SNP vcf • 20k views

ADD COMMENT • link updated 11 months ago by Ram 44k • written 7.0 years ago by kk.mahsa ▴ 140

6

Entering edit mode

Take 1000 Genome phase 3 data as the example:

bcftools norm -d both --threads=32 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -O z  -o chr1.vcf.gz

ADD REPLY • link 5.9 years ago by Shicheng Guo ★ 9.5k

1

Entering edit mode

Terrible, still have duplicates: bcftools norm

Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commandstreat these as missing.
Error: Duplicate ID '.'.

ADD REPLY • link updated 11 months ago by Ram 44k • written 5.9 years ago by Shicheng Guo ★ 9.5k

3

Entering edit mode

Would you please explain more what do you mean by duplicated variants? Do you observe two lines in your VCF file that are exactly the same?

ADD REPLY • link 7.0 years ago by smho ▴ 40

0

Entering edit mode

yes i mean is what you say and i want to keep one of the duplicate variants and remove the rest. in fact i intend to remove variants that are same in scoffold id and pos and keep one of them.

ADD REPLY • link 7.0 years ago by kk.mahsa ▴ 140

0

Entering edit mode

3.9 years ago

Kevin Blighe 88k

More options (just adding to keep threads linked based on common information): A: Remove duplicate SNPs only based on SNP ID in bcftools

Kevin

ADD COMMENT • link 3.9 years ago by Kevin Blighe 88k

0

Entering edit mode

2.6 years ago

summerday1112 ▴ 10

Do you know how to remove duplicated variants from vcf file(ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz), I am struggling for it.

ADD COMMENT • link 2.6 years ago by summerday1112 ▴ 10

score 4 · Accepted Answer · 2017-07-26

4

Entering edit mode

7.0 years ago

Pierre Lindenbaum 163k

. in fact i intend to remove variants that are same in scoffold id and pos and keep one of them.

I strongly suggest you also use the REF information...

sort on CHROM/POS/REF. using awk create a KEY=CHROM\tPOS\REF, print the line if the key wasn't found previously

LC_ALL=C sort -t $'\t' -k1,1 -k2,2n -k4,4  input.vcf |\
awk -F '\t' '/^#/ {print;prev="";next;} {key=sprintf("%s\t%s\t%s",$1,$2,$4);if(key==prev) next;print;prev=key;}'

edit: added 'next; ' for VCF header.

ADD COMMENT • link 7.0 years ago by Pierre Lindenbaum 163k

0

Entering edit mode

thanks Pierre for your answer, i ran your cammand and get an vcf file as output but when used bcftools stats i got this error.

Failed to open output.vcf: unknown file type

why bcftools can not regognize output as a vcf file? i need to output file for downstream analysis as vcf file

ADD REPLY • link 7.0 years ago by kk.mahsa ▴ 140

3

Entering edit mode

ah yes, sorry it's because, sort messed-up the VCF header and ##fileformat= is not anymore the first line.

please try:

( grep  '^#' input.vcf ; grep -v "^#" input.vcf | LC_ALL=C sort -t $'\t' -k1,1 -k2,2n -k4,4 | awk -F '\t' 'BEGIN{ prev="";} {key=sprintf("%s\t%s\t%s",$1,$2,$4);if(key==prev) next;print;prev=key;}' )  > out.vcf

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 163k

0

Entering edit mode

your answer was really helpfull, thank you so much Pierre. it worked

ADD REPLY • link 7.0 years ago by kk.mahsa ▴ 140

Ram · Accepted Answer · 2017-07-26

1

Entering edit mode

7.0 years ago

cpad0112 21k

use vcfuniq or bcftools norm (with -d option) to remove duplicates

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 7.0 years ago by cpad0112 21k

1

Entering edit mode

bcftools norm left-align and normalize indels

Yes. It is left-align the alleles and then if the start coordinate is same then remove one of them, right?

ADD REPLY • link 6.3 years ago by Shicheng Guo ★ 9.5k

0

Entering edit mode

thank you capd0112, i used bcftools norm and it worked.

ADD REPLY • link 7.0 years ago by kk.mahsa ▴ 140

0

Entering edit mode

bcftools normis new to me, thanks !

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 163k