Question: how can i remove duplicated variants from vcf file?
2
gravatar for kk.mahsa
18 months ago by
kk.mahsa80
kk.mahsa80 wrote:

how can i remove duplicated variants from vcf file? i googled and searched in biostars history but i did not fond any way to do it.

snp vcf • 3.7k views
ADD COMMENTlink modified 6 months ago by Shicheng Guo7.4k • written 18 months ago by kk.mahsa80
3

Would you please explain more what do you mean by duplicated variants? Do you observe two lines in your VCF file that are exactly the same?

ADD REPLYlink written 18 months ago by smho40

yes i mean is what you say and i want to keep one of the duplicate variants and remove the rest. in fact i intend to remove variants that are same in scoffold id and pos and keep one of them.

ADD REPLYlink written 18 months ago by kk.mahsa80
3
gravatar for Pierre Lindenbaum
18 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

. in fact i intend to remove variants that are same in scoffold id and pos and keep one of them.

I strongly suggest you also use the REF information...

sort on CHROM/POS/REF. using awk create a KEY=CHROM\tPOS\REF, print the line if the key wasn't found previously

LC_ALL=C sort -t $'\t' -k1,1 -k2,2n -k4,4  input.vcf |\
awk -F '\t' '/^#/ {print;prev="";next;} {key=sprintf("%s\t%s\t%s",$1,$2,$4);if(key==prev) next;print;prev=key;}'

edit: added 'next; ' for VCF header.

ADD COMMENTlink modified 18 months ago • written 18 months ago by Pierre Lindenbaum116k

thanks Pierre for your answer, i ran your cammand and get an vcf file as output but when used bcftools stats i got this error.

Failed to open output.vcf: unknown file type

why bcftools can not regognize output as a vcf file? i need to output file for downstream analysis as vcf file

ADD REPLYlink modified 18 months ago • written 18 months ago by kk.mahsa80
1

ah yes, sorry it's because, sort messed-up the VCF header and ##fileformat= is not anymore the first line.

please try:

( grep  '^#' input.vcf ; grep -v "^#" input.vcf | LC_ALL=C sort -t $'\t' -k1,1 -k2,2n -k4,4 | awk -F '\t' 'BEGIN{ prev="";} {key=sprintf("%s\t%s\t%s",$1,$2,$4);if(key==prev) next;print;prev=key;}' )  > out.vcf
ADD REPLYlink modified 18 months ago • written 18 months ago by Pierre Lindenbaum116k

your answer was really helpfull, thank you so much Pierre. it worked

ADD REPLYlink written 18 months ago by kk.mahsa80
1
gravatar for cpad0112
18 months ago by
cpad011211k
India
cpad011211k wrote:

use vcfuniq or bcftoolsnorm (with -d option) to remove duplicates

ADD COMMENTlink modified 18 months ago • written 18 months ago by cpad011211k
1

bcftools norm left-align and normalize indels

Yes. It is left-align the alleles and then if the start coordinate is same then remove one of them, right?

ADD REPLYlink written 10 months ago by Shicheng Guo7.4k

thank you capd0112, i used bcftools norm and it worked.

ADD REPLYlink written 18 months ago by kk.mahsa80

bcftools normis new to me, thanks !

ADD REPLYlink written 18 months ago by Pierre Lindenbaum116k
2
gravatar for Shicheng Guo
6 months ago by
Shicheng Guo7.4k
Shicheng Guo7.4k wrote:

Take 1000 Genome phase 3 data as the example:

bcftools norm -d both --threads=32 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -O z  -o chr1.vcf.gz
ADD COMMENTlink modified 6 months ago • written 6 months ago by Shicheng Guo7.4k
1

Terribile, still have duplicates: bcftools norm

Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commandstreat these as missing.

Error: Duplicate ID '.'.

ADD REPLYlink modified 6 months ago • written 6 months ago by Shicheng Guo7.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1600 users visited in the last hour