Question: how can i remove duplicated variants from vcf file?
2
gravatar for kk.mahsa
23 months ago by
kk.mahsa100
kk.mahsa100 wrote:

how can i remove duplicated variants from vcf file? i googled and searched in biostars history but i did not fond any way to do it.

snp vcf • 4.7k views
ADD COMMENTlink modified 10 months ago by Shicheng Guo7.5k • written 23 months ago by kk.mahsa100
3

Would you please explain more what do you mean by duplicated variants? Do you observe two lines in your VCF file that are exactly the same?

ADD REPLYlink written 23 months ago by smho40

yes i mean is what you say and i want to keep one of the duplicate variants and remove the rest. in fact i intend to remove variants that are same in scoffold id and pos and keep one of them.

ADD REPLYlink written 23 months ago by kk.mahsa100
3
gravatar for Pierre Lindenbaum
23 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

. in fact i intend to remove variants that are same in scoffold id and pos and keep one of them.

I strongly suggest you also use the REF information...

sort on CHROM/POS/REF. using awk create a KEY=CHROM\tPOS\REF, print the line if the key wasn't found previously

LC_ALL=C sort -t $'\t' -k1,1 -k2,2n -k4,4  input.vcf |\
awk -F '\t' '/^#/ {print;prev="";next;} {key=sprintf("%s\t%s\t%s",$1,$2,$4);if(key==prev) next;print;prev=key;}'

edit: added 'next; ' for VCF header.

ADD COMMENTlink modified 23 months ago • written 23 months ago by Pierre Lindenbaum121k

thanks Pierre for your answer, i ran your cammand and get an vcf file as output but when used bcftools stats i got this error.

Failed to open output.vcf: unknown file type

why bcftools can not regognize output as a vcf file? i need to output file for downstream analysis as vcf file

ADD REPLYlink modified 23 months ago • written 23 months ago by kk.mahsa100
1

ah yes, sorry it's because, sort messed-up the VCF header and ##fileformat= is not anymore the first line.

please try:

( grep  '^#' input.vcf ; grep -v "^#" input.vcf | LC_ALL=C sort -t $'\t' -k1,1 -k2,2n -k4,4 | awk -F '\t' 'BEGIN{ prev="";} {key=sprintf("%s\t%s\t%s",$1,$2,$4);if(key==prev) next;print;prev=key;}' )  > out.vcf
ADD REPLYlink modified 23 months ago • written 23 months ago by Pierre Lindenbaum121k

your answer was really helpfull, thank you so much Pierre. it worked

ADD REPLYlink written 23 months ago by kk.mahsa100
1
gravatar for cpad0112
23 months ago by
cpad011211k
India
cpad011211k wrote:

use vcfuniq or bcftools norm (with -d option) to remove duplicates

ADD COMMENTlink modified 6 weeks ago by RamRS22k • written 23 months ago by cpad011211k
1

bcftools norm left-align and normalize indels

Yes. It is left-align the alleles and then if the start coordinate is same then remove one of them, right?

ADD REPLYlink written 15 months ago by Shicheng Guo7.5k

thank you capd0112, i used bcftools norm and it worked.

ADD REPLYlink written 23 months ago by kk.mahsa100

bcftools normis new to me, thanks !

ADD REPLYlink written 23 months ago by Pierre Lindenbaum121k
2
gravatar for Shicheng Guo
10 months ago by
Shicheng Guo7.5k
Shicheng Guo7.5k wrote:

Take 1000 Genome phase 3 data as the example:

bcftools norm -d both --threads=32 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -O z  -o chr1.vcf.gz
ADD COMMENTlink modified 10 months ago • written 10 months ago by Shicheng Guo7.5k
1

Terribile, still have duplicates: bcftools norm

Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commandstreat these as missing.

Error: Duplicate ID '.'.

ADD REPLYlink modified 10 months ago • written 10 months ago by Shicheng Guo7.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1929 users visited in the last hour