I have a vcf like that one:
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20200612
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw Depth">
##INFO=<ID=AF,Number=1,Type=Float,Description="Allele Frequency">
##INFO=<ID=SB,Number=1,Type=Integer,Description="Phred-scaled strand bias at this position">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Counts for ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=NC_045512.2>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4
NC_045512.2 71 . C T 494 PASS AF=0.041451;SB=3;DP=3177;DP4=2604,474,77,14 GT 1 . 1 .
NC_045512.2 71 . C T 494 PASS AF=0.041451;SB=3;DP=3177;DP4=2604,474,77,14 GT . 1 . .
As you can see there are two variants in different rows being the same but present in different samples (first row sample1 and sample3; second row sample2).
I want to get the row collapsed but keeping the genotypes from both rows. Like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR11607198.anno Sample1 Sample2 Sample3 Sample4
NC_045512.2 71 . C T 494 PASS AF=0.041451;SB=3;DP=3177;DP4=2604,474,77,14 GT 1 1 1 .
I tried to run bcftools removing duplicates:
bcftools_normCommand=norm -d none -o merg_3nodup.vcf merg_3.vcf
But it does not work as it just deletes the second row (no keeping genotype from sample2).
I also tried to run bcftools collapsing using isec (at first it is indicated for using it with multiple vcf's files and it only allows you to run with one vcf if putting --targets option:
bcftools isec -c none --targets "NC_045512.2" merg_3.vcf.gz -o merg_3collapse.vcf
But it keeps the vcf exactly like the initial file.
Does anyone have a clue on how to proceed?
split your vcf per sample and then merge the 4 vcf ?
Thanks for your answer Pierre,
Unfortunately this vcf was originaly created by merging some other vcf files (thousands of vcfs in fact) and splitting again is not a desirable option.
Another suggestion?