Question: Merge VCFs with overlapping samples
1
gravatar for Wan Shi Tong
6 months ago by
Wan Shi Tong60
Wan Shi Tong60 wrote:

I have VCFs that have some overlapping samples, is there a tool that can do this...

###VCF1:
CHR POS ID ALT REF QUAL INFO FILTER FORMAT Sample1 Sample2 Sample3  
SNP1...  
SNP2...  
SNP3...

###VCF2:
CHR POS ID ALT REF QUAL INFO FILTER FORMAT Sample2 Sample3 Sample4  
SNP2...  
SNP3...  
SNP4...

I WANT THIS...

###VCF1+VC2:
CHR POS ID ALT REF QUAL INFO FILTER FORMAT Sample1 Sample2 Sample3 Sample4  
SNP1... (missing for Sample4)  
SNP2...  
SNP3...  
SNP4... (missing for Sample1)

I DO NOT WANT THIS...

###VCF1+VCF2:
CHR POS ID ALT REF QUAL INFO FILTER FORMAT Sample1 Sample2 Sample3 Sample2_2 Sample3_2 Sample 4  
SNP1... (missing for Sample2_2, Sample3_2, and Sample4)  
SNP2...  
SNP3...  
SNP4... (missing for Sample1, Sample2, and Sample3)

In this example of what I do not want, Sample2 and Sample3 would only have SNP1, SNP2, and SNP3 and Sample2_2 and Sample3_2 would have SNP2, SNP3, SNP4.


Is there a tool that can merge VCFs and keep only one copy of each sample?

snp tools vcf • 479 views
ADD COMMENTlink modified 6 months ago by omg what am I doing60 • written 6 months ago by Wan Shi Tong60

On face value, all that you require is bcftools merge. Pay close attention to the -m parameter, too. Missing genotypes will be represented as ./.

ADD REPLYlink written 6 months ago by Kevin Blighe49k

merge would want to have unique samples over vcfs, we could use --force-samples but then we get suffixes which OP doesn't want.

ADD REPLYlink written 6 months ago by zx87548.2k

Yeah, that is exactly my problem. vcf-merge and bcftools merge do not merge same samples. They create new entries for each repeated sample unfortunately.

ADD REPLYlink written 6 months ago by Wan Shi Tong60

Would be easier to split these back into individual VCFs and then run bcftools concat --allow-overlaps --remove-duplicates to concat the same samples into a single VCF, and then merge everything with bcftools merge. This will work, as I have done it before for this type of situation.

ADD REPLYlink modified 6 months ago • written 6 months ago by Kevin Blighe49k
2
gravatar for omg what am I doing
6 months ago by
Penn State College of Medicine
omg what am I doing60 wrote:

You need bcftools concat, I used the command below and got the result you described.

bcftools concat -a filtered_indels_annotated.vcf.gz filtered_snps_annotated.vcf.gz -Ov -o filtered_BC_merged.vcf

Some useful info here on the -a option: https://samtools.github.io/bcftools/bcftools.html#norm

ADD COMMENTlink modified 6 months ago • written 6 months ago by omg what am I doing60

For concat to work we need all samples to overlap exactly.

All source files must have the same sample columns appearing in the same order.

ADD REPLYlink written 6 months ago by zx87548.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1681 users visited in the last hour