bcftools merge of multiple vcf produces duplicate records. How to solve the issue.
1
5
Entering edit mode
6.1 years ago
kirannbishwa01 ★ 1.6k

I have multiple single-sample VCF files, which I want to merge into a single multi-sample VCF file. When using bcftools merge I am getting duplicate records.

$ bcftools merge ms01e_phased.vcf.gz ms02g_phased.vcf.gz ms03g_phased.vcf.gz ms04h_phased.vcf.gz MA605_phased.vcf.gz MA611_phased.vcf.gz -O v -o RBphased_variants.SixSamples.Final.vcf
 # duplicate records at the same lines from the file "RBphased_variants.SixSamples.Final.vcf"
2   14691373    .   A   .   1153.31 PASS    BaseQRankSum=2.02;ClippingRankSum=0;ExcessHet=3.0103;FS=1.098;InbreedingCoeff=-0.0861;MQ=58.74;MQRankSum=-2.459;QD=19.22;ReadPosRankSum=-0.466;SOR=0.96;DP=68;AN=8  GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM    0/0:4:4:0:0:0/0:.:.:0/0:.:. 0/0:7:7:18:0:0/0:.:.:0/0:.:.    0/0:4:4:12:0:0/0:.:.:0/0:.:.    0/0:2:2:3:0:0/0:.:.:0/0:.:. ./.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.
2   14691373    .   A   AAG 1153.31 PASS    BaseQRankSum=2.02;ClippingRankSum=0;ExcessHet=3.0103;FS=1.098;InbreedingCoeff=-0.0861;MQ=58.74;MQRankSum=-2.459;QD=19.22;ReadPosRankSum=-0.466;SOR=0.96;set=InDels;DP=676;AF=0.042;AN=4;AC=0    GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC:PM    ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. 0/0:12,0:12:9:.:.:0,9,135:0/0:.:.:0/0:.:.   0/0:22,0:22:12:.:.:0,12,180:0/0:.:.:0/0:.:.
2   14691374    .   A   .   1320.25 PASS    BaseQRankSum=-1.049;ClippingRankSum=0;ExcessHet=0.2929;FS=0;InbreedingCoeff=0.4006;MQ=55.35;MQRankSum=0;QD=33.01;ReadPosRankSum=-0.671;SOR=0.892;DP=44;AN=2 GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM    0/0:4:4:0:0:0/0:.:.:0/0:.:. ./.:7:7:.:0:./.:.:.:./.:.:. ./.:0:0:.:0:./.:.:.:./.:.:. ./.:0:0:.:0:./.:.:.:./.:.:. ./.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.
2   14691374    .   A   G   1320.25 PASS    BaseQRankSum=-1.049;ClippingRankSum=0;ExcessHet=0.2929;FS=0;InbreedingCoeff=0.4006;MQ=55.35;MQRankSum=0;QD=33.01;ReadPosRankSum=-0.671;SOR=0.892;set=HignConfSNPs;DP=710;AF=0.115;MLEAC=3;MLEAF=0.115;AN=4;AC=0 GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC:PM    ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. 0/0:12,0:12:9:.:.:0,9,135:0/0:.:.:0/0:.:.   0/0:22,0:22:12:.:.:0,12,180:0/0:.:.:0/0:.:.

I raised this issue in bcftools thinking if it was a bug https://github.com/samtools/bcftools/issues/754 . But, is there any other solution to the problem.

VCF vcf merge bcftools duplicate records variants • 21k views
ADD COMMENT
13
Entering edit mode
6.1 years ago

Hey kirannbishwa01,

I would not call these duplicate records as they are different calls but at the same position.

Just adding --merge all to your command should help to solve it

So:

bcftools merge --merge all ms01e_phased.vcf.gz ms02g_phased.vcf.gz ms03g_phased.vcf.gz ms04h_phased.vcf.gz MA605_phased.vcf.gz MA611_phased.vcf.gz -O v > RBphased_variants.SixSamples.Final.vcf

Kevin

ADD COMMENT
0
Entering edit mode

Thank you Kevin for the answer. But, this didn't work - it just deleted all the records that were duplicates (by position).

ADD REPLY
0
Entering edit mode

Try without the piped bcftools view --min-ac 1.

ADD REPLY
1
Entering edit mode

Hi Kevin,

I actually tried it without bcftools view --min-ac 1 and it seemed to work. Thanks,

ADD REPLY
0
Entering edit mode

Hi Kevin, Everything worked well with this tool. But, there came a little problem. Several lines in my VCFs are this way:

2   180505  .   AT  A   73.06   PASS    AC=3;AF=0.214;AN=14;BaseQRankSum=0;ClippingRankSum=0;DP=204;ExcessHet=0.6367;FS=0;MQ=36.63;MQRankSum=0.967;QD=24.35;ReadPosRankSum=0.967;SOR=0.693;set=InDels   GT:AD:DP:GQ:PG:PL:PW    ./.:0,0:0:.:./.:0,0,0:./.   0/0:1,0:1:3:0/0:0,3,34:0/0  0/0:3,0:3:9:0/0:0,9,82:0/0  ./.:0,0:0:.:./.:0,0,0:./.   ./.:0,0:0:.:./.:0,0,0:./.   ./.:0,0:0:.:./.:0,0,0:./.   0/0:3,0:3:0:0/0:0,0,4:0/0   0/0:2,0:2:6:0/0:0,6,80:0/0  0/0:1,0:1:3:0/0:0,3,36:0/0  1/1:0,3:3:9:1/1:96,9,0:1/1  0/1:2,1:3:23:0/1:23,0,54:0/1    ./.:1,0:1:.:./.:0,0,0:./.   ./. ./. ./. ./.

See those samples at the end. Instead of ./. ./. ./. ./. they should be. ./.:.:.:.:.:.:. ./.:.:.:.:.:.:. ...... i.e filled with null values for equal number of fields in FORMAT field. The reason I need this is because I am using other tools to manipulate the data downstream and this data structure is throwing a problem. I have altogther 16 samples in this VCF.

Thanks,

ADD REPLY
0
Entering edit mode

Let me guess: these VCFs have not been produced in the same way?

ADD REPLY
0
Entering edit mode

Kevin, that's not the case. See my code:

merge variants

bcftools merge --merge all ms01e_phased.vcf.gz ms02g_phased.vcf.gz ms03g_phased.vcf.gz ms04h_phased.vcf.gz MA605_phased.vcf.gz MA611_phased.vcf.gz MA622_phased.vcf.gz MA625_phased.vcf.gz MA629_phased.vcf.gz Ncm8_phased.vcf.gz Sp3_phased.vcf.gz Sp21_phased.vcf.gz Sp76_phased.vcf.gz Sp154_phased.vcf.gz Sp164_phased.vcf.gz SpNor33_phased.vcf.gz -O v -o RBphased_variants.AllSamples.vcf

ValidateVariants

java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T ValidateVariants -R ${ref_genome} -V RBphased_variants.AllSamples.Final.vcf

That line I put there is from the file "RBphased_variants.AllSamples.Final.vcf"

ADD REPLY
0
Entering edit mode

And, here is another problem. Several FORMAT level tags are missing. See the example below.

In the single sample VCF's.

sample "MA605" - "PI" format field exits

2   180505  .   AT  A   73.06   PASS    AC=3;AF=0.214;AN=14;BaseQRankSum=0.00;ClippingRankSum=0.00;DP=17;ExcessHet=0.6367;FS=0.000;MQ=36.63;MQRankSum=0.967;QD=24.35;ReadPosRankSum=0.967;SOR=0.693;set=InDels  GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM    ./.:0,0:0:.:0,0,0:./.:.:.:./.:.:.

sample "ms01e" - no record at position 2 180505

sample "Sp3" - "PI" field present

2   180505  .   AT  A   73.06   PASS    AC=3;AF=0.214;AN=14;BaseQRankSum=0.00;ClippingRankSum=0.00;DP=17;ExcessHet=0.6367;FS=0.000;MQ=36.63;MQRankSum=0.967;QD=24.35;ReadPosRankSum=0.967;SOR=0.693;set=InDels  GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM    1/1:0,3:3:9:96,9,0:1/1:.:.:1/1:.:.

But, the merge VCFs has "PI" tag missing at all.

2   180505  .   AT  A   73.06   PASS    AC=3;AF=0.214;AN=14;BaseQRankSum=0;ClippingRankSum=0;DP=204;ExcessHet=0.6367;FS=0;MQ=36.63;MQRankSum=0.967;QD=24.35;ReadPosRankSum=0.967;SOR=0.693;set=InDels   GT:AD:DP:GQ:PG:PL:PW    ./.:0,0:0:.:./.:0,0,0:./.   0/0:1,0:1:3:0/0:0,3,34:0/0  0/0:3,0:3:9:0/0:0,9,82:0/0  ./.:0,0:0:.:./.:0,0,0:./.   ./.:0,0:0:.:./.:0,0,0:./.   ./.:0,0:0:.:./.:0,0,0:./.   0/0:3,0:3:0:0/0:0,0,4:0/0   0/0:2,0:2:6:0/0:0,6,80:0/0  0/0:1,0:1:3:0/0:0,3,36:0/0  1/1:0,3:3:9:1/1:96,9,0:1/1  0/1:2,1:3:23:0/1:23,0,54:0/1    ./.:1,0:1:.:./.:0,0,0:./.   ./. ./. ./. ./.
ADD REPLY
0
Entering edit mode

I think the reason the "PI" are missing is because all the samples at that record don't have meaningful "PI" value. But, regardless of that shouldn't it be possible to have it in "merged VCFs".

ADD REPLY
0
Entering edit mode

I see that you're using some GATK tools, a program suite for which hardly anything is compatible, but that's an issue for the Broad Institute to work on.

For the particular example that you mention just now (chr2:180505), the data that you say is missing is neither in the input VCF files going into the BCFtools command. So, it's an issue even prior to that... perhaps to do with phASER.

For the PL example that you mention, the correct PL values are indeed there for that particular sample - just look further along the line.

The VCFs coming out of phASER, generally, do not look like they are in good shape. I don't know the exact specifics of your experiment to comment much further. It looks messy because there are a lot of reference calls in the data and a lot of missing FORMAT values.

Sorry that I cannot help further for now.

ADD REPLY
0
Entering edit mode

Kevin, This merging has been a big pain for me. GATK CombineVariants is throwing an error which I can't fix. And, bcftools now seem to be giving up too.

I think I will need to the author of "phaser" to look into this problem.

Thanks,

ADD REPLY
0
Entering edit mode

Yes, that is a better bet. The author will undoubtedly have encountered these issues before. Had I known that you were eventually using GATK, I would not have suggested BCFtools. SAMtools / BCFtools don't interact very well with GATK functions because the GATK likes to add numerous different and specific (to the GATK) encodings.

You can email the author here: https://stephanecastel.wordpress.com/

Best of luck.

ADD REPLY
0
Entering edit mode

Yeah, GATK doesn't seem to be working well when data are piped in, if it came out of the GATK workflow.

I have been in touch with Stephane Castel over the issue several times. But, it seems he is busy and not able to look over the problem. Or, prolly he has moved on.

I know a little bit of python but it just kills most of my time; and I may be reinventing the wheel, when a fast and workable solution is already out there.

Thanks much though.

ADD REPLY
0
Entering edit mode

Yes, that happens: they get their publication and then move on.

What if you tried to merge the VCFs prior to phASER, and then ran phASER on the merged VCF? I presume that phASER supports that.

ADD REPLY
1
Entering edit mode

My vcf's are merged and validated before piping into "phaser". The issue is "phaser" exclusively outputs single sample VCFs, which needs remerging - and when I try to merge I run into problems. I have not been able to find a tools that can fix the VCF format issue (but has correct information). Let's see what I can work out.

ADD REPLY
0
Entering edit mode

Did you ever get a solution to this? I find that each situation is different and requires a different combinations of commands.

ADD REPLY
1
Entering edit mode

You may take ideas from the following, where I process / merge calls:

ADD REPLY
0
Entering edit mode

I have to add that it's not "PL" tag and values but "pi" tags and values. I think the character differences caused the issue.

ADD REPLY

Login before adding your answer.

Traffic: 2418 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6