Question: bcftools merge of multiple vcf produces duplicate records. How to solve the issue.
5
gravatar for kirannbishwa01
2.4 years ago by
kirannbishwa011.2k
United States
kirannbishwa011.2k wrote:

I have multiple single-sample VCF files, which I want to merge into a single multi-sample VCF file. When using bcftools merge I am getting duplicate records.

$ bcftools merge ms01e_phased.vcf.gz ms02g_phased.vcf.gz ms03g_phased.vcf.gz ms04h_phased.vcf.gz MA605_phased.vcf.gz MA611_phased.vcf.gz -O v -o RBphased_variants.SixSamples.Final.vcf
 # duplicate records at the same lines from the file "RBphased_variants.SixSamples.Final.vcf"
2   14691373    .   A   .   1153.31 PASS    BaseQRankSum=2.02;ClippingRankSum=0;ExcessHet=3.0103;FS=1.098;InbreedingCoeff=-0.0861;MQ=58.74;MQRankSum=-2.459;QD=19.22;ReadPosRankSum=-0.466;SOR=0.96;DP=68;AN=8  GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM    0/0:4:4:0:0:0/0:.:.:0/0:.:. 0/0:7:7:18:0:0/0:.:.:0/0:.:.    0/0:4:4:12:0:0/0:.:.:0/0:.:.    0/0:2:2:3:0:0/0:.:.:0/0:.:. ./.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.
2   14691373    .   A   AAG 1153.31 PASS    BaseQRankSum=2.02;ClippingRankSum=0;ExcessHet=3.0103;FS=1.098;InbreedingCoeff=-0.0861;MQ=58.74;MQRankSum=-2.459;QD=19.22;ReadPosRankSum=-0.466;SOR=0.96;set=InDels;DP=676;AF=0.042;AN=4;AC=0    GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC:PM    ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. 0/0:12,0:12:9:.:.:0,9,135:0/0:.:.:0/0:.:.   0/0:22,0:22:12:.:.:0,12,180:0/0:.:.:0/0:.:.
2   14691374    .   A   .   1320.25 PASS    BaseQRankSum=-1.049;ClippingRankSum=0;ExcessHet=0.2929;FS=0;InbreedingCoeff=0.4006;MQ=55.35;MQRankSum=0;QD=33.01;ReadPosRankSum=-0.671;SOR=0.892;DP=44;AN=2 GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM    0/0:4:4:0:0:0/0:.:.:0/0:.:. ./.:7:7:.:0:./.:.:.:./.:.:. ./.:0:0:.:0:./.:.:.:./.:.:. ./.:0:0:.:0:./.:.:.:./.:.:. ./.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.
2   14691374    .   A   G   1320.25 PASS    BaseQRankSum=-1.049;ClippingRankSum=0;ExcessHet=0.2929;FS=0;InbreedingCoeff=0.4006;MQ=55.35;MQRankSum=0;QD=33.01;ReadPosRankSum=-0.671;SOR=0.892;set=HignConfSNPs;DP=710;AF=0.115;MLEAC=3;MLEAF=0.115;AN=4;AC=0 GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC:PM    ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. 0/0:12,0:12:9:.:.:0,9,135:0/0:.:.:0/0:.:.   0/0:22,0:22:12:.:.:0,12,180:0/0:.:.:0/0:.:.

I raised this issue in bcftools thinking if it was a bug https://github.com/samtools/bcftools/issues/754 . But, is there any other solution to the problem.

ADD COMMENTlink modified 2.4 years ago by Kevin Blighe63k • written 2.4 years ago by kirannbishwa011.2k
9
gravatar for Kevin Blighe
2.4 years ago by
Kevin Blighe63k
Kevin Blighe63k wrote:

Hey kirannbishwa01,

I would not call these duplicate records as they are different calls but at the same position.

Just adding --merge all to your command should help to solve it

So:

bcftools merge --merge all ms01e_phased.vcf.gz ms02g_phased.vcf.gz ms03g_phased.vcf.gz ms04h_phased.vcf.gz MA605_phased.vcf.gz MA611_phased.vcf.gz -O v > RBphased_variants.SixSamples.Final.vcf

Kevin

ADD COMMENTlink modified 20 months ago • written 2.4 years ago by Kevin Blighe63k

Thank you Kevin for the answer. But, this didn't work - it just deleted all the records that were duplicates (by position).

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by kirannbishwa011.2k

Try without the piped bcftools view --min-ac 1.

ADD REPLYlink modified 20 months ago • written 2.4 years ago by Kevin Blighe63k
1

Hi Kevin,

I actually tried it without bcftools view --min-ac 1 and it seemed to work. Thanks,

ADD REPLYlink written 2.4 years ago by kirannbishwa011.2k

Hi Kevin, Everything worked well with this tool. But, there came a little problem. Several lines in my VCFs are this way:

2   180505  .   AT  A   73.06   PASS    AC=3;AF=0.214;AN=14;BaseQRankSum=0;ClippingRankSum=0;DP=204;ExcessHet=0.6367;FS=0;MQ=36.63;MQRankSum=0.967;QD=24.35;ReadPosRankSum=0.967;SOR=0.693;set=InDels   GT:AD:DP:GQ:PG:PL:PW    ./.:0,0:0:.:./.:0,0,0:./.   0/0:1,0:1:3:0/0:0,3,34:0/0  0/0:3,0:3:9:0/0:0,9,82:0/0  ./.:0,0:0:.:./.:0,0,0:./.   ./.:0,0:0:.:./.:0,0,0:./.   ./.:0,0:0:.:./.:0,0,0:./.   0/0:3,0:3:0:0/0:0,0,4:0/0   0/0:2,0:2:6:0/0:0,6,80:0/0  0/0:1,0:1:3:0/0:0,3,36:0/0  1/1:0,3:3:9:1/1:96,9,0:1/1  0/1:2,1:3:23:0/1:23,0,54:0/1    ./.:1,0:1:.:./.:0,0,0:./.   ./. ./. ./. ./.

See those samples at the end. Instead of ./. ./. ./. ./. they should be. ./.:.:.:.:.:.:. ./.:.:.:.:.:.:. ...... i.e filled with null values for equal number of fields in FORMAT field. The reason I need this is because I am using other tools to manipulate the data downstream and this data structure is throwing a problem. I have altogther 16 samples in this VCF.

Thanks,

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by kirannbishwa011.2k

Let me guess: these VCFs have not been produced in the same way?

ADD REPLYlink written 2.4 years ago by Kevin Blighe63k

Kevin, that's not the case. See my code:

merge variants

bcftools merge --merge all ms01e_phased.vcf.gz ms02g_phased.vcf.gz ms03g_phased.vcf.gz ms04h_phased.vcf.gz MA605_phased.vcf.gz MA611_phased.vcf.gz MA622_phased.vcf.gz MA625_phased.vcf.gz MA629_phased.vcf.gz Ncm8_phased.vcf.gz Sp3_phased.vcf.gz Sp21_phased.vcf.gz Sp76_phased.vcf.gz Sp154_phased.vcf.gz Sp164_phased.vcf.gz SpNor33_phased.vcf.gz -O v -o RBphased_variants.AllSamples.vcf

ValidateVariants

java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T ValidateVariants -R ${ref_genome} -V RBphased_variants.AllSamples.Final.vcf

That line I put there is from the file "RBphased_variants.AllSamples.Final.vcf"

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by kirannbishwa011.2k

And, here is another problem. Several FORMAT level tags are missing. See the example below.

In the single sample VCF's.

sample "MA605" - "PI" format field exits

2   180505  .   AT  A   73.06   PASS    AC=3;AF=0.214;AN=14;BaseQRankSum=0.00;ClippingRankSum=0.00;DP=17;ExcessHet=0.6367;FS=0.000;MQ=36.63;MQRankSum=0.967;QD=24.35;ReadPosRankSum=0.967;SOR=0.693;set=InDels  GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM    ./.:0,0:0:.:0,0,0:./.:.:.:./.:.:.

sample "ms01e" - no record at position 2 180505

sample "Sp3" - "PI" field present

2   180505  .   AT  A   73.06   PASS    AC=3;AF=0.214;AN=14;BaseQRankSum=0.00;ClippingRankSum=0.00;DP=17;ExcessHet=0.6367;FS=0.000;MQ=36.63;MQRankSum=0.967;QD=24.35;ReadPosRankSum=0.967;SOR=0.693;set=InDels  GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM    1/1:0,3:3:9:96,9,0:1/1:.:.:1/1:.:.

But, the merge VCFs has "PI" tag missing at all.

2   180505  .   AT  A   73.06   PASS    AC=3;AF=0.214;AN=14;BaseQRankSum=0;ClippingRankSum=0;DP=204;ExcessHet=0.6367;FS=0;MQ=36.63;MQRankSum=0.967;QD=24.35;ReadPosRankSum=0.967;SOR=0.693;set=InDels   GT:AD:DP:GQ:PG:PL:PW    ./.:0,0:0:.:./.:0,0,0:./.   0/0:1,0:1:3:0/0:0,3,34:0/0  0/0:3,0:3:9:0/0:0,9,82:0/0  ./.:0,0:0:.:./.:0,0,0:./.   ./.:0,0:0:.:./.:0,0,0:./.   ./.:0,0:0:.:./.:0,0,0:./.   0/0:3,0:3:0:0/0:0,0,4:0/0   0/0:2,0:2:6:0/0:0,6,80:0/0  0/0:1,0:1:3:0/0:0,3,36:0/0  1/1:0,3:3:9:1/1:96,9,0:1/1  0/1:2,1:3:23:0/1:23,0,54:0/1    ./.:1,0:1:.:./.:0,0,0:./.   ./. ./. ./. ./.
ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by kirannbishwa011.2k

I think the reason the "PI" are missing is because all the samples at that record don't have meaningful "PI" value. But, regardless of that shouldn't it be possible to have it in "merged VCFs".

ADD REPLYlink written 2.4 years ago by kirannbishwa011.2k

I see that you're using some GATK tools, a program suite for which hardly anything is compatible, but that's an issue for the Broad Institute to work on.

For the particular example that you mention just now (chr2:180505), the data that you say is missing is neither in the input VCF files going into the BCFtools command. So, it's an issue even prior to that... perhaps to do with phASER.

For the PL example that you mention, the correct PL values are indeed there for that particular sample - just look further along the line.

The VCFs coming out of phASER, generally, do not look like they are in good shape. I don't know the exact specifics of your experiment to comment much further. It looks messy because there are a lot of reference calls in the data and a lot of missing FORMAT values.

Sorry that I cannot help further for now.

ADD REPLYlink written 2.4 years ago by Kevin Blighe63k

Kevin, This merging has been a big pain for me. GATK CombineVariants is throwing an error which I can't fix. And, bcftools now seem to be giving up too.

I think I will need to the author of "phaser" to look into this problem.

Thanks,

ADD REPLYlink written 2.4 years ago by kirannbishwa011.2k

Yes, that is a better bet. The author will undoubtedly have encountered these issues before. Had I known that you were eventually using GATK, I would not have suggested BCFtools. SAMtools / BCFtools don't interact very well with GATK functions because the GATK likes to add numerous different and specific (to the GATK) encodings.

You can email the author here: https://stephanecastel.wordpress.com/

Best of luck.

ADD REPLYlink written 2.4 years ago by Kevin Blighe63k

Yeah, GATK doesn't seem to be working well when data are piped in, if it came out of the GATK workflow.

I have been in touch with Stephane Castel over the issue several times. But, it seems he is busy and not able to look over the problem. Or, prolly he has moved on.

I know a little bit of python but it just kills most of my time; and I may be reinventing the wheel, when a fast and workable solution is already out there.

Thanks much though.

ADD REPLYlink written 2.4 years ago by kirannbishwa011.2k

Yes, that happens: they get their publication and then move on.

What if you tried to merge the VCFs prior to phASER, and then ran phASER on the merged VCF? I presume that phASER supports that.

ADD REPLYlink written 2.4 years ago by Kevin Blighe63k
1

My vcf's are merged and validated before piping into "phaser". The issue is "phaser" exclusively outputs single sample VCFs, which needs remerging - and when I try to merge I run into problems. I have not been able to find a tools that can fix the VCF format issue (but has correct information). Let's see what I can work out.

ADD REPLYlink written 2.4 years ago by kirannbishwa011.2k

Did you ever get a solution to this? I find that each situation is different and requires a different combinations of commands.

ADD REPLYlink modified 20 months ago • written 20 months ago by Kevin Blighe63k
1

You may take ideas from the following, where I process / merge calls:

ADD REPLYlink written 20 months ago by Kevin Blighe63k

I have to add that it's not "PL" tag and values but "pi" tags and values. I think the character differences caused the issue.

ADD REPLYlink written 2.4 years ago by kirannbishwa011.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1547 users visited in the last hour