Question: bcftools merge of multiple vcf produces duplicate records. How to solve the issue.
0
gravatar for kirannbishwa01
3 months ago by
United States
kirannbishwa01790 wrote:

I have multiple single-sample VCF files, which I want to merge into a single multi-sample VCF file. When using bcftools merge I am getting duplicate records.

$ bcftools merge ms01e_phased.vcf.gz ms02g_phased.vcf.gz ms03g_phased.vcf.gz ms04h_phased.vcf.gz MA605_phased.vcf.gz MA611_phased.vcf.gz -O v -o RBphased_variants.SixSamples.Final.vcf
 # duplicate records at the same lines from the file "RBphased_variants.SixSamples.Final.vcf"
2   14691373    .   A   .   1153.31 PASS    BaseQRankSum=2.02;ClippingRankSum=0;ExcessHet=3.0103;FS=1.098;InbreedingCoeff=-0.0861;MQ=58.74;MQRankSum=-2.459;QD=19.22;ReadPosRankSum=-0.466;SOR=0.96;DP=68;AN=8  GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM    0/0:4:4:0:0:0/0:.:.:0/0:.:. 0/0:7:7:18:0:0/0:.:.:0/0:.:.    0/0:4:4:12:0:0/0:.:.:0/0:.:.    0/0:2:2:3:0:0/0:.:.:0/0:.:. ./.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.
2   14691373    .   A   AAG 1153.31 PASS    BaseQRankSum=2.02;ClippingRankSum=0;ExcessHet=3.0103;FS=1.098;InbreedingCoeff=-0.0861;MQ=58.74;MQRankSum=-2.459;QD=19.22;ReadPosRankSum=-0.466;SOR=0.96;set=InDels;DP=676;AF=0.042;AN=4;AC=0    GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC:PM    ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. 0/0:12,0:12:9:.:.:0,9,135:0/0:.:.:0/0:.:.   0/0:22,0:22:12:.:.:0,12,180:0/0:.:.:0/0:.:.
2   14691374    .   A   .   1320.25 PASS    BaseQRankSum=-1.049;ClippingRankSum=0;ExcessHet=0.2929;FS=0;InbreedingCoeff=0.4006;MQ=55.35;MQRankSum=0;QD=33.01;ReadPosRankSum=-0.671;SOR=0.892;DP=44;AN=2 GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM    0/0:4:4:0:0:0/0:.:.:0/0:.:. ./.:7:7:.:0:./.:.:.:./.:.:. ./.:0:0:.:0:./.:.:.:./.:.:. ./.:0:0:.:0:./.:.:.:./.:.:. ./.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.
2   14691374    .   A   G   1320.25 PASS    BaseQRankSum=-1.049;ClippingRankSum=0;ExcessHet=0.2929;FS=0;InbreedingCoeff=0.4006;MQ=55.35;MQRankSum=0;QD=33.01;ReadPosRankSum=-0.671;SOR=0.892;set=HignConfSNPs;DP=710;AF=0.115;MLEAC=3;MLEAF=0.115;AN=4;AC=0 GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC:PM    ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. 0/0:12,0:12:9:.:.:0,9,135:0/0:.:.:0/0:.:.   0/0:22,0:22:12:.:.:0,12,180:0/0:.:.:0/0:.:.

I raised this issue in bcftools thinking if it was a bug https://github.com/samtools/bcftools/issues/754 . But, is there any other solution to the problem.

ADD COMMENTlink modified 3 months ago by Kevin Blighe21k • written 3 months ago by kirannbishwa01790
3
gravatar for Kevin Blighe
3 months ago by
Kevin Blighe21k
University College London Cancer Institute
Kevin Blighe21k wrote:

Hey kirannbishwa01,

I would not call these duplicate records as they are different calls but at the same position. In fact, in the example that you've shown, there are only reference calls and no alternate alleles (maybe you have not pasted all of the data?).

Just adding --merge all to your command should help to solve it, but then pipe that into bcftools view --min-ac 1

So:

bcftools merge --merge all ms01e_phased.vcf.gz ms02g_phased.vcf.gz ms03g_phased.vcf.gz ms04h_phased.vcf.gz MA605_phased.vcf.gz MA611_phased.vcf.gz -O v | bcftools view --min-ac 1 > RBphased_variants.SixSamples.Final.vcf

Kevin

ADD COMMENTlink modified 3 months ago • written 3 months ago by Kevin Blighe21k

Thank you Kevin for the answer. But, this didn't work - it just deleted all the records that were duplicates (by position).

ADD REPLYlink modified 3 months ago • written 3 months ago by kirannbishwa01790

Try without the piped bcftools view --min-ac 1.

The examples that you've shown are again just reference calls, though, i.e., there are no reported alternate genotypes.

ADD REPLYlink written 3 months ago by Kevin Blighe21k
1

Hi Kevin,

I actually tried it without bcftools view --min-ac 1 and it seemed to work. Thanks,

ADD REPLYlink written 3 months ago by kirannbishwa01790

Hi Kevin, Everything worked well with this tool. But, there came a little problem. Several lines in my VCFs are this way:

2   180505  .   AT  A   73.06   PASS    AC=3;AF=0.214;AN=14;BaseQRankSum=0;ClippingRankSum=0;DP=204;ExcessHet=0.6367;FS=0;MQ=36.63;MQRankSum=0.967;QD=24.35;ReadPosRankSum=0.967;SOR=0.693;set=InDels   GT:AD:DP:GQ:PG:PL:PW    ./.:0,0:0:.:./.:0,0,0:./.   0/0:1,0:1:3:0/0:0,3,34:0/0  0/0:3,0:3:9:0/0:0,9,82:0/0  ./.:0,0:0:.:./.:0,0,0:./.   ./.:0,0:0:.:./.:0,0,0:./.   ./.:0,0:0:.:./.:0,0,0:./.   0/0:3,0:3:0:0/0:0,0,4:0/0   0/0:2,0:2:6:0/0:0,6,80:0/0  0/0:1,0:1:3:0/0:0,3,36:0/0  1/1:0,3:3:9:1/1:96,9,0:1/1  0/1:2,1:3:23:0/1:23,0,54:0/1    ./.:1,0:1:.:./.:0,0,0:./.   ./. ./. ./. ./.

See those samples at the end. Instead of ./. ./. ./. ./. they should be. ./.:.:.:.:.:.:. ./.:.:.:.:.:.:. ...... i.e filled with null values for equal number of fields in FORMAT field. The reason I need this is because I am using other tools to manipulate the data downstream and this data structure is throwing a problem. I have altogther 16 samples in this VCF.

Thanks,

ADD REPLYlink modified 3 months ago • written 3 months ago by kirannbishwa01790

Let me guess: these VCFs have not been produced in the same way?

ADD REPLYlink written 3 months ago by Kevin Blighe21k

Kevin, that's not the case. See my code:

merge variants

bcftools merge --merge all ms01e_phased.vcf.gz ms02g_phased.vcf.gz ms03g_phased.vcf.gz ms04h_phased.vcf.gz MA605_phased.vcf.gz MA611_phased.vcf.gz MA622_phased.vcf.gz MA625_phased.vcf.gz MA629_phased.vcf.gz Ncm8_phased.vcf.gz Sp3_phased.vcf.gz Sp21_phased.vcf.gz Sp76_phased.vcf.gz Sp154_phased.vcf.gz Sp164_phased.vcf.gz SpNor33_phased.vcf.gz -O v -o RBphased_variants.AllSamples.vcf

ValidateVariants

java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T ValidateVariants -R ${ref_genome} -V RBphased_variants.AllSamples.Final.vcf

That line I put there is from the file "RBphased_variants.AllSamples.Final.vcf"

ADD REPLYlink modified 3 months ago • written 3 months ago by kirannbishwa01790

And, here is another problem. Several FORMAT level tags are missing. See the example below.

In the single sample VCF's.

sample "MA605" - "PI" format field exits

2   180505  .   AT  A   73.06   PASS    AC=3;AF=0.214;AN=14;BaseQRankSum=0.00;ClippingRankSum=0.00;DP=17;ExcessHet=0.6367;FS=0.000;MQ=36.63;MQRankSum=0.967;QD=24.35;ReadPosRankSum=0.967;SOR=0.693;set=InDels  GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM    ./.:0,0:0:.:0,0,0:./.:.:.:./.:.:.

sample "ms01e" - no record at position 2 180505

sample "Sp3" - "PI" field present

2   180505  .   AT  A   73.06   PASS    AC=3;AF=0.214;AN=14;BaseQRankSum=0.00;ClippingRankSum=0.00;DP=17;ExcessHet=0.6367;FS=0.000;MQ=36.63;MQRankSum=0.967;QD=24.35;ReadPosRankSum=0.967;SOR=0.693;set=InDels  GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM    1/1:0,3:3:9:96,9,0:1/1:.:.:1/1:.:.

But, the merge VCFs has "PI" tag missing at all.

2   180505  .   AT  A   73.06   PASS    AC=3;AF=0.214;AN=14;BaseQRankSum=0;ClippingRankSum=0;DP=204;ExcessHet=0.6367;FS=0;MQ=36.63;MQRankSum=0.967;QD=24.35;ReadPosRankSum=0.967;SOR=0.693;set=InDels   GT:AD:DP:GQ:PG:PL:PW    ./.:0,0:0:.:./.:0,0,0:./.   0/0:1,0:1:3:0/0:0,3,34:0/0  0/0:3,0:3:9:0/0:0,9,82:0/0  ./.:0,0:0:.:./.:0,0,0:./.   ./.:0,0:0:.:./.:0,0,0:./.   ./.:0,0:0:.:./.:0,0,0:./.   0/0:3,0:3:0:0/0:0,0,4:0/0   0/0:2,0:2:6:0/0:0,6,80:0/0  0/0:1,0:1:3:0/0:0,3,36:0/0  1/1:0,3:3:9:1/1:96,9,0:1/1  0/1:2,1:3:23:0/1:23,0,54:0/1    ./.:1,0:1:.:./.:0,0,0:./.   ./. ./. ./. ./.
ADD REPLYlink modified 3 months ago • written 3 months ago by kirannbishwa01790

I think the reason the "PI" are missing is because all the samples at that record don't have meaningful "PI" value. But, regardless of that shouldn't it be possible to have it in "merged VCFs".

ADD REPLYlink written 3 months ago by kirannbishwa01790

I see that you're using some GATK tools, a program suite for which hardly anything is compatible, but that's an issue for the Broad Institute to work on.

For the particular example that you mention just now (chr2:180505), the data that you say is missing is neither in the input VCF files going into the BCFtools command. So, it's an issue even prior to that... perhaps to do with phASER.

For the PL example that you mention, the correct PL values are indeed there for that particular sample - just look further along the line.

The VCFs coming out of phASER, generally, do not look like they are in good shape. I don't know the exact specifics of your experiment to comment much further. It looks messy because there are a lot of reference calls in the data and a lot of missing FORMAT values.

Sorry that I cannot help further for now.

ADD REPLYlink written 3 months ago by Kevin Blighe21k

Kevin, This merging has been a big pain for me. GATK CombineVariants is throwing an error which I can't fix. And, bcftools now seem to be giving up too.

I think I will need to the author of "phaser" to look into this problem.

Thanks,

ADD REPLYlink written 3 months ago by kirannbishwa01790

Yes, that is a better bet. The author will undoubtedly have encountered these issues before. Had I known that you were eventually using GATK, I would not have suggested BCFtools. SAMtools / BCFtools don't interact very well with GATK functions because the GATK likes to add numerous different and specific (to the GATK) encodings.

You can email the author here: https://stephanecastel.wordpress.com/

Best of luck.

ADD REPLYlink written 3 months ago by Kevin Blighe21k

Yeah, GATK doesn't seem to be working well when data are piped in, if it came out of the GATK workflow.

I have been in touch with Stephane Castel over the issue several times. But, it seems he is busy and not able to look over the problem. Or, prolly he has moved on.

I know a little bit of python but it just kills most of my time; and I may be reinventing the wheel, when a fast and workable solution is already out there.

Thanks much though.

ADD REPLYlink written 3 months ago by kirannbishwa01790

Yes, that happens: they get their publication and then move on.

What if you tried to merge the VCFs prior to phASER, and then ran phASER on the merged VCF? I presume that phASER supports that.

ADD REPLYlink written 3 months ago by Kevin Blighe21k
1

My vcf's are merged and validated before piping into "phaser". The issue is "phaser" exclusively outputs single sample VCFs, which needs remerging - and when I try to merge I run into problems. I have not been able to find a tools that can fix the VCF format issue (but has correct information). Let's see what I can work out.

ADD REPLYlink written 3 months ago by kirannbishwa01790

I have to add that it's not "PL" tag and values but "pi" tags and values. I think the character differences caused the issue.

ADD REPLYlink written 3 months ago by kirannbishwa01790
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1041 users visited in the last hour