Question: bcftools-merge for vcf files
0
gravatar for evelyn
14 months ago by
evelyn110
evelyn110 wrote:

Hello,

I have three vcf files created with bwa and bcftools. I have merged these three files using bcftools-merge command as following:

bcftools merge file1.vcf.gz fle2.vcf.gz file3.vcf.gz > out.vcf

And a snippet for file out.vcf looks like this:

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT      file1.sorted.bam    file2.sorted.bam    file3.sorted.bam
    1   687 .   G   A   42.4147 .   VDB=0.0298006;SGB=-0.556411;MQ0F=0.5;MQ=19;DP=4;DP4=0,0,0,4;AN=2;AC=2   GT:PL   ./.:.   1/1:72,12,0 ./.:.
    1   689 .   G   A   42.4147 .   VDB=0.0298006;SGB=-0.556411;MQ0F=0.5;MQ=19;DP=4;DP4=0,0,0,4;AN=2;AC=2   GT:PL   ./.:.   1/1:72,12,0 ./.:.
    1   701 .   T   A   42.4147 .   VDB=0.0221621;SGB=-0.556411;MQ0F=0.5;MQ=19;DP=4;DP4=0,0,0,4;AN=2;AC=2   GT:PL   ./.:.   1/1:72,12,0 ./.:.
    1   704 .   T   G   42.4147 .   VDB=0.0190094;SGB=-0.556411;MQ0F=0.5;MQ=19;DP=4;DP4=0,0,0,4;AN=2;AC=2   GT:PL   ./.:.   1/1:72,12,0 ./.:.
    1   708 .   C   T,A 20.4535 .   SGB=-0.379885;MQ0F=0;MQ=50;DP=2;DP4=0,0,0,2;AN=4;AC=2,2 GT:PL   1/1:50,3,0,.,.,.    ./.:.   2/2:40,.,.,3,.,0

Here, for example, for position 708, there are two ALT bases (T, A) which means at this position for reference base C, file1.sorted.bam has an ALT base, T; file2.sorted.bam has the same base as a reference that's why it has ./.:. and file3.sorted.bam has an ALT base, A.

So, using bcftools I want to make a file which looks something like this where the individual file columns have their assigned bases (both missing as reference base and an ATL base) so that I can use this vcf for further analysis:

#CHROM  ID      REF ALT     file1.sorted.bam    file2.sorted.bam    file3.sorted.bam
1       687     G   A           G   A   G
1       689     G   A           G   A   G
1       701     T   A           T   A   T
1       704     T   G           T   G   T
1       708     C   T,A         T   C   A

I have looked into other posts where GATK is used for such jobs but it also requires to use GATK to call the variants. For this analysis, I have to use bcftools for calling variants but I can use other software to merge the vcf files to get the required result.

Thank you so much for your help!

snp bcftools • 1.8k views
ADD COMMENTlink modified 14 months ago by zx87549.7k • written 14 months ago by evelyn110

So, using bcftools I want to make a vcf which looks something like this where the individual file columns have their assigned bases (both missing as reference base and an ATL base) so that I can use this vcf for further analysis:

That's not a VCF anymore, and I would strongly recommend to stick to commonly used file formats. That said, if I would have to solve your issue I would use python and the cyvcf2 module.

ADD REPLYlink written 14 months ago by WouterDeCoster44k

That was my mistake to say I want a vcf file. You are right it won't be a vcf anymore. I have edited my question. Thank you for the suggestion. I have never used python or the cyvcf2 module. I will try them. Thank you!

ADD REPLYlink written 14 months ago by evelyn110
2
gravatar for finswimmer
14 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

This will give you not the exact output as requested, but with another python/awk/... script you can do it quite easily:

bcftools +missing2ref out.vcf.vcf | bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%TGT]\n'
1   687 .   G   A   G/G A/A G/G
1   689 .   G   A   G/G A/A G/G
1   701 .   T   A   T/T A/A T/T
1   704 .   T   G   T/T G/G T/T
1   708 .   C   T,A T/T C/C A/A

How would heterozygous position will look like in your desired output?

ADD COMMENTlink written 14 months ago by finswimmer13k

Hello @finswimmer,

Thank you for your suggestion! I just found another post which says ./.:. means a missing observation and 0/0 means the same as the reference base. So I think I need the output something like this by replacing ./.:. with - and a 0/0 replaced by reference base :

 #CHROM  ID      REF ALT     file1.sorted.bam    file2.sorted.bam    file3.sorted.bam
    1       687     G   A           -   A   -
    1       689     G   A           -   A   -
    1       701     T   A           -   A   -
    1       704     T   G           -   G   -
    1       708     C   T,A         T   -   A

Thank you for pointing out heterozygous positions. I have pasted an example from the vcf file here:

   1    6440    .   G   T   21.279  . VDB=0.0264014;SGB=-0.511536;RPB=1;MQB=1;MQSB=1;BQB=1;MQ0F=0.25;MQ=15;ICB=1;HOB=0.5;DP=7;DP4=0,2,5,0;AN=4;AC=3 GT:PL   1/1:48,5,0  0/1:35,0,19 ./.:.

Here, for file2, the call is 0/1 and I was thinking to replace heterozygous calls with an N. I did not add this in my question earlier. But is there a way to address this along with homozygous positions?

Thank you for your help!

ADD REPLYlink modified 14 months ago • written 14 months ago by evelyn110
1

just found another post which says ./.:. means a missing observation and 0/0 means the same as the reference base.

That is correct. So if you don't want to replace ./. with the reference, then you can omit the bcftools +missing2ref.

Try this to get your desired output:

input.vcf looks like this:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.9+htslib-1.9
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1,length=248956422>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  file1.sorted.bam    file2.sorted.bam    file3.sorted.bam
1   687 .   G   A   42.4147 .   .   GT  ./. 1/1 ./.
1   689 .   G   A   42.4147 .   .   GT  ./. 1/1 ./.
1   701 .   T   A   42.4147 .   .   GT  ./. 1/1 ./.
1   704 .   T   G   42.4147 .   .   GT  ./. 1/1 ./.
1   708 .   C   T,A 20.4535 .   .   GT  1/1 ./. 2/2
1   6440    .   G   T   21.279  .   .   GT  1/1 0/1 ./.

The run:

$ bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%TGT]\n' merge.vcf | awk -v FS="\t" -v OFS="\t" '{for(i=6;i<=NF;i++) {split($i, gt, "/"); if(gt[1]==".") $i="-"; else if(gt[1]==gt[2]) $i=gt[1]; else $i="N";} print }'
1   687 .   G   A   -   A   -
1   689 .   G   A   -   A   -
1   701 .   T   A   -   A   -
1   704 .   T   G   -   G   -
1   708 .   C   T,A T   -   A
1   6440    .   G   T   T   N   -

The bcftools query give as the chrom, pos, id, ref, alt and the translated genotypes.

The awk script iterates over the genotypes and split them by /. If value 1 is equal to . it replaces the column with -. If value 1 is equal to value 2 it replaces the column with value 1. In all other cases it replaces the column with N.

ADD REPLYlink written 14 months ago by finswimmer13k

Thank you so much for your help! It works great. I understood the code as well. However, I am trying to keep the vcf file names to use this file for further analysis. I used -H option from bcftools query and it gives this header:

# [1]CHROM  [2]POS  [3]ID   [4]REF  [5]ALT  N   N   N

It replaced the vcf file names with N. Thank you for your time!

ADD REPLYlink modified 14 months ago • written 14 months ago by evelyn110

See my answer here.

ADD REPLYlink modified 14 months ago • written 14 months ago by finswimmer13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1297 users visited in the last hour