Using bcftools to change ploidy level in vcf files
Entering edit mode
7.2 years ago
flolil27 • 0


I am facing problem with bcftools. I have a VCF file containing variants from a haploid fungus. I would like to use the --haploid2diploid option to convert haploid genotypes to homozygous diploid genotypes.

I have tried using bcftools convert --haploid2diploid -h hapfile.vcf > dipfile.vcf

but the dipfile.vcf is still a haploid vcf.

Does anyone know how I could do that with bcftools? Is there another way to convert my haploid vcf file?


snp bcftools • 8.4k views
Entering edit mode

Hi Flolil,

I am facing the same problem, did you find a solution?

Thanks, Elisabet

Entering edit mode
6.7 years ago

using vcffilterjdk: . It fix the haploid, genotype , AC and AN. I'm too lazty to fix things like AF (shouldn't change ?) , etc...

    $  wget -q -O - "" |\
       java -jar dist/vcffilterjdk.jar -e 'return new VariantContextBuilder(variant).genotypes(variant.getGenotypes().stream().map(G->!G.isCalled()?GenotypeBuilder.createMissing(G.getSampleName(),2):G).map(G->G.isCalled() && G.getPloidy()==1?new GenotypeBuilder(G).alleles(Arrays.asList(G.getAllele(0),G.getAllele(0))).make():G).collect(Collectors.toList())).attribute("AC",variant.getGenotypes().stream().mapToInt(G->G.isCalled() && G.getPloidy()==1 && !G.getAllele(0).isReference()?2:G.getAlleles().size()).sum()).attribute("AN",variant.getGenotypes().stream().mapToInt(G->G.isCalled() && G.getPloidy()==1 ?2:G.getAlleles().size()).sum()).make();'

gi|15829254|ref|NC_002695.1|    270484  .   G   C   2747    .   AC=2;AF=1.00;AN=2;BaseQRankSum=1.612;DP=116;Dels=0.00;FS=0.000;HaplotypeScore=3.4380;MLEAC=1;MLEAF=1.00;MQ=59.89;MQ0=0;MQRankSum=1.521;QD=23.68;ReadPosRankSum=-1.009;SOR=0.268 GT:AD:DP:GQ:PL  1/1:1,115:116:99:2777,0
gi|15829254|ref|NC_002695.1|    270502  .   G   A   2924    .   AC=2;AF=1.00;AN=2;BaseQRankSum=0.948;DP=106;Dels=0.00;FS=0.000;HaplotypeScore=2.6493;MLEAC=1;MLEAF=1.00;MQ=59.26;MQ0=0;MQRankSum=1.667;QD=27.58;ReadPosRankSum=1.667;SOR=0.281  GT:AD:DP:GQ:PL  1/1:1,105:106:99:2954,0
gi|15829254|ref|NC_002695.1|    270545  .   A   C   2121    .   AC=2;AF=1.00;AN=2;BaseQRankSum=2.084;DP=109;Dels=0.01;FS=5.100;HaplotypeScore=23.5544;MLEAC=1;MLEAF=1.00;MQ=49.74;MQ0=3;MQRankSum=-1.031;QD=19.46;ReadPosRankSum=-0.760;SOR=0.425   GT:AD:DP:GQ:PL  1/1:5,103:108:99:2151,0
gi|15829254|ref|NC_002695.1|    270562  .   G   C   2366    .   AC=2;AF=1.00;AN=2;DP=104;Dels=0.00;FS=0.000;HaplotypeScore=5.2793;MLEAC=1;MLEAF=1.00;MQ=48.85;MQ0=3;QD=22.75;SOR=0.818  GT:AD:DP:GQ:PL  1/1:1,103:104:99:2396,0
gi|15829254|ref|NC_002695.1|    270577  .   G   A   559 .   AC=2;AF=1.00;AN=2;BaseQRankSum=4.123;DP=63;Dels=0.00;FS=67.351;HaplotypeScore=61.7544;MLEAC=1;MLEAF=1.00;MQ=41.28;MQ0=3;MQRankSum=4.843;QD=8.87;ReadPosRankSum=-2.054;SOR=7.167 GT:AD:DP:GQ:PL  1/1:28,35:63:99:589,0
gi|15829254|ref|NC_002695.1|    270584  .   G   A   1197    .   AC=2;AF=1.00;AN=2;DP=57;Dels=0.00;FS=0.000;HaplotypeScore=32.4027;MLEAC=1;MLEAF=1.00;MQ=38.79;MQ0=3;QD=21.00;SOR=0.931  GT:AD:DP:GQ:PL  1/1:0,57:57:99:1227,0
gi|15829254|ref|NC_002695.1|    270596  .   T   G   327 .   AC=2;AF=1.00;AN=2;BaseQRankSum=0.330;DP=51;Dels=0.02;FS=0.000;HaplotypeScore=28.8469;MLEAC=1;MLEAF=1.00;MQ=35.57;MQ0=3;MQRankSum=-0.578;QD=6.41;ReadPosRankSum=1.238;SOR=0.613  GT:AD:DP:GQ:PL  1/1:1,48:50:99:357,0
gi|15829254|ref|NC_002695.1|    270598  .   C   G   809 .   AC=2;AF=1.00;AN=2;BaseQRankSum=2.513;DP=52;Dels=0.02;FS=0.000;HaplotypeScore=29.6561;MLEAC=1;MLEAF=1.00;MQ=36.19;MQ0=3;MQRankSum=-0.852;QD=15.56;ReadPosRankSum=0.256;SOR=0.303 GT:AD:DP:GQ:PL  1/1:3,48:51:99:839,0
gi|15829254|ref|NC_002695.1|    270627  .   A   G   2705    .   AC=2;AF=1.00;AN=2;DP=103;Dels=0.01;FS=0.000;HaplotypeScore=5.0818;MLEAC=1;MLEAF=1.00;MQ=50.91;MQ0=2;QD=26.26;SOR=0.957  GT:AD:DP:GQ:PL  1/1:0,102:102:99:2735,0
gi|15829254|ref|NC_002695.1|    270646  .   C   A   2337    .   AC=2;AF=1.00;AN=2;DP=81;Dels=0.00;FS=0.000;HaplotypeScore=7.7758;MLEAC=1;MLEAF=1.00;MQ=56.60;MQ0=0;QD=28.85;SOR=1.416   GT:AD:DP:GQ:PL  1/1:0,81:81:99:2367,0
Entering edit mode
2.4 years ago
taprs • 0

A (terrible) workaround with sed gets the thing done quick and dirty. Works for me for PopGenome library VCF input.

bcftools view file.vcf.gz | sed -r -e 's@\t([a-Z0-9.]):@\t\1/\1:@g' | bcftools view -Oz -o file_hap2dip.vcf.gz
Entering edit mode

Hey, I'm trying to find how to convert my diploid vcf file to a triploid one, and came across your sed code, but I have to admit I don't get what it does (and cannot find where the ploidy info is in a vcf file). Would you mind explaining? Thanks!

Entering edit mode

Hey! Sorry it's been a while.

Ploidy level in VCF file can be deduced for each sample by number of positions in the respective GT field, e.g. GT value of 1/1 in the example above indicates diploid homozygous genotype with alternative allele in both positions. If you are confused, take a look at the VCF format specification.

The sed code I provided will only work as-is for haploid-to-diploid transformation. It is a search-and-replace command (denoted here by s@...@...@g pattern). In essence, it does the following:

  • find haploid GT values (the \t([a-Z0-9.]): search pattern works because GT field is normally the first in the tab-delimited sample cell and is normally followed by other fields separated by colons;
  • duplicate the found character in-place and add the necessary / separator (\1 stands for the matched pattern in brackets).

If you are confused, check out any sed manual; I personally like this one. Note that my command is not optimal (e.g. now I see that the search pattern should rather be \t[0-9.]: because GT fields contain no numbers...)

With diploid-to-triploid conversion, the solution would depend on how you are planning to treat heterozygous positions.

Entering edit mode
16 months ago
Jimmy ▴ 30

Python solution to diploidizing a haploid VCF (fairly faster than the Bash solution also posted on this question).

hap_to_dip = {'0':'0/0', '1':'1/1'}

with open("haploid.vcf","r") as file, open("diplodized.vcf","w") as s_file:
        for line in file:
            if line[0] == '#':
                rowstr = line.split('\t')[9:]
                for i in range(len(rowstr)):
                    rowstr[i] = hap_to_dip[rowstr[i][0]] + rowstr[i][1:]
                row = '\t'.join(str(line).split('\t')[:9]) + '\t' + '\t'.join(rowstr)

Here's a cleaner but slightly slower version

import csv

hap_to_dip = {'0': '0/0', '1': '1/1'}
with open("haploid.vcf","r") as file, open("diplodized.vcf","w") as s_file:
    reader = csv.reader(file, delimiter="\t")
    writer = csv.writer(s_file, delimiter="\t")
    for row in reader:
        if row[0].startswith("#"):
            row[9:] = [hap_to_dip.get(field[0], field) + field[1:] for field in row[9:]]
Entering edit mode
5 weeks ago
michael ▴ 10

The bcftools plugin fixploidy can convert between ploidy quite easily

bcftools +fixploidy haploid.vcf -- -f 2 > diploid.vcf

here are the options

$ bcftools +fixploidy --help

About: Fix ploidy
Usage: bcftools +fixploidy [General Options] -- [Plugin Options]
   run "bcftools plugin" for a list of common options

Plugin options:
   -d, --default-ploidy <int>  default ploidy for regions unlisted in -p [2]
   -f, --force-ploidy <int>    ignore -p, set the same ploidy for all genotypes
   -p, --ploidy <file>         space/tab-delimited list of CHROM,FROM,TO,SEX,PLOIDY
   -s, --sex <file>            list of samples, "NAME SEX"
   -t, --tags <list>           VCF tags to fix [GT]

   # Default ploidy, if -p not given. Unlisted regions have ploidy 2
   X 1 60000 M 1
   X 2699521 154931043 M 1
   Y 1 59373566 M 1
   Y 1 59373566 F 0
   MT 1 16569 M 1
   MT 1 16569 F 1

   # Example of -s file, sex of unlisted samples is "F"
   sampleName1 M

   bcftools +fixploidy in.vcf -- -s samples.txt

Login before adding your answer.

Traffic: 2684 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6