Question

Using bcftools to change ploidy level in vcf files

0

Entering edit mode

7.0 years ago

flolil27 • 0

hello

I am facing problem with bcftools. I have a VCF file containing variants from a haploid fungus. I would like to use the --haploid2diploid option to convert haploid genotypes to homozygous diploid genotypes.

I have tried using bcftools convert --haploid2diploid -h hapfile.vcf > dipfile.vcf

but the dipfile.vcf is still a haploid vcf.

Does anyone know how I could do that with bcftools? Is there another way to convert my haploid vcf file?

Thanks

snp bcftools • 7.9k views

ADD COMMENT • link updated 13 months ago by Jimmy ▴ 30 • written 7.0 years ago by flolil27 • 0

0

Entering edit mode

Hi Flolil,

I am facing the same problem, did you find a solution?

Thanks, Elisabet

ADD REPLY • link 6.5 years ago by elisabet.sjokvist • 0

score 0 · Answer 1 · 2017-11-07

using vcffilterjdk: http://lindenb.github.io/jvarkit/VcfFilterJdk.html . It fix the haploid, genotype , AC and AN. I'm too lazty to fix things like AF (shouldn't change ?) , etc...

    $  wget -q -O - "https://raw.githubusercontent.com/CostaLab/practical_SS2015/598ea0dddf2ef073a55ae21bc6d39ac2172eb617/data_analysis/organisms/escherichia_coli/O157H7_Sakai/IonTorrentPGM_mem/SRX185723/SRR566635/SRR566635-snps.vcf" |\
       java -jar dist/vcffilterjdk.jar -e 'return new VariantContextBuilder(variant).genotypes(variant.getGenotypes().stream().map(G->!G.isCalled()?GenotypeBuilder.createMissing(G.getSampleName(),2):G).map(G->G.isCalled() && G.getPloidy()==1?new GenotypeBuilder(G).alleles(Arrays.asList(G.getAllele(0),G.getAllele(0))).make():G).collect(Collectors.toList())).attribute("AC",variant.getGenotypes().stream().mapToInt(G->G.isCalled() && G.getPloidy()==1 && !G.getAllele(0).isReference()?2:G.getAlleles().size()).sum()).attribute("AN",variant.getGenotypes().stream().mapToInt(G->G.isCalled() && G.getPloidy()==1 ?2:G.getAlleles().size()).sum()).make();'
    (...)

gi|15829254|ref|NC_002695.1|    270484  .   G   C   2747    .   AC=2;AF=1.00;AN=2;BaseQRankSum=1.612;DP=116;Dels=0.00;FS=0.000;HaplotypeScore=3.4380;MLEAC=1;MLEAF=1.00;MQ=59.89;MQ0=0;MQRankSum=1.521;QD=23.68;ReadPosRankSum=-1.009;SOR=0.268 GT:AD:DP:GQ:PL  1/1:1,115:116:99:2777,0
gi|15829254|ref|NC_002695.1|    270502  .   G   A   2924    .   AC=2;AF=1.00;AN=2;BaseQRankSum=0.948;DP=106;Dels=0.00;FS=0.000;HaplotypeScore=2.6493;MLEAC=1;MLEAF=1.00;MQ=59.26;MQ0=0;MQRankSum=1.667;QD=27.58;ReadPosRankSum=1.667;SOR=0.281  GT:AD:DP:GQ:PL  1/1:1,105:106:99:2954,0
gi|15829254|ref|NC_002695.1|    270545  .   A   C   2121    .   AC=2;AF=1.00;AN=2;BaseQRankSum=2.084;DP=109;Dels=0.01;FS=5.100;HaplotypeScore=23.5544;MLEAC=1;MLEAF=1.00;MQ=49.74;MQ0=3;MQRankSum=-1.031;QD=19.46;ReadPosRankSum=-0.760;SOR=0.425   GT:AD:DP:GQ:PL  1/1:5,103:108:99:2151,0
gi|15829254|ref|NC_002695.1|    270562  .   G   C   2366    .   AC=2;AF=1.00;AN=2;DP=104;Dels=0.00;FS=0.000;HaplotypeScore=5.2793;MLEAC=1;MLEAF=1.00;MQ=48.85;MQ0=3;QD=22.75;SOR=0.818  GT:AD:DP:GQ:PL  1/1:1,103:104:99:2396,0
gi|15829254|ref|NC_002695.1|    270577  .   G   A   559 .   AC=2;AF=1.00;AN=2;BaseQRankSum=4.123;DP=63;Dels=0.00;FS=67.351;HaplotypeScore=61.7544;MLEAC=1;MLEAF=1.00;MQ=41.28;MQ0=3;MQRankSum=4.843;QD=8.87;ReadPosRankSum=-2.054;SOR=7.167 GT:AD:DP:GQ:PL  1/1:28,35:63:99:589,0
gi|15829254|ref|NC_002695.1|    270584  .   G   A   1197    .   AC=2;AF=1.00;AN=2;DP=57;Dels=0.00;FS=0.000;HaplotypeScore=32.4027;MLEAC=1;MLEAF=1.00;MQ=38.79;MQ0=3;QD=21.00;SOR=0.931  GT:AD:DP:GQ:PL  1/1:0,57:57:99:1227,0
gi|15829254|ref|NC_002695.1|    270596  .   T   G   327 .   AC=2;AF=1.00;AN=2;BaseQRankSum=0.330;DP=51;Dels=0.02;FS=0.000;HaplotypeScore=28.8469;MLEAC=1;MLEAF=1.00;MQ=35.57;MQ0=3;MQRankSum=-0.578;QD=6.41;ReadPosRankSum=1.238;SOR=0.613  GT:AD:DP:GQ:PL  1/1:1,48:50:99:357,0
gi|15829254|ref|NC_002695.1|    270598  .   C   G   809 .   AC=2;AF=1.00;AN=2;BaseQRankSum=2.513;DP=52;Dels=0.02;FS=0.000;HaplotypeScore=29.6561;MLEAC=1;MLEAF=1.00;MQ=36.19;MQ0=3;MQRankSum=-0.852;QD=15.56;ReadPosRankSum=0.256;SOR=0.303 GT:AD:DP:GQ:PL  1/1:3,48:51:99:839,0
gi|15829254|ref|NC_002695.1|    270627  .   A   G   2705    .   AC=2;AF=1.00;AN=2;DP=103;Dels=0.01;FS=0.000;HaplotypeScore=5.0818;MLEAC=1;MLEAF=1.00;MQ=50.91;MQ0=2;QD=26.26;SOR=0.957  GT:AD:DP:GQ:PL  1/1:0,102:102:99:2735,0
gi|15829254|ref|NC_002695.1|    270646  .   C   A   2337    .   AC=2;AF=1.00;AN=2;DP=81;Dels=0.00;FS=0.000;HaplotypeScore=7.7758;MLEAC=1;MLEAF=1.00;MQ=56.60;MQ0=0;QD=28.85;SOR=1.416   GT:AD:DP:GQ:PL  1/1:0,81:81:99:2367,0

score 0 · Answer 2 · 2022-02-08

0

Entering edit mode

2.2 years ago

taprs • 0

A (terrible) workaround with sed gets the thing done quick and dirty. Works for me for PopGenome library VCF input.

bcftools view file.vcf.gz | sed -r -e 's@\t([a-Z0-9.]):@\t\1/\1:@g' | bcftools view -Oz -o file_hap2dip.vcf.gz

ADD COMMENT • link 2.2 years ago by taprs • 0

0

Entering edit mode

Hey, I'm trying to find how to convert my diploid vcf file to a triploid one, and came across your sed code, but I have to admit I don't get what it does (and cannot find where the ploidy info is in a vcf file). Would you mind explaining? Thanks!

ADD REPLY • link 20 months ago by alnam ▴ 10

0

Entering edit mode

Hey! Sorry it's been a while.

Ploidy level in VCF file can be deduced for each sample by number of positions in the respective GT field, e.g. GT value of 1/1 in the example above indicates diploid homozygous genotype with alternative allele in both positions. If you are confused, take a look at the VCF format specification.

The sed code I provided will only work as-is for haploid-to-diploid transformation. It is a search-and-replace command (denoted here by s@...@...@g pattern). In essence, it does the following:

find haploid GT values (the \t([a-Z0-9.]): search pattern works because GT field is normally the first in the tab-delimited sample cell and is normally followed by other fields separated by colons;
duplicate the found character in-place and add the necessary / separator (\1 stands for the matched pattern in brackets).

If you are confused, check out any sed manual; I personally like this one. Note that my command is not optimal (e.g. now I see that the search pattern should rather be \t[0-9.]: because GT fields contain no numbers...)

With diploid-to-triploid conversion, the solution would depend on how you are planning to treat heterozygous positions.

ADD REPLY • link 13 months ago by taprs • 0

score 0 · Answer 3 · 2023-03-13

Python solution to diploidizing a haploid VCF (fairly faster than the Bash solution also posted on this question).

hap_to_dip = {'0':'0/0', '1':'1/1'}

with open("haploid.vcf","r") as file, open("diplodized.vcf","w") as s_file:
        for line in file:
            if line[0] == '#':
                s_file.write(line)
            else:
                rowstr = line.split('\t')[9:]
                for i in range(len(rowstr)):
                    rowstr[i] = hap_to_dip[rowstr[i][0]] + rowstr[i][1:]
                row = '\t'.join(str(line).split('\t')[:9]) + '\t' + '\t'.join(rowstr)
                s_file.write(row)

Here's a cleaner but slightly slower version

import csv

hap_to_dip = {'0': '0/0', '1': '1/1'}
with open("haploid.vcf","r") as file, open("diplodized.vcf","w") as s_file:
    reader = csv.reader(file, delimiter="\t")
    writer = csv.writer(s_file, delimiter="\t")
    for row in reader:
        if row[0].startswith("#"):
            s_file.write(str('\t'.join(row)+'\n'))
        else:
            row[9:] = [hap_to_dip.get(field[0], field) + field[1:] for field in row[9:]]
            writer.writerow(row)