Using bcftools to change ploidy level in vcf files
3
0
Entering edit mode
7.0 years ago
flolil27 • 0

hello

I am facing problem with bcftools. I have a VCF file containing variants from a haploid fungus. I would like to use the --haploid2diploid option to convert haploid genotypes to homozygous diploid genotypes.

I have tried using bcftools convert --haploid2diploid -h hapfile.vcf > dipfile.vcf

but the dipfile.vcf is still a haploid vcf.

Does anyone know how I could do that with bcftools? Is there another way to convert my haploid vcf file?

Thanks

snp bcftools • 7.9k views
ADD COMMENT
0
Entering edit mode

Hi Flolil,

I am facing the same problem, did you find a solution?

Thanks, Elisabet

ADD REPLY
0
Entering edit mode
6.4 years ago

using vcffilterjdk: http://lindenb.github.io/jvarkit/VcfFilterJdk.html . It fix the haploid, genotype , AC and AN. I'm too lazty to fix things like AF (shouldn't change ?) , etc...

    $  wget -q -O - "https://raw.githubusercontent.com/CostaLab/practical_SS2015/598ea0dddf2ef073a55ae21bc6d39ac2172eb617/data_analysis/organisms/escherichia_coli/O157H7_Sakai/IonTorrentPGM_mem/SRX185723/SRR566635/SRR566635-snps.vcf" |\
       java -jar dist/vcffilterjdk.jar -e 'return new VariantContextBuilder(variant).genotypes(variant.getGenotypes().stream().map(G->!G.isCalled()?GenotypeBuilder.createMissing(G.getSampleName(),2):G).map(G->G.isCalled() && G.getPloidy()==1?new GenotypeBuilder(G).alleles(Arrays.asList(G.getAllele(0),G.getAllele(0))).make():G).collect(Collectors.toList())).attribute("AC",variant.getGenotypes().stream().mapToInt(G->G.isCalled() && G.getPloidy()==1 && !G.getAllele(0).isReference()?2:G.getAlleles().size()).sum()).attribute("AN",variant.getGenotypes().stream().mapToInt(G->G.isCalled() && G.getPloidy()==1 ?2:G.getAlleles().size()).sum()).make();'
    (...)

gi|15829254|ref|NC_002695.1|    270484  .   G   C   2747    .   AC=2;AF=1.00;AN=2;BaseQRankSum=1.612;DP=116;Dels=0.00;FS=0.000;HaplotypeScore=3.4380;MLEAC=1;MLEAF=1.00;MQ=59.89;MQ0=0;MQRankSum=1.521;QD=23.68;ReadPosRankSum=-1.009;SOR=0.268 GT:AD:DP:GQ:PL  1/1:1,115:116:99:2777,0
gi|15829254|ref|NC_002695.1|    270502  .   G   A   2924    .   AC=2;AF=1.00;AN=2;BaseQRankSum=0.948;DP=106;Dels=0.00;FS=0.000;HaplotypeScore=2.6493;MLEAC=1;MLEAF=1.00;MQ=59.26;MQ0=0;MQRankSum=1.667;QD=27.58;ReadPosRankSum=1.667;SOR=0.281  GT:AD:DP:GQ:PL  1/1:1,105:106:99:2954,0
gi|15829254|ref|NC_002695.1|    270545  .   A   C   2121    .   AC=2;AF=1.00;AN=2;BaseQRankSum=2.084;DP=109;Dels=0.01;FS=5.100;HaplotypeScore=23.5544;MLEAC=1;MLEAF=1.00;MQ=49.74;MQ0=3;MQRankSum=-1.031;QD=19.46;ReadPosRankSum=-0.760;SOR=0.425   GT:AD:DP:GQ:PL  1/1:5,103:108:99:2151,0
gi|15829254|ref|NC_002695.1|    270562  .   G   C   2366    .   AC=2;AF=1.00;AN=2;DP=104;Dels=0.00;FS=0.000;HaplotypeScore=5.2793;MLEAC=1;MLEAF=1.00;MQ=48.85;MQ0=3;QD=22.75;SOR=0.818  GT:AD:DP:GQ:PL  1/1:1,103:104:99:2396,0
gi|15829254|ref|NC_002695.1|    270577  .   G   A   559 .   AC=2;AF=1.00;AN=2;BaseQRankSum=4.123;DP=63;Dels=0.00;FS=67.351;HaplotypeScore=61.7544;MLEAC=1;MLEAF=1.00;MQ=41.28;MQ0=3;MQRankSum=4.843;QD=8.87;ReadPosRankSum=-2.054;SOR=7.167 GT:AD:DP:GQ:PL  1/1:28,35:63:99:589,0
gi|15829254|ref|NC_002695.1|    270584  .   G   A   1197    .   AC=2;AF=1.00;AN=2;DP=57;Dels=0.00;FS=0.000;HaplotypeScore=32.4027;MLEAC=1;MLEAF=1.00;MQ=38.79;MQ0=3;QD=21.00;SOR=0.931  GT:AD:DP:GQ:PL  1/1:0,57:57:99:1227,0
gi|15829254|ref|NC_002695.1|    270596  .   T   G   327 .   AC=2;AF=1.00;AN=2;BaseQRankSum=0.330;DP=51;Dels=0.02;FS=0.000;HaplotypeScore=28.8469;MLEAC=1;MLEAF=1.00;MQ=35.57;MQ0=3;MQRankSum=-0.578;QD=6.41;ReadPosRankSum=1.238;SOR=0.613  GT:AD:DP:GQ:PL  1/1:1,48:50:99:357,0
gi|15829254|ref|NC_002695.1|    270598  .   C   G   809 .   AC=2;AF=1.00;AN=2;BaseQRankSum=2.513;DP=52;Dels=0.02;FS=0.000;HaplotypeScore=29.6561;MLEAC=1;MLEAF=1.00;MQ=36.19;MQ0=3;MQRankSum=-0.852;QD=15.56;ReadPosRankSum=0.256;SOR=0.303 GT:AD:DP:GQ:PL  1/1:3,48:51:99:839,0
gi|15829254|ref|NC_002695.1|    270627  .   A   G   2705    .   AC=2;AF=1.00;AN=2;DP=103;Dels=0.01;FS=0.000;HaplotypeScore=5.0818;MLEAC=1;MLEAF=1.00;MQ=50.91;MQ0=2;QD=26.26;SOR=0.957  GT:AD:DP:GQ:PL  1/1:0,102:102:99:2735,0
gi|15829254|ref|NC_002695.1|    270646  .   C   A   2337    .   AC=2;AF=1.00;AN=2;DP=81;Dels=0.00;FS=0.000;HaplotypeScore=7.7758;MLEAC=1;MLEAF=1.00;MQ=56.60;MQ0=0;QD=28.85;SOR=1.416   GT:AD:DP:GQ:PL  1/1:0,81:81:99:2367,0
ADD COMMENT
0
Entering edit mode
2.2 years ago
taprs • 0

A (terrible) workaround with sed gets the thing done quick and dirty. Works for me for PopGenome library VCF input.

bcftools view file.vcf.gz | sed -r -e 's@\t([a-Z0-9.]):@\t\1/\1:@g' | bcftools view -Oz -o file_hap2dip.vcf.gz
ADD COMMENT
0
Entering edit mode

Hey, I'm trying to find how to convert my diploid vcf file to a triploid one, and came across your sed code, but I have to admit I don't get what it does (and cannot find where the ploidy info is in a vcf file). Would you mind explaining? Thanks!

ADD REPLY
0
Entering edit mode

Hey! Sorry it's been a while.

Ploidy level in VCF file can be deduced for each sample by number of positions in the respective GT field, e.g. GT value of 1/1 in the example above indicates diploid homozygous genotype with alternative allele in both positions. If you are confused, take a look at the VCF format specification.

The sed code I provided will only work as-is for haploid-to-diploid transformation. It is a search-and-replace command (denoted here by s@...@...@g pattern). In essence, it does the following:

  • find haploid GT values (the \t([a-Z0-9.]): search pattern works because GT field is normally the first in the tab-delimited sample cell and is normally followed by other fields separated by colons;
  • duplicate the found character in-place and add the necessary / separator (\1 stands for the matched pattern in brackets).

If you are confused, check out any sed manual; I personally like this one. Note that my command is not optimal (e.g. now I see that the search pattern should rather be \t[0-9.]: because GT fields contain no numbers...)

With diploid-to-triploid conversion, the solution would depend on how you are planning to treat heterozygous positions.

ADD REPLY
0
Entering edit mode
13 months ago
Jimmy ▴ 30

Python solution to diploidizing a haploid VCF (fairly faster than the Bash solution also posted on this question).

hap_to_dip = {'0':'0/0', '1':'1/1'}

with open("haploid.vcf","r") as file, open("diplodized.vcf","w") as s_file:
        for line in file:
            if line[0] == '#':
                s_file.write(line)
            else:
                rowstr = line.split('\t')[9:]
                for i in range(len(rowstr)):
                    rowstr[i] = hap_to_dip[rowstr[i][0]] + rowstr[i][1:]
                row = '\t'.join(str(line).split('\t')[:9]) + '\t' + '\t'.join(rowstr)
                s_file.write(row)

Here's a cleaner but slightly slower version

import csv

hap_to_dip = {'0': '0/0', '1': '1/1'}
with open("haploid.vcf","r") as file, open("diplodized.vcf","w") as s_file:
    reader = csv.reader(file, delimiter="\t")
    writer = csv.writer(s_file, delimiter="\t")
    for row in reader:
        if row[0].startswith("#"):
            s_file.write(str('\t'.join(row)+'\n'))
        else:
            row[9:] = [hap_to_dip.get(field[0], field) + field[1:] for field in row[9:]]
            writer.writerow(row)
ADD COMMENT

Login before adding your answer.

Traffic: 2813 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6