Question: How to "haploidize" diploid SNPs data in a vcf file
0
gravatar for BioinfoNovice
16 months ago by
France
BioinfoNovice90 wrote:

Dear all,

I have a vcf file containing SNPs data of multiple individuals where most are haploid while some are diploid. What I want to do is to "haploidize" the diploid individual, meaning that I want to to randomly take one allel out for the two for all loci of those individuals.

What kind of tools can I use to do that ? Or what is the scripting manner to adopt ?

For example,

scaffold1       25042   .       G       A       13300.6 PASS    AC=5;AF=0.179;AN=28;BaseQRankSum=-5.920e-01;ClippingRankSum=0.373;DP=1268;ExcessHet=0.7918;FS=5.925;MQ=57.37;MQRankSum=-1.031e+00;QD=32.39;ReadPosRankSum=0.943;SOR=1.255       GT:AD:DP:GQ:PL  0/0:41,0:41:90:0,90,1528        0/1:13,16:29:99:616,0,498       0:56,0:56:99:0,1800     0:66,0:66:99:0,1800     0:82,0:82:99:0,1800     0/0:19,0:19:33:0,33,495

I have the 1st, 2nd and 6th individuals are diploid while the others are haploid...what I want to do is to randomly take one of the two alleles for the diploid individuals... The resulting file could be another vcf file or under the form of tab separated file.

Any suggestion ?

Thank you very much in advance.

snp ploidy vcf • 1.1k views
ADD COMMENTlink modified 16 months ago by Len Trigg1.1k • written 16 months ago by BioinfoNovice90

if you were to work with the same problem but on the plink ped files, here's my approach after recoding alleles as 0/1/2:

import random
import itertools
infile = open("recode12.ped","r")
outfile = open("new_genotypes.ped","w")

check_11 = ('1','1')
check_22 = ('2','2')
def random_gen():
        choice_m = [check_11,check_22]
        res = random.choice(choice_m)
        return res

def parse_ped(ped_infile):
        check_12 = ('1','2')
        check_21 = ('2','1')

        for line in ped_infile:
                row = line.split()
                head = row[0:6]
                genotype_old = row[6:]
                genotype_f = row[6:]
                it = iter(genotype_f)
                genotype_t = zip(it, it)
                for n,i in enumerate(genotype_t):
                        if i == check_12 or i == check_21:
                                genotype_t[n]=random_gen()
                print genotype_t[0:10]
                genotype_t = list(itertools.chain(*genotype_t))
                print genotype_t[0:10]

                new_line = head+genotype_t
                outfile.write(" ".join(new_line)+"\n")
        outfile.close()

parse_ped(infile)
ADD REPLYlink written 16 months ago by stolarek.ir550
1
gravatar for Pierre Lindenbaum
16 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum108k wrote:

My solution using java/javascript nashorn. The script removes the INFO/FILTER/QUAL data and you might get some non-variant (variant with no ALT allele...).

ADD COMMENTlink written 16 months ago by Pierre Lindenbaum108k

Thank you very much for sharing ! I don't know much about java but I will try to use it.

ADD REPLYlink written 16 months ago by BioinfoNovice90

Hi,

I tried using the script. However, I get the error:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  BEN-SA3 BEN-ZOO1    BEN-CI5 BEN-SI5 BEN-CI2 BEN-NE2 BEN-SI1 BEN-NE1 BEN-CI7 BEN-NE3 BEN-ZOO2    BEN-NOR2    BEN-NOR1    BEN-CI6 BEN-CI8 BEN-CI3 BEN-SI3 BEN-ZOO6    BEN-ZOO3    BEN-SI2
biostar236160.js:59 ReferenceError: "Genotype" is not defined

I am not able to understand what is happening. Can you please help fix this error.

Thanks a lot

ADD REPLYlink modified 4 months ago by Pierre Lindenbaum108k • written 4 months ago by anubhabkhan0

strangely the line

var Genotype = Java.type("htsjdk.variant.variantcontext.Genotype");

was missing. I've updated the code.

ADD REPLYlink written 4 months ago by Pierre Lindenbaum108k
1
gravatar for Len Trigg
16 months ago by
Len Trigg1.1k
New Zealand
Len Trigg1.1k wrote:

Here is another simple approach using RTG Tools -- while it's still using javascript, it's got a more AWK'ish flavour than Pierre's solution, so it depends what you're most comfortable with:

$ cat haploidify.js
function record() {
  for (i = 0; i < SAMPLES.length; i++) {
    if (has(SAMPLES[i].GT)) {
      alleles = SAMPLES[i].GT.split(/\D/);
      SAMPLES[i].GT = alleles[Math.floor(Math.random() * alleles.length)];
    }
  }
}
$ rtg vcffilter -i input.vcf -o output.vcf --javascript haploidify.js
ADD COMMENTlink written 16 months ago by Len Trigg1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1565 users visited in the last hour