Question: Convert the genotype format from numeric to letters
1
gravatar for waqaskhokhar999
25 days ago by
waqaskhokhar99940 wrote:

I want to use IVAS for sQTL analysis and it accepts only allelic encoding of genotypes, so that they should be two letters of A,C,G,T

The format of my vcf file is like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  108 139 159 265

1   73  .   C   A   40  PASS    .   GT:DP:GQ    0|0:5:40    0|0:9:40    0|0:6:38    ./.:.:.

1   83  .   T   C,A 40  PASS    .   GT:DP:GQ    1|1:5:40    1|1:9:40    0|0:8:38    ./.:.:.

I want to convert the genotype format from numeric to letters

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  108 139 159 265

1   73  .   C   A   40  PASS    .   GT:DP:GQ    CC  CC  CA  NA

1   83  .   T   C,A 40  PASS    .   GT:DP:GQ    TC  TC  TT  NA
snp vcf • 162 views
ADD COMMENTlink modified 25 days ago by Leon90 • written 25 days ago by waqaskhokhar99940
0
gravatar for Leon
25 days ago by
Leon90
Leon90 wrote:
#!/usr/bin/env python2
#!/coding:utf-8
import sys
import os
import gzip


class Letter(object):
    standard_nt = ['A', 'C', 'G', 'T']

    def __init__(self, vcf):
        self.vcf = vcf
        self.get_head()

    def _open(self):
        if os.path.splitext(self.vcf)[-1] == '.gz':
            input = gzip.open(self.vcf)
        else:
            input = open(self.vcf)
        return input

    def get_head(self):
        input = self._open()
        self.vcfhead = []
        for h in input:
            if h.startswith('#'):
                self.vcfhead.append(h)
            else:
                break
        input.close()

    def parse_vcf(self):
        input = self._open()
        for i in input:
            if i.startswith('#'):
                continue
            else:
                line = i.strip().split()
                line[7]='.'
                line[8]='GT'
                ref = line[3]
                alt = line[4]
                if len(ref) > 1 or len(alt) > 1:
                    print 'only accept BIALLELIC SNP\nremove this site\n'
                    continue
                elif ref not in self.standard_nt or alt not in self.standard_nt:
                    print 'only accept valid nucleotide\n'
                    continue
                else:
                    part1 = '\t'.join(line[0:9])
                    sample_gt = []
                    for gt in line[9:]:
                        gt = gt.split(':')[0]
                        if gt == '0|0' or gt == '0/0':
                            letter = ref*2
                        elif gt == '0|1'or gt == '0/1':
                            letter = ref+alt
                        elif gt == '1|1'or gt == '1/1':
                            letter = alt*2
                        else:
                            letter = 'NA'
                        sample_gt.append(letter)
                    sample_gt = '\t'.join(sample_gt)
                    all = part1+'\t'+sample_gt
                    yield all
        input.close()

    def write_out(self):
        with gzip.open('letter.'+self.vcf, 'w') as f:
            f.write(''.join(self.vcfhead))
            for line in self.parse_vcf():
                f.write(line+'\n')

if __name__ == '__main__':
    vcf = sys.argv[1]
    Letter(vcf).write_out()

save the script as letter.py,run `python letter.py yours.vcf(.gz format is ok too).The file with 'letter.' prefix is the result file.

ADD COMMENTlink written 25 days ago by Leon90

Many thanks, the genotype format of my vcf file is like this 0|0, 1|1, 2|2, ./. so I replaced

if gt == '0|0' or gt == '0/0':
                        letter = ref*2
                    elif gt == '0|1'or gt == '0/1':
                        letter = ref+alt
                    elif gt == '1|1'or gt == '1/1':
                        letter = alt*2

by

if gt == '0|0' or gt == '0/0':
                        letter = ref*2
                    elif gt == '1|1'or gt == '1/1':
                        letter = ref+alt
                    elif gt == '2|2'or gt == '2/2':
                        letter = alt*2

It needs parenthesis in print statement so I added parenthesis

(print 'only accept BIALLELIC SNP\nremove this site\n')

But I am getting this error:

Traceback (most recent call last):
  File "letter.py", line 76, in <module>
    Letter(vcf).write_out()
  File "letter.py", line 70, in write_out
    f.write(''.join(self.vcfhead))
  File "/home/waqas/miniconda3/lib/python3.6/gzip.py", line 260, in write
    data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'
ADD REPLYlink written 24 days ago by waqaskhokhar99940

Actually, it is running in py2 not py3. By the way, it is a little weird. Is standard format of your vcf ? If so, I hope my code will resolve your problem without change. But if you just want to use 0 represent refHOM ,1 represent HET, 2 represent altHOM, i think you will meet some problems in data format when use the script.

ADD REPLYlink written 24 days ago by Leon90

Many thanks for your response, I have downloaded vcf file from here. |In vcf file the genotype fromats are like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  108 139 159 265

1   73  .   C   A   40  PASS    .   GT:DP:GQ    0|0:5:40    0|0:9:40    0|0:6:38    ./.:.:.

1   83  .   T   C,A 40  PASS    .   GT:DP:GQ    1|1:5:40    1|1:9:40    0|0:8:38    ./.:.:.

I was thinking if your code tries to find 1|0 or 0|1 and tries to replace it with het as I tried to grep 0|1 and 1|0 from vcf file and did not get any count.

ADD REPLYlink written 23 days ago by waqaskhokhar99940

[vcf format]https://faculty.washington.edu/browning/beagle/intro-to-vcf.html.It is useful.

ADD REPLYlink written 23 days ago by Leon90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1099 users visited in the last hour