Convert genotype to nucleotide genotype?
1
0
Entering edit mode
8.4 years ago
SheelS ▴ 40

Hi there, I have a csv file looks like below, and I would like to convert the genotypes (A or B) to genotype nucleotide (A, T, G, C).

Marker, alleleA, alleleB,  id1, id1, id2, id2,    # id is people
rs1,          G,       T,    A,   A,   A,   B,
rs2,          A,       G,    B,   A,   A,   A, 
rs3,          C,       T,    0,   0,   A,   B,    # 0 is missing

After converting.....

Marker, alleleA, alleleB,  id1, id1, id2, id2,    
rs1,          G,       T,    G,   G,   G,   T, # for rs1, the 'A' is replaced by G, and 'B' by T
rs2,          A,       G,    G,   A,   A,   A, 
rs3,          C,       T,    0,   0,   C,   T,

Does anyone know any smart way to do this ?

Thanks for any possible solution or advice in advance !

R genotype python • 2.5k views
ADD COMMENT
0
Entering edit mode
8.4 years ago

An almost python functional script (to not completely spoil the fun for you). Poorly indented.

import csv, sys

file = csv.reader(sys.argv[1], delimiter=',')
with open('converted.csv'. 'w') as output:
for line in file:
   genotypes = [] 
   pos = line[0]
   alleleA = line[1]
   alleleB = line[2]
   for samplefield in line[3:]:
      if samplefield == 'A':
           genotypes.append(alleleA)
      elif samplefield == 'B'
          genotypes.append(alleleB)
      else:
          sys.exit('Unexpected input!')
  outfile.write("{}\t{}\t{}\t{}\n".format(pos, alleleA, alleleB, '\t'.join(genotypes)))

Let me know if you need some more help getting it working.

ADD COMMENT
0
Entering edit mode

Thanks, but sorry it looks like it does not work, IndexError: list out of range alleleA = [1]

ADD REPLY
1
Entering edit mode

Oh crap I made a mistake, will edit. Delimiter = ','

ADD REPLY
0
Entering edit mode

Sorry, but issue still there. Any ideas?

ADD REPLY
0
Entering edit mode

This should do the trick now. My bad, did some bad pseudocode in the above almost functional script above. This script definitely isn't the most efficient if you have millions of lines, but it should do the job. Warning: python2.7 synthax. I could write it in a few lines shorter, but readability counts and explicit is better than implicit, ya know ;-)

import csv, sys

with open(sys.argv[1]) as input, open('converted.csv', 'w') as output:
    data = csv.reader(input, delimiter=',')
    for line in data:
        if line[0].startswith('Marker'): #Catch the headerline
            output.write("{}\n".format(','.join(line)))
        else:
            genotypes = [] 
            pos = line[0]
            alleleA = line[1]
            alleleB = line[2]
            for samplefield in line[3:]:
                if samplefield == 'A':
                    genotypes.append(alleleA)
                elif samplefield == 'B':
                    genotypes.append(alleleB)
                elif samplefield == '0':
                    genotypes.append('0')
                else:
                    sys.exit('!! Unexpected input: {}'.format(','.join(line)))
            output.write("{},{},{},{}\n".format(pos, alleleA, alleleB, ','.join(genotypes)))

I wonder where the trailing comma comes from in your code example (at the end of the line). Correct me if I'm wrong, but is that common for a csv file? To easily remove it:

sed -i 's/,$//g' input.csv
ADD REPLY
0
Entering edit mode

Sorry but still the same issue, list out of range would you mind I take your code to ask in somewhere? No offend, then we can know how to solve it ! : )

ADD REPLY
1
Entering edit mode

Well, the code works here. But you can do whatever you want with my code :)

ADD REPLY

Login before adding your answer.

Traffic: 996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6