Question

Simple python script to convert ambiguity codes

1

Entering edit mode

7.5 years ago

aberry814 ▴ 80

I'm very new to python. I know the basic formula, but I don't really know any commands. I'm learning, but I have to get this done faster than I can learn. I have a tab-delimited file that looks like this:

A R R R A

R W R W R

R M S W K

etc.

and want to generate a new file where I replace certain letters under certain conditions, i.e.:

if all cells in a row are A or R: replace R with G in that row only

if all cells in a row are R and W: replace R with G and W with T in that row only

(+20 other conditions)

so that it looks like:

A G G G A

G T G T G

R M S W K

Nothing should happen if a row has 3 or more unique characters.

The concept would be to read through the first row, see if it matches any conditions, and if it does, write the new output to a file. Iterate through all rows.

file = open(file.txt)
num_columns=numcolumns() # I can manually put in the number of columns for each run, though I'm sure there's a simple command.

for line in file:

    if line.contains('A') and line.contains('R') and (count('A')+count('R')==num_columns()): 

        write.to.newfile.replace('R') with ('G') #I obviously have no idea how to code this

        elif: 

        new_condition #etc. etc. etc.

        else:

        write line to file as is

Thanks for any help and sorry for the noob question!

ambiguity codes SNP • 2.6k views

ADD COMMENT • link updated 7.5 years ago by WouterDeCoster 47k • written 7.5 years ago by aberry814 ▴ 80

0

Entering edit mode

7.5 years ago

aberry814 ▴ 80

Thanks so much for the detailed response! I'll give it a go. Here is what I'm trying to accomplish:

I have a diploid, unphased SNP file, where the heterozygotes are notated with ambiguity codes (eg if locus n is heterozygous for A and G, it is labeled R). I'm building phylogenies with these data, and the samples are so closely related that most of the differences are held within heterozygotes (half of the samples are AA at locus n, and the other half are AG). In an A -->R (AA to AG) comparison, a single A is assumed to be non-polymorphic, so the informative site is the single A to G transition. Most phylogenetic software does not recognize ambiguity codes as heterozygotes, since that is not the intended purpose of ambiguity codes. This will also cut down on the computational time of building the phylogeny since I'm essentially halving the number of sites by removing uninformative "haplotypes" (used loosely, since unphased.)

tl; dr. I'm trying to replace ambiguity codes with the phylogenetically informative SNP information.

ADD COMMENT • link 7.5 years ago by aberry814 ▴ 80

0

Entering edit mode

Please use ADD COMMENT to answer to earlier replies, as such this thread remains logically structured and easy to follow.

ADD REPLY • link 7.5 years ago by WouterDeCoster 47k

score 4 · Accepted Answer · 2016-11-11

I would probably do it like this. Seems like an odd thing to do, though..care to elaborate about the background?

#!/usr/bin/env python
import sys
import csv
from collections import Counter

# script is called with python scriptname name_of_input_file
# it will create a file named name_of_input_file.out
filename = sys.argv[1]
# with is a safe way to open files for i/o
with open(filename) as fi, open(filename + '.out', 'w') as fo:
    # for processing tab-separated, we use csv.reader with delimiter='\t'
    for row in csv.reader(fi, delimiter='\t'):
        # Counter objects count elements in a collection and store them in a dictionary-like structure
        # e.g. {'A': 2, 'R': 3}
        symbolCounter = Counter(row)
        # this way we can easily determine the number of different characters by checking the length of the Counter object
        if len(symbolCounter) == 2:
            # assuming all your conditions only take place when there are exactly two different characters
            # build a string from the current row
            rowString = ''.join(row)
            # extract the keys from the Counter object, this will give you the different characters present
            # sorting will allow to resolve ambiguities ('AR' == 'RA')
            keys = ''.join(sorted(symbolCounter))
            # check for your conditions
            if keys == 'AR':
                # perform the replace-operations on the string built from the row
                rowString = rowString.replace('R', 'G')
            elif keys == 'RW':
                # chaining replace-operations is also possible
                rowString = rowString.replace('W', 'T').replace('R', 'G')
            # add your other conditions here
            # turn the row string back into a list and assign to the original row
            row = list(rowString)
            pass
        # write the row (transformed, if only contains two different characters; untransformed otherwise)
        fo.write('\t'.join(row) + '\n')