Question: Simple python script to convert ambiguity codes
1
gravatar for aberry814
6 months ago by
aberry81420
United States
aberry81420 wrote:

I'm very new to python. I know the basic formula, but I don't really know any commands. I'm learning, but I have to get this done faster than I can learn. I have a tab-delimited file that looks like this:

A R R R A

R W R W R

R M S W K

etc.

and want to generate a new file where I replace certain letters under certain conditions, i.e.:

if all cells in a row are A or R: replace R with G in that row only

if all cells in a row are R and W: replace R with G and W with T in that row only

(+20 other conditions)

so that it looks like:

A G G G A

G T G T G

R M S W K

Nothing should happen if a row has 3 or more unique characters.

The concept would be to read through the first row, see if it matches any conditions, and if it does, write the new output to a file. Iterate through all rows.

file = open(file.txt)
num_columns=numcolumns() # I can manually put in the number of columns for each run, though I'm sure there's a simple command.

for line in file:

    if line.contains('A') and line.contains('R') and (count('A')+count('R')==num_columns()): 

        write.to.newfile.replace('R') with ('G') #I obviously have no idea how to code this

        elif: 

        new_condition #etc. etc. etc.

        else:

        write line to file as is

Thanks for any help and sorry for the noob question!

ambiguity codes snp • 389 views
ADD COMMENTlink modified 6 months ago by WouterDeCoster17k • written 6 months ago by aberry81420
4
gravatar for cschu181
6 months ago by
cschu181510
cschu181510 wrote:

I would probably do it like this. Seems like an odd thing to do, though..care to elaborate about the background?

#!/usr/bin/env python
import sys
import csv
from collections import Counter

# script is called with python scriptname name_of_input_file
# it will create a file named name_of_input_file.out
filename = sys.argv[1]
# with is a safe way to open files for i/o
with open(filename) as fi, open(filename + '.out', 'w') as fo:
    # for processing tab-separated, we use csv.reader with delimiter='\t'
    for row in csv.reader(fi, delimiter='\t'):
        # Counter objects count elements in a collection and store them in a dictionary-like structure
        # e.g. {'A': 2, 'R': 3}
        symbolCounter = Counter(row)
        # this way we can easily determine the number of different characters by checking the length of the Counter object
        if len(symbolCounter) == 2:
            # assuming all your conditions only take place when there are exactly two different characters
            # build a string from the current row
            rowString = ''.join(row)
            # extract the keys from the Counter object, this will give you the different characters present
            # sorting will allow to resolve ambiguities ('AR' == 'RA')
            keys = ''.join(sorted(symbolCounter))
            # check for your conditions
            if keys == 'AR':
                # perform the replace-operations on the string built from the row
                rowString = rowString.replace('R', 'G')
            elif keys == 'RW':
                # chaining replace-operations is also possible
                rowString = rowString.replace('W', 'T').replace('R', 'G')
            # add your other conditions here
            # turn the row string back into a list and assign to the original row
            row = list(rowString)
            pass
        # write the row (transformed, if only contains two different characters; untransformed otherwise)
        fo.write('\t'.join(row) + '\n')
ADD COMMENTlink written 6 months ago by cschu181510
0
gravatar for aberry814
6 months ago by
aberry81420
United States
aberry81420 wrote:

Thanks so much for the detailed response! I'll give it a go. Here is what I'm trying to accomplish:

I have a diploid, unphased SNP file, where the heterozygotes are notated with ambiguity codes (eg if locus n is heterozygous for A and G, it is labeled R). I'm building phylogenies with these data, and the samples are so closely related that most of the differences are held within heterozygotes (half of the samples are AA at locus n, and the other half are AG). In an A -->R (AA to AG) comparison, a single A is assumed to be non-polymorphic, so the informative site is the single A to G transition. Most phylogenetic software does not recognize ambiguity codes as heterozygotes, since that is not the intended purpose of ambiguity codes. This will also cut down on the computational time of building the phylogeny since I'm essentially halving the number of sites by removing uninformative "haplotypes" (used loosely, since unphased.)

tl; dr. I'm trying to replace ambiguity codes with the phylogenetically informative SNP information.

ADD COMMENTlink written 6 months ago by aberry81420

Please use ADD COMMENT to answer to earlier replies, as such this thread remains logically structured and easy to follow.

ADD REPLYlink written 6 months ago by WouterDeCoster17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 980 users visited in the last hour