Question: Simple python script to convert ambiguity codes
1
gravatar for aberry814
12 months ago by
aberry81430
United States
aberry81430 wrote:

I'm very new to python. I know the basic formula, but I don't really know any commands. I'm learning, but I have to get this done faster than I can learn. I have a tab-delimited file that looks like this:

A R R R A

R W R W R

R M S W K

etc.

and want to generate a new file where I replace certain letters under certain conditions, i.e.:

if all cells in a row are A or R: replace R with G in that row only

if all cells in a row are R and W: replace R with G and W with T in that row only

(+20 other conditions)

so that it looks like:

A G G G A

G T G T G

R M S W K

Nothing should happen if a row has 3 or more unique characters.

The concept would be to read through the first row, see if it matches any conditions, and if it does, write the new output to a file. Iterate through all rows.

file = open(file.txt)
num_columns=numcolumns() # I can manually put in the number of columns for each run, though I'm sure there's a simple command.

for line in file:

    if line.contains('A') and line.contains('R') and (count('A')+count('R')==num_columns()): 

        write.to.newfile.replace('R') with ('G') #I obviously have no idea how to code this

        elif: 

        new_condition #etc. etc. etc.

        else:

        write line to file as is

Thanks for any help and sorry for the noob question!

ambiguity codes snp • 553 views
ADD COMMENTlink modified 12 months ago by WouterDeCoster23k • written 12 months ago by aberry81430
4
gravatar for cschu181
12 months ago by
cschu181640
cschu181640 wrote:

I would probably do it like this. Seems like an odd thing to do, though..care to elaborate about the background?

#!/usr/bin/env python
import sys
import csv
from collections import Counter

# script is called with python scriptname name_of_input_file
# it will create a file named name_of_input_file.out
filename = sys.argv[1]
# with is a safe way to open files for i/o
with open(filename) as fi, open(filename + '.out', 'w') as fo:
    # for processing tab-separated, we use csv.reader with delimiter='\t'
    for row in csv.reader(fi, delimiter='\t'):
        # Counter objects count elements in a collection and store them in a dictionary-like structure
        # e.g. {'A': 2, 'R': 3}
        symbolCounter = Counter(row)
        # this way we can easily determine the number of different characters by checking the length of the Counter object
        if len(symbolCounter) == 2:
            # assuming all your conditions only take place when there are exactly two different characters
            # build a string from the current row
            rowString = ''.join(row)
            # extract the keys from the Counter object, this will give you the different characters present
            # sorting will allow to resolve ambiguities ('AR' == 'RA')
            keys = ''.join(sorted(symbolCounter))
            # check for your conditions
            if keys == 'AR':
                # perform the replace-operations on the string built from the row
                rowString = rowString.replace('R', 'G')
            elif keys == 'RW':
                # chaining replace-operations is also possible
                rowString = rowString.replace('W', 'T').replace('R', 'G')
            # add your other conditions here
            # turn the row string back into a list and assign to the original row
            row = list(rowString)
            pass
        # write the row (transformed, if only contains two different characters; untransformed otherwise)
        fo.write('\t'.join(row) + '\n')
ADD COMMENTlink written 12 months ago by cschu181640
0
gravatar for aberry814
12 months ago by
aberry81430
United States
aberry81430 wrote:

Thanks so much for the detailed response! I'll give it a go. Here is what I'm trying to accomplish:

I have a diploid, unphased SNP file, where the heterozygotes are notated with ambiguity codes (eg if locus n is heterozygous for A and G, it is labeled R). I'm building phylogenies with these data, and the samples are so closely related that most of the differences are held within heterozygotes (half of the samples are AA at locus n, and the other half are AG). In an A -->R (AA to AG) comparison, a single A is assumed to be non-polymorphic, so the informative site is the single A to G transition. Most phylogenetic software does not recognize ambiguity codes as heterozygotes, since that is not the intended purpose of ambiguity codes. This will also cut down on the computational time of building the phylogeny since I'm essentially halving the number of sites by removing uninformative "haplotypes" (used loosely, since unphased.)

tl; dr. I'm trying to replace ambiguity codes with the phylogenetically informative SNP information.

ADD COMMENTlink written 12 months ago by aberry81430

Please use ADD COMMENT to answer to earlier replies, as such this thread remains logically structured and easy to follow.

ADD REPLYlink written 12 months ago by WouterDeCoster23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 572 users visited in the last hour