Question

How to count occurrence of numbers in text files using phython

0

Entering edit mode

7.8 years ago

Kachibunny • 0

I have been working with a project that requires I count the number of 1's and 0's which describes the effect an amino acid would have in the stability of the peptide. There are about 300 different peptide sequences in the file. I want my code to recognize the start of a peptide sequence from my text file, count its length then count the number of 1's and 0's each amino acid records. so far I have been working to get my code recognize the start of a sequence using its index numbering, here's what I have

count=0
input_file01=open (r'C:/Users/12345/Documents/Dr XXX Research/MHC I 17 NOV2016.txt') 
Output_file01= open ('MHC I 17 NOV2016OUT.txt','w') 
for line in input_file01:
    templist=line.split()
    for i in templist[0]:
        if i=='1':
            count+1
            if count=='#':
                break 
                Output_file01.write(templist[0] + '\t\t' + templist[1] + '\t\t' + templist[2]+'\n') 



Here is an example of the content in the file. I want my code to count the peptide sequence, count the number of 1's and 0's and find their ratios within each peptide seq.                                                                                    

    #   1 - Amino acid number
    #   2 - One letter code
    #   3 - ANCHOR probability value
    #   4 - ANCHOR output
    #   
    1   A         0.3129    0
    2   P         0.4044    0
    3   K         0.5258    1
    4   R         0.6358    1
    5   P         0.7277    1
    6   P         0.7895    1
    7   S         0.8710    1
    8   A         0.9358    1
    9   F         0.9680    1

python excel counting writelines • 6.4k views

ADD COMMENT • link 7.8 years ago by Kachibunny • 0

0

Entering edit mode

Why does the post have a writelines (whatever that means) and an excel tag?
Please avoid white spaces in file/folder names. Also, use Linux based OS whenever possible.

You will also need to mention why you're doing this, and how your input file differentiates between peptides.

ADD REPLY • link 7.8 years ago by Ram 44k

0

Entering edit mode

The programming language i am using is python and I am fairly new to it. The tags were a faster way to reach out for assistance. My files where initially excel files converted to text as I have little experience manipulating text files in python.
I have a windows based laptop so I'm not really sure about the Linus OS suggestion but thank you.

I am in the bioinformatics masters program and as I said above fairly new to Python coming from a Microbiology background. For a research project I was given this task, which uses the ANCHOR software to study the stability/disorderliness of an amino acid and assigns a 1 or 0 to it based on the software's analysis. Hence I decided to write a script to count and get the ratio of 1's and 0's in each peptide run through ANCHOR. My initial code which I have been working on over the weekend keeps giving me this error: "IndexError: list index out of range "

ADD REPLY • link 7.8 years ago by Kachibunny • 0

1

Entering edit mode

You are not getting faster or more accurate help by using incorrect tags, so doesn't have added value to add in excel when this is not a question related to excel (regardless of where your input data is from).

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

Can you tell us more about your input file? Is it tab separated?

ADD REPLY • link 7.8 years ago by Joe 21k

0

Entering edit mode

Yes it is tab separated.

ADD REPLY • link 7.8 years ago by Kachibunny • 0

0

Entering edit mode

Do you have all the files separately or do they have to be all in one? Frankly the # lines are completely stumping me. Splitting the file up by entry is difficult (for me anyway) and then having to do all the column parsing (which is easy).

ADD REPLY • link 7.8 years ago by Joe 21k

1

Entering edit mode

Perhaps OP should post a longer example of what he has and what he wants to obtain.

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

I really appreciate athe help youv've offered thus far.Yes I want the files all in one. Just realizing the difficulty level of this this task, however I have modified my code below

  input_file01=open (r'C:/Users/12345/Documents/Dr XXX Research/MHC I 17 NOV2016.txt') 
        Output_file01= open ('MHCI17NOV2016OUT.txt','w') 
        for line in input_file01:
            if line[0]=='#':
                pass
            elif line[0] !='#':

                newlines= line[0] +'\t\t'+line[2]+'\t\t'+line[3]+'\t\t'+line[4]+'\t\t'+ line[6]+'\n'
                print newlines
                Output_file01.write(newlines)

        input_file01.close()
        Output_file01.close()`


    Here is  a sample of what gets written onto my output file onyly.

        >       g       b                       
        1       S                        
        2       L                        
        3       N                        
        4       M                        
        5       I                        
        6       S                        
        7       K                        
        8       K                        
        9       Y

ADD REPLY • link 7.8 years ago by Kachibunny • 0

1

Entering edit mode

This is my partial solution. It handles your given file but doesn't attempt to handle mutliple files strung together. I must be having a bad day because I can't make it work. If it was me, I'd save myself a considerably amount of headache and split the files apart first, then just loop over them with the script. I also haven't added in the ratio you alluded to, as I don't know whether you want 1's over 0s or vice versa, but that should be pretty easy to add - beware of division by zero errors here if one of your protein files only has all 1s or all 0s; probably want a try/except for this perhaps.

# $ python AAcount.py inputfile.txt

import sys

with open(sys.argv[1], 'r') as ifh:
        entries = []
        for line in ifh:
                line = line.lstrip()
                if not line.startswith('#'):
                        row = line.split()

                        entries.append(row)

count = 0
for item in entries:
        if item[3] == '1':
                count += 1

num1 = count
num0 = int(len(entries)) - int(count)
print("Length | 1s | 0s")
print(str(len(entries)) + "\t" + str(num1) + '\t' + str(num0))

ADD REPLY • link 7.8 years ago by Joe 21k

0

Entering edit mode

Needs more list comprehensions! :-D

entries = [line.lstrip().split() for line in ifh if not line.startswith('#')]

count = sum([1 for item in entries if item[3] == '1'])

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

I know :P they just aren't verbose enough for me so I can't think like that haha!

ADD REPLY • link 7.8 years ago by Joe 21k

0

Entering edit mode

Thanks guys so very much. I am not familiar with the module (sys) and sys.argv prompt, however I did some light research and tried implementing your code this way, I got the IndexError: list index out of range error message in relation to the fourth line of the code.
From my research apparently I am supposed to use the python script.py input_file01.txt script in some context, although to be honest this would be the first time implementing the sys.argv prompt as stated.

import sys 
input_file01=open (r'C:/Users/12345/Documents/Dr XXX Research/MHC I 17 NOV2016.txt') 
Output_file01= open ('MHCI17NOV2016OUT.txt','w') 
with open(sys.argv[1], 'r') as input_file01, open(sys.argv[2], 'w') as Output_file01:
        entries = []
        for line in input_file01:
                line = line.lstrip()
                if not line.startswith('#'):
                        row = line.split()

                        entries.append(row)

count = 0
for item in entries:
        if item[3] == '1':
                count += 1

num1 = count
num0 = int(len(entries)) - int(count)
print("Length | 1s | 0s")
print(str(len(entries)) + "\t" + str(num1) + '\t' + str(num0))

ADD REPLY • link 7.8 years ago by Kachibunny • 0

0

Entering edit mode

In this case you would use the script as python script.py inputfile outputfile. IndexError means that the interpreter couldn't find the item sys.argv[2] since there was no item at that index. Sys.argv is a list containing the arguments with which the script was executed (with sys.argv[0] the script-name itself).

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

1

Entering edit mode

Exactly as mentioned above.

Using sys.argv[] is just a quick and dirty way of taking arguments directly from the command line. I hate hardcoding file names etc in scripts. It means you can do away with the input_file01 line.

Thus if you invoked the following at the commandline: python script.py input.txt output.txt:

sys.argv[0] = script.py (the name of your executed file)

sys.argv[1] = input.txt

sys.argv[2] = output.txt

And so on...

If you want to open a file to write to as well, I would just change the `with open (sys.argv[1], 'r') line to:

with open(sys.argv[1], 'r') as ifh, open(sys.argv [2], 'w') as ofh:
# do stuff
    ofh.write(stuff)

ADD REPLY • link 7.8 years ago by Joe 21k

0

Entering edit mode

It's a bit unclear what you are asking for, but I think the break statement before Output_file01.write() is not your best friend. That will just break the loop there.

Also, you are looking at templist[0], which is (if I'm not mistaken) just your amino acid number and not actually the 0 or 1 at the end.
Since templist[0] is just one position, it's rather pointless to have a for loop over it.

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

From the initial post above, the file contains unwanted annotation describing the numbers and amino acids. Templis[0] is the first row/line indexing the number of amino acids in each peptide. I am attempting to use this indexing which is unique to the actual information I want to write into my output file only the information I need which is the number of amino acids in the peptide (templist[0]), the amino acid code(templist[1]), the float numbers(templist [2] and the ANCHOR score(templist[3]) and eventually count the 1's and 0;s in templist[3], hence the use of the break. Hopefully this explains how I am attempting to solve this problem.

ADD REPLY • link 7.8 years ago by Kachibunny • 0