Question: How to count occurrence of numbers in text files using phython
0
gravatar for Kachibunny
3.4 years ago by
Kachibunny0
Kachibunny0 wrote:

I have been working with a project that requires I count the number of 1's and 0's which describes the effect an amino acid would have in the stability of the peptide. There are about 300 different peptide sequences in the file. I want my code to recognize the start of a peptide sequence from my text file, count its length then count the number of 1's and 0's each amino acid records. so far I have been working to get my code recognize the start of a sequence using its index numbering, here's what I have

count=0
input_file01=open (r'C:/Users/12345/Documents/Dr XXX Research/MHC I 17 NOV2016.txt') 
Output_file01= open ('MHC I 17 NOV2016OUT.txt','w') 
for line in input_file01:
    templist=line.split()
    for i in templist[0]:
        if i=='1':
            count+1
            if count=='#':
                break 
                Output_file01.write(templist[0] + '\t\t' + templist[1] + '\t\t' + templist[2]+'\n') 



Here is an example of the content in the file. I want my code to count the peptide sequence, count the number of 1's and 0's and find their ratios within each peptide seq.                                                                                    

    #   1 - Amino acid number
    #   2 - One letter code
    #   3 - ANCHOR probability value
    #   4 - ANCHOR output
    #   
    1   A         0.3129    0
    2   P         0.4044    0
    3   K         0.5258    1
    4   R         0.6358    1
    5   P         0.7277    1
    6   P         0.7895    1
    7   S         0.8710    1
    8   A         0.9358    1
    9   F         0.9680    1
writelines excel python counting • 2.6k views
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Kachibunny0
  1. Why does the post have a writelines (whatever that means) and an excel tag?
  2. Please avoid white spaces in file/folder names. Also, use Linux based OS whenever possible.

You will also need to mention why you're doing this, and how your input file differentiates between peptides.

ADD REPLYlink written 3.4 years ago by RamRS27k
  1. The programming language i am using is python and I am fairly new to it. The tags were a faster way to reach out for assistance. My files where initially excel files converted to text as I have little experience manipulating text files in python.
  2. I have a windows based laptop so I'm not really sure about the Linus OS suggestion but thank you.

I am in the bioinformatics masters program and as I said above fairly new to Python coming from a Microbiology background. For a research project I was given this task, which uses the ANCHOR software to study the stability/disorderliness of an amino acid and assigns a 1 or 0 to it based on the software's analysis. Hence I decided to write a script to count and get the ratio of 1's and 0's in each peptide run through ANCHOR. My initial code which I have been working on over the weekend keeps giving me this error: "IndexError: list index out of range "

ADD REPLYlink written 3.4 years ago by Kachibunny0
1

You are not getting faster or more accurate help by using incorrect tags, so doesn't have added value to add in excel when this is not a question related to excel (regardless of where your input data is from).

ADD REPLYlink written 3.4 years ago by WouterDeCoster43k

Can you tell us more about your input file? Is it tab separated?

ADD REPLYlink written 3.4 years ago by Joe16k

Yes it is tab separated.

ADD REPLYlink written 3.4 years ago by Kachibunny0

Do you have all the files separately or do they have to be all in one? Frankly the # lines are completely stumping me. Splitting the file up by entry is difficult (for me anyway) and then having to do all the column parsing (which is easy).

ADD REPLYlink written 3.4 years ago by Joe16k
1

Perhaps OP should post a longer example of what he has and what he wants to obtain.

ADD REPLYlink written 3.4 years ago by WouterDeCoster43k

I really appreciate athe help youv've offered thus far.Yes I want the files all in one. Just realizing the difficulty level of this this task, however I have modified my code below

  input_file01=open (r'C:/Users/12345/Documents/Dr XXX Research/MHC I 17 NOV2016.txt') 
        Output_file01= open ('MHCI17NOV2016OUT.txt','w') 
        for line in input_file01:
            if line[0]=='#':
                pass
            elif line[0] !='#':

                newlines= line[0] +'\t\t'+line[2]+'\t\t'+line[3]+'\t\t'+line[4]+'\t\t'+ line[6]+'\n'
                print newlines
                Output_file01.write(newlines)

        input_file01.close()
        Output_file01.close()`


    Here is  a sample of what gets written onto my output file onyly.

        >       g       b                       
        1       S                        
        2       L                        
        3       N                        
        4       M                        
        5       I                        
        6       S                        
        7       K                        
        8       K                        
        9       Y
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Kachibunny0
1

This is my partial solution. It handles your given file but doesn't attempt to handle mutliple files strung together. I must be having a bad day because I can't make it work. If it was me, I'd save myself a considerably amount of headache and split the files apart first, then just loop over them with the script. I also haven't added in the ratio you alluded to, as I don't know whether you want 1's over 0s or vice versa, but that should be pretty easy to add - beware of division by zero errors here if one of your protein files only has all 1s or all 0s; probably want a try/except for this perhaps.

# $ python AAcount.py inputfile.txt

import sys

with open(sys.argv[1], 'r') as ifh:
        entries = []
        for line in ifh:
                line = line.lstrip()
                if not line.startswith('#'):
                        row = line.split()

                        entries.append(row)

count = 0
for item in entries:
        if item[3] == '1':
                count += 1

num1 = count
num0 = int(len(entries)) - int(count)
print("Length | 1s | 0s")
print(str(len(entries)) + "\t" + str(num1) + '\t' + str(num0))
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Joe16k

Needs more list comprehensions! :-D

entries = [line.lstrip().split() for line in ifh if not line.startswith('#')]

count = sum([1 for item in entries if item[3] == '1'])
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by WouterDeCoster43k

I know :P they just aren't verbose enough for me so I can't think like that haha!

ADD REPLYlink written 3.4 years ago by Joe16k

Thanks guys so very much. I am not familiar with the module (sys) and sys.argv prompt, however I did some light research and tried implementing your code this way, I got the IndexError: list index out of range error message in relation to the fourth line of the code.
From my research apparently I am supposed to use the python script.py input_file01.txt script in some context, although to be honest this would be the first time implementing the sys.argv prompt as stated.

import sys 
input_file01=open (r'C:/Users/12345/Documents/Dr XXX Research/MHC I 17 NOV2016.txt') 
Output_file01= open ('MHCI17NOV2016OUT.txt','w') 
with open(sys.argv[1], 'r') as input_file01, open(sys.argv[2], 'w') as Output_file01:
        entries = []
        for line in input_file01:
                line = line.lstrip()
                if not line.startswith('#'):
                        row = line.split()

                        entries.append(row)

count = 0
for item in entries:
        if item[3] == '1':
                count += 1

num1 = count
num0 = int(len(entries)) - int(count)
print("Length | 1s | 0s")
print(str(len(entries)) + "\t" + str(num1) + '\t' + str(num0))
ADD REPLYlink written 3.4 years ago by Kachibunny0

In this case you would use the script as python script.py inputfile outputfile. IndexError means that the interpreter couldn't find the item sys.argv[2] since there was no item at that index. Sys.argv is a list containing the arguments with which the script was executed (with sys.argv[0] the script-name itself).

ADD REPLYlink written 3.4 years ago by WouterDeCoster43k
1

Exactly as mentioned above.

Using sys.argv[] is just a quick and dirty way of taking arguments directly from the command line. I hate hardcoding file names etc in scripts. It means you can do away with the input_file01 line.

Thus if you invoked the following at the commandline: python script.py input.txt output.txt:

sys.argv[0] = script.py (the name of your executed file)

sys.argv[1] = input.txt

sys.argv[2] = output.txt

And so on...

If you want to open a file to write to as well, I would just change the `with open (sys.argv[1], 'r') line to:

with open(sys.argv[1], 'r') as ifh, open(sys.argv [2], 'w') as ofh:
# do stuff
    ofh.write(stuff)
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Joe16k

It's a bit unclear what you are asking for, but I think the break statement before Output_file01.write() is not your best friend. That will just break the loop there.

Also, you are looking at templist[0], which is (if I'm not mistaken) just your amino acid number and not actually the 0 or 1 at the end.
Since templist[0] is just one position, it's rather pointless to have a for loop over it.

ADD REPLYlink written 3.4 years ago by WouterDeCoster43k

From the initial post above, the file contains unwanted annotation describing the numbers and amino acids. Templis[0] is the first row/line indexing the number of amino acids in each peptide. I am attempting to use this indexing which is unique to the actual information I want to write into my output file only the information I need which is the number of amino acids in the peptide (templist[0]), the amino acid code(templist[1]), the float numbers(templist [2] and the ANCHOR score(templist[3]) and eventually count the 1's and 0;s in templist[3], hence the use of the break. Hopefully this explains how I am attempting to solve this problem.

ADD REPLYlink written 3.4 years ago by Kachibunny0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1868 users visited in the last hour