Entering edit mode
7.7 years ago
Kachibunny
•
0
I have been working with a project that requires I count the number of 1's and 0's which describes the effect an amino acid would have in the stability of the peptide. There are about 300 different peptide sequences in the file. I want my code to recognize the start of a peptide sequence from my text file, count its length then count the number of 1's and 0's each amino acid records. so far I have been working to get my code recognize the start of a sequence using its index numbering, here's what I have
count=0
input_file01=open (r'C:/Users/12345/Documents/Dr XXX Research/MHC I 17 NOV2016.txt')
Output_file01= open ('MHC I 17 NOV2016OUT.txt','w')
for line in input_file01:
templist=line.split()
for i in templist[0]:
if i=='1':
count+1
if count=='#':
break
Output_file01.write(templist[0] + '\t\t' + templist[1] + '\t\t' + templist[2]+'\n')
Here is an example of the content in the file. I want my code to count the peptide sequence, count the number of 1's and 0's and find their ratios within each peptide seq.
# 1 - Amino acid number
# 2 - One letter code
# 3 - ANCHOR probability value
# 4 - ANCHOR output
#
1 A 0.3129 0
2 P 0.4044 0
3 K 0.5258 1
4 R 0.6358 1
5 P 0.7277 1
6 P 0.7895 1
7 S 0.8710 1
8 A 0.9358 1
9 F 0.9680 1
writelines
(whatever that means) and anexcel
tag?You will also need to mention why you're doing this, and how your input file differentiates between peptides.
I am in the bioinformatics masters program and as I said above fairly new to Python coming from a Microbiology background. For a research project I was given this task, which uses the ANCHOR software to study the stability/disorderliness of an amino acid and assigns a 1 or 0 to it based on the software's analysis. Hence I decided to write a script to count and get the ratio of 1's and 0's in each peptide run through ANCHOR. My initial code which I have been working on over the weekend keeps giving me this error: "IndexError: list index out of range "
You are not getting faster or more accurate help by using incorrect tags, so doesn't have added value to add in excel when this is not a question related to excel (regardless of where your input data is from).
Can you tell us more about your input file? Is it tab separated?
Yes it is tab separated.
Do you have all the files separately or do they have to be all in one? Frankly the # lines are completely stumping me. Splitting the file up by entry is difficult (for me anyway) and then having to do all the column parsing (which is easy).
Perhaps OP should post a longer example of what he has and what he wants to obtain.
I really appreciate athe help youv've offered thus far.Yes I want the files all in one. Just realizing the difficulty level of this this task, however I have modified my code below
This is my partial solution. It handles your given file but doesn't attempt to handle mutliple files strung together. I must be having a bad day because I can't make it work. If it was me, I'd save myself a considerably amount of headache and split the files apart first, then just loop over them with the script. I also haven't added in the ratio you alluded to, as I don't know whether you want 1's over 0s or vice versa, but that should be pretty easy to add - beware of division by zero errors here if one of your protein files only has all 1s or all 0s; probably want a
try/except
for this perhaps.Needs more list comprehensions! :-D
I know :P they just aren't verbose enough for me so I can't think like that haha!
Thanks guys so very much. I am not familiar with the module (sys) and sys.argv prompt, however I did some light research and tried implementing your code this way, I got the IndexError: list index out of range error message in relation to the fourth line of the code.
From my research apparently I am supposed to use the python script.py input_file01.txt script in some context, although to be honest this would be the first time implementing the sys.argv prompt as stated.
In this case you would use the script as
python script.py inputfile outputfile
. IndexError means that the interpreter couldn't find the item sys.argv[2] since there was no item at that index. Sys.argv is a list containing the arguments with which the script was executed (with sys.argv[0] the script-name itself).Exactly as mentioned above.
Using
sys.argv[]
is just a quick and dirty way of taking arguments directly from the command line. I hate hardcoding file names etc in scripts. It means you can do away with theinput_file01
line.Thus if you invoked the following at the commandline:
python script.py input.txt output.txt
:sys.argv[0]
=script.py
(the name of your executed file)sys.argv[1]
=input.txt
sys.argv[2]
=output.txt
And so on...
If you want to open a file to write to as well, I would just change the `with open (sys.argv[1], 'r') line to:
It's a bit unclear what you are asking for, but I think the
break
statement before Output_file01.write() is not your best friend. That will just break the loop there.Also, you are looking at templist[0], which is (if I'm not mistaken) just your amino acid number and not actually the 0 or 1 at the end.
Since templist[0] is just one position, it's rather pointless to have a for loop over it.
From the initial post above, the file contains unwanted annotation describing the numbers and amino acids. Templis[0] is the first row/line indexing the number of amino acids in each peptide. I am attempting to use this indexing which is unique to the actual information I want to write into my output file only the information I need which is the number of amino acids in the peptide (templist[0]), the amino acid code(templist[1]), the float numbers(templist [2] and the ANCHOR score(templist[3]) and eventually count the 1's and 0;s in templist[3], hence the use of the break. Hopefully this explains how I am attempting to solve this problem.