Comparing lists generated by Counter() and .most_common() for AA seqs
0
0
Entering edit mode
7.4 years ago
st.ph.n ★ 2.6k

I'm trying to use Counter() and most_common(), to count the occurrence of amino acids from two lists. Let's call them upper and lower:

counterup = Counter(upperseqs)
counterlow = Counter(lowerseqs)
countermc_up = (counterup.most_common(500))
countermc_low = (counterlow.most_common())

print len(countermc_up)
print len(countermc_low)

countermc_low)

for k,v in countermc_up:
for x,y in countermc_low:
if x == k:
print >> fh1, k, '\t', v, '\t', y
elif x != k:
print >> fh1, k, '\t', v, '\t', "0.00"
else:
print "No Matches found!! Try again!"

So I want the top 500 sequences from my "upper" list, and I want to compare the counts for those, if they are present to, to all of those sequences that would be contained in the seconed "lower" list. THere are approx 36K items with counts in the second list.

When I run the code, without the elif, else statement, I get what I want. All of the matches that are contained in the second list are printed to a fh, that I opened previously, in a tab delimited format: sequence, count for upper, count for lower.

CARYLGYNSNWYPFDYW       589778  427779
CARDYRGYSGYNDAFNIW      294911  29343
CARKIGYSSGSEDYW         187806  90299
CARHLGYNNSWYPFDYW       82820   88700
CARHLGYNSAWYPFDYW       55642   45723
CARHLGYNDSWYPFDYW       44338   30974
CAKDFRGYTGYNDAFDIW      34638   9703
CARHLGYNSDWYPFDYW       23476   15692
CARHLGYNSVWYPFDYW       16223   12220
CARHLGYNSNWYPFDYW       15673   17198
......
CARYLNSWPY              89      0.00

However, there is one that is in the upper 500 list that is not in the lower list, and I need to find out which one. I will also use this for other second lists of varying size where I know there are fewer items that will be found in the first list. What I want the code to do, is to input "0.00" in the third column, if that sequence does not exist in the second list.

What's happening when I run it with the elif, else statement, I get the first row perfect:

Ex: CARYLGYNSNWYPFDYW 589778 427779

However the code continues to only use the first sequence until it goes through all items in the second list. So I get:

CARYLGYNSNWYPFDYW       589778  0.00
CARYLGYNSNWYPFDYW       589778  0.00
CARYLGYNSNWYPFDYW       589778  0.00
CARYLGYNSNWYPFDYW       589778  0.00
CARYLGYNSNWYPFDYW       589778  0.00
CARYLGYNSNWYPFDYW       589778  0.00 

for thousands of rows. I've sifted through this file, and found that it does print the next count where the item is found in the second list. Since it already found it's match, I need it to go on to the next one in list one to look for it in list two, since I know the item won't appear again. I also need to keep the sorted order of the lists that were created by Counter().

All help is appreciated.

python amino acid counts Counter() compare lists • 2.2k views