Question

Creating frequency table from CD-HIT output in Python

0

Entering edit mode

18 months ago

Paula ▴ 60

Hi!

I apologize in advance for creating an open question without attempting to solve the coding problem. I am quite confused with this coding problem.

I am trying to create a frequency table based on a nested list in Python. The table must include the name of the cluster and the frequency in which the samples (labeled as SOL_X_) appear in the sublists The input list looks like this:

Cluster 13

0 3489aa, >SOL_1_10_length_33340_cov_18.252877_23... *

1 61aa, >SOL_1_40_length_4804_cov_17.346178_1... at 95.08%

Cluster 14

0 3456aa, >SOL_1_10_length_66994_cov_17.869433_28... *

The desired result is a table that contains the clusters in the first row and the frequency in which sample is found in a cluster in the other rows

desired table

Thank you

python cd-hit • 522 views

ADD COMMENT • link updated 18 months ago by Will • 0 • written 18 months ago by Paula ▴ 60

0

Entering edit mode

This seems like a non-trivial problem, although that is on a relative scale as I am not a programmer. Still, it seems to me that you are betting on a generosity of others to give you a complete solution without any work on your part. I predict this response will be the only one you get unless you show some effort.

ADD REPLY • link 18 months ago by Mensur Dlakic ★ 27k

score 0 · Answer 1 · 2022-10-17

I would suggest using a dictionary with the keys as "Cluster X" and then values as the strings that follow.

example_dict = {'Cluster 13': ['0 3489aa, >SOL_1_10_length_33340_cov_18.252877_23... ', '1 61aa, >SOL_1_40_length_4804_cov_17.346178_1... at 95.08%'], 'Cluster 14': ['0 3456aa, >SOL_1_10_length_66994_cov_17.869433_28... ']}

You can populate this dictionary by using for loops, conditionals ("if/else"), and the readline() method if the input is a file.

After the dictionary is filled with all necessary data, you can go through each key,value pair and count how many times SOL_X appears in the list. This info can then be added to a pandas dataframe -> excel file.

I hope this helps a little bit. This might not be an optimal solution, but it will do the trick.

Happy Coding!