Create a nested dictionary from a text file
0
0
Entering edit mode
9 weeks ago
Paula ▴ 60

Hi! I am trying to write a code to use the information contained in a text file and convert it to a nested dictionary. Here, the first dictionary contains the names of the clusters (Cluster 0, Cluster 1), the dictionary named "samples" contains the names of the samples (SOL_1_3,SOL_1_50,_SOL_1_40) and each sample has a calculated cov value. For example, in Cluster 0, the cov value for sample SOL_1_50 is 7, which is the sum of the values cov values for the sample (cov_3.5).

>Cluster 0
0   948aa, >SOL_1_50_cov_3.5_N_171282... at 100.00%
1   815aa, >SOL_1_50_cov_3.5_N_190968... at 100.00%
2   13323aa, >SOL_1_40_cov_79.5_N_6768... *
3   395aa, >SOL_1_3_cov_5.5_N_257377... at 90.38%

>Cluster 1
0   1759aa, >SOL_1_50_cov_5.5_N_75037... at 100.00%
1   1055aa, >SOL_1_50_cov_4.5_N_129969... at 99.91%


The desired output is the following:

{'Cluster 0': {'samples': {'SOL_1_50': 7, 'SOL_1_40'':79.5, 'SOL 1_3'":5.5, 'SOL_1_10':0}}, 'Cluster 1': {'samples': {'SOL_1_50':10, 'SOL_1_40':0, 'SOL 1_3':0, 'SOL_1_10':0}}}


Here is my script:

f_in = 'real_short_test_cluster.txt'
f_out = 'output.txt'

if __name__ == '__main__':
with open(f_in, 'r') as f:
f.close()

dct_cluster_sol = dict()
current_cluster = ''
nested_dic = {'SOL 1_3','SOL_1_40','SOL_1_10','SOL_1_50'}
#all_keys = []
#coverage_count = 0
for line in lines:
if "Cluster" in line.strip():
current_cluster = line.strip().split('>')[1]
dct_cluster_sol[current_cluster] = dict()
print('perro')
print(dct_cluster_sol)
elif ">SOL_" in line.strip():
id = line.strip().split('\t')[1].split('>')[1].split('_cov')[0]
coverage = line.strip().split('\t')[1].split('>')[1].split('_')[4]
print(coverage,round(float(coverage) + 2.0,6))
dct_cluster_sol[current_cluster]['samples'] = nested_dic
print(dct_cluster_sol)
for i in dct_cluster_sol:
print(i)
for j in dct_cluster_sol[i]:
for k in dct_cluster_sol[i][j]:
print(k)
if k == id:
print(k)
covi = 0.0
covi = covi + float(coverage)
dct_cluster_sol[i][j][k] = float(covi)


And this is the error I obtain:

Traceback (most recent call last):
File "biostars.py", line 39, in <module>
dct_cluster_sol[i][j][k] = float(covi)
TypeError: 'set' object does not support item assignment


Thank you!

dictionary python • 470 views
2
Entering edit mode

It looks like you are trying to parse CD-HIT output. Maybe you prefer to do it on your own, but there are already scripts to do that. I recommend ParseCDHIT.py in this collection of tools:

https://github.com/jrjhealey/bioinfo-tools

Searching GitHub for parse cdhit will produce many other results, but I linked the one I know to work.

0
Entering edit mode

Hi Mensur! Yes, that's exactly what I am trying to do. Do you know where can I find an example of the output format for the script? Thank you so much!

0
Entering edit mode

Not sure what you are asking here. Is it about the parsing script I recommended? If so, it is easy enough for you to run it and find out, as it has minimal outside dependencies. The output is not exactly what you want, but it should be relatively easy to tailor the original script.

A few lines from the output:

Parsing Cluster 23145, with IDs:
['29841', '29842', '29843', '29844', '29845', '29846', '29847', '29848', '29849', '29850', '29851', '29854']
Parsing Cluster 23146, with IDs:
['30021']
Parsing Cluster 23147, with IDs:
['394']
Parsing Cluster 23148, with IDs:
['1189']


It also creates many fasta files containing sequences from each cluster.

0
Entering edit mode

Not sure, where exactly your error is, but what you are trying to output is essentially JSON, so you can probably use json.dumps() instead and save yourself a headache.

For any tool you plan to publish or any script that will not be a one-off, consider using Pydantic schemas or dataclasses in conjunction with Pydantic for the validation of complex structures, the serialization of values and for writing clean output.

0
Entering edit mode

Is there a particular reason you need it in a nested dictionary? I think you are probably making your life unnecessarily hard by trying to do arithmetic over multiple entries and then concoct a dict format for it.

It's also not clear where SOL_1_10':0} is coming from, as it isn't represented in the clusters anywhere? Is all missing data to be treated as a 0 coverage?