Question: (Closed) Dictionary to Matrix
0
gravatar for flogin
4 months ago by
flogin150
FioCruz/Brazil
flogin150 wrote:

I have received an archive with several results from DNAsp, the content of the file is like that (only the firsts lines, the original archive have zillions of lines.:

     Input Data File: C:\...\GSTE7_EXON.AB.07.fas
     Selected region: 1-672   Number of sites: 672
       Variable (polymorphic) sites: 0   (Total number of mutations: 0)
     Input Data File: C:\...\GSTE7_EXON.AB.07.fas
     Selected region: 1-672   Number of sites: 672
     Nucleotide diversity, Pi: 0,0000000000
     Input Data File: C:\...\GSTE7_EXON.AB.07.fas
     Selected region: 1-672   Number of sites: 672
     Number of pairwise comparisons: 0
     Number of significant pairwise comparisons by Fisher's exact test: 0
     Number of significant pairwise comparisons by chi-square test: 0
     Input Data File: C:\...\GSTE7_EXON.AB.07.fas
     Selected region: 1-672   Number of sites: 672
     Nucleotide diversity, Pi: 0,00000
     Input Data File: C:\...\GSTE7_EXON.AB.07.fas
     Selected region: 1-672   Number of sites: 672
     Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
     Selected region: 1-672   Number of sites: 672
       Variable (polymorphic) sites: 11   (Total number of mutations: 11)
     Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
     Selected region: 1-672   Number of sites: 672
     Nucleotide diversity, Pi: 0,00662
     Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
     Selected region: 1-672   Number of sites: 672
     Number of pairwise comparisons: 55
     Number of significant pairwise comparisons by Fisher's exact test: 51
     Number of significant pairwise comparisons by chi-square test: 51
     Value of ZnS (Kelly 1997): 0,4058
     Value of Za (Rozas et al. 2001): 0,5058
     Value of ZZ (Rozas et al. 2001): 0,1001
      r^2 values:  Y = 0,4200 - 0,0668X   (55 points)
     Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
     Selected region: 1-672   Number of sites: 672
     Nucleotide diversity, Pi: 0,00662
     Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
     Selected region: 1-672   Number of sites: 672
        Nucleotide diversity, Pi: 0,00662
     Tajima's D: 2,27081     Statistical significance: *, P < 0.05
     Coding region: Tajima's D: 2,27081     *, P < 0.05
Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
 Selected region: 1-672   Number of sites: 672
   Variable (polymorphic) sites: 20   (Total number of mutations: 20)
 Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
 Selected region: 1-672   Number of sites: 672
 Nucleotide diversity, Pi: 0,00642
 Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
 Selected region: 1-672   Number of sites: 672
 Number of pairwise comparisons: 190
 Number of significant pairwise comparisons by Fisher's exact test: 1
 Number of significant pairwise comparisons by chi-square test: 145
 Value of ZnS (Kelly 1997): 0,6608
 Value of Za (Rozas et al. 2001): 0,7791
 Value of ZZ (Rozas et al. 2001): 0,1183
  r^2 values:  Y = 0,6736 - 0,0530X   (190 points)
 Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
 Selected region: 1-672   Number of sites: 672
 Nucleotide diversity, Pi: 0,00642
 Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
 Selected region: 1-672   Number of sites: 672
    Nucleotide diversity, Pi: 0,00642
 Tajima's D: -1,83426     Statistical significance: *, P < 0.05
 Coding region: Tajima's D: -1,83426     *, P < 0.05
 NonSynonymous sites: Tajima's D(NonSyn): -1,74110     *, P < 0.05

As we can see, had several redundant lines (Input data file, for example), and the specific information is always after ":". To analyze those results, I want to make a table with several pieces of information (name of the file, number of sites, and the results of each evolutive test (Tajima, Fu and Li's and Linkage Disequilibrium).

I have a little experience with Python, and I think that a method with python dictionary and conversion to data frame can be a good choice to resolve my problem. So I wrote that script:

# -*- coding: utf-8 -*-
import pandas as pd
# opening txt file
file = open("GST71_OK.txt","r")
#creating output file
output = open('output.csv','w+')
# creating dictionary keys order
keys_order = ["Input Data File","Number of sites","Variable (polymorphic) sites", "Nucleotide diversity, Pi",
              "Value of ZnS", "Value of Za","Value of ZZ","r^2 values","Number of pairwise comparisons",
             "Fisher's exact test","chi-square test","Tajima's D","Tajima's D(Syn)","Tajima's D(NonSyn)",
              "Tajima's D(Sil)"]

# creating dictionary
dictio = dict()
# searching patterns by line
for line in file:
    for key in keys_order: # patterns are present in keys_order
        key, values = line.strip().split(":") # values are present after ':'
        dictio.setdefault(key, set()).update(values)
# converting dictionary to dataframe
dictio_df = pd.DataFrame.from_dict(dictio, orient='index', 
                           columns=['Input','Total Sites', 'Polymorphic Sites', 'Pi', 'ZnS', 'Za','ZZ', 'r^2','Pairwise Comparisons',
                                    'Fischer','Chi^2','Tajimas D','Syn','NonSyn','Sil'])
#writing output
with open("output.csv", "wt") as out:
    for line in dictio_df:
        print(line,file=out)
output.close()

Briefly: the script opens the file with results, create an output, use a set of keys (pieces of information that I want) to search each key inside the input file and put this in a dictionary, convert the dictionary in a data frame and save that data frame in an output file.

But I'm blocking with that error:

ValueError                                Traceback (most recent call last)
<ipython-input-11-ae4db977bd1a> in <module>()
     16 for line in file:
     17     for key in keys_order: # patterns are present in keys_order
---> 18         key, values = line.strip().split(":") # values are present after ':'
     19         dictio.setdefault(key, set()).update(values)
     20 # converting dictionary to dataframe

ValueError: too many values to unpack (expected 2)

Can anyone help me?

Tx

dictionary pandas python table • 204 views
ADD COMMENTlink modified 4 months ago • written 4 months ago by flogin150

Hello flogin!

We believe that this post does not fit the main topic of this site.

I was able to solve

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

ADD REPLYlink written 4 months ago by flogin150
1
gravatar for WouterDeCoster
4 months ago by
Belgium
WouterDeCoster41k wrote:

The error says that there are lines with more than 2 :, which will result in an error when you try to assign it to key, values.

One of those lines would be:

Selected region: 1-672 Number of sites: 672

ADD COMMENTlink written 4 months ago by WouterDeCoster41k

Ok WouterDeCoster, I keep only one ":" by line (only in the informations that I need.

The following error is reported:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-6-286eb07c422e> in <module>()
     22 # converting dictionary to dataframe
     23 dictio_df = pd.DataFrame.from_dict(dictio, orient='index', 
---> 24                            columns=['Input','Total Sites', 'Polymorphic Sites', 'Pi', 'ZnS', 'Za','ZZ', 'r^2','Pairwise Comparisons','Fischer','Chi^2','Tajimas D','Syn','NonSyn','Sil'])
     25 #writing output
     26 with open("output.csv", "wt") as out:

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in from_dict(cls, data, orient, dtype, columns)
    983             raise ValueError('only recognize index or columns for orient')
    984 
--> 985         return cls(data, index=index, columns=columns, dtype=dtype)
    986 
    987     def to_dict(self, orient='dict', into=dict):

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    385                     if is_named_tuple(data[0]) and columns is None:
    386                         columns = data[0]._fields
--> 387                     arrays, columns = _to_arrays(data, columns, dtype=dtype)
    388                     columns = _ensure_index(columns)
    389 

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _to_arrays(data, columns, coerce_float, dtype)
   7454         data = lmap(tuple, data)
   7455         return _list_to_arrays(data, columns, coerce_float=coerce_float,
-> 7456                                dtype=dtype)
   7457 
   7458 

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _list_to_arrays(data, columns, coerce_float, dtype)
   7511         content = list(lib.to_object_array(data).T)
   7512     return _convert_object_array(content, columns, dtype=dtype,
-> 7513                                  coerce_float=coerce_float)
   7514 
   7515 

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _convert_object_array(content, columns, coerce_float, dtype)
   7569             raise AssertionError('{col:d} columns passed, passed data had '
   7570                                  '{con} columns'.format(col=len(columns),
-> 7571                                                         con=len(content)))
   7572 
   7573     # provide soft conversion of object dtypes

AssertionError: 15 columns passed, passed data had 44 columns

If I print the dictionary content:

Input Data File : {'f', 'X', 'l', 'p', 's', 'S', 'n', 'C', 'i', 'B', 'a', '\\', 'A', 'g', '.', '8', 'G', 'x', 'e', 'R', 'T', 'r', '4', '7', 'o', '9', '1', '5', 'c', 'd', '_', 'I', '0', '3', '6', 'N', 'O', 'E', 'v', '2', 'F', 'V', 'D', 't'}
Number of sites : {' ', '2', '3', '6', '7'}
Variable (polymorphic) sites : {'f', 'l', 'b', 's', 'n', 'i', 'a', ')', 'm', '(', 'T', 'r', 't', '4', '7', 'o', '9', '1', '5', '_', '3', '0', '6', ' ', 'u', '2', 'e'}
Nucleotide diversity, Pi : {' ', '9', '1', '5', '3', '8', '2', '4', '0', ',', '6', '7'}
Number of pairwise comparisons : {'9', ' ', '3', '5', '1', '8', '2', '4', '0', '6', '7'}
Number of significant pairwise comparisons by Fisher's exact test : {' ', '9', '1', '5', '3', '8', '2', '4', '0', '6', '7'}
Number of significant pairwise comparisons by chi-square test : {' ', '9', '1', '5', '3', '8', '2', '4', '0', '6', '7'}
Value of ZnS (Kelly 1997) : {' ', '9', '1', '5', '3', '8', '2', '4', '0', ',', '6', '7'}
Value of Za (Rozas et al. 2001) : {' ', '9', '1', '5', '3', '8', '2', '4', '0', ',', '6', '7'}
Value of ZZ (Rozas et al. 2001) : {' ', '9', '1', '3', '5', '8', '-', '2', '4', '0', ',', '6', '7'}
r^2 values : {'Y', 'X', 'p', 's', 'i', 'n', ')', '+', '(', '8', '4', ',', '=', '7', 'o', '9', '1', '5', '-', '0', '3', '6', ' ', '2', 't'}
Tajima's D : {'f', 'l', 's', 'S', 'i', 'n', 'a', 'g', '.', '8', 't', '4', ',', '7', '9', '1', '5', 'c', '_', '0', '-', '3', '<', '6', ' ', '*', 'P', '2', 'e'}
Coding region_ Tajima's D : {' ', '*', '.', '6', '9', '1', 'P', '5', '3', '8', '-', '2', '4', '0', ',', '<', '7'}
NonSynonymous sites_ Tajima's D(NonSyn) : {' ', '*', '.', '9', '6', '1', 'P', '5', '3', '0', '4', '-', ',', '<', '7'}

Apparently, I have a great error inside my code.

ADD REPLYlink written 4 months ago by flogin150

If I count in your dictionary I see 14 lines, so that doesn't match with the number of columns you specified, but it also doesn't match with the error message you received...

ADD REPLYlink written 4 months ago by WouterDeCoster41k

Yeah, but the values for each key are completely random, in key "Input Data File" the values should be the name of archives (GSTE7_EXON.AB.07.fas;GSTE7_FN_CONVERTIDO.txt); in key "Number of sites" the values should be the number of sites (672;672)...

ADD REPLYlink written 4 months ago by flogin150
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1140 users visited in the last hour