I have received an archive with several results from DNAsp, the content of the file is like that (only the firsts lines, the original archive have zillions of lines.:
Input Data File: C:\...\GSTE7_EXON.AB.07.fas
Selected region: 1-672 Number of sites: 672
Variable (polymorphic) sites: 0 (Total number of mutations: 0)
Input Data File: C:\...\GSTE7_EXON.AB.07.fas
Selected region: 1-672 Number of sites: 672
Nucleotide diversity, Pi: 0,0000000000
Input Data File: C:\...\GSTE7_EXON.AB.07.fas
Selected region: 1-672 Number of sites: 672
Number of pairwise comparisons: 0
Number of significant pairwise comparisons by Fisher's exact test: 0
Number of significant pairwise comparisons by chi-square test: 0
Input Data File: C:\...\GSTE7_EXON.AB.07.fas
Selected region: 1-672 Number of sites: 672
Nucleotide diversity, Pi: 0,00000
Input Data File: C:\...\GSTE7_EXON.AB.07.fas
Selected region: 1-672 Number of sites: 672
Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
Selected region: 1-672 Number of sites: 672
Variable (polymorphic) sites: 11 (Total number of mutations: 11)
Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
Selected region: 1-672 Number of sites: 672
Nucleotide diversity, Pi: 0,00662
Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
Selected region: 1-672 Number of sites: 672
Number of pairwise comparisons: 55
Number of significant pairwise comparisons by Fisher's exact test: 51
Number of significant pairwise comparisons by chi-square test: 51
Value of ZnS (Kelly 1997): 0,4058
Value of Za (Rozas et al. 2001): 0,5058
Value of ZZ (Rozas et al. 2001): 0,1001
r^2 values: Y = 0,4200 - 0,0668X (55 points)
Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
Selected region: 1-672 Number of sites: 672
Nucleotide diversity, Pi: 0,00662
Input Data File: C:\...\GSTE7_FN_CONVERTIDO.txt
Selected region: 1-672 Number of sites: 672
Nucleotide diversity, Pi: 0,00662
Tajima's D: 2,27081 Statistical significance: *, P < 0.05
Coding region: Tajima's D: 2,27081 *, P < 0.05
Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
Selected region: 1-672 Number of sites: 672
Variable (polymorphic) sites: 20 (Total number of mutations: 20)
Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
Selected region: 1-672 Number of sites: 672
Nucleotide diversity, Pi: 0,00642
Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
Selected region: 1-672 Number of sites: 672
Number of pairwise comparisons: 190
Number of significant pairwise comparisons by Fisher's exact test: 1
Number of significant pairwise comparisons by chi-square test: 145
Value of ZnS (Kelly 1997): 0,6608
Value of Za (Rozas et al. 2001): 0,7791
Value of ZZ (Rozas et al. 2001): 0,1183
r^2 values: Y = 0,6736 - 0,0530X (190 points)
Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
Selected region: 1-672 Number of sites: 672
Nucleotide diversity, Pi: 0,00642
Input Data File: C:\...\GSTE7_allpop_02_convertido.fas.algn.fas
Selected region: 1-672 Number of sites: 672
Nucleotide diversity, Pi: 0,00642
Tajima's D: -1,83426 Statistical significance: *, P < 0.05
Coding region: Tajima's D: -1,83426 *, P < 0.05
NonSynonymous sites: Tajima's D(NonSyn): -1,74110 *, P < 0.05
As we can see, had several redundant lines (Input data file, for example), and the specific information is always after ":". To analyze those results, I want to make a table with several pieces of information (name of the file, number of sites, and the results of each evolutive test (Tajima, Fu and Li's and Linkage Disequilibrium).
I have a little experience with Python, and I think that a method with python dictionary and conversion to data frame can be a good choice to resolve my problem. So I wrote that script:
# -*- coding: utf-8 -*-
import pandas as pd
# opening txt file
file = open("GST71_OK.txt","r")
#creating output file
output = open('output.csv','w+')
# creating dictionary keys order
keys_order = ["Input Data File","Number of sites","Variable (polymorphic) sites", "Nucleotide diversity, Pi",
"Value of ZnS", "Value of Za","Value of ZZ","r^2 values","Number of pairwise comparisons",
"Fisher's exact test","chi-square test","Tajima's D","Tajima's D(Syn)","Tajima's D(NonSyn)",
"Tajima's D(Sil)"]
# creating dictionary
dictio = dict()
# searching patterns by line
for line in file:
for key in keys_order: # patterns are present in keys_order
key, values = line.strip().split(":") # values are present after ':'
dictio.setdefault(key, set()).update(values)
# converting dictionary to dataframe
dictio_df = pd.DataFrame.from_dict(dictio, orient='index',
columns=['Input','Total Sites', 'Polymorphic Sites', 'Pi', 'ZnS', 'Za','ZZ', 'r^2','Pairwise Comparisons',
'Fischer','Chi^2','Tajimas D','Syn','NonSyn','Sil'])
#writing output
with open("output.csv", "wt") as out:
for line in dictio_df:
print(line,file=out)
output.close()
Briefly: the script opens the file with results, create an output, use a set of keys (pieces of information that I want) to search each key inside the input file and put this in a dictionary, convert the dictionary in a data frame and save that data frame in an output file.
But I'm blocking with that error:
ValueError Traceback (most recent call last)
<ipython-input-11-ae4db977bd1a> in <module>()
16 for line in file:
17 for key in keys_order: # patterns are present in keys_order
---> 18 key, values = line.strip().split(":") # values are present after ':'
19 dictio.setdefault(key, set()).update(values)
20 # converting dictionary to dataframe
ValueError: too many values to unpack (expected 2)
Can anyone help me?
Tx