Question: Mutual Information from Nucleotide Distribution in Python
3
12 weeks ago by
nameuser30
nameuser30 wrote:

This question has been removed from this site -- please see stackoverflow if interested.

#### Previous content restored by Ram from Google Cache

Hi there,

I'm currently trying to write a program that will calculate the mutation rate given text files of nucleotide distributions. I am hoping to automate the process of calculating mutual information in Excel to python. I'm stuck at this step in the calculation.....

An example of an input file is as follows

``````A,T,G,C
84 , 59 , 35 , 125032
74 , 40 , 6 , 125082
125107 , 44 , 24 , 36
3 , 44 , 4 , 125161
125122 , 23 , 28 , 37
5 , 23 , 4 , 125180
125149 , 8 , 18 , 37
125124 , 32 , 14 , 38
9 , 25 , 8 , 125170
``````

The program:

``````import pandas as pd
import sys

filename = sys.argv[1]
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
col = ['A', 'T', 'G', 'C']
df['max'] = df[['A', 'T', 'G', 'C']].max(axis=1)
df['sum'] = df[['A', 'T', 'G', 'C']].sum(axis=1)
df.loc[:,"A":"C"] = df.loc[:,"A":"C"].div(df["sum"], axis=0)
df['mutation_rate'] = (1-df['max']/df['sum'])
df['max2'] = df[['A', 'T', 'G', 'C']].max(axis=1)
df['sum2'] = df[['A', 'T',  'G', 'C']].sum(ax

is=1)
df['marginal_distribution']=(1-df['max2']/df['sum2'])

numberOfBins = sys.argv[2]
df['A/numberOfBins'] = df['A'].div(8)
df['T/numberOfBins'] = df['T'].div(8)
df['G/numberOfBins'] = df['G'].div(8)
df['C/numberOfBins'] = df['C'].div(8)
``````

With the output

``````    A   T   G   C
0   0.000671    0.000471    0.00028 0.998578
1   0.000591    0.000319    0.000048    0.999042
2   0.999169    0.000351    0.000192    0.000288
3   0.000024    0.000351    0.000032    0.999593
4   0.999297    0.000184    0.000224    0.000296
5   0.00004     0.000184    0.000032    0.999744
6   0.999497    0.000064    0.000144    0.000295
7   0.999329    0.000256    0.000112    0.000303
8   0.000072    0.0002      0.000064    0.999665

max    sum mutation_rate
125032  125210  0.001422
125082  125202  0.000958
125107  125211  0.000831
125161  125212  0.000407
125122  125210  0.000703
125180  125212  0.000256
125149  125212  0.000503
125124  125208  0.000671
125170  125212  0.000335

max2    sum2
0.998578    1
0.999042    1
0.999169    1
0.999593    1
0.999297    1
0.999744    1
0.999497    1
0.999329    1
0.999665    1

marginal_distribution
0.001422
0.000958
0.000831
0.000407
0.000703
0.000256
0.000503
0.000671
0.000335

A/numberOfBins  T/numberOfBins  G/numberOfBins  C/numberOfBins
0.000084    0.000059    0.000035    0.124822
0.000074    0.00004     0.000006    0.12488
0.124896    0.000044    0.000024    0.000036
0.000003    0.000044    0.000004    0.124949
0.124912    0.000023    0.000028    0.000037
0.000005    0.000023    0.000004    0.124968
0.124937    0.000008    0.000018    0.000037
0.124916    0.000032    0.000014    0.000038
0.000009    0.000025    0.000008    0.124958
``````

I am attempting to solve for Shannon entropy/Mutual information. Thank you SO much.

entropy • 220 views
modified 8 weeks ago by _r_am32k • written 12 weeks ago by nameuser30
1

``````row = list(map(int, row))
print(1 - max(row) / sum(row))
``````

Edit: Note: the text (esp. the code) of the question appears to have changed since the initial posting, so this comment doesn't seem to make sense any more.

Hello nameuser,

Do not redact content after you've received feedback on a post. This is inconsiderate and such behavior can lead to suspension of your user account.

Please point to the StackOverflow post that you are referring to. In the meantime, I'll be restoring the content of this post from Google Cache.