Network X used in Biopython
2
0
Entering edit mode
3.2 years ago
anasjamshed ▴ 120

I have developed the script to calculate the distance matrix and phylogenetic tree of 63 different sequences

Script:

#Importing Libraries
import pandas as pd
#import seaborn as sns
import numpy as np
#import csv
from Bio import Phylo, AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import matplotlib
import matplotlib.pyplot as plt
from scipy.spatial import distance_matrix

# Calculate the distance matrix
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(aln)
#print(dm)
# Visualize neighbor joined tree
constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)
matplotlib.rc('font', size=7)
fig = plt.figure(figsize=(6, 6), dpi=400)

#Drawing of Tree
Phylo.draw(tree, axes=axes, do_show=False)
#Save Figure
plt.savefig('phy.jpg')
#Creation of Array on the basis of DM
a= np.array(dm)
print(a)

#print(a)

genus1 = genus["genus"]
#species = ['Taxa1', 'Taxa2', 'Taxa3','Taxa4', 'Taxa5', 'Taxa6','Taxa7', 'Taxa8', 'Taxa9','Taxa10', 'Taxa11', 'Taxa12','Taxa13', 'Taxa14', 'Taxa15','Taxa16', 'Taxa17', 'Taxa18','Taxa19', 'Taxa20', 'Taxa21', 'Taxa22', 'Taxa23', 'Taxa24','Taxa25', 'Taxa26', 'Taxa27','Taxa28', 'Taxa29', 'Taxa30','Taxa31', 'Taxa32', 'Taxa33','Taxa34', 'Taxa35', 'Taxa36','Taxa37', 'Taxa38', 'Taxa39','Taxa40', 'Taxa41', 'Taxa42','Taxa43', 'Taxa44', 'Taxa45','Taxa46', 'Taxa47', 'Taxa48','Taxa49', 'Taxa50', 'Taxa51','Taxa52', 'Taxa53', 'Taxa54','Taxa55', 'Taxa56', 'Taxa57','Taxa58', 'Taxa59', 'Taxa60','Taxa61', 'Taxa62','Taxa63']
print(genus1)
df = pd.DataFrame(a,columns=genus1,  index=genus1)

print(df)

#Creation of distance matrix

pd.DataFrame(distance_matrix(df.values, dm), index=df.index, columns=df.index)

#Saving data into csv file
df.to_csv("C:\\Users\\USER\\Desktop\\phylodata.csv")

Anas Jamshed's profile photo
Anas Jamshed
7:26 AM (7 minutes ago)
to networkx-discuss
I have developed the script to calculate the distance matrix and phylogenetic tree of 63 different sequences

Script:
#Importing Libraries
import pandas as pd
#import seaborn as sns
import numpy as np
#import csv
from Bio import Phylo, AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import matplotlib
import matplotlib.pyplot as plt
from scipy.spatial import distance_matrix

# Calculate the distance matrix
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(aln)
#print(dm)
# Visualize neighbor joined tree
constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)
matplotlib.rc('font', size=7)
fig = plt.figure(figsize=(6, 6), dpi=400)

#Drawing of Tree
Phylo.draw(tree, axes=axes, do_show=False)
#Save Figure
plt.savefig('phy.jpg')
#Creation of Array on the basis of DM
a= np.array(dm)
print(a)

#print(a)

genus1 = genus["genus"]
#species = ['Taxa1', 'Taxa2', 'Taxa3','Taxa4', 'Taxa5', 'Taxa6','Taxa7', 'Taxa8', 'Taxa9','Taxa10', 'Taxa11', 'Taxa12','Taxa13', 'Taxa14', 'Taxa15','Taxa16', 'Taxa17', 'Taxa18','Taxa19', 'Taxa20', 'Taxa21', 'Taxa22', 'Taxa23', 'Taxa24','Taxa25', 'Taxa26', 'Taxa27','Taxa28', 'Taxa29', 'Taxa30','Taxa31', 'Taxa32', 'Taxa33','Taxa34', 'Taxa35', 'Taxa36','Taxa37', 'Taxa38', 'Taxa39','Taxa40', 'Taxa41', 'Taxa42','Taxa43', 'Taxa44', 'Taxa45','Taxa46', 'Taxa47', 'Taxa48','Taxa49', 'Taxa50', 'Taxa51','Taxa52', 'Taxa53', 'Taxa54','Taxa55', 'Taxa56', 'Taxa57','Taxa58', 'Taxa59', 'Taxa60','Taxa61', 'Taxa62','Taxa63']
print(genus1)
df = pd.DataFrame(a,columns=genus1,  index=genus1)

print(df)

#Creation of distance matrix

pd.DataFrame(distance_matrix(df.values, dm), index=df.index, columns=df.index)

#Saving data into csv file
df.to_csv("C:\\Users\\USER\\Desktop\\phylodata.csv")


Result:

I need to do the following: 1) By using NetworkX : a. import csv file (matrix file that the distance matrix code generates) into pandas

b.Put labels in the graph (genus names)

1. Compare the structures of networks using some metrics and put those in a table in these three ways:

a. Through degree distribution (every node has a degree) and you can distribute that across all the nodes you have so that gives you probability distribution

b. Through Closeness centrality (you can calculate that in a single line from network) where you see how many hops you can make from one line to another

c. Through direct comparison between the different nodes

Can anyone help me plz?

Biopython Graphs Python • 1.8k views
0
Entering edit mode

a= nx.closeness_centrality(G,distance='weight')

gives me :

{'Steroidobacter': 2.1825689426338295, 'Tahibacter': 2.28218966846569, 'Dyella': 2.251978088861838, 'Lysobacter': 2.2846557579499844, 'Stenotrophomonas': 2.277974449746037, 'Pseudoxanthomonas': 2.2512929723151807, 'Xanthomonas': 2.273425499231951, 'Burkholderia-Caballeronia-Paraburkholderia': 2.283245911755632, 'Massilia': 2.2681992337164756, 'Duganella': 2.267156862745098, 'uncultured': 1.7487888455630392, 'Pseudomonas': 2.2165643252957916, 'Providencia': 2.233625113190462, 'Pantoea': 2.2695905535960743, 'Erwinia': 2.266809618624598, 'Serratia': 2.253006545897397, 'Acinetobacter': 1.6685456595264938, 'Sphingomonas': 1.7922015015742307, 'Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium': 1.783132530120482, 'Ochrobactrum': 1.7750059966418807, 'Methylobacterium': 1.7930700266537438, 'Dyadobacter': 1.7247407062113975, 'Chitinophaga': 1.7151465986788736, 'Taibaiella': 1.721931355439209, 'Flavobacterium': 1.6982214572576018, 'Mucilaginibacter': 1.7452830188679243, 'Sphingobacterium': 1.7283662267896762, 'Pedobacter': 1.7191311418283188, 'RB41': 1.5810276679841893, 'uncultured forest soil bacterium': 1.6954977660671322, 'Streptomyces': 1.8836706121929487, 'Paenarthrobacter': 1.8954918032786883, 'Microbacterium': 1.9079541059688019, 'Terrabacter': 1.8908905072186013, 'Rhodococcus': 1.8824726532688882, 'Nocardioides': 1.6525234479678426, 'Kribbella': 1.8355450824755055, '21551': 1.5739657556099118}


Is this output correct?

0
Entering edit mode
3.2 years ago

For reading a networkx from pandas, you dont have to save it to csv. NetworkX can read pandas dataframes.
There are two types of dataframes you can use: edgelist, and adjecency list.
An edgelist contains at least two columns (source, target) with nodes that belong to an edge. It can have more columns, usually to denote edge attributes (weight, distance, etc). In order to load such a dataframe to NetworkX, you can use: networkx.convert_matrix.from_pandas_edgelist(df, source, target, edge_attr).

• df: Dataframe containing the data
• source: Column name containing one of the nodes
• target: Column name containing the other node
• edge_attr: name of column (or columns, as a list), that contain data that should be associated to an edge (e.g. weight, distance)

For example:

import pandas as pd
import networkx as nx

df = pd.DataFrame([["a","b",1],["a","c",2]])
df.columns = ("n1", "n2", "distance")
G = nx.convert_matrix.from_pandas_edgelist(df,"n1","n2","distance")


That's your graph. You can go to the NetworkX site for help on how to plot things and get graph/edge stats. There are a lot of ways to do it, and it will be a good learning experience.

I am assuming that your distance matrix has weights (the distances), so the "adjacency" list method won't work.
Looking at the biopython docs, it looks like what you get from get_distance does not have a format adequate to load a pandas dataframe, so we can first read the numpy array from dm, and then fix the node names:

# 1. Make numpy array from get_distance object
np_matrix = np.array(dm)
# 2. Read graph from numpy array
G = nx.from_numpy_matrix(np_matrix)
# This graph does not have the correct node labels Let us fix that

# 3.1 Generate a dictionary that maps indices (current names in the G graph), to the correct labels
# Here I am assuming you already have the get_distance object as dm
labels = {ix: name for ix, name in enumerate(dm.names)}

# 3.2 Fix labels in graph
G_with_labels = nx.relabel_nodes(G, labels)


From this point, go ahead and work with your graph. Just one thing. From here on, the distance is going to be labeled as "weight" in the graph edge attributes.
Good luck

0
Entering edit mode

i tried :

df = pd.read_csv('C://Users//USER//Desktop/phylodata.csv', index_col='genus')
#nx.draw(G)

# 3.1 Generate a dictionary that maps indices (current names in the G graph), to the correct labels
# Here I am assuming you already have the get_distance object as dm
labels = genus1

# 3.2 Fix labels in graph
G_with_labels = nx.relabel_nodes(G, labels)
fig = plt.figure(figsize=(35, 35))
nx.draw(G_with_labels)


and got the result:

But how can I label the nodes with genus names?