Question

Upset plot from orthofinder genecount.tsv file using python

0

Entering edit mode

18 months ago

Shakunthala Natarajan ▴ 50

I am trying to plot the Orthogroup gene count table as an upset plot using python. I have 12543 orthogroups and six different species in my genecount table file. I referred to a previous post in stackoverflow that had tried to do the same but with a smaller data set (upset plot from genecount.tsv)

But I am not getting the upset plot correctly and it just displays the entire number of orthogroups as such. I am giving the code I wrote here below:

import pandas as pd
dic={'group':[],'sp1':[],'sp2':[],'sp3':[],'sp4':[],'sp5':[],'sp6':[],'total':[]}
with open ("/home/ubuntu/Orthogroups.GeneCount.tsv","r") as f:
    f.readline()
    line=f.readline()
    while line:
        parts=line.strip().split("\t")
        dic['group'].append(parts[0])
        dic['sp1'].append(parts[1])
        dic['sp2'].append(parts[2])
        dic['sp3'].append(parts[3])
        dic['sp4'].append(parts[4])
        dic['sp5'].append(parts[5])
        dic['sp6'].append(parts[6])
        dic['total'].append(parts[7])
        line=f.readline()
df=pd.DataFrame(data=dic).set_index("group")
group_dict={}

for index,row  in df.iterrows():
    for sp,count in row.items():
        if sp != "total" and count != 0:
            group_dict.setdefault(index, []).append(sp)
group_dict       
import pyupset as pyu
from upsetplot import UpSet
from upsetplot import from_memberships
x=from_memberships(group_dict.values()).sort_values(ascending= False)
UpSet(x, subset_size='count',show_counts=True).plot()

The upset plot I get is as follows:

enter image description here

Can someone please help? Thank you!

python orthofinder upset • 1.5k views

ADD COMMENT • link 18 months ago by Shakunthala Natarajan ▴ 50

0

Entering edit mode

I suspect the problem is the dataframe dtypes created from reading in your data file vs. what the example gets directly inputing Python integers. Can you run df.dtypes and post the result for just the first few? If you aren't doing this in a Jupyter notebook, just run everything up to df=pd.DataFrame(data=dic).set_index("group") in your code and then add print(df.dtypes).

ADD REPLY • link 18 months ago by Wayne ★ 2.0k

score 2 · Accepted Answer · 2022-10-21

If my suspicion is correct, your kludgy way of reading in the data to make a dictionary to match your example caused you the headache you are having. If you want to use your code, you have to cast the data types in the Pandas columns to be correct. One option to do that is to insert the following two lines after you make your dataframe df:

cols = df.columns[df.dtypes.eq('object')]  # based on https://stackoverflow.com/a/36814203/8508004
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce') # based on https://stackoverflow.com/a/36814203/8508004

Specifically, your code with those two lines added would be:

import pandas as pd
dic={'group':[],'sp1':[],'sp2':[],'sp3':[],'sp4':[],'sp5':[],'sp6':[],'total':[]}
with open ("/home/ubuntu/Orthogroups.GeneCount.tsv","r") as f:
    f.readline()
    line=f.readline()
    while line:
        parts=line.strip().split("\t")
        dic['group'].append(parts[0])
        dic['sp1'].append(parts[1])
        dic['sp2'].append(parts[2])
        dic['sp3'].append(parts[3])
        dic['sp4'].append(parts[4])
        dic['sp5'].append(parts[5])
        dic['sp6'].append(parts[6])
        dic['total'].append(parts[7])
        line=f.readline()
df=pd.DataFrame(data=dic).set_index("group")
cols = df.columns[df.dtypes.eq('object')]  # based on https://stackoverflow.com/a/36814203/8508004
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce') # based on https://stackoverflow.com/a/36814203/8508004
group_dict={}

for index,row  in df.iterrows():
    for sp,count in row.items():
        if sp != "total" and count != 0:
            group_dict.setdefault(index, []).append(sp)
group_dict       
#import pyupset as pyu
from upsetplot import UpSet
from upsetplot import from_memberships
x=from_memberships(group_dict.values()).sort_values(ascending= False)
UpSet(x, subset_size='count',show_counts=True).plot()

(I commented out the line import pyupset as pyu because it seemed unnecessary. I didn't want to install another package that may confuse things in the namespace.)

Perhaps a better route

I think the example in the post your referenced was just making a dictionary to then make the dataframe. I suspect you could have just let Pandas read in the tsv file data and assign data types. It usually does pretty good if your data is clean/consistent.

To illustrate:

I'm going to use the following code to make mock .tsv text file that matches the referenced example, but has six species like yours, using a Jupyter notebook:

s = '''group_name\tsp1\tsp2\tsp3\tsp4\tsp5\tsp6\ttotal
group1\t1912\t0\t1\t0\t0\t0\t1913
group2\t804\t0\t0\t0\t0\t0\t804
group3\t780\t0\t0\t0\t0\t0\t780'''
%store s >"Orthogroups.GeneCount.tsv"

I made up the mock header line because I know there is one based on your posted code; however, I don't know what it is. That line is just my guess.
If I use Pandas to read that in, I just need two lines to get to the dataframe form of the data:

import pandas as pd
df=pd.read_csv("Orthogroups.GeneCount.tsv",sep="\t").set_index("group_name")

Basically, Pandas is what you want to use of you have structured text data table to read into Python. The name stands for Paneled data analysis. If you look at what that makes, it should be close to what the dataframe the example data makes. Note I had to adapt the set_index() step to match the 'group' column header.

So then your code becomes the following, which is just those two lines appended to the making of the group_dict and plotting parts:

import pandas as pd
df=pd.read_csv("Orthogroups.GeneCount.tsv",sep="\t").set_index("group_name")
group_dict={}

for index,row  in df.iterrows():
    for sp,count in row.items():
        if sp != "total" and count != 0:
            group_dict.setdefault(index, []).append(sp)
group_dict       
from upsetplot import UpSet
from upsetplot import from_memberships
x=from_memberships(group_dict.values()).sort_values(ascending= False)
UpSet(x, subset_size='count',show_counts=True).plot();

Much more succinct and Pythonic.

Alternate option to fix your code

Now you may be asking if Pandas is so great, why didn't my dictionary work. Investigating that will arrive at an alternate way you could have updated your code.
If I restructure your code to handle something with just the 3 species like the referenced example, I get for dic:

{'group': ['group1', 'group2', 'group3'],
 'sp1': ['1912', '804', '780'],
 'sp2': ['0', '0', '0'],
 'sp3': ['1', '0', '0'],
 'total': ['0', '804', '780']}

The values are being read in as strings because you are reading them in from a text file. Note the quotes around the numbers in the print out of dic above. But in the example dictionary they are not strings:

d= {"groups":["group1", "group2", "group3"],"sp1":[1912,804, 780], "sp2":[0,0,0], "sp3": [1,0,0], "total":[1913,804,780]}

No quotes around the numbers.

So you could have avoided that difference by casting each of the values to integers when you were reading them in from the data file by changing the seven append() lines:

import pandas as pd
dic={'group':[],'sp1':[],'sp2':[],'sp3':[],'sp4':[],'sp5':[],'sp6':[],'total':[]}
with open ("/home/ubuntu/Orthogroups.GeneCount.tsv","r") as f:
    f.readline()
    line=f.readline()
    while line:
        parts=line.strip().split("\t")
        dic['group'].append(parts[0])
        dic['sp1'].append(int(parts[1]))
        dic['sp2'].append(int(parts[2]))
        dic['sp3'].append(int(parts[3]))
        dic['sp4'].append(int(parts[4]))
        dic['sp5'].append(int(parts[5]))
        dic['sp6'].append(int(parts[6]))
        dic['total'].append(int(parts[7]))
        line=f.readline()
df=pd.DataFrame(data=dic).set_index("group")
group_dict={}

for index,row  in df.iterrows():
    for sp,count in row.items():
        if sp != "total" and count != 0:
            group_dict.setdefault(index, []).append(sp)
group_dict       
#import pyupset as pyu
from upsetplot import UpSet
from upsetplot import from_memberships
x=from_memberships(group_dict.values()).sort_values(ascending= False)
UpSet(x, subset_size='count',show_counts=True).plot()

When Pandas reads in data from text files it tries to figure out the datatypes of the data portion. (If there is a header line, it keeps that separate from the consideration. And this is why clean well-structured data is important. And/or indicating to pd.read_csv() via optional parameters how to deal best with the complexities. If part of a header gets in to 'data' when Pandas reads it in, even if you manage to remove it cleanly from the dataframe, it can interfere with the datatype assignment.) When it reads in from a Python dictionary, it's not doing that because if it is in Python already it then the user already read it in from somewhere and it got an assigned type, more or less.