group viral family pandas dataframe by number of rows
1
2
Entering edit mode
3.8 years ago
flogin ▴ 280

Hey guys, I have a table with circa of 1600 rows, like this:

Element-ID,Protein-ID,Protein-Product,Virus-ID,Super-Kingdom,Order,Family,Genus,Species,Sense,Start,End,EVE-length
ctg_1003:180657-180956,YP_184770.1,hypothetical protein,39640,Viruses,Unclass,Polydnaviridae,Bracovirus,Unclass,neg,180657,180956,299
ctg_1003:283549-284079,YP_184879.1,hypothetical protein,39640,Viruses,Unclass,Polydnaviridae,Bracovirus,Unclass,pos,283549,284079,530
ctg_1007:58711-59043,YP_184770.1,hypothetical protein,39640,Viruses,Unclass,Polydnaviridae,Bracovirus,Unclass,neg,58711,59043,332
ctg_100:908810-909199,YP_184882.1,hypothetical protein,39640,Viruses,Unclass,Polydnaviridae,Bracovirus,Unclass,neg,908810,909199,389
ctg_1011:242875-243240,YP_001426207.1,hypothetical protein,399781,Viruses,Algavirales,Phycodnaviridae,Chlorovirus,Paramecium,bursaria Chlorella virus A1,pos,242875,243240,365
  

Basically it's a csv file with taxonomy information from endogenous viruses. So I read this csv file as a pandas dataframe (I'm working with python 3.7).

df = pd.read_csv('eve_tax.csv', sep=',',header = 0)

An create a swarmplot with seaborn:

beeswarn_plot =sns.swarmplot(x='EVE-length',y='Super-Kingdom', hue='Family', data=df)

So, my code, at the moment, is:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.style as style

sns.set(style="ticks")
style.use('seaborn-poster')

df = pd.read_csv('eve_tax.csv', sep=',',header = 0)
beeswarn_plot =sns.swarmplot(x='EVE-length',y='Super-Kingdom', hue='Family', data=df)
sns.despine(fig=None, top=True, right=True, left=True, bottom=False, offset=None, trim=False)
beeswarn_plot.set_ylabel('')
beeswarn_plot.set_yticks([])
beeswarn_plot.set_xlabel('EVE length (pb)')

beeswarn_plot.legend(loc='upper center', bbox_to_anchor=(0.7, 1.05), ncol=4, fancybox=True, prop={'size': 11},title='EVEs Family')
plt.savefig('eves_tax_beeswarn_plot.pdf',dpi=300)

Which plots a 'beeswarm' plot like this:

beeswarmplo

So, what I want?

For some families I have a low number of elements, such as: Orthomyxoviridae n = 5 , Poxviridae n= 5. I want to put all families with 5 or less elements in a category 'Others', and plot with seaborn all families with 5 or more elements and the category 'Others'.

Can anyone help? Thanks !

pandas python dataframe • 669 views
ADD COMMENT
2
Entering edit mode
3.8 years ago
flogin ▴ 280

I resolve this with a quick fix:

family_count = df['Family'].value_counts()
family_low = family_count[family_count <= 5]
family_low_list = family_low.index
for i in family_low_list:
    df['Family'] = df['Family'].replace({i:'Others'})

So I replace all families that occurs less than 5 times in dataframe for 'Others', but I know that can have a way to make this without quick fixes...

ADD COMMENT

Login before adding your answer.

Traffic: 1596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6