Group rows in table based on column
1
0
Entering edit mode
21 months ago
pramirez ▴ 10

Hello,

I need to group the rows in a table based on a set of conditions. Here is the table: enter image description here

And I need to group it based on the following conditions:

  1. I want to identify the rows that have the same taxonomy.
  2. If the rows have the same taxonomy but difference of the value for column11 is more than 10^5 from the other columns, I want to drop the row. For example, in the table above, I would drop row 5 (tomato) because the value for column11 is more than five orders of magnitude different than the value from column11 for the other tomatoes.
  3. I want to group the remaining rows by taxonomy and keep the mean of column11 as the final column 11 value per row. Here is the desired result:

enter image description here

Here is the code I have so far:

import numpy as np
import pandas as pd

df = pd.read_csv('data.csv', sep='\t', decimal='.')


l= list(df.iloc[:,0])
q = len(l) - 1
r=[False]
tax = df.taxonomy
m = 0
for k in range(0,q):
    r.append(l[k]==l[k+1])
    print('k')
    print(l[k])
    print('k+1')
    print(l[k+1])
    print(r)
    dif = df.column11[k] - df.column11[k+1]
    if ((r == 'True') AND (dif > 100000)):
         df.drop(k)

The code finds the rows that have the same taxonomy, but does not execute the "if" loop. I'd be grateful if I could have some help.

Thanks!

df python dataframe pandas • 496 views
ADD COMMENT
1
Entering edit mode
21 months ago
BlastedBadger ▴ 160

First, your code has syntax errors. Can you change the "if" line with:

if (r[-1] and dif > 100000):

and check again?

Similarly, print('k') will print the string k, not the value of k. Use print(k) !

Then, it's seems to me that there are conceptual errors as well. You basically want to remove outlier values from groups. Maybe search methods that are designed to spot outliers, like removing values that are distant from the mean by more than 2 standard deviations? This method I suggest might not be appropriate if you have few individuals per group though.

Other conceptual error: the check l[k]==l[k+1] will be False the first time you see "Tomato", but you also want to check that first value against others, right?

Instead of doing it manually, I encourage you to do a python tutorial and a pandas tutorial. In pandas, you can group rows that way:

df.groupby('taxonomy')

which means you could compute mean and standard deviations of the whole group, broadcasted back on each row using:

groupmeans = df.groupby('taxonomy')['column11'].transform('mean').rename('mean11')
groupstds = df.groupby('taxonomy')['column11'].transform('std').rename('std11')

and finally make your checks

indivstats = pd.concat((df, groupmeans, groupstds), axis=1)

keepme = (indivstats.column11 - indivstats.mean11).abs() / indivstats.std11 > 2

To terminate, your questions on this Q&A site need to be formatted more appropriately if you want help: please do not post images of dataframe! Post a "reproducible" example, that is the code required to generate the dataframe. Example:

df = pd.DataFrame([
    ['mouse', 3.02e-90, 'Animal', 'Muridae'],
    ['mouse', 3.02e-89, 'Animal', 'Muridae'],
    ['tomato', 3.02e-90, 'Plant', 'Solanaceae'],
    # etc
    ],
    columns=['taxonomy', 'column11', 'Kingdom', 'Family'])
ADD COMMENT

Login before adding your answer.

Traffic: 1878 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6