how to loop and select values from a python dataframe
1
0
Entering edit mode
3 months ago
raalsuwaidi ▴ 90

hi all,

I have a python dataframe that is based on a distance matrix of the 1000 genomes. the rows and the columns are the sample ID in the same order, with the value representing the distance between the samples.

a sample of the dataframe is as below:

>               10174           10187            10205           10215           10227           10231            10249      10347             10411       10490
>        0   0.000000   0.069211    0.067786    0.068593    0.068817    0.068341    0.067894    0.069827    0.068312    0.067571
>        1   0.069211   0.000000    0.069832    0.070054    0.070337    0.068410    0.069597    0.071458    0.069664    0.069361
>        2   0.067786   0.069832    0.000000    0.069795    0.070234    0.069094    0.068961    0.070510    0.069114    0.069188
>        3   0.068593   0.070054    0.069795    0.000000    0.069213    0.069364    0.068045    0.069976    0.068899    0.068610
>        4   0.068817   0.070337    0.070234    0.069213    0.000000    0.069265    0.066743    0.069880    0.068370    0.068147


the actual file has over 2000 samples with all the sample IDs in the header as column names.

now my question is how do I select a group of values for each sample based on conditions. like for example, for sample 10174 what are all the samples with values above 0.068 and below 0.07? in the example below it will be 10187, 10215, and 10227. while it will not select 10174 itself as the value is 0.

as I need to do this for each of the 2000 samples, I will not be able to do it manually.

I am assuming that this will have to be a loop on the column values, but I am not sure how to write the code for a data frame and return the sample names with the values fulfilling the condition.

thank you so much for the help in advance.

dataframe python • 397 views
1
Entering edit mode
3 months ago
Kveta ▴ 10

Assuming your dataframe looks like this (I changed the numbers a bit and added sample numbers to the index):

>>> df
10174  10187  10205  10215
10174    0.0   0.60   0.55    0.7
10187    0.6   0.00   0.60    0.6
10205    0.7   0.40   0.00    0.5
10215    0.4   0.65   0.70    0.0


You can get the values satisfying your conditions by using:

df[(df[10174] > 0.4) & (df[10174] < 0.8)][10174]


This works for a single column (single sample). To calculate these for the whole dataframe, I'd use an apply function, something like this:

def filter_by_value(col):
res = col[(col > 0.4) & (col < 0.8)]
return list(res.index)

df.apply(filter_by_value, axis=0)


Which returns:

10174           [10187, 10205]
10187           [10174, 10215]
10205    [10174, 10187, 10215]
10215    [10174, 10187, 10205]
dtype: object

0
Entering edit mode

thank you so much. i am playing around with it. it should work. but can you please tell me how to just get the text values for all the columns fitting the condition without having to put it in a python list ?

0
Entering edit mode

Can you please specify what exactly you mean by that? You want the names of the columns (samples) or the whole columns?

0
Entering edit mode

i want the text for the name of the columns. like for example, given the results

10174 [10187, 10205] 10187 [10174, 10215]
10205 [10174, 10187, 10215] 10215 [10174, 10187, 10205]

i want 10174, 10187, 10205, 10215 as the output. These are the samples that the condition applies for. not in a python list, but in a string.

i know i could use the t0_string function, but in the above case i will have to use it for each list and i dont know how to loop through them.