Question: python tricks to parse a dataframe
0
gravatar for clementpch
4 weeks ago by
clementpch0
clementpch0 wrote:

Hi everyone,

I want to create a dataframe from another one using python code. My dataframe that I want to modify look like that :

input dataframe to modify

I want to produce from this dataframe : (I take just the first 3 rows of the dataframe to build the exemple)

output wanted

As you can see I want to separate the Go term per gene in order to build separate goterm list for CC,MF and BP.

this is the function that I start to build but I am block because of the mix of float and string value that not allow to separate the GO column.

sep1="\t"
sep2=","

sepIn=";"

# Read input files
inputFileToParse=pd.read_csv(inputFileToParsePath, sep=sep1)
pivotFile=pd.read_csv(pivotFileForParsingPath,sep=sep2, index_col=0)

# Erasing unwanted column
del pivotFile["level_0"]

# get columns from the input dataframe
inputFileColumnsNames=inputFileToParse.columns

# function
def getListDescGoPerProcess(inputFileToParse,pivotFile,sepIn)
# separate all row if multiple value
if inputFileToParse['GO'].str.contains(';'):
    print("some gene as multiple GO term, it will be separate for the parsing")
    pd.concat([pd.Series(row['CMiso_genes'], row['GO'].split(';'))              
                 for _, row in inputFileToParse.iterrows()]).reset_index()

# Create the 3 output dateframe to save
BPtable=pd.DataFrame(columns = inputFileColumnsNames)
CCtable=pd.DataFrame(columns = inputFileColumnsNames)
MFtable=pd.DataFrame(columns = inputFileColumnsNames)
if

I have not finish the function already because I am block on the lign when I concatenate the sliting of the second colums. But as I said before the function split cannot take in count float value.

Do you have any suggestion to build this dataframe ?

Thanks in advance

annotation code python • 180 views
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by clementpch0

Can you please post the input data as text using the code environment. It's too complicated for people to create by hand from an image.

ADD REPLYlink written 4 weeks ago by Joe18k

Hi, thanks for your respons.

Here you have the dataframe as text :

                 CMiso_genes                                GO
 0      CMiso1.1chr04g0000001                               NaN
 1      CMiso1.1chr04g0000011  GO:0003676;GO:0046983;GO:0015074
 2      CMiso1.1chr04g0000021                               NaN
 3      CMiso1.1chr04g0000031                               NaN
 4      CMiso1.1chr04g0000041  GO:0016301;GO:0016773;GO:0016310
 ...                      ...                               ...
 28278  CMiso1.1chr00g0282781                               NaN
 28279  CMiso1.1chr00g0282791                               NaN
 28280  CMiso1.1chr00g0282801                               NaN
 28281  CMiso1.1chr00g0282811                               NaN
 28282  CMiso1.1chr00g0282821                               NaN
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by clementpch0
1
gravatar for joseph4tran
4 weeks ago by
joseph4tran10
joseph4tran10 wrote:

Hi clementpch

You can refer to pandas explode() function to transform your semicolon separated list into rows. The trick here is to convert first the csv list into python list then finally use the explode() function to get rows out of it. Here is a link to a post where you can find examples: https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows

Here is an example with pandas 1.1.2:

#!/usr/bin/env python

import pandas as pd
import numpy as np

#        CMiso_genes                                          GO
# 0      CMiso1.1chr04g0000001                               NaN
# 1      CMiso1.1chr04g0000011  GO:0003676;GO:0046983;GO:0015074
# 2      CMiso1.1chr04g0000021                               NaN
# 3      CMiso1.1chr04g0000031                               NaN
# 4      CMiso1.1chr04g0000041  GO:0016301;GO:0016773;GO:0016310

df = pd.DataFrame({'CMiso_genes': ['CMiso1.1chr04g0000001', 'CMiso1.1chr04g0000011', 'CMiso1.1chr04g0000021', 'CMiso1.1chr04g0000031', 'CMiso1.1chr04g0000041'], 'GO': [np.nan, 'GO:0003676;GO:0046983;GO:0015074', np.nan, np.nan, 'GO:0016301;GO:0016773;GO:0016310']})
print(df)

print(df['GO'].str.split(';'))

# if you want to preserve the original index
print(df.assign(GO=df['GO'].str.split(';')).explode('GO', ignore_index=False))

# if you want to ignore index
print(df.assign(GO=df['GO'].str.split(';')).explode('GO', ignore_index=True))

And here the output:

python explode.py 

# your data  
            CMiso_genes                                GO
0  CMiso1.1chr04g0000001                               NaN
1  CMiso1.1chr04g0000011  GO:0003676;GO:0046983;GO:0015074
2  CMiso1.1chr04g0000021                               NaN
3  CMiso1.1chr04g0000031                               NaN
4  CMiso1.1chr04g0000041  GO:0016301;GO:0016773;GO:0016310
0                                     NaN
1    [GO:0003676, GO:0046983, GO:0015074]
2                                     NaN
3                                     NaN
4    [GO:0016301, GO:0016773, GO:0016310]
Name: GO, dtype: object

# explode preserving the original index  
             CMiso_genes          GO
0  CMiso1.1chr04g0000001         NaN
1  CMiso1.1chr04g0000011  GO:0003676
1  CMiso1.1chr04g0000011  GO:0046983
1  CMiso1.1chr04g0000011  GO:0015074
2  CMiso1.1chr04g0000021         NaN
3  CMiso1.1chr04g0000031         NaN
4  CMiso1.1chr04g0000041  GO:0016301
4  CMiso1.1chr04g0000041  GO:0016773
4  CMiso1.1chr04g0000041  GO:0016310

# explode ignoring the original index  
             CMiso_genes          GO
0  CMiso1.1chr04g0000001         NaN
1  CMiso1.1chr04g0000011  GO:0003676
2  CMiso1.1chr04g0000011  GO:0046983
3  CMiso1.1chr04g0000011  GO:0015074
4  CMiso1.1chr04g0000021         NaN
5  CMiso1.1chr04g0000031         NaN
6  CMiso1.1chr04g0000041  GO:0016301
7  CMiso1.1chr04g0000041  GO:0016773
8  CMiso1.1chr04g0000041  GO:0016310

Hope it is helpful

Jos

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by joseph4tran10
0
gravatar for clementpch
4 weeks ago by
clementpch0
clementpch0 wrote:

Hi, Thanks Jos it was helpful I success to write my function and do what I want.

ADD COMMENTlink written 4 weeks ago by clementpch0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1222 users visited in the last hour