python tricks to parse a dataframe
2
0
Entering edit mode
3.6 years ago
clementpch • 0

Hi everyone,

I want to create a dataframe from another one using python code. My dataframe that I want to modify look like that :

input dataframe to modify

I want to produce from this dataframe : (I take just the first 3 rows of the dataframe to build the exemple)

output wanted

As you can see I want to separate the Go term per gene in order to build separate goterm list for CC,MF and BP.

this is the function that I start to build but I am block because of the mix of float and string value that not allow to separate the GO column.

sep1="\t"
sep2=","

sepIn=";"

# Read input files
inputFileToParse=pd.read_csv(inputFileToParsePath, sep=sep1)
pivotFile=pd.read_csv(pivotFileForParsingPath,sep=sep2, index_col=0)

# Erasing unwanted column
del pivotFile["level_0"]

# get columns from the input dataframe
inputFileColumnsNames=inputFileToParse.columns

# function
def getListDescGoPerProcess(inputFileToParse,pivotFile,sepIn)
# separate all row if multiple value
if inputFileToParse['GO'].str.contains(';'):
    print("some gene as multiple GO term, it will be separate for the parsing")
    pd.concat([pd.Series(row['CMiso_genes'], row['GO'].split(';'))              
                 for _, row in inputFileToParse.iterrows()]).reset_index()

# Create the 3 output dateframe to save
BPtable=pd.DataFrame(columns = inputFileColumnsNames)
CCtable=pd.DataFrame(columns = inputFileColumnsNames)
MFtable=pd.DataFrame(columns = inputFileColumnsNames)
if

I have not finish the function already because I am block on the lign when I concatenate the sliting of the second colums. But as I said before the function split cannot take in count float value.

Do you have any suggestion to build this dataframe ?

Thanks in advance

python code annotation • 6.1k views
ADD COMMENT
0
Entering edit mode

Can you please post the input data as text using the code environment. It's too complicated for people to create by hand from an image.

ADD REPLY
0
Entering edit mode

Hi, thanks for your respons.

Here you have the dataframe as text :

                 CMiso_genes                                GO
 0      CMiso1.1chr04g0000001                               NaN
 1      CMiso1.1chr04g0000011  GO:0003676;GO:0046983;GO:0015074
 2      CMiso1.1chr04g0000021                               NaN
 3      CMiso1.1chr04g0000031                               NaN
 4      CMiso1.1chr04g0000041  GO:0016301;GO:0016773;GO:0016310
 ...                      ...                               ...
 28278  CMiso1.1chr00g0282781                               NaN
 28279  CMiso1.1chr00g0282791                               NaN
 28280  CMiso1.1chr00g0282801                               NaN
 28281  CMiso1.1chr00g0282811                               NaN
 28282  CMiso1.1chr00g0282821                               NaN
ADD REPLY
1
Entering edit mode
3.6 years ago
joseph4tran ▴ 10

Hi clementpch

You can refer to pandas explode() function to transform your semicolon separated list into rows. The trick here is to convert first the csv list into python list then finally use the explode() function to get rows out of it. Here is a link to a post where you can find examples: https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows

Here is an example with pandas 1.1.2:

#!/usr/bin/env python

import pandas as pd
import numpy as np

#        CMiso_genes                                          GO
# 0      CMiso1.1chr04g0000001                               NaN
# 1      CMiso1.1chr04g0000011  GO:0003676;GO:0046983;GO:0015074
# 2      CMiso1.1chr04g0000021                               NaN
# 3      CMiso1.1chr04g0000031                               NaN
# 4      CMiso1.1chr04g0000041  GO:0016301;GO:0016773;GO:0016310

df = pd.DataFrame({'CMiso_genes': ['CMiso1.1chr04g0000001', 'CMiso1.1chr04g0000011', 'CMiso1.1chr04g0000021', 'CMiso1.1chr04g0000031', 'CMiso1.1chr04g0000041'], 'GO': [np.nan, 'GO:0003676;GO:0046983;GO:0015074', np.nan, np.nan, 'GO:0016301;GO:0016773;GO:0016310']})
print(df)

print(df['GO'].str.split(';'))

# if you want to preserve the original index
print(df.assign(GO=df['GO'].str.split(';')).explode('GO', ignore_index=False))

# if you want to ignore index
print(df.assign(GO=df['GO'].str.split(';')).explode('GO', ignore_index=True))

And here the output:

python explode.py 

# your data  
            CMiso_genes                                GO
0  CMiso1.1chr04g0000001                               NaN
1  CMiso1.1chr04g0000011  GO:0003676;GO:0046983;GO:0015074
2  CMiso1.1chr04g0000021                               NaN
3  CMiso1.1chr04g0000031                               NaN
4  CMiso1.1chr04g0000041  GO:0016301;GO:0016773;GO:0016310
0                                     NaN
1    [GO:0003676, GO:0046983, GO:0015074]
2                                     NaN
3                                     NaN
4    [GO:0016301, GO:0016773, GO:0016310]
Name: GO, dtype: object

# explode preserving the original index  
             CMiso_genes          GO
0  CMiso1.1chr04g0000001         NaN
1  CMiso1.1chr04g0000011  GO:0003676
1  CMiso1.1chr04g0000011  GO:0046983
1  CMiso1.1chr04g0000011  GO:0015074
2  CMiso1.1chr04g0000021         NaN
3  CMiso1.1chr04g0000031         NaN
4  CMiso1.1chr04g0000041  GO:0016301
4  CMiso1.1chr04g0000041  GO:0016773
4  CMiso1.1chr04g0000041  GO:0016310

# explode ignoring the original index  
             CMiso_genes          GO
0  CMiso1.1chr04g0000001         NaN
1  CMiso1.1chr04g0000011  GO:0003676
2  CMiso1.1chr04g0000011  GO:0046983
3  CMiso1.1chr04g0000011  GO:0015074
4  CMiso1.1chr04g0000021         NaN
5  CMiso1.1chr04g0000031         NaN
6  CMiso1.1chr04g0000041  GO:0016301
7  CMiso1.1chr04g0000041  GO:0016773
8  CMiso1.1chr04g0000041  GO:0016310

Hope it is helpful

Jos

ADD COMMENT
0
Entering edit mode
3.6 years ago
clementpch • 0

Hi, Thanks Jos it was helpful I success to write my function and do what I want.

ADD COMMENT

Login before adding your answer.

Traffic: 1826 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6