Question: Parsing df with pandas and some conditions
0
gravatar for benjamin
18 months ago by
benjamin0
benjamin0 wrote:

Hi all, I need some help with python and pandas.

I actually have a dataframe with in the column seq1_id al the seq_id of sequences of the species 1 and the column 2 for the sequences of the sp2.

I actually passed a filter on those sequences and got two dataframes (one with all sequences of sp 1 passed through the filter) and (one with all sequences of sp2 passed through the filter).

Then I have 3 dataframes.

Because in a pairs, one seq can passe the filter while the other does not, it is important to keep only paired genes wich are keeping on the two previous filtering, so what I need to do is acutally to parse my first df such this one:

Seq1.id      Seq2.id
seq1_01     seq5_02
seq2_01     Seq6_02
seq3_01     Seq7_02
seq4_01     Seq8_02

and check row by row if (ex the first row) seq1_01 is present in the df2 and if seq8_02 is also present in the df3, then keep this row in the df1 and add it in a new df4.

Here is an exemple with output wanted:

first df: 

 Seq1.id     Seq2.id
seq1_01     seq5_02
seq2_01     Seq6_02
seq3_01     Seq7_02
seq4_01     Seq8_02

df2 (sp1) (seq3_01 is absent)

    Seq_1.id   
    seq1_01     
    seq2_01        
    seq4_01 


df3 (sp2) (Seq8_02 is absent)

   Seq_2.id
   seq5_02
   Seq6_02
   Seq7_02

Then because Seq8_02 and seq3_01 are not present, the df4 (output) would be:

    Seq1.id   Seq2.id
    seq1_01     seq5_02
    seq2_01    Seq6_02

Hi tried:

HGT_candidats_0035=candidates_0035
HGT_candidats_0042=candidates_0042

#convert gene names into a list
gene_name_0035=[]
for i in HGT_candidats_0035["gene"]:
    gene_name_0035.append(i)

gene_name_0042=[]
for i in HGT_candidats_0042["gene"]:
    gene_name_0042.append(i)

#Keep only paired sequences
seq1_id=[]
for i in dN_dS["seq1_id"]:
    seq1_id.append(i)

seq2_id=[]
for i in dN_dS["seq2_id"]:
    seq2_id.append(i)

newdf = pd.DataFrame(columns=("seq1_id","seq2_id"))

for a, b in zip(seq1_id,seq2_id):
    if a in gene_name_0035 and b in gene_name_0042:
        newdf=newdf.append({"seq1_id":a,"seq2_id":b}, ignore_index=True)

But I think it is too long

Here is you code: with my data

candidates_0035=pd.read_csv("candidates_genes_filtering_0035",sep='\t')
candidates_0042=pd.read_csv("candidates_genes_filtering_0042",sep='\t')
dN_dS=pd.read_csv("dn_ds.out_sorted",sep='\t')

df4 = pd.DataFrame(columns=dN_dS.columns)
print(df4)
for index, row in dN_dS.iterrows():
    if row['seq1_id'] in candidates_0042['gene'] and row['seq2_id'] in candidates_0035['gene']:
        df4 = df4.append(row, ignore_index=True)

df4.to_csv("new_df",sep='\t')

and here the empty output of df4:

Unnamed: 0  Unnamed: 0.1    seq1_id seq2_id dN  dS  Dist_third_pos  Dist_brute  Length_seq_1    Length_seq_2    GC_content_seq1 GC_content_seq2 GC  Mean_length

Here are the data: https://drive.google.com/file/d/1FR9MUk4x0NoM-r3F4oe6dt5HgDMaUlKv/view?usp=sharing https://drive.google.com/file/d/1MWRJwqRAA2B7eAXG1hcnIAqeQyjtx7pT/view?usp=sharing https://drive.google.com/file/d/10ZP-Awx_qevKoT-AfMjDpd8KKaUcsEog/view?usp=sharing

pandas python • 737 views
ADD COMMENTlink modified 18 months ago by cpad011212k • written 18 months ago by benjamin0

This is a short story about loop over the first dataframe then look in the 2 others.

Try something first and i'll help :)

ADD REPLYlink written 18 months ago by Bastien Hervé4.5k

I posted what I tried but I think it is too long and maybe I do some mistake as well

ADD REPLYlink modified 18 months ago • written 18 months ago by benjamin0

This is how you create your first df not how you try to resolve your issue.

You need to loop over your newdf now

Care at the french typos you made in your post (passe, filtre...). That could bother more than one person here.

ADD REPLYlink written 18 months ago by Bastien Hervé4.5k

Well if I do the for a, b in zip(seq1_id,seq2_id): part I ask for each row if we find this same seq_id in the df2 and df3, then add to a new df the row in the first df no? Yep sorry I corrected it

ADD REPLYlink written 18 months ago by benjamin0
1

Ok, I just got that dN_dS is your first df, candidates_0035 is your df2 and candidates_0042 is your df3

I didn't try but something like this (or very close) should work :

df4 = pd.DataFrame(columns=dN_dS.columns)
for index, row in dN_dS.iterrows():
    if row['Seq1.id'] in candidates_0035['Seq_1.id'] and row['Seq2.id'] in candidates_0042['Seq_2.id']:
        df4 = df4.append(row, ignore_index=True)
ADD REPLYlink written 18 months ago by Bastien Hervé4.5k

Thank you but I updated my first post with the code I used, The df4 outpfile is actually empty (only columns names). I also put my data if you want to see them.

ADD REPLYlink modified 18 months ago • written 18 months ago by benjamin0

Finnally I succed by passing the df2 and 3 into a list.

ADD REPLYlink written 18 months ago by benjamin0

Here you go (with your files) :

candidates_0035=pd.read_csv("candidates_genes_0035",sep='\t')
candidates_0042=pd.read_csv("candidates_genes_0042",sep='\t')
dN_dS=pd.read_csv("dn_ds.out_sorted",sep='\t')

df4 = pd.DataFrame(columns=dN_dS.columns)

for index, row in dN_dS.iterrows():
    if row['seq1_id'] in candidates_0035['gene'].tolist() and row['seq2_id'] in candidates_0042['gene'].tolist():
    df4 = df4.append(row, ignore_index=True)

df4.to_csv("new_df",sep='\t')
ADD REPLYlink written 18 months ago by Bastien Hervé4.5k
0
gravatar for cpad0112
18 months ago by
cpad011212k
India
cpad011212k wrote:
import pandas as pd
from pandasql import sqldf as psql

print(df1)
print(df2)
print(df3)

    Seq1.id  Seq2.id
0  seq1.01  seq5.02
1  seq2.01  Seq6.02
2  seq3.01  Seq7.02
3  seq4.01  Seq8.02
  Seq.1.id
0  seq1.01
1  seq2.01
2  seq4.01
  Seq.2.id
0  seq5.02
1  Seq6.02
2  Seq7.02

psql('select [Seq1.id],[Seq2.id] from df1 inner join df2 on df1.[Seq1.id]=df2.[Seq.1.id] inner join df3 on df1.[Seq2.id]=df3.[Seq.2.id]')

     Seq1.id    Seq2.id
0   seq1.01     seq5.02
1   seq2.01     Seq6.02

Following output needs a little bit of home work (with same output):

df1.merge(df2, left_on='Seq1.id', right_on='Seq.1.id', sort=True, how='inner').merge(df3,left_on='Seq2.id', right_on='Seq.2.id',sort=True, how="inner")

    Seq1.id     Seq2.id     Seq.1.id    Seq.2.id
0   seq2.01     Seq6.02     seq2.01     Seq6.02
1   seq1.01     seq5.02     seq1.01     seq5.02
ADD COMMENTlink modified 18 months ago • written 18 months ago by cpad011212k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1960 users visited in the last hour