Question: compare 2 dataframe with pandas
0
gravatar for Chvatil
2.5 years ago by
Chvatil50
Chvatil50 wrote:

It is the first time I use pandas and I do not really know how to deal with my problematic.

In fact I have 2 data frame:

import pandas 

blast=pandas.read_table("blast")
cluster=pandas.read_table("cluster")

Here is an exemple of their contents:

 >>> cluster
       cluster_name            seq_names
0                 1           g1.t1_0035
1                 1      g1.t1_0035_0042
2            119365           g1.t1_0042
3             90273      g1.t1_0042_0035
4             71567          g10.t1_0035
5             37976     g10.t1_0035_0042
6             22560          g10.t1_0042
7             90280     g10.t1_0042_0035
8             82698         g100.t1_0035
9             47392    g100.t1_0035_0042
10            28484         g100.t1_0042
11            22580    g100.t1_0042_0035
12            19474        g1000.t1_0035
13             5770   g1000.t1_0035_0042
14            29708        g1000.t1_0042
15            99776   g1000.t1_0042_0035
16             6283       g10000.t1_0035
17            39828  g10000.t1_0035_0042
18            25383       g10000.t1_0042
19           106614  g10000.t1_0042_0035
20             6285       g10001.t1_0035
21            13866  g10001.t1_0035_0042
22           121157       g10001.t1_0042
23           106615  g10001.t1_0042_0035
24             6286       g10002.t1_0035
25              113  g10002.t1_0035_0042
26            25397       g10002.t1_0042
27           106616  g10002.t1_0042_0035
28             4643       g10003.t1_0035
29            13868  g10003.t1_0035_0042
...             ...                  ...

and

    [78793 rows x 2 columns]
    >>> blast
                      qseqid               sseqid  pident  length  mismatch  \
0            g1.t1_0035_0042      g1.t1_0035_0042   100.0     286         0   
1            g1.t1_0035_0042           g1.t1_0035   100.0     257         0   
2            g1.t1_0035_0042        g9307.t1_0035    26.9     134        65   
3            g2.t1_0035_0042      g2.t1_0035_0042   100.0     445         0   
4            g2.t1_0035_0042           g2.t1_0035    95.8     451         3   
5            g2.t1_0035_0042  g24520.t1_0042_0035    61.1     429       137   
6            g2.t1_0035_0042        g9924.t1_0042    61.1     429       137   
7            g2.t1_0035_0042        g1838.t1_0035    86.2      29         4   
8            g3.t1_0035_0042      g3.t1_0035_0042   100.0     719         0   
9            g3.t1_0035_0042           g3.t1_0035    84.7     753        62   
10           g4.t1_0035_0042      g4.t1_0035_0042   100.0     242         0   
11           g4.t1_0035_0042           g3.t1_0035    98.8     161         2   
12           g5.t1_0035_0042      g5.t1_0035_0042   100.0     291         0   
13           g5.t1_0035_0042           g3.t1_0035    93.1     291         0   
14           g6.t1_0035_0042      g6.t1_0035_0042   100.0     152         0   
15           g6.t1_0035_0042           g4.t1_0035   100.0     152         0   
16           g7.t1_0035_0042      g7.t1_0035_0042   100.0     216         0   
17           g7.t1_0035_0042           g5.t1_0035    98.1     160         3   
18           g7.t1_0035_0042       g11143.t1_0042    46.5     230        99   
19           g7.t1_0035_0042  g27537.t1_0042_0035    40.8     233       111   
20        g3778.t1_0035_0042   g3778.t1_0035_0042   100.0      86         0   
21        g3778.t1_0035_0042        g6174.t1_0035    98.0      51         1   
22        g3778.t1_0035_0042  g20037.t1_0035_0042   100.0      50         0   
23        g3778.t1_0035_0042       g37190.t1_0035   100.0      50         0   
24        g3778.t1_0035_0042  g15112.t1_0042_0035    66.0      53        18   
25        g3778.t1_0035_0042        g6061.t1_0042    66.0      53        18   
26       g18109.t1_0035_0042  g18109.t1_0035_0042   100.0      86         0   
27       g18109.t1_0035_0042       g33071.t1_0035   100.0      81         0   
28       g18109.t1_0035_0042       g32810.t1_0035    96.4      83         3   
29       g18109.t1_0035_0042  g17982.t1_0035_0042    98.6      72         1   
...                      ...                  ...     ...     ...       ...

if you stay focus on the cluster database, the first column correspond to the cluster ID and inside those clusters there are several sequences ID. What I need to to is first to split all my cluster (in R it would be like: liste=split(x = data$V2, f = data$V1) )

And then, creat a function which displays the most similarity paires sequence within each cluster. here is an exemple:

let's say I have two clusters (dataframe cluster):

cluster 1: 
seq1
seq2
seq3
seq4
cluster 2:
seq5
seq6
seq7

...

On the blast dataframe there is on the 3th column the similarity between all sequences (all against all), so something like:

seq1 vs seq1 100
seq1 vs seq2 90
seq1 vs seq3 56
seq1 vs seq4 49
seq1 vs seq5 40
....
seq2 vs seq3 70
seq2 vs seq4 98
...
seq5 vs seq5 100
seq5 vs seq6 89
seq5 vs seq7 60
seq7 vs seq7 46
seq7 vs seq7 100
seq6 vs seq6 100

and what I need to get is :

cluster 1 (best paired sequences): 
seq 1 vs seq 2

cluster2 (best paired sequences):
seq 5 vs seq6

...

So as you can see, I do not want to take into account the sequences paired by themselves

IF someone could give me some help it would be fantastic.

Thank you all.

data.frame pandas python • 9.5k views
ADD COMMENTlink modified 2.5 years ago by Bastien Hervé4.8k • written 2.5 years ago by Chvatil50
1

your import is incorrect. You lost the data in headers.

ADD REPLYlink written 2.5 years ago by cpad011214k
1

First things first, you need to set column names and indexes for both blast and cluster dataframes. Because in your example your first row line is your header column

ADD REPLYlink written 2.5 years ago by Bastien Hervé4.8k

Ok, I changed it thanks you for the comment.

ADD REPLYlink written 2.5 years ago by Chvatil50

I think that if you join your dataframes (linked seq_names for cluster and qseqid for blast) you will be good. You will have a huge dataframe with all the information you need, then filter the dataframe as your convenience.

ADD REPLYlink written 2.5 years ago by Bastien Hervé4.8k

Ok thansk, here is the resulte, do you have an idea of a function to get only the best paired sequences between each cluster?

python3 pandas_cluster.py 
           cluster_name            seq_names               qseqid  \
    0                 1           g1.t1_0035      g1.t1_0035_0042   
    1                 1      g1.t1_0035_0042      g1.t1_0035_0042   
    2            119365           g1.t1_0042      g1.t1_0035_0042   
    3             90273      g1.t1_0042_0035      g2.t1_0035_0042   
    4             71567          g10.t1_0035      g2.t1_0035_0042   
    5             37976     g10.t1_0035_0042      g2.t1_0035_0042   
    6             22560          g10.t1_0042      g2.t1_0035_0042   
    7             90280     g10.t1_0042_0035      g2.t1_0035_0042   
    8             82698         g100.t1_0035      g3.t1_0035_0042   
    9             47392    g100.t1_0035_0042      g3.t1_0035_0042   
    10            28484         g100.t1_0042      g4.t1_0035_0042   
    11            22580    g100.t1_0042_0035      g4.t1_0035_0042   
    12            19474        g1000.t1_0035      g5.t1_0035_0042   
    13             5770   g1000.t1_0035_0042      g5.t1_0035_0042   
    14            29708        g1000.t1_0042      g6.t1_0035_0042   
    15            99776   g1000.t1_0042_0035      g6.t1_0035_0042   
    16             6283       g10000.t1_0035      g7.t1_0035_0042   
    17            39828  g10000.t1_0035_0042      g7.t1_0035_0042   
    18            25383       g10000.t1_0042      g7.t1_0035_0042   
    19           106614  g10000.t1_0042_0035      g7.t1_0035_0042   
    20             6285       g10001.t1_0035   g3778.t1_0035_0042   
    21            13866  g10001.t1_0035_0042   g3778.t1_0035_0042   
    22           121157       g10001.t1_0042   g3778.t1_0035_0042   
    23           106615  g10001.t1_0042_0035   g3778.t1_0035_0042   
    24             6286       g10002.t1_0035   g3778.t1_0035_0042   
    25              113  g10002.t1_0035_0042   g3778.t1_0035_0042   
    26            25397       g10002.t1_0042  g18109.t1_0035_0042   
    27           106616  g10002.t1_0042_0035  g18109.t1_0035_0042   
    28             4643       g10003.t1_0035  g18109.t1_0035_0042   
    29            13868  g10003.t1_0035_0042  g18109.t1_0035_0042   
    ...             ...                  ...                  ...

                        sseqid  pident  
    0          g1.t1_0035_0042   100.0  
    1               g1.t1_0035   100.0  
    2            g9307.t1_0035    26.9  
    3          g2.t1_0035_0042   100.0  
    4               g2.t1_0035    95.8  
    5      g24520.t1_0042_0035    61.1  
    6            g9924.t1_0042    61.1  
    7            g1838.t1_0035    86.2  
    8          g3.t1_0035_0042   100.0  
    9               g3.t1_0035    84.7  
    10         g4.t1_0035_0042   100.0  
    11              g3.t1_0035    98.8  
    12         g5.t1_0035_0042   100.0  
    13              g3.t1_0035    93.1  
    14         g6.t1_0035_0042   100.0  
    15              g4.t1_0035   100.0  
    16         g7.t1_0035_0042   100.0  
    17              g5.t1_0035    98.1  
    18          g11143.t1_0042    46.5  
    19     g27537.t1_0042_0035    40.8  
    20      g3778.t1_0035_0042   100.0  
    21           g6174.t1_0035    98.0  
    22     g20037.t1_0035_0042   100.0  
    23          g37190.t1_0035   100.0  
    24     g15112.t1_0042_0035    66.0  
    25           g6061.t1_0042    66.0  
    26     g18109.t1_0035_0042   100.0  
    27          g33071.t1_0035   100.0  
    28          g32810.t1_0035    96.4  
    29     g17982.t1_0035_0042    98.6  
    ...                    ...     ...
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Chvatil50

group by cluster and print max value in each group

ADD REPLYlink written 2.5 years ago by cpad011214k

I did:

grouped = df.groupby("cluster_name")

and now I do not know how to print max values on dataframe "grouped" but only in the row pident.

print(grouped.max())

Have you an idea?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Chvatil50

something like this: df.groupby(['cluster_name'], sort=False)['pident'].max()

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by cpad011214k

Ok, thanks, but it actually gives me :

cluster_name
1         100.0
119365     26.9
90273     100.0
71567      95.8
37976      61.1
22560      83.6
90280      86.2
82698     100.0
47392      84.7
28484     100.0
22580      98.8
19474     100.0
5770       99.0
29708     100.0
99776     100.0
6283      100.0
39828      98.1
25383      95.3
106614     40.8
6285      100.0
13866      98.0
121157    100.0
106615    100.0
6286       66.0
113       100.0
25397     100.0
106616    100.0
4643      100.0
13868     100.0
25398     100.0
          ...

ABut I actually do not know which paires have the best similarity score (It does not display their seq ID) and I have to remove or at list not take into account the sequences paired by themselves, here it is always 100 but I want those wo are differents but display the best similarity score.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Chvatil50

See if following works:

idx=df.groupby(["cluster_name"])['pident'].transform(max)==df['pident']
df[idx]

if you don't have a tie within a cluster, then you can try this as well:

df.sort_values('pident', ascending=False).drop_duplicates(['cluster_name'])
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by cpad011214k

I personally don't appreciate the groupby method, it's powerfull and cleaner but I don't like to use it, you can try this otherwise :

After merging your dataframes in df

for i in df.cluster_name.unique():
    temp = df[(df.cluster_name == i)]
    best = temp[(temp.pident == temp['pident'].max())]
    print("Cluster : "+str(i))
    for index, row in best.iterrows():
            print(row['seqnames']+" vs "+row['sseqid'])
    print("\n")
ADD REPLYlink written 2.5 years ago by Bastien Hervé4.8k

In fact I thing something is going wrong when I merge my two dataframes: When typing:

blast=pd.read_table("matches.m8",header=None)
blast.columns = ["qseqid", "sseqid", "pident", "length", "mismatch", "gapopen","qstart", "qend", "sstart", "send", "evalue", "bitscore"]
cluster=pd.read_table("seq.fnodes2",header=None)
cluster.columns = ["cluster_name", "seq_names"]
df=cluster.join(blast, lsuffix='seq_names', rsuffix='qseqid')
df=df.drop(columns=["length"  ,"mismatch" , "gapopen" ,"qstart"  ,"qend","sstart", "send","evalue" , "bitscore"]
print(df)

I'm actually getting for the first cluster for exemple:

       cluster_name            seq_names             qseqid                         sseqid                       pident
0                 1                  g1.t1_0035                g1.t1_0035_0042        g1.t1_0035_0042    100.0
1                 1                  g1.t1_0035_0042      g1.t1_0035_0042        g1.t1_0035              100.0

But as you can see the cluster 1 is forming by these 2 sequences:

1    g1.t1_0035
1    g1.t1_0035_0042

And the blast file is :

g1.t1_0035_0042    g1.t1_0035_0042    100.0
g1.t1_0035_0042    g1.t1_0035    100.0

So, what I want is to display the best score between two differents sequences, here only g1.t1_0035_0042 vs g1.t1_0035

The problem is that the data.frame marged should not be like that:

       cluster_name            seq_names             qseqid                         sseqid                       pident
0                 1                  g1.t1_0035                g1.t1_0035_0042        g1.t1_0035_0042    100.0
1                 1                  g1.t1_0035_0042      g1.t1_0035_0042        g1.t1_0035              100.0

but like that:

       cluster_name            seq_names             qseqid                         sseqid                       pident
0                 1                  g1.t1_0035                g1.t1_0035_0042       g1.t1_0035              100.0
1                 1                  g1.t1_0035_0042      g1.t1_0035_0042        g1.t1_0035_0042    100.0

and the same here:

for this cluster:

119365    g1.t1_0042

This is the only one who exists, so this sequence is alone inside this cluster but your code desplays:

Cluster : 119365
g1.t1_0035_0042 vs g9307.t1_0035

but I should actually get nothing thing this sequence is alone.

And indeed something is wrong with my marging again because it gives me that:

        cluster_name            seq_names             qseqid                           sseqid                       pident
2            119365              g1.t1_0042            g1.t1_0035_0042  2          g9307.t1_0035          26.9

Which is wrong...

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Chvatil50
0
gravatar for Bastien Hervé
2.5 years ago by
Bastien Hervé4.8k
Karolinska Institutet, Sweden
Bastien Hervé4.8k wrote:

Here is how I do it :

import pandas as pd

blast=pd.read_table("blast.csv", sep="\t", header=None, names=["Ref", "Query", "identity"])
cluster=pd.read_table("cluster.csv", sep="\t", header=None, names=["Cluster", "Ref"])

huge_df = pd.merge(cluster, blast, on='Ref', how='outer')

###Remove Na, if a seq doesn't have a cluster or a cluster without sequences etc...
huge_df = huge_df.dropna()

for index, line in huge_df.iterrows():
    if line["Ref"].split("_")[1] == line["Query"].split("_")[1]:
        huge_df.drop(index, inplace=True)

for i in huge_df.Cluster.unique():
    c = huge_df[(huge_df.Cluster == i)]
    best = c[(c.identity == c['identity'].max())]
    print("Cluster : "+str(i))
    for index, row in best.iterrows():
        print(row['Ref']+" vs "+row['Query'])
    print("\n")
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Bastien Hervé4.8k

Hi, it is a good idea but it actually does not avoid sequences paired with the same ID:

For exemple I get:

Cluster : 25720.0
g10322.t1_0042 vs g10322.t1_0042  (they are the same specie)
g25505.t1_0042_0035 vs g25505.t1_0042_0035

Cluster : 14187.0
g10321.t1_0035_0042 vs g10321.t1_0035_0042 (they are the same specie)
g10321.t1_0035_0042 vs g17887.t1_0035
g1675.t1_0035 vs g1675.t1_0035
g17887.t1_0035 vs g10321.t1_0035_0042
g17887.t1_0035 vs g17887.t1_0035

and so on I would need to only desplay the sequence with the best ID within each cluster and this sequence has to be a pair of 2 different species.

So two species are identical when they have the same ID and (new thing) also sequences which begins with the same patern after the first "_"? For exemple: if the 2 sequences compared are: g1.t1_0035_0042 vs g1.t1_0035 or g23602.t1_0035_0042 vs g44900.t1_0035. In fact the first number correspond at the specie number so when the two number are equal, it means that we are comparing the same specie. Do you think it is possible to also removed those sequences with the same n°? cf the number after the first "_"?

I actually got one code and it allows to do it but only for sequences with the same ID, and I do not really know how to add this part (remove also when the first number is equal):

import pandas as pd

blast=pd.read_table("matches.m8",header=None)
blast.columns = ["qseqid", "sseqid", "pident", "length", "mismatch", "gapopen","qstart", "qend", "sstart", "send", "evalue", "bitscore"]

cluster=pd.read_table("seq.fnodes2",header=None)
cluster.columns = ["cluster_name", "seq_names"]


data = cluster.merge(blast, left_on='seq_names', right_on='qseqid')
data=data.drop(columns=["length"  ,"mismatch" , "gapopen" ,"qstart"  ,"qend","sstart", "send","evalue" , "bitscore"])
data = data[data['qseqid']!=data["sseqid"]]
data2=data.groupby('cluster_name').max()


for index, row in data2.iterrows():
        print(row['qseqid']+" vs "+row['sseqid'])
print("\n")
ADD REPLYlink written 2.5 years ago by Chvatil50

Do you want to filter out g1.t1_0035_0042 vs g2.t2_0035_0087, for example ? Only based on the 0035 pattern.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Bastien Hervé4.8k

And for something like this ? g2.t1_0035_0042 vs g9924.t1_0042 Because here 0042 isn't first number

ADD REPLYlink written 2.5 years ago by Bastien Hervé4.8k

yes, only the first number matters and by the way, 0087 does not exist normally :)

ADD REPLYlink written 2.5 years ago by Chvatil50

Yes I just added it for testing. I updated my answer above :)

ADD REPLYlink written 2.5 years ago by Bastien Hervé4.8k

I used this code:

import pandas as pd

blast=pd.read_table("matches.m8",header=None)
blast.columns = ["Ref", "Query", "identity", "length", "mismatch", "gapopen","qstart", "qend", "sstart", "send", "evalue", "bitscore"]

cluster=pd.read_table("seq.fnodes2",header=None)
cluster.columns = ["Cluster", "Ref"]

huge_df = pd.merge(cluster, blast, on='Ref', how='outer')

huge_df =huge_df.drop(columns=["length"  ,"mismatch" , "gapopen" ,"qstart"  ,"qend","sstart", "send","evalue" , "bitscore"])

###Remove Na, if a seq doesn't have a cluster or a cluster without sequences etc...
huge_df = huge_df.dropna()

for index, line in huge_df.iterrows():
    if line["Ref"].split("_")[1] == line["Query"].split("_")[1]:
        huge_df.drop(index, inplace=True)

for i in huge_df.Cluster.unique():
    c = huge_df[(huge_df.Cluster == i)]
    best = c[(c.identity == c['identity'].max())]
    print("Cluster : "+str(i))
    for index, row in best.iterrows():
        print(row['Ref']+" vs "+row['Query'])
    print("\n")

But it actually does nothing, I mean I run the script and it seems to be stuck, do you know if it will take a lot of time?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Chvatil50

Ok by adding some print I know it is actually running but it is very very slow at the part where we removed seq by their first ID number. How could I output the new data frame huge_df after this step to do not have to do it at each time ?

ADD REPLYlink written 2.5 years ago by Chvatil50

I don't really understand what you are trying to achieve here. If you want to output your df after the drop session :

for index, line in huge_df.iterrows():
    if line["Ref"].split("_")[1] == line["Query"].split("_")[1]:
        huge_df.drop(index, inplace=True)
print(huge_df)
###Add sys.exit to kill the job here (you need to import sys)
sys.exit()
for i in huge_df.Cluster.unique():
    c = huge_df[(huge_df.Cluster == i)]
    best = c[(c.identity == c['identity'].max())]
    print("Cluster : "+str(i))
    for index, row in best.iterrows():
        print(row['Ref']+" vs "+row['Query'])
    print("\n")

If you want to know how many time the job will need to finish the drop process, print the total length of your df before the first loop, just add a counter in the first loop (before the if statement) and print that counter just after. Just look how many iteration you have in 1 minute and see how many iteration you have left, do simple math to know how many times you'll have to wait (If you have to count it in days, you'll have to find an other way to do it).

ADD REPLYlink written 2.5 years ago by Bastien Hervé4.8k

Ok it will takes around 1 hour I guess. I mean get the dataframe in an outpuf file, from which I can go direclty to the step 2 :

for i in huge_df.Cluster.unique():
    c = huge_df[(huge_df.Cluster == i)]
    best = c[(c.identity == c['identity'].max())]
    print("Cluster : "+str(i))
    for index, row in best.iterrows():
        print(row['Ref']+" vs "+row['Query'])
    print("\n")

Does something like : print(huge_def, file=my_file.out) will do it? or maybe : huge_df.to_csv(output_features_file2, sep='\t') Because I do not actually want to kill the job but just get an output on that process who took 1 hour just in case something was wrong.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Chvatil50

If you want to save a df to csv, it's something like this in python:

df.to_csv(file_name, sep='\t')

Do you want to write :

print(row['Ref']+" vs "+row['Query'])

or the best dataframe in file ?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Bastien Hervé4.8k

It works now, thanks you for you help it was really usefull.

ADD REPLYlink written 2.5 years ago by Chvatil50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1011 users visited in the last hour