Question

Closed:Pandas and dataframe (filtering)

0

Entering edit mode

6.0 years ago

Chvatil ▴ 130

I have a problem, I need to parse the following dataframe:

    cluster_name    qseqid  sseqid  pident_x    qstart  qend    sstar   send
2   1   seq1_0035_0035  seq13_0042_0035 0.73    42  133 46  189
3   1   seq1_0035_0035  seq13_0042_0035 0.73    146 283 287 389
4   1   seq1_0035_0035  seq13_0042_0035 0.73    301 478 402 503
5   1   seq13_0042_0035 seq1_0035_0035  0.73    46  189 42  133
6   1   seq13_0042_0035 seq1_0035_0035  0.73    287 389 146 283
7   1   seq13_0042_0035 seq1_0035_0035  0.73    402 503 301 478
8   2   seq4_0042_0035  seq2_0035_0035  0.71    256 789 125 678
9   2   seq4_0042_0035  seq2_0035_0035  0.71    802 1056    706 985
10  2   seq4_0042_0035  seq7_0035_0042  0.83    123 745 156 723
12  4   seq11_0035_0035 seq14_0042_0035 0.89    145 647 236 921
13  4   seq11_0035_0035 seq17_0042_0042 0.97    148 623 241 1002
14  5   seq17_0035_0042 seq17_0042_0042 0.94    188 643 179 746

and only get within each cluster the maximum pident_x but the issue is that as you can see I can have reversed sequences (if you take a look at the 2,3,4 and 5,6,7 they are the same but reversed) and what I need to do is to keep only one for exemple only the line 2,3 and 4.

The output would be then :

cluster_name    qseqid  sseqid  pident_x    qstart  qend    sstar   send
    2   1   seq1_0035_0035  seq13_0042_0035 0.73    42  133 46  189
    3   1   seq1_0035_0035  seq13_0042_0035 0.73    146 283 287 389
    4   1   seq1_0035_0035  seq13_0042_0035 0.73    301 478 402 503
    10  2   seq4_0042_0035  seq7_0035_0042  0.83    123 745 156 723
    13  4   seq11_0035_0035 seq17_0042_0042 0.97    148 623 241 1002
    14  5   seq17_0035_0042 seq17_0042_0042 0.94    188 643 179 746

Indeed : for the cluster1: seq1_0035_0035 vs seq13_0042_0035 has his reversed seq13_0042_0035 seq1_0035_0035 but I only keep the first one.

for the cluster2: seq4_0042_0035 vs seq7_0035_0042 (0.83) has a better pident score than seq4_0042_0035 vs seq2_0035_0035 (0.71)

for the cluster4: seq11_0035_0035 vs seq17_0042_0042 (0.97) has a better pident score than seq11_0035_0035 vs seq14_0042_0035 (0.89)

for the custer5: There is only one paired sequence seq17_0035_0042 vs seq17_0042_0042 (0.94) , then I keep this one

I do not really know how to manage to do such a thing, someone has an idea?

pandas filtering python • 145 views

ADD COMMENT • link updated 6.0 years ago by shoujun.gu ▴ 380 • written 6.0 years ago by Chvatil ▴ 130