Closed:Pandas and dataframe (filtering)
1
0
Entering edit mode
6.0 years ago
Chvatil ▴ 130

I have a problem, I need to parse the following dataframe:

    cluster_name    qseqid  sseqid  pident_x    qstart  qend    sstar   send
2   1   seq1_0035_0035  seq13_0042_0035 0.73    42  133 46  189
3   1   seq1_0035_0035  seq13_0042_0035 0.73    146 283 287 389
4   1   seq1_0035_0035  seq13_0042_0035 0.73    301 478 402 503
5   1   seq13_0042_0035 seq1_0035_0035  0.73    46  189 42  133
6   1   seq13_0042_0035 seq1_0035_0035  0.73    287 389 146 283
7   1   seq13_0042_0035 seq1_0035_0035  0.73    402 503 301 478
8   2   seq4_0042_0035  seq2_0035_0035  0.71    256 789 125 678
9   2   seq4_0042_0035  seq2_0035_0035  0.71    802 1056    706 985
10  2   seq4_0042_0035  seq7_0035_0042  0.83    123 745 156 723
12  4   seq11_0035_0035 seq14_0042_0035 0.89    145 647 236 921
13  4   seq11_0035_0035 seq17_0042_0042 0.97    148 623 241 1002
14  5   seq17_0035_0042 seq17_0042_0042 0.94    188 643 179 746

and only get within each cluster the maximum pident_x but the issue is that as you can see I can have reversed sequences (if you take a look at the 2,3,4 and 5,6,7 they are the same but reversed) and what I need to do is to keep only one for exemple only the line 2,3 and 4.

The output would be then :

cluster_name    qseqid  sseqid  pident_x    qstart  qend    sstar   send
    2   1   seq1_0035_0035  seq13_0042_0035 0.73    42  133 46  189
    3   1   seq1_0035_0035  seq13_0042_0035 0.73    146 283 287 389
    4   1   seq1_0035_0035  seq13_0042_0035 0.73    301 478 402 503
    10  2   seq4_0042_0035  seq7_0035_0042  0.83    123 745 156 723
    13  4   seq11_0035_0035 seq17_0042_0042 0.97    148 623 241 1002
    14  5   seq17_0035_0042 seq17_0042_0042 0.94    188 643 179 746

Indeed : for the cluster1: seq1_0035_0035 vs seq13_0042_0035 has his reversed seq13_0042_0035 seq1_0035_0035 but I only keep the first one.

for the cluster2: seq4_0042_0035 vs seq7_0035_0042 (0.83) has a better pident score than seq4_0042_0035 vs seq2_0035_0035 (0.71)

for the cluster4: seq11_0035_0035 vs seq17_0042_0042 (0.97) has a better pident score than seq11_0035_0035 vs seq14_0042_0035 (0.89)

for the custer5: There is only one paired sequence seq17_0035_0042 vs seq17_0042_0042 (0.94) , then I keep this one

I do not really know how to manage to do such a thing, someone has an idea?

pandas filtering python • 145 views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 2164 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6