Question: Fastest way to subset dataframe in python
0
gravatar for Bastien Hervé
13 months ago by
Bastien Hervé3.6k
Limoges, CBRS, France
Bastien Hervé3.6k wrote:

I have a blast result like this (3 million results) (sorry I don't know how to display it better than this...) :

(blast header : ["qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"])


M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 18 0 0 223 240 2 19 4e-08 36.2

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 8 0 0 224 231 10 3 0.040 16.4

M02945:227:000000000-BHPRH:1:1101:14031:1695 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 10 0 0 67 76 10 1 0.003 20.3

M02945:227:000000000-BHPRH:1:1101:17816:1743 adaptator_sequence 100.00 18 0 0 208 225 2 19 4e-08 36.2

M02945:227:000000000-BHPRH:1:1101:17816:1743 adaptator_sequence 100.00 8 0 0 209 216 10 3 0.037 16.4

M02945:227:000000000-BHPRH:1:1101:17816:1743 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3

M02945:227:000000000-BHPRH:1:1101:17816:1743 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293

M02945:227:000000000-BHPRH:1:1101:15392:1766 adaptator_sequence 100.00 18 0 0 255 272 2 19 5e-08 36.2

M02945:227:000000000-BHPRH:1:1101:15392:1766 adaptator_sequence 100.00 8 0 0 256 263 10 3 0.045 16.4

M02945:227:000000000-BHPRH:1:1101:15392:1766 vector_sequence 100.00 11 0 0 157 167 1 11 7e-04 22.3

M02945:227:000000000-BHPRH:1:1101:15392:1766 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293

...


My aim is to subset blast result by "qseqid" to process some calculation in python.

I put the whole blast result in a dataframe, named "df_blast" :

f_blast= open("blast.csv", 'rt')
try:
    df_blast = pd.read_csv(f_blast, sep='\t', header=None, index_col=None, names=["qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"])
finally:
    f_blast.close()

Over this dataframe I have a for loop that give me the incoming "qseqid", something like this

for index in ["M02945:227:000000000-BHPRH:1:1101:14031:1695", "M02945:227:000000000-BHPRH:1:1101:17816:1743", "M02945:227:000000000-BHPRH:1:1101:15392:1766"] :

First I tried in my for loop :

df_blast_line = df_blast.loc[[index]]

I got on my first step :


M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 18 0 0 223 240 2 19 4e-08 36.2

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 8 0 0 224 231 10 3 0.040 16.4

M02945:227:000000000-BHPRH:1:1101:14031:1695 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 10 0 0 67 76 10 1 0.003 20.3


This solution works, but it is too much time consuming, around 5h30 (using good indexation and filtering, 4h30, still too much)

I also try to drop rows after subsetting :

df_blast.drop([index], inplace=True)

But python is taking too much time to delete these rows...

Second, I tried just before my loop to subset my dataframe into a dictionnary :

dic_blast={}
for index, line in df_blast.iterrows():
    if line['qseqid'] not in dic_blast :
        dic_blast[line['qseqid']] = pd.DataFrame([line], columns=["qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"])
    else:
        dic_blast[line['qseqid']] = dic_blast[line['qseqid']].append([line], ignore_index=True)

And then in the "for loop" :

df_blast_line = dic_blast_sequences[index]

This solution is way better, around 1h15 to build the dictionnary and few minutes for to "for loop". Winning 3h.

Does someone know a better way to go faster ? I'm very interested I have multiple blast results to process...

Thanks !

blast dataframe python • 2.9k views
ADD COMMENTlink modified 13 months ago by shoujun.gu370 • written 13 months ago by Bastien Hervé3.6k
1

To me, it's a bit unclear what your desired output is. A for loop is for sure not the way to go for a pandas DataFrame. Perhaps groupby() is what you are looking for?

Note that you can do pd.read_csv() directly from a file, there is no need to first open a handle. df_blast = pd.read_csv("blast.csv",...)

ADD REPLYlink written 13 months ago by WouterDeCoster37k

I have a blast result in tsv. I have a list of id in a separate file (corresponding to qseqid in the blast file). Here, are the first 5 lines of the blast result (same as above) :

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 18 0 0 223 240 2 19 4e-08 36.2

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 8 0 0 224 231 10 3 0.040 16.4

M02945:227:000000000-BHPRH:1:1101:14031:1695 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 10 0 0 67 76 10 1 0.003 20.3

I want to know by example for each "qseqid", if the order smu_sequence-vector_sequence-adaptator_sequence on my query is respected.

In this case I have 9-156 for smu_sequence, 157-167 for vector_sequence and 223-240 for adaptator_sequence. So 9 <156 < 157 < 167 < 223 < 240, qseqid = M02945:227:000000000-BHPRH:1:1101:14031:1695 pass the test, I keep it. Then I go for another qseqid, same tests etc...

I'm not sure I could do all these tests in a groupby but i'm just not familiar with.

Thanks for the tips !

ADD REPLYlink written 13 months ago by Bastien Hervé3.6k
2
gravatar for shoujun.gu
13 months ago by
shoujun.gu370
Rockville/MD
shoujun.gu370 wrote:

I think what you need is some code like this:

   import  pandas as pd

    df=pd.read_csv('your blast file')
    labels=df['qseqid'].unique()

    for label in labels:
        df_temp=df.loc[df['qseqid']==label,:]
        # do whatever you want to do
ADD COMMENTlink written 13 months ago by shoujun.gu370
1

I would suggest setting the qseqid column as index for the DataFrame and just slice like that, for further simplification:

df=pd.read_csv('your blast file', index_col='qseqid')

for label in df.index.unique():
    df_temp=df.loc[label]
ADD REPLYlink written 13 months ago by WouterDeCoster37k

With the qseqid column as index, it takes 40 minutes to compute.

I found an other solution that takes 30 minutes

df_blast = pd.read_csv('blast.csv' index_col='qseqid')
df_blast.sort_index(inplace=True)
previous_index=""

for index, line in df_blast.iterrows():
    if previous_index == "" :
        previous_index = index
        df_temp = pd.DataFrame([line], index=[index])
    else:
        if previous_index == index:
            df_temp = df_temp.append([line])
        else:
            #Do stuff here
            previous_index = index
            df_temp = pd.DataFrame([line], index=[index])

Thanks to both of you !

ADD REPLYlink modified 13 months ago • written 13 months ago by Bastien Hervé3.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1228 users visited in the last hour