Fastest way to subset dataframe in python
1
0
Entering edit mode
3.7 years ago

I have a blast result like this (3 million results) (sorry I don't know how to display it better than this...) :

(blast header : ["qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"])

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 18 0 0 223 240 2 19 4e-08 36.2

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 8 0 0 224 231 10 3 0.040 16.4

M02945:227:000000000-BHPRH:1:1101:14031:1695 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 10 0 0 67 76 10 1 0.003 20.3

M02945:227:000000000-BHPRH:1:1101:17816:1743 adaptator_sequence 100.00 18 0 0 208 225 2 19 4e-08 36.2

M02945:227:000000000-BHPRH:1:1101:17816:1743 adaptator_sequence 100.00 8 0 0 209 216 10 3 0.037 16.4

M02945:227:000000000-BHPRH:1:1101:17816:1743 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3

M02945:227:000000000-BHPRH:1:1101:17816:1743 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293

M02945:227:000000000-BHPRH:1:1101:15392:1766 adaptator_sequence 100.00 18 0 0 255 272 2 19 5e-08 36.2

M02945:227:000000000-BHPRH:1:1101:15392:1766 adaptator_sequence 100.00 8 0 0 256 263 10 3 0.045 16.4

M02945:227:000000000-BHPRH:1:1101:15392:1766 vector_sequence 100.00 11 0 0 157 167 1 11 7e-04 22.3

M02945:227:000000000-BHPRH:1:1101:15392:1766 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293

...

My aim is to subset blast result by "qseqid" to process some calculation in python.

I put the whole blast result in a dataframe, named "df_blast" :

f_blast= open("blast.csv", 'rt')
try:
df_blast = pd.read_csv(f_blast, sep='\t', header=None, index_col=None, names=["qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"])
finally:
f_blast.close()


Over this dataframe I have a for loop that give me the incoming "qseqid", something like this

for index in ["M02945:227:000000000-BHPRH:1:1101:14031:1695", "M02945:227:000000000-BHPRH:1:1101:17816:1743", "M02945:227:000000000-BHPRH:1:1101:15392:1766"] :


First I tried in my for loop :

df_blast_line = df_blast.loc[[index]]


I got on my first step :

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 18 0 0 223 240 2 19 4e-08 36.2

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 8 0 0 224 231 10 3 0.040 16.4

M02945:227:000000000-BHPRH:1:1101:14031:1695 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 10 0 0 67 76 10 1 0.003 20.3

This solution works, but it is too much time consuming, around 5h30 (using good indexation and filtering, 4h30, still too much)

I also try to drop rows after subsetting :

df_blast.drop([index], inplace=True)


But python is taking too much time to delete these rows...

Second, I tried just before my loop to subset my dataframe into a dictionnary :

dic_blast={}
for index, line in df_blast.iterrows():
if line['qseqid'] not in dic_blast :
dic_blast[line['qseqid']] = pd.DataFrame([line], columns=["qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"])
else:
dic_blast[line['qseqid']] = dic_blast[line['qseqid']].append([line], ignore_index=True)


And then in the "for loop" :

df_blast_line = dic_blast_sequences[index]


This solution is way better, around 1h15 to build the dictionnary and few minutes for to "for loop". Winning 3h.

Does someone know a better way to go faster ? I'm very interested I have multiple blast results to process...

Thanks !

python blast dataframe • 7.8k views
1
Entering edit mode

To me, it's a bit unclear what your desired output is. A for loop is for sure not the way to go for a pandas DataFrame. Perhaps groupby() is what you are looking for?

Note that you can do pd.read_csv() directly from a file, there is no need to first open a handle. df_blast = pd.read_csv("blast.csv",...)

0
Entering edit mode

I have a blast result in tsv. I have a list of id in a separate file (corresponding to qseqid in the blast file). Here, are the first 5 lines of the blast result (same as above) :

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 18 0 0 223 240 2 19 4e-08 36.2

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 8 0 0 224 231 10 3 0.040 16.4

M02945:227:000000000-BHPRH:1:1101:14031:1695 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 10 0 0 67 76 10 1 0.003 20.3

I want to know by example for each "qseqid", if the order smu_sequence-vector_sequence-adaptator_sequence on my query is respected.

In this case I have 9-156 for smu_sequence, 157-167 for vector_sequence and 223-240 for adaptator_sequence. So 9 <156 < 157 < 167 < 223 < 240, qseqid = M02945:227:000000000-BHPRH:1:1101:14031:1695 pass the test, I keep it. Then I go for another qseqid, same tests etc...

I'm not sure I could do all these tests in a groupby but i'm just not familiar with.

Thanks for the tips !

2
Entering edit mode
3.7 years ago
shoujun.gu ▴ 370

I think what you need is some code like this:

   import  pandas as pd

labels=df['qseqid'].unique()

for label in labels:
df_temp=df.loc[df['qseqid']==label,:]
# do whatever you want to do

1
Entering edit mode

I would suggest setting the qseqid column as index for the DataFrame and just slice like that, for further simplification:

df=pd.read_csv('your blast file', index_col='qseqid')

for label in df.index.unique():
df_temp=df.loc[label]

0
Entering edit mode

With the qseqid column as index, it takes 40 minutes to compute.

I found an other solution that takes 30 minutes

df_blast = pd.read_csv('blast.csv' index_col='qseqid')
df_blast.sort_index(inplace=True)
previous_index=""

for index, line in df_blast.iterrows():
if previous_index == "" :
previous_index = index
df_temp = pd.DataFrame([line], index=[index])
else:
if previous_index == index:
df_temp = df_temp.append([line])
else:
#Do stuff here
previous_index = index
df_temp = pd.DataFrame([line], index=[index])


Thanks to both of you !