Question: (Closed) How can I utilize vectorization on my Pandas script for efficiency?
0
Volka120 wrote:

this is a continuation from my previous post, where I wanted a faster and more efficient alternative to a standard Python loop, which performs some summing and multiplication on elements of each row.

Basically, what I have are two file inputs. One is a list of all combinations for a group of SNPs, for example below for 3 SNPs:

``````    AA   CC   TT
AT   CC   TT
TT   CC   TT
AA   CG   TT
AT   CG   TT
TT   CG   TT
AA   GG   TT
AT   GG   TT
TT   GG   TT
AA   CC   TA
AT   CC   TA
TT   CC   TA
AA   CG   TA
AT   CG   TA
TT   CG   TA
AA   GG   TA
AT   GG   TA
TT   GG   TA
AA   CC   AA
AT   CC   AA
TT   CC   AA
AA   CG   AA
AT   CG   AA
TT   CG   AA
AA   GG   AA
AT   GG   AA
TT   GG   AA
``````

And the second is a table, containing some information for each SNP, notably their log(OR) for a disease and the frequency of the risk allele:

``````SNP1             A       T       1.25    0.223143551314     0.97273
SNP2             C       G       1.07    0.0676586484738    0.3
SNP3             T       A       1.08    0.0769610411361    0.1136
``````

Below is my main code, in which I am looking to calculate a 'score' and a 'frequency' for each 'profile. The score is the sum of log(ORs) for each risk allele present in the profile, while the frequency is the frequencies multiplied together, assuming Hardy Weinberg equilibrium:

``````import pandas as pd

def score_freq(line):
score=0
freq=1
for j in range(len(line)):
if line[j] != numbers.values[j]:   # homozygous for ref
score+=0
freq*=(float(1-float(numbers.values[j]))*float(1-float(numbers.values[j])))
elif line[j] != numbers.values[j] and line[j] == numbers.values[j]: # heterozygous
score+=(float(numbers.values[j]))
freq*=(2*(float(1-float(numbers.values[j]))*float(numbers.values[j])))
elif line[j] == numbers.values[j]:   # homozygous for risk
score+=2*(float(numbers.values[j]))
freq*=(float(numbers.values[j])*float(numbers.values[j]))

if freq < 1e-05:   # threshold to stop loop in interest of efficiency
break

return pd.Series([score, freq])

combinations[['score', 'freq']] = combinations.apply(lambda row: score_freq(row), axis=1)
#combinations[['score', 'freq']] = score_freq(combinations.values) # vectorization?

print(combinations)
``````

I was referring to this site, where they go over the fastest way to loop over a Pandas dataframe. I have been able to use the Pandas apply method, but I am not sure how to perform the vectorization method over the Pandas series. Other than that, do suggest any way in which I can improve my script to make it more efficient, thanks!

vectorization pandas python • 722 views
written 2.1 years ago by Volka120

Hello Volka!

We believe that this post does not fit the main topic of this site.

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

In fact it is most likely an ill-posed or XY-problem. Until OP accepts that, repeated posts without further details will not help.

The problem occurs when people get stuck on what they believe is the solution and are unable step back and explain the issue in full.