Question: (Closed) How can I utilize vectorization on my Pandas script for efficiency?
gravatar for Volka
4 months ago by
Volka40 wrote:

this is a continuation from my previous post, where I wanted a faster and more efficient alternative to a standard Python loop, which performs some summing and multiplication on elements of each row.

Basically, what I have are two file inputs. One is a list of all combinations for a group of SNPs, for example below for 3 SNPs:

    AA   CC   TT
    AT   CC   TT
    TT   CC   TT
    AA   CG   TT
    AT   CG   TT
    TT   CG   TT
    AA   GG   TT
    AT   GG   TT
    TT   GG   TT
    AA   CC   TA
    AT   CC   TA
    TT   CC   TA
    AA   CG   TA
    AT   CG   TA
    TT   CG   TA
    AA   GG   TA
    AT   GG   TA
    TT   GG   TA
    AA   CC   AA
    AT   CC   AA
    TT   CC   AA
    AA   CG   AA
    AT   CG   AA
    TT   CG   AA
    AA   GG   AA
    AT   GG   AA
    TT   GG   AA

And the second is a table, containing some information for each SNP, notably their log(OR) for a disease and the frequency of the risk allele:

SNP1             A       T       1.25    0.223143551314     0.97273 
SNP2             C       G       1.07    0.0676586484738    0.3     
SNP3             T       A       1.08    0.0769610411361    0.1136

Below is my main code, in which I am looking to calculate a 'score' and a 'frequency' for each 'profile. The score is the sum of log(ORs) for each risk allele present in the profile, while the frequency is the frequencies multiplied together, assuming Hardy Weinberg equilibrium:

import pandas as pd

numbers = pd.read_csv(table2, sep="\t", header=None)

combinations = pd.read_csv(table1, sep=" ", header=None)

def score_freq(line):
    for j in range(len(line)):
        if line[j][1] != numbers.values[j][1]:   # homozygous for ref
        elif line[j][0] != numbers.values[j][1] and line[j][1] == numbers.values[j][1]: # heterozygous
        elif line[j][0] == numbers.values[j][1]:   # homozygous for risk

        if freq < 1e-05:   # threshold to stop loop in interest of efficiency 

    return pd.Series([score, freq])

combinations[['score', 'freq']] = combinations.apply(lambda row: score_freq(row), axis=1)
#combinations[['score', 'freq']] = score_freq(combinations.values) # vectorization?


I was referring to this site, where they go over the fastest way to loop over a Pandas dataframe. I have been able to use the Pandas apply method, but I am not sure how to perform the vectorization method over the Pandas series. Other than that, do suggest any way in which I can improve my script to make it more efficient, thanks!

vectorization pandas python • 160 views
ADD COMMENTlink written 4 months ago by Volka40

Hello Volka!

We believe that this post does not fit the main topic of this site.

This is a pure Python question. Please ask at StackExchange.

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.


ADD REPLYlink written 4 months ago by ATpoint14k

In fact it is most likely an ill-posed or XY-problem. Until OP accepts that, repeated posts without further details will not help.

The problem occurs when people get stuck on what they believe is the solution and are unable step back and explain the issue in full.

ADD REPLYlink modified 4 months ago • written 4 months ago by Michael Dondrup45k
Please log in to add an answer.
The thread is closed. No new answers may be added.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1282 users visited in the last hour