Question: Calculating Averages For Windows
1
gravatar for Rubal7
5.1 years ago by
Rubal740
Rubal740 wrote:

Hello all,

I have columns of population genetic data, here's an example:

line    chromosome    fst_score
1    1    0.3
2    1    0.7
3    1    0.3
4    1    0.15
5    1    0.4
6    2    0.6
7    2    0.94
8    2    0.17
9    2    0.19

I want to calculate the average of the values in column 3 (the Fst score) but not for the whole column, I need to bin windows of certain sizes. To start with I'd just like to know how to calculate the average for 10 rows of data column 3. I know that to calculate the average for the whole column I can do something like this:

for line in fileObj:
    lineList = line.strip().split()
    if lineList[1] in ['1100', '1200', '1300']:
        list_to_average = [float(s) for s in lineList[2:]]
        average = sum(list_to_average)/len(list_to_average)

but I am not sure how to initiate a reliable counter that would do this for every 10 rows and output this along with the value in the first column (so that I know which lines the average comes from).

This is for chromosome data so in the real file the values in column 2 are chromosome positions and I will use these to define the number of rows that need to be average together as 1mb. But this is trickier as I will need to count when the distance between rows has reached 1mb as I iterate through the file. For now I would just like to solve the first challenge of calculating an average for every 10 lines in a file.

Please let me know if I can format the question better or if there is already an answer on this forum (I have looked)

Any help is appreciated. I'm still getting to grips with programming!

Thanks

Rubal7

python windows fst snp • 1.2k views
ADD COMMENTlink written 5.1 years ago by Rubal740

Are you committed to a Python solution? This would be very easy in R.

ADD REPLYlink written 5.1 years ago by Neilfws46k

I'm open to R solutions too, especially if it is simpler. But I am even less familiar with R syntax

ADD REPLYlink written 5.1 years ago by Rubal740
3
gravatar for Istvan Albert
5.1 years ago by
Istvan Albert ♦♦ 70k
University Park, USA
Istvan Albert ♦♦ 70k wrote:

In these types of problems it is often advantageous to transform your data into a numpy data structure after which you can use many of the numpy functions to compute the quantities of interest. This has multiple benefits as far as simplicity and speed goes.

For a moving window average you could then use the numpy.convolve function as explained here:

http://argandgahandapandpa.wordpress.com/2011/02/24/python-numpy-moving-average-for-data/

ADD COMMENTlink written 5.1 years ago by Istvan Albert ♦♦ 70k

thanks, i'll investigate numpy

ADD REPLYlink written 5.1 years ago by Rubal740
3
gravatar for Adrian
5.1 years ago by
Adrian640
Cambridge, MA
Adrian640 wrote:

You could also just do this directly with a for loop.

counter = 0
sum = 0

for line in fileObj:
    linedata = line.strip().split()
    counter += 1
    id = linedata[1]
    sum += float(linedata[2])

    if counter == 10:
        average = sum / counter
        print "Average of last 10 values for %s: %f" % (id, average)
        # reset counter   
        sum = 0
        counter = 0
ADD COMMENTlink written 5.1 years ago by Adrian640

thanks that's great

ADD REPLYlink written 5.1 years ago by Rubal740

I understand the logic to this answer but when i run the script it runs but produces no output. Any ideas why? Here is how I formatted it (I checked and the file exists):

inputfile = open("/Users/..filename", "r")

counter = 0 sum = 0

for line in inputfile: linedata = line.strip().split() counter += 1 id = linedata[1] sum += float(linedata[2]) print inputfile

if counter == 10:
    average = sum / counter
    print "Average of last 10 value
ADD REPLYlink written 5.1 years ago by Rubal740
0
gravatar for Rubal7
5.1 years ago by
Rubal740
Rubal740 wrote:

Here is my attempted solution based on Adrian's answer. Unfortunately even though I understand the logic and think it should be working, the script runs and finishes without printing anything. I checked the inputfile exists and has more than 10 rows of data. Can anyone identify the problem in the script? Otherwise I guess it must be a problem with the formatting of the inputfile.

Thanks in advance!

inputfile = open("inputfilepath", "r")

counter = 0
sum = 0

for line in inputfile:
    linedata = line.strip().split()
    counter += 1
    id = linedata[1]
    sum += float(linedata[2])
    print inputfile

    if counter == 10:
        average = sum / counter
        print "Average of last 10 values for %s: %f" % (id, average)
        # reset counter   
        sum = 0
        counter = 0
ADD COMMENTlink written 5.1 years ago by Rubal740

I did a fast check and the script works fine for me. The only thing that should be implemented is to skip the first line of the file if it contains a header. This can be done in different ways. If you not do this, you will get an error from float(linedata[2]).

ADD REPLYlink written 5.1 years ago by Robert Ernst60

thanks , must be my input file then, ill make a new one and check

ADD REPLYlink written 5.1 years ago by Rubal740
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1005 users visited in the last hour