Calculating Averages For Windows
3
1
Entering edit mode
12.1 years ago
Rubal7 ▴ 40

Hello all,

I have columns of population genetic data, here's an example:

line    chromosome    fst_score
1    1    0.3
2    1    0.7
3    1    0.3
4    1    0.15
5    1    0.4
6    2    0.6
7    2    0.94
8    2    0.17
9    2    0.19

I want to calculate the average of the values in column 3 (the Fst score) but not for the whole column, I need to bin windows of certain sizes. To start with I'd just like to know how to calculate the average for 10 rows of data column 3. I know that to calculate the average for the whole column I can do something like this:

for line in fileObj:
    lineList = line.strip().split()
    if lineList[1] in ['1100', '1200', '1300']:
        list_to_average = [float(s) for s in lineList[2:]]
        average = sum(list_to_average)/len(list_to_average)

but I am not sure how to initiate a reliable counter that would do this for every 10 rows and output this along with the value in the first column (so that I know which lines the average comes from).

This is for chromosome data so in the real file the values in column 2 are chromosome positions and I will use these to define the number of rows that need to be average together as 1mb. But this is trickier as I will need to count when the distance between rows has reached 1mb as I iterate through the file. For now I would just like to solve the first challenge of calculating an average for every 10 lines in a file.

Please let me know if I can format the question better or if there is already an answer on this forum (I have looked)

Any help is appreciated. I'm still getting to grips with programming!

Thanks

Rubal7

snp fst python windows • 3.4k views
ADD COMMENT
0
Entering edit mode

Are you committed to a Python solution? This would be very easy in R.

ADD REPLY
0
Entering edit mode

I'm open to R solutions too, especially if it is simpler. But I am even less familiar with R syntax

ADD REPLY
3
Entering edit mode
12.1 years ago

In these types of problems it is often advantageous to transform your data into a numpy data structure after which you can use many of the numpy functions to compute the quantities of interest. This has multiple benefits as far as simplicity and speed goes.

For a moving window average you could then use the numpy.convolve function as explained here:

http://argandgahandapandpa.wordpress.com/2011/02/24/python-numpy-moving-average-for-data/

ADD COMMENT
0
Entering edit mode

thanks, i'll investigate numpy

ADD REPLY
3
Entering edit mode
12.1 years ago
Adrian ▴ 700

You could also just do this directly with a for loop.

counter = 0
sum = 0

for line in fileObj:
    linedata = line.strip().split()
    counter += 1
    id = linedata[1]
    sum += float(linedata[2])

    if counter == 10:
        average = sum / counter
        print "Average of last 10 values for %s: %f" % (id, average)
        # reset counter   
        sum = 0
        counter = 0
ADD COMMENT
0
Entering edit mode

thanks that's great

ADD REPLY
0
Entering edit mode

I understand the logic to this answer but when i run the script it runs but produces no output. Any ideas why? Here is how I formatted it (I checked and the file exists):

inputfile = open("/Users/..filename", "r")

counter = 0 sum = 0

for line in inputfile: linedata = line.strip().split() counter += 1 id = linedata[1] sum += float(linedata[2]) print inputfile

if counter == 10:
    average = sum / counter
    print "Average of last 10 value
ADD REPLY
0
Entering edit mode
12.1 years ago
Rubal7 ▴ 40

Here is my attempted solution based on Adrian's answer. Unfortunately even though I understand the logic and think it should be working, the script runs and finishes without printing anything. I checked the inputfile exists and has more than 10 rows of data. Can anyone identify the problem in the script? Otherwise I guess it must be a problem with the formatting of the inputfile.

Thanks in advance!

inputfile = open("inputfilepath", "r")

counter = 0
sum = 0

for line in inputfile:
    linedata = line.strip().split()
    counter += 1
    id = linedata[1]
    sum += float(linedata[2])
    print inputfile

    if counter == 10:
        average = sum / counter
        print "Average of last 10 values for %s: %f" % (id, average)
        # reset counter   
        sum = 0
        counter = 0
ADD COMMENT
0
Entering edit mode

I did a fast check and the script works fine for me. The only thing that should be implemented is to skip the first line of the file if it contains a header. This can be done in different ways. If you not do this, you will get an error from float(linedata[2]).

ADD REPLY
0
Entering edit mode

thanks , must be my input file then, ill make a new one and check

ADD REPLY

Login before adding your answer.

Traffic: 1963 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6