Question: Calculating Averages For Windows
1
Rubal740 wrote:

Hello all,

I have columns of population genetic data, here's an example:

``````line    chromosome    fst_score
1    1    0.3
2    1    0.7
3    1    0.3
4    1    0.15
5    1    0.4
6    2    0.6
7    2    0.94
8    2    0.17
9    2    0.19
``````

I want to calculate the average of the values in column 3 (the Fst score) but not for the whole column, I need to bin windows of certain sizes. To start with I'd just like to know how to calculate the average for 10 rows of data column 3. I know that to calculate the average for the whole column I can do something like this:

``````for line in fileObj:
lineList = line.strip().split()
if lineList in ['1100', '1200', '1300']:
list_to_average = [float(s) for s in lineList[2:]]
average = sum(list_to_average)/len(list_to_average)
``````

but I am not sure how to initiate a reliable counter that would do this for every 10 rows and output this along with the value in the first column (so that I know which lines the average comes from).

This is for chromosome data so in the real file the values in column 2 are chromosome positions and I will use these to define the number of rows that need to be average together as 1mb. But this is trickier as I will need to count when the distance between rows has reached 1mb as I iterate through the file. For now I would just like to solve the first challenge of calculating an average for every 10 lines in a file.

Please let me know if I can format the question better or if there is already an answer on this forum (I have looked)

Any help is appreciated. I'm still getting to grips with programming!

Thanks

Rubal7

python windows fst snp • 2.1k views
written 8.3 years ago by Rubal740

Are you committed to a Python solution? This would be very easy in R.

I'm open to R solutions too, especially if it is simpler. But I am even less familiar with R syntax

3
Istvan Albert ♦♦ 84k wrote:

In these types of problems it is often advantageous to transform your data into a numpy data structure after which you can use many of the numpy functions to compute the quantities of interest. This has multiple benefits as far as simplicity and speed goes.

For a moving window average you could then use the `numpy.convolve` function as explained here:

http://argandgahandapandpa.wordpress.com/2011/02/24/python-numpy-moving-average-for-data/

thanks, i'll investigate numpy

3

You could also just do this directly with a for loop.

``````counter = 0
sum = 0

for line in fileObj:
linedata = line.strip().split()
counter += 1
id = linedata
sum += float(linedata)

if counter == 10:
average = sum / counter
print "Average of last 10 values for %s: %f" % (id, average)
# reset counter
sum = 0
counter = 0
``````

thanks that's great

I understand the logic to this answer but when i run the script it runs but produces no output. Any ideas why? Here is how I formatted it (I checked and the file exists):

inputfile = open("/Users/..filename", "r")

counter = 0 sum = 0

for line in inputfile: linedata = line.strip().split() counter += 1 id = linedata sum += float(linedata) print inputfile

``````if counter == 10:
average = sum / counter
print "Average of last 10 value
``````
0
Rubal740 wrote:

Here is my attempted solution based on Adrian's answer. Unfortunately even though I understand the logic and think it should be working, the script runs and finishes without printing anything. I checked the inputfile exists and has more than 10 rows of data. Can anyone identify the problem in the script? Otherwise I guess it must be a problem with the formatting of the inputfile.

``````inputfile = open("inputfilepath", "r")

counter = 0
sum = 0

for line in inputfile:
linedata = line.strip().split()
counter += 1
id = linedata
sum += float(linedata)
print inputfile

if counter == 10:
average = sum / counter
print "Average of last 10 values for %s: %f" % (id, average)
# reset counter
sum = 0
counter = 0
``````

I did a fast check and the script works fine for me. The only thing that should be implemented is to skip the first line of the file if it contains a header. This can be done in different ways. If you not do this, you will get an error from float(linedata).