I have two tab delimited files. Each has two columns, one for chromosome and one for variant co-ordinate. I want to identify the number of positions in file2 that are within a specified window range of the positions in file1, and then check in the next window, and the next and so forth.
So if this is the first row from file 1:
And this is from file 2:
1 10177 1 10352 1 10616 1 11008 1 11012 1 13110 1 13116 1 13118 1 13273 1 13550 2 10700 2 12321 2 34234
If the window is of size 1000, the following co-ordinates from file 2 fall within this window:
1 10177 1 10352 1 10616
The second window should then be 2000 in size and therefore these co-ordinates fall within this window:
1 10177 1 10352 1 10616 1 11008 1 11012
What I want to do is find the mean number of variants in each window. So for each variant in file 1, calculate the number of variants in a 1kb window, and then calculate the mean. I then want to do this for the larger 2kb window and so on and so forth to a specified window size (e.g. 100kb).
I have tried writing a python function to do this and then looping through the windows. The issue is that file1 has 11609 variants, and file2 has 13758644 so this is proving extremely time consuming.
I was wondering if anyone knows of any better way of doing this?
Below are my python functions. It is a bit long:
#Function counts number of variants in a window of specified size around a specified #genomic position. nsPos is the variant position from file1, df is the dataframe of file2. def vCount(nsPos, df, windowSize): #If the window minimum is below 0, set to 0 if(nsPos - (windowSize/2) < 0): low = 0 else: low = nsPos - (windowSize/2) high = high = nsPos + (windowSize/2) #calculate the length (i.e. the number) of variants in the subsetted data. diversity = len(df[(df['POS'] >= low) & (df['POS'] <= high)]) return(diversity) #Function to apply vCount function across the entire positional column of df. #NSdf is the dataframe of file1, variantsDF is dataframe from file2. def windowAvg(NSdf, variantsDF, window): x=NSdf.POS.apply(vCount, df=variantsDF, windowSize=window) return(x)
I then run this second function through a nested loop to produce a plot:
def loopNplot(NSdf, variantsDF, window): #store window number windows = list(range(1,101)) #store diversity (the mean number of variants) diversity =  #Loop through windows appending to the list. for i in range(window, 101000, window): #Create a list to hold counts for each chromosome (which we then take the mean of) tempList =  #Loop through chromosomes for j in NSdf['CHROM']: #Subset file1 and file2 dfs for only the relevant chromosome. subDF = vtable.loc[(vtable['CHROM'] == j)].copy() #Subset all variants subVar = all_variants.loc[(all_variants['CHROM'] == j)].copy() #run function x = windowAvg(subDF, subVar, i) #convert series to list x = x.tolist() #extend templist to include x tempList.extend(x) #Append mean of tempList - counts of diversity - to list. diversity.append(sum(tempList)/len(tempList)) #Copy diversity divCopy = list(diversity) #Add a new first index of 0 diversity =  + diversity #Remove last index diversity = diversity[:-1] #x - diversity to give number of variants in each window div = list(map(operator.sub, divCopy, diversity)) plt.scatter(windows, div) plt.show()