counting number of variants within specified windows
3
0
Entering edit mode
5.2 years ago
spiral01 ▴ 110

I have two tab delimited files. Each has two columns, one for chromosome and one for variant co-ordinate. I want to identify the number of positions in file2 that are within a specified window range of the positions in file1, and then check in the next window, and the next and so forth.

So if this is the first row from file 1:

1 10500


And this is from file 2:

1 10177
1 10352
1 10616
1 11008
1 11012
1 13110
1 13116
1 13118
1 13273
1 13550
2 10700
2 12321
2 34234


If the window is of size 1000, the following co-ordinates from file 2 fall within this window:

1 10177
1 10352
1 10616


The second window should then be 2000 in size and therefore these co-ordinates fall within this window:

1 10177
1 10352
1 10616
1 11008
1 11012


What I want to do is find the mean number of variants in each window. So for each variant in file 1, calculate the number of variants in a 1kb window, and then calculate the mean. I then want to do this for the larger 2kb window and so on and so forth to a specified window size (e.g. 100kb).

I have tried writing a python function to do this and then looping through the windows. The issue is that file1 has 11609 variants, and file2 has 13758644 so this is proving extremely time consuming.

I was wondering if anyone knows of any better way of doing this?

Below are my python functions. It is a bit long:

    #Function counts number of variants in a window of specified size around a specified
#genomic position. nsPos is the variant position from file1, df is the dataframe of file2.
def vCount(nsPos, df, windowSize):
#If the window minimum is below 0, set to 0
if(nsPos - (windowSize/2) < 0):
low = 0
else:
low = nsPos - (windowSize/2)

high = high = nsPos + (windowSize/2)
#calculate the length (i.e. the number) of variants in the subsetted data.
diversity = len(df[(df['POS'] >= low) & (df['POS'] <= high)])

return(diversity)

#Function to apply vCount function across the entire positional column of df.
#NSdf is the dataframe of file1, variantsDF is dataframe from file2.

def windowAvg(NSdf, variantsDF, window):
x=NSdf.POS.apply(vCount, df=variantsDF, windowSize=window)
return(x)


I then run this second function through a nested loop to produce a plot:

def loopNplot(NSdf, variantsDF, window):
#store window number
windows = list(range(1,101))
#store diversity (the mean number of variants)
diversity = []

#Loop through windows appending to the list.
for i in range(window, 101000, window):

#Create a list to hold counts for each chromosome (which we then take the mean of)
tempList = []

#Loop through chromosomes
for j in NSdf['CHROM']:
#Subset file1 and file2 dfs for only the relevant chromosome.
subDF = vtable.loc[(vtable['CHROM'] == j)].copy()
#Subset all variants
subVar = all_variants.loc[(all_variants['CHROM'] == j)].copy()

#run function

x = windowAvg(subDF, subVar, i)
#convert series to list
x = x.tolist()
#extend templist to include x
tempList.extend(x)

#Append mean of tempList - counts of diversity - to list.
diversity.append(sum(tempList)/len(tempList))

#Copy diversity
divCopy = list(diversity)
#Add a new first index of 0
diversity = [0] + diversity
#Remove last index
diversity = diversity[:-1]
#x - diversity to give number of variants in each window
div = list(map(operator.sub, divCopy, diversity))

plt.scatter(windows, div)

plt.show()

SNP genome • 1.9k views
1
Entering edit mode

@Pierre's VarintsInWindow tool.

0
Entering edit mode

thanks, but it only works with vcf :-)

0
Entering edit mode

your final aim is to produce a plot isn't it ? what is this plot ? (x and y axis ?)

0
Entering edit mode

Yes, sorry I didn't make this very clear.I want to plot window number on the x axis and div (the mean number of variants) on the y axis.

0
Entering edit mode

spiral01 : It is important that you validate/acknowledge (by up-voting/accepting) answers below. That is the only way we know that your original problem has been solved. You can (and should) test all answers suggested and accept all those that provide the expected result.

2
Entering edit mode
5.2 years ago
sacha ★ 2.4k

I will suggest you to use bedtools with bash or python (pybedtools) . Edit your data to have 3 columns : chr - start - end. In your case start = end. Then use :

0
Entering edit mode

Here is a solution for one window of 1000.

 #file1.bed
1       10500   10500
2       34234   34234

#file2.bed
1       10177   10177
1       10352   10352
1       10616   10616
1       11008   11008
1       11012   11012
1       13110   13110
1       13116   13116
1       13118   13118
1       13273   13273
1       13550   13550
2       10700   10700
2       12321   12321
2       34234   34234

# script.sh
# make a window range of 1000 base from file1
bedtools slop -i file1.bed -g hg19.genom -b 1000 > file1.window.bed

# Intersect the window range with file2 and count how many  variant intersect :
bedtools intersect -a file1.window.bed -b file2.bed -c > intersect.txt


You will get the following intersect.txt. And as you mention above, the first line returns your 5 variants

 1       9500    11500   5
2       33234   35234   1


You can then create a for loop to range with different window. You can also parallelize. For instance using parallel command line or a simple for loop:

   for window in {1000...10000}
do ./script.sh file1.bed file2.bed window > out_$window.bed & done  ADD REPLY 1 Entering edit mode I had success with all the methods here but accepted this one as I understand it best. Thank you for all the help. ADD REPLY 3 Entering edit mode 5.2 years ago using c++ and an associative map chrom->position->count. first we load the whole file2 in this map then we load file1 and we loop over each size window output is a table window-size -> mean-count-of-variant not tested use with care... #include <fstream> #include <map> #include <string> #include <cstdlib> #include <cstdio> #include <vector> #include <cassert> #include <iostream> #include <stdint.h> using namespace std; static const int WINBEGIN=1000; static const int WINEND=2000; struct WinInfo { double sum; int count; }; int main(int argc,char** argv) { if(argc!=3) return EXIT_FAILURE; map<int,WinInfo> winInfos; for(int windowsize=WINBEGIN; windowsize<=WINEND;++windowsize) { WinInfo w; w.sum=0; w.count=0; winInfos[windowsize]=w; } map<string,map<int,int > > contig2pos2count; ifstream in1(argv[2],ios::in); assert(in1.is_open()); string line; while(getline(in1,line)) { string::size_type tab1=line.find('\t'); assert(tab1!=string::npos); line[tab1]=0; int pos=atoi(&(line.data()[tab1+1])); contig2pos2count[line.substr(0,tab1)][pos]++; } in1.close(); in1.open(argv[1],ios::in); assert(in1.is_open()); while(getline(in1,line)) { string::size_type tab1=line.find('\t'); if(tab1==string::npos) return EXIT_FAILURE; string contig = line.substr(0,tab1); map<string,map<int,int > >::iterator r= contig2pos2count.find(contig); if(r!=contig2pos2count.end()) { line[tab1]=0; int pos = atoi(&(line.data()[tab1+1])); assert(pos>=0); for(int windowsize=WINBEGIN; windowsize<=WINEND;++windowsize) { int winsize = windowsize/2; int start = max(0,pos-winsize); int end = pos+winsize; int count=0; while(start<end) { map<int,int> ::iterator r2= r->second.find(start); if(r2!=r->second.end()) count+=r2->second; start++; }// start winInfos[windowsize].sum+=count; winInfos[windowsize].count++; } } } in1.close(); for(int windowsize=WINBEGIN; windowsize<=WINEND;++windowsize) { if(winInfos[windowsize].count<=0) continue; cout << windowsize<<"\t"<<(winInfos[windowsize].sum/winInfos[windowsize].count) << endl; } return EXIT_SUCCESS; }  compile: $ g++ -O3 biostar291480.cpp


execute:

\$ ./a.out file1.txt file2.txt

1000    3
(...)
1995    5
1996    5
1997    5
1998    5
1999    5
2000    5

2
Entering edit mode
5.2 years ago

in python 3.6:

>>> import pandas as pd
>>> import pandasql as psql
>>> file1.columns=("V1","V2")
>>> file2.columns=("V1","V2")
>>>  psql.sqldf("SELECT * FROM file1 f1   join file2 f2 ON f1.V1 = f2.V1 and f1.V2 between f2.V2 - 500 and f2.V2 + 500 ")


ouput:

    V1  V2  V1  V2
0   1   10500   1   10177
1   1   10500   1   10352
2   1   10500   1   10616


in R: ouput:

> library(sqldf)
> sqldf( "SELECT * FROM file1 f1  join file2 f2 ON f1.V1 = f2.V1 and f1.V2 between f2.V2 - 500 and f2.V2 + 500 ")
V1    V2 V1    V2
1  1 10500  1 10177
2  1 10500  1 10352
3  1 10500  1 10616


input: I modified file1 a little bit for testing:

> file1
V1    V2
1  1 10500
2  2 11500

> file2
V1    V2
1   1 10177
2   1 10352
3   1 10616
4   1 11008
5   1 11012
6   1 13110
7   1 13116
8   1 13118
9   1 13273
10  1 13550
11  2 10700
12  2 12321
13  2 34234

0
Entering edit mode

For numbers in R:

output:

> library(sqldf)

# for 1000 range (+500 to -500)
> sqldf( "SELECT  f1.V1 as chr, count(f1.V2) as total FROM file1 f1 inner  join file2 f2 ON f1.V1 = f2.V1 and f1.V2 between f2.V2 - 500 and f2.V2 + 500 group by f1.V1")
chr total
1   1     3

# for 1000 range (+500 to -500)
> sqldf( "SELECT  f1.V1 as chr, count(f1.V2) as total FROM file1 f1 inner  join file2 f2 ON f1.V1 = f2.V1 and f1.V2 between f2.V2 - 1000 and f2.V2 + 1000 group by f1.V1")
chr total
1   1     5
2   2     2


input is same as above.