Question: counting number of variants within specified windows
0
3.1 years ago by
spiral01100
spiral01100 wrote:

I have two tab delimited files. Each has two columns, one for chromosome and one for variant co-ordinate. I want to identify the number of positions in file2 that are within a specified window range of the positions in file1, and then check in the next window, and the next and so forth.

So if this is the first row from file 1:

``````1 10500
``````

And this is from file 2:

``````1 10177
1 10352
1 10616
1 11008
1 11012
1 13110
1 13116
1 13118
1 13273
1 13550
2 10700
2 12321
2 34234
``````

If the window is of size 1000, the following co-ordinates from file 2 fall within this window:

``````1 10177
1 10352
1 10616
``````

The second window should then be 2000 in size and therefore these co-ordinates fall within this window:

``````1 10177
1 10352
1 10616
1 11008
1 11012
``````

What I want to do is find the mean number of variants in each window. So for each variant in file 1, calculate the number of variants in a 1kb window, and then calculate the mean. I then want to do this for the larger 2kb window and so on and so forth to a specified window size (e.g. 100kb).

I have tried writing a python function to do this and then looping through the windows. The issue is that file1 has 11609 variants, and file2 has 13758644 so this is proving extremely time consuming.

I was wondering if anyone knows of any better way of doing this?

Below are my python functions. It is a bit long:

``````    #Function counts number of variants in a window of specified size around a specified
#genomic position. nsPos is the variant position from file1, df is the dataframe of file2.
def vCount(nsPos, df, windowSize):
#If the window minimum is below 0, set to 0
if(nsPos - (windowSize/2) < 0):
low = 0
else:
low = nsPos - (windowSize/2)

high = high = nsPos + (windowSize/2)
#calculate the length (i.e. the number) of variants in the subsetted data.
diversity = len(df[(df['POS'] >= low) & (df['POS'] <= high)])

return(diversity)

#Function to apply vCount function across the entire positional column of df.
#NSdf is the dataframe of file1, variantsDF is dataframe from file2.

def windowAvg(NSdf, variantsDF, window):
x=NSdf.POS.apply(vCount, df=variantsDF, windowSize=window)
return(x)
``````

I then run this second function through a nested loop to produce a plot:

``````def loopNplot(NSdf, variantsDF, window):
#store window number
windows = list(range(1,101))
#store diversity (the mean number of variants)
diversity = []

#Loop through windows appending to the list.
for i in range(window, 101000, window):

#Create a list to hold counts for each chromosome (which we then take the mean of)
tempList = []

#Loop through chromosomes
for j in NSdf['CHROM']:
#Subset file1 and file2 dfs for only the relevant chromosome.
subDF = vtable.loc[(vtable['CHROM'] == j)].copy()
#Subset all variants
subVar = all_variants.loc[(all_variants['CHROM'] == j)].copy()

#run function

x = windowAvg(subDF, subVar, i)
#convert series to list
x = x.tolist()
#extend templist to include x
tempList.extend(x)

#Append mean of tempList - counts of diversity - to list.
diversity.append(sum(tempList)/len(tempList))

#Copy diversity
divCopy = list(diversity)
#Add a new first index of 0
diversity = [0] + diversity
#Remove last index
diversity = diversity[:-1]
#x - diversity to give number of variants in each window
div = list(map(operator.sub, divCopy, diversity))

plt.scatter(windows, div)

plt.show()
``````
snp genome • 1.1k views
modified 3.1 years ago by cpad011214k • written 3.1 years ago by spiral01100
1

@Pierre's VarintsInWindow tool.

thanks, but it only works with vcf :-)

your final aim is to produce a plot isn't it ? what is this plot ? (x and y axis ?)

Yes, sorry I didn't make this very clear.I want to plot window number on the x axis and div (the mean number of variants) on the y axis.

spiral01 : It is important that you validate/acknowledge (by up-voting/accepting) answers below. That is the only way we know that your original problem has been solved. You can (and should) test all answers suggested and accept all those that provide the expected result.

2
3.1 years ago by
sacha2.0k
France
sacha2.0k wrote:

I will suggest you to use bedtools with bash or python (pybedtools) . Edit your data to have 3 columns : chr - start - end. In your case start = end. Then use :

Here is a solution for one window of 1000.

`````` #file1.bed
1       10500   10500
2       34234   34234

#file2.bed
1       10177   10177
1       10352   10352
1       10616   10616
1       11008   11008
1       11012   11012
1       13110   13110
1       13116   13116
1       13118   13118
1       13273   13273
1       13550   13550
2       10700   10700
2       12321   12321
2       34234   34234

# script.sh
# make a window range of 1000 base from file1
bedtools slop -i file1.bed -g hg19.genom -b 1000 > file1.window.bed

# Intersect the window range with file2 and count how many  variant intersect :
bedtools intersect -a file1.window.bed -b file2.bed -c > intersect.txt
``````

You will get the following intersect.txt. And as you mention above, the first line returns your 5 variants

`````` 1       9500    11500   5
2       33234   35234   1
``````

You can then create a for loop to range with different window. You can also parallelize. For instance using parallel command line or a simple for loop:

``````   for window in {1000...10000}
do ./script.sh file1.bed file2.bed window > out_\$window.bed &
done
``````
1

I had success with all the methods here but accepted this one as I understand it best. Thank you for all the help.

3
3.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum133k wrote:

using c++ and an associative `map chrom->position->count`.

first we load the whole file2 in this map

then we load file1 and we loop over each size window

output is a table `window-size -> mean-count-of-variant`

not tested use with care...

``````#include <fstream>
#include <map>
#include <string>
#include <cstdlib>
#include <cstdio>
#include <vector>
#include <cassert>
#include <iostream>
#include <stdint.h>
using namespace std;
static const int WINBEGIN=1000;
static const int WINEND=2000;

struct WinInfo
{
double sum;
int count;
};

int main(int argc,char** argv)
{
if(argc!=3) return EXIT_FAILURE;

map<int,WinInfo> winInfos;
for(int windowsize=WINBEGIN; windowsize<=WINEND;++windowsize)
{
WinInfo w;
w.sum=0;
w.count=0;
winInfos[windowsize]=w;
}

map<string,map<int,int > >  contig2pos2count;
ifstream in1(argv[2],ios::in);
assert(in1.is_open());
string line;
while(getline(in1,line))
{
string::size_type tab1=line.find('\t');
assert(tab1!=string::npos);
line[tab1]=0;
int pos=atoi(&(line.data()[tab1+1]));
contig2pos2count[line.substr(0,tab1)][pos]++;
}
in1.close();

in1.open(argv[1],ios::in);
assert(in1.is_open());
while(getline(in1,line))
{
string::size_type tab1=line.find('\t');
if(tab1==string::npos) return EXIT_FAILURE;
string contig = line.substr(0,tab1);
map<string,map<int,int > >::iterator r= contig2pos2count.find(contig);
if(r!=contig2pos2count.end()) {
line[tab1]=0;
int pos = atoi(&(line.data()[tab1+1]));
assert(pos>=0);
for(int windowsize=WINBEGIN; windowsize<=WINEND;++windowsize)
{
int winsize = windowsize/2;
int start = max(0,pos-winsize);
int end = pos+winsize;
int count=0;
while(start<end) {
map<int,int> ::iterator r2= r->second.find(start);
if(r2!=r->second.end()) count+=r2->second;
start++;
}// start
winInfos[windowsize].sum+=count;
winInfos[windowsize].count++;
}
}
}
in1.close();
for(int windowsize=WINBEGIN; windowsize<=WINEND;++windowsize)
{
if(winInfos[windowsize].count<=0) continue;
cout << windowsize<<"\t"<<(winInfos[windowsize].sum/winInfos[windowsize].count) << endl;
}
return EXIT_SUCCESS;
}
``````

compile:

``````\$ g++ -O3 biostar291480.cpp
``````

execute:

``````\$ ./a.out file1.txt file2.txt

1000    3
(...)
1995    5
1996    5
1997    5
1998    5
1999    5
2000    5
``````
2
3.1 years ago by

in python 3.6:

``````>>> import pandas as pd
>>> import pandasql as psql
>>> file1.columns=("V1","V2")
>>> file2.columns=("V1","V2")
>>>  psql.sqldf("SELECT * FROM file1 f1   join file2 f2 ON f1.V1 = f2.V1 and f1.V2 between f2.V2 - 500 and f2.V2 + 500 ")
``````

ouput:

``````    V1  V2  V1  V2
0   1   10500   1   10177
1   1   10500   1   10352
2   1   10500   1   10616
``````

in R: ouput:

``````> library(sqldf)
> sqldf( "SELECT * FROM file1 f1  join file2 f2 ON f1.V1 = f2.V1 and f1.V2 between f2.V2 - 500 and f2.V2 + 500 ")
V1    V2 V1    V2
1  1 10500  1 10177
2  1 10500  1 10352
3  1 10500  1 10616
``````

input: I modified file1 a little bit for testing:

``````> file1
V1    V2
1  1 10500
2  2 11500

> file2
V1    V2
1   1 10177
2   1 10352
3   1 10616
4   1 11008
5   1 11012
6   1 13110
7   1 13116
8   1 13118
9   1 13273
10  1 13550
11  2 10700
12  2 12321
13  2 34234
``````

For numbers in R:

output:

``````> library(sqldf)

# for 1000 range (+500 to -500)
> sqldf( "SELECT  f1.V1 as chr, count(f1.V2) as total FROM file1 f1 inner  join file2 f2 ON f1.V1 = f2.V1 and f1.V2 between f2.V2 - 500 and f2.V2 + 500 group by f1.V1")
chr total
1   1     3

# for 1000 range (+500 to -500)
> sqldf( "SELECT  f1.V1 as chr, count(f1.V2) as total FROM file1 f1 inner  join file2 f2 ON f1.V1 = f2.V1 and f1.V2 between f2.V2 - 1000 and f2.V2 + 1000 group by f1.V1")
chr total
1   1     5
2   2     2
``````

input is same as above.