Question: counting number of variants within specified windows
0
gravatar for spiral01
2.7 years ago by
spiral01100
spiral01100 wrote:

I have two tab delimited files. Each has two columns, one for chromosome and one for variant co-ordinate. I want to identify the number of positions in file2 that are within a specified window range of the positions in file1, and then check in the next window, and the next and so forth.

So if this is the first row from file 1:

1 10500

And this is from file 2:

1 10177
1 10352
1 10616
1 11008
1 11012
1 13110
1 13116
1 13118
1 13273
1 13550
2 10700
2 12321
2 34234

If the window is of size 1000, the following co-ordinates from file 2 fall within this window:

1 10177
1 10352
1 10616

The second window should then be 2000 in size and therefore these co-ordinates fall within this window:

1 10177
1 10352
1 10616
1 11008
1 11012

What I want to do is find the mean number of variants in each window. So for each variant in file 1, calculate the number of variants in a 1kb window, and then calculate the mean. I then want to do this for the larger 2kb window and so on and so forth to a specified window size (e.g. 100kb).

I have tried writing a python function to do this and then looping through the windows. The issue is that file1 has 11609 variants, and file2 has 13758644 so this is proving extremely time consuming.

I was wondering if anyone knows of any better way of doing this?

Below are my python functions. It is a bit long:

    #Function counts number of variants in a window of specified size around a specified
    #genomic position. nsPos is the variant position from file1, df is the dataframe of file2.
    def vCount(nsPos, df, windowSize):
        #If the window minimum is below 0, set to 0
        if(nsPos - (windowSize/2) < 0): 
            low = 0
        else:
            low = nsPos - (windowSize/2)

        high = high = nsPos + (windowSize/2)
        #calculate the length (i.e. the number) of variants in the subsetted data.
        diversity = len(df[(df['POS'] >= low) & (df['POS'] <= high)])

        return(diversity)

#Function to apply vCount function across the entire positional column of df. 
#NSdf is the dataframe of file1, variantsDF is dataframe from file2.

def windowAvg(NSdf, variantsDF, window):
    x=NSdf.POS.apply(vCount, df=variantsDF, windowSize=window)
    return(x)

I then run this second function through a nested loop to produce a plot:

def loopNplot(NSdf, variantsDF, window):
    #store window number
    windows = list(range(1,101))
    #store diversity (the mean number of variants)
    diversity = []

    #Loop through windows appending to the list.
    for i in range(window, 101000, window):

        #Create a list to hold counts for each chromosome (which we then take the mean of)
        tempList = []

        #Loop through chromosomes
        for j in NSdf['CHROM']:
            #Subset file1 and file2 dfs for only the relevant chromosome.
            subDF = vtable.loc[(vtable['CHROM'] == j)].copy()
            #Subset all variants
            subVar = all_variants.loc[(all_variants['CHROM'] == j)].copy()

#run function

            x = windowAvg(subDF, subVar, i)
            #convert series to list
            x = x.tolist()
            #extend templist to include x
            tempList.extend(x)

        #Append mean of tempList - counts of diversity - to list.
        diversity.append(sum(tempList)/len(tempList))


    #Copy diversity
    divCopy = list(diversity)
    #Add a new first index of 0
    diversity = [0] + diversity
    #Remove last index
    diversity = diversity[:-1]
    #x - diversity to give number of variants in each window
    div = list(map(operator.sub, divCopy, diversity))

    plt.scatter(windows, div)

    plt.show()
snp genome • 1.1k views
ADD COMMENTlink modified 2.7 years ago by cpad011214k • written 2.7 years ago by spiral01100
1

@Pierre's VarintsInWindow tool.

ADD REPLYlink written 2.7 years ago by genomax89k

thanks, but it only works with vcf :-)

ADD REPLYlink written 2.7 years ago by Pierre Lindenbaum130k

your final aim is to produce a plot isn't it ? what is this plot ? (x and y axis ?)

ADD REPLYlink written 2.7 years ago by Pierre Lindenbaum130k

Yes, sorry I didn't make this very clear.I want to plot window number on the x axis and div (the mean number of variants) on the y axis.

ADD REPLYlink written 2.7 years ago by spiral01100

spiral01 : It is important that you validate/acknowledge (by up-voting/accepting) answers below. That is the only way we know that your original problem has been solved. You can (and should) test all answers suggested and accept all those that provide the expected result.

ADD REPLYlink written 2.7 years ago by genomax89k
2
gravatar for sacha
2.7 years ago by
sacha1.9k
France
sacha1.9k wrote:

I will suggest you to use bedtools with bash or python (pybedtools) . Edit your data to have 3 columns : chr - start - end. In your case start = end. Then use :

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by sacha1.9k

Here is a solution for one window of 1000.

 #file1.bed
 1       10500   10500
 2       34234   34234

#file2.bed 
1       10177   10177
1       10352   10352
1       10616   10616
1       11008   11008
1       11012   11012
1       13110   13110
1       13116   13116
1       13118   13118
1       13273   13273
1       13550   13550
2       10700   10700
2       12321   12321
2       34234   34234

  # script.sh 
  # make a window range of 1000 base from file1 
  bedtools slop -i file1.bed -g hg19.genom -b 1000 > file1.window.bed

  # Intersect the window range with file2 and count how many  variant intersect : 
  bedtools intersect -a file1.window.bed -b file2.bed -c > intersect.txt

You will get the following intersect.txt. And as you mention above, the first line returns your 5 variants

 1       9500    11500   5
 2       33234   35234   1

You can then create a for loop to range with different window. You can also parallelize. For instance using parallel command line or a simple for loop:

   for window in {1000...10000}
   do ./script.sh file1.bed file2.bed window > out_$window.bed &
   done
ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by sacha1.9k
1

I had success with all the methods here but accepted this one as I understand it best. Thank you for all the help.

ADD REPLYlink written 2.7 years ago by spiral01100
3
gravatar for Pierre Lindenbaum
2.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

using c++ and an associative map chrom->position->count.

first we load the whole file2 in this map

then we load file1 and we loop over each size window

output is a table window-size -> mean-count-of-variant

not tested use with care...

#include <fstream>
#include <map>
#include <string>
#include <cstdlib>
#include <cstdio>
#include <vector>
#include <cassert>
#include <iostream>
#include <stdint.h>
using namespace std;
static const int WINBEGIN=1000;
static const int WINEND=2000;


struct WinInfo
  {
  double sum;
  int count;
  };

int main(int argc,char** argv)
    {
    if(argc!=3) return EXIT_FAILURE;

    map<int,WinInfo> winInfos;
  for(int windowsize=WINBEGIN; windowsize<=WINEND;++windowsize)
    {
    WinInfo w;
    w.sum=0;
    w.count=0;
    winInfos[windowsize]=w;
    }


    map<string,map<int,int > >  contig2pos2count;
    ifstream in1(argv[2],ios::in);
    assert(in1.is_open());
    string line;
    while(getline(in1,line))
        {
        string::size_type tab1=line.find('\t');
        assert(tab1!=string::npos);
        line[tab1]=0;
        int pos=atoi(&(line.data()[tab1+1]));
        contig2pos2count[line.substr(0,tab1)][pos]++;
        }
    in1.close();


    in1.open(argv[1],ios::in);
    assert(in1.is_open());
    while(getline(in1,line))
        {
        string::size_type tab1=line.find('\t');
        if(tab1==string::npos) return EXIT_FAILURE;
        string contig = line.substr(0,tab1);
        map<string,map<int,int > >::iterator r= contig2pos2count.find(contig);
        if(r!=contig2pos2count.end()) {
          line[tab1]=0;
          int pos = atoi(&(line.data()[tab1+1]));
          assert(pos>=0);
          for(int windowsize=WINBEGIN; windowsize<=WINEND;++windowsize)
            {
            int winsize = windowsize/2; 
          int start = max(0,pos-winsize);
          int end = pos+winsize;
        int count=0;
          while(start<end) {
            map<int,int> ::iterator r2= r->second.find(start);
            if(r2!=r->second.end()) count+=r2->second;
            start++;
            }// start
          winInfos[windowsize].sum+=count;
          winInfos[windowsize].count++;
          }
          }
        }
    in1.close();
    for(int windowsize=WINBEGIN; windowsize<=WINEND;++windowsize)
            {
            if(winInfos[windowsize].count<=0) continue;
            cout << windowsize<<"\t"<<(winInfos[windowsize].sum/winInfos[windowsize].count) << endl;
            }
    return EXIT_SUCCESS;
    }

compile:

$ g++ -O3 biostar291480.cpp

execute:

$ ./a.out file1.txt file2.txt

1000    3
(...)
1995    5
1996    5
1997    5
1998    5
1999    5
2000    5
ADD COMMENTlink written 2.7 years ago by Pierre Lindenbaum130k
2
gravatar for cpad0112
2.7 years ago by
cpad011214k
India
cpad011214k wrote:

in python 3.6:

>>> import pandas as pd
>>> import pandasql as psql
>>> file1=pd.read_csv("file1.txt", header=None, sep="\t")
>>> file2=pd.read_csv("file2.txt", header=None,sep="\t")
>>> file1.columns=("V1","V2")
>>> file2.columns=("V1","V2")
>>>  psql.sqldf("SELECT * FROM file1 f1   join file2 f2 ON f1.V1 = f2.V1 and f1.V2 between f2.V2 - 500 and f2.V2 + 500 ")

ouput:

    V1  V2  V1  V2
0   1   10500   1   10177
1   1   10500   1   10352
2   1   10500   1   10616

in R: ouput:

> library(sqldf)
> sqldf( "SELECT * FROM file1 f1  join file2 f2 ON f1.V1 = f2.V1 and f1.V2 between f2.V2 - 500 and f2.V2 + 500 ")
  V1    V2 V1    V2
1  1 10500  1 10177
2  1 10500  1 10352
3  1 10500  1 10616

input: I modified file1 a little bit for testing:

> file1
  V1    V2
1  1 10500
2  2 11500

> file2
   V1    V2
1   1 10177
2   1 10352
3   1 10616
4   1 11008
5   1 11012
6   1 13110
7   1 13116
8   1 13118
9   1 13273
10  1 13550
11  2 10700
12  2 12321
13  2 34234
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by cpad011214k

For numbers in R:

output:

> library(sqldf)

# for 1000 range (+500 to -500)
> sqldf( "SELECT  f1.V1 as chr, count(f1.V2) as total FROM file1 f1 inner  join file2 f2 ON f1.V1 = f2.V1 and f1.V2 between f2.V2 - 500 and f2.V2 + 500 group by f1.V1")
  chr total
1   1     3   

# for 1000 range (+500 to -500)
> sqldf( "SELECT  f1.V1 as chr, count(f1.V2) as total FROM file1 f1 inner  join file2 f2 ON f1.V1 = f2.V1 and f1.V2 between f2.V2 - 1000 and f2.V2 + 1000 group by f1.V1")
      chr total
    1   1     5
    2   2     2

input is same as above.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by cpad011214k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1953 users visited in the last hour