Question: how to find chromosome position ranges common to all files
0
gravatar for sebastianzeki0
5.3 years ago by
United Kingdom
sebastianzeki0180 wrote:

I have a load of bed files which have been filtered according to certain criteria. The bins are the same size in all files but they contain sometimes the same bins after filtering and sometimes different bins. I would like to extract all the common bins. For example below. I have tried bedtools intersect but this will only allow comparison of one sample with all the other samples whereas I want to see what is common to all the samples 

File 1:

chrom                       chrStart                    chrEnd

chr1                                 1                         1001

chr1                               11001                   12001

chr1                               12001                   13001

File 2:

chrom                   chrStart                  chrEnd

chr1                                 1                         1001

chr1                               9001                   10001

chr1                               12001                   13001

 

The output should be:

File 1:

chrom                   chrStart                  chrEnd

chr1                                 1                         1001

chr1                               12001                   13001

 

sequencing genome • 1.4k views
ADD COMMENTlink modified 5.3 years ago by ashwini100 • written 5.3 years ago by sebastianzeki0180

I don't understand very well.. do you want to extract the "common lines" between two files? If so, you can use this bash command:

comm -12 <( sort file1 ) <( sort file2)
ADD REPLYlink written 5.3 years ago by iraun3.7k
1
gravatar for Alex Reynolds
5.3 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

Modifying the example from 4.7. Working with many input files at once with bedops and bedmap to a general solution for N input files (where your N is 204):

$ bedops --everything file1.bed file2.bed ... fileN.bed  \
    | bedmap --count --exact --echo --delim '\t' - \
    | awk -v nFiles=N '$1==nFiles' \ 
    | cut -f2- - \
    | awk '!seen[$0] { print } { ++seen[$0] }' \
    > answer.bed

To break this down:

The bedops --everything command generates a multiset union, which is piped to a bedmap operation that uses --count and --exact to count how many times there is an exact match between one element in the unioned set and all other elements in all input files ("sets"). The --echo operation prints the count and the element to standard output.

We use awk and cut to filter this result to any input element that shows up N times (if it maps N times, we know it is found in common to all N input sets.

The final awk statement filters duplicate elements, leaving us a single copy of elements that exist in common to all N input sets.

As the linked BEDOPS example page describes, this technique avoids calculating cxN-1 comparisons, which can be required with other approaches. 

In addition to being efficient, this is a useful general approach, in that you can easily relax the constraints. This solution applies the most stringent requirements: all BED elements in the output set must match exactly, and they must be found in all sets.

However, your experiment might require modifying the mapping step to specify some fractional overlap between elements — instead of --exact matches, you could use bedmap --fraction-* operations — or you could modify the first awk filter step, if you only need some M or greater number of common matches, where 0 < M < N input sets. It's easy to tweak the pipeline to these requirements.

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Alex Reynolds29k
0
gravatar for sebastianzeki0
5.3 years ago by
United Kingdom
sebastianzeki0180 wrote:

Would that allow me find the commonalities between 204 files? 

ADD COMMENTlink written 5.3 years ago by sebastianzeki0180
0
gravatar for ashwini
5.3 years ago by
ashwini100
India
ashwini100 wrote:
$ cat file1.bed file2.bed .. file204.bed | sort --parallel=30 | uniq -c | awk '{if ($1==204)print}' > common.bed

I am not sure whether this is the efficient way of doing but you can give a try.

ADD COMMENTlink written 5.3 years ago by ashwini100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 848 users visited in the last hour