Question

Finding An Genomic Intervals Shared For A Given Individual As Called By Two Or More Algorithms

1

Entering edit mode

12.4 years ago

Ryan D ★ 3.4k

Sorry for the rather inelegant title. An example may illustrate this best.

Given a dataset where genomic intervals have been identified by one or more algorithms such as this:

ind    chr    start    end    alg
A    chr1    100    1000    alg1
A    chr1    150    1200    alg2
A    chr1    5000    6000    alg1
A    chr1    7000    8000    alg2

I'd like to get back ONLY intervals that are shared for an individual by two or more algorithms. For the above, this would be:

ind    chr    start    end    alg
A    chr1    150    1000    alg1/alg2

Galaxy offers a cluster tool, but it operates on the entire dataset without regard to verifying each individual is the same before clustering. Is there a tool that can make sure a column (or columns) match a condition before carrying out this kind of clustering function for shared intervals?

Thanks,

Rx

galaxy clustering • 4.2k views

ADD COMMENT • link updated 10.3 years ago by Biostar 20 • written 12.4 years ago by Ryan D ★ 3.4k

score 7 · Answer 1 · 2011-12-12

7

Entering edit mode

12.4 years ago

Rob Syme ▴ 540

It sounds like you're looking for bedtools, which allows you to find the intersection of regions (among other things):

$ cat algo1.bed
chr1    100 1000
chr1    5000    6000
$ cat algo2.bed
chr1    150 1200
chr1    7000    8000
$ intersectBed -a algo1.bed -b algo2.bed
chr1    150 1000

ADD COMMENT • link 12.4 years ago by Rob Syme ▴ 540

1

Entering edit mode

One possible solution is the awk command that aaronQuinlan has suggested below. This would split your original file into bed files grouped by the first and fifth column. Is this what you are after?

ADD REPLY • link 12.4 years ago by Rob Syme ▴ 540

0

Entering edit mode

This looks useful. But most of the examples on the bedtools page seem to be good for intersecting bed files without pre-specifying that another column (like ID) should match. I have around 7k samples and wish to make sure that merges are only done in cases where samples match. Am I missing something? Could you possibly give an example of how to carry out the function above only when a fourth column matches?

ADD REPLY • link 12.4 years ago by Ryan D ★ 3.4k

0

Entering edit mode

I think I can imagine a solution to this that uses multiintersectBed and awk as below. It will make a lot of temporary files, but those can be cat-ed together at the end.

Thanks.

ADD REPLY • link 12.4 years ago by Ryan D ★ 3.4k

Ram · Answer 2 · 2011-12-12

You may also check multiIntersectBed, which is part of the bedtools package and is discussed towards the bottom (second answer from me) of this Biostar thread. I plan to release this on the Galaxy Tool Shed in the next month.

EDIT: More explicit example.

Extend your example two to individuals, A and B.

$ cat all.txt 
A   chr1    100 1000    alg1
A   chr1    150 1200    alg2
A   chr1    5000    6000    alg1
A   chr1    7000    8000    alg2 
B   chr1    100 1000    alg1
B   chr1    150 1200    alg2
B   chr1    5000    6000    alg1
B   chr1    7000    8000    alg2

Use awk to split the file into distinct ind/alg files.

awk '{outfile=$1"_"$5".bed"; print $2"\t"$3"\t"$4 >> outfile}' all.txt

List the resulting files as a sanity check. Note the use of ">>" to create and append to files named (outfile=) based on the ind ($1) and the alg. ($5):

ls -1 *_*.bed
A_alg1.bed
A_alg2.bed
B_alg1.bed
B_alg2.bed

Use multiIntersectBed to find intervals that are common to multiple files. The fourth column is the count of files in which the interval is present. The fifth is a list of the file labels (-names argument) in which the intervals were found. The rest of the columns are T/F indicators of whether the interval was found in each file.

multiIntersectBed -i A_alg1.bed A_alg2.bed B_alg1.bed B_alg2.bed -names A1 A2 B1 B2
chr1    100 150 2   A1,B1   1   0   1   0
chr1    150 1000    4   A1,A2,B1,B2 1   1   1   1
chr1    1000    1200    2   A2,B2   0   1   0   1
chr1    5000    6000    2   A1,B1   1   0   1   0
chr1    7000    8000    2   A2,B2   0   1   0   1

You can then use awk or a simple Perl script to limit the results to those intervals involving two or more algs. for the same individual.

Or, more simply, you could just do a separate command for each individual and just look for output where column 4 is >= 2:

multiIntersectBed -i A_alg1.bed A_alg2.bed -names A1 A2 
chr1    100 150 1   A1  1   0
chr1    150 1000    2   A1,A2   1   1
chr1    1000    1200    1   A2  0   1
chr1    5000    6000    1   A1  1   0
chr1    7000    8000    1   A2  0   1

multiIntersectBed -i A_alg1.bed A_alg2.bed -names A1 A2 | awk '$4>1'
chr1    150 1000    2   A1,A2   1   1

I hope this helps.