Finding An Genomic Intervals Shared For A Given Individual As Called By Two Or More Algorithms
2
1
Entering edit mode
12.4 years ago
Ryan D ★ 3.4k

Sorry for the rather inelegant title. An example may illustrate this best.

Given a dataset where genomic intervals have been identified by one or more algorithms such as this:

ind    chr    start    end    alg
A    chr1    100    1000    alg1
A    chr1    150    1200    alg2
A    chr1    5000    6000    alg1
A    chr1    7000    8000    alg2

I'd like to get back ONLY intervals that are shared for an individual by two or more algorithms. For the above, this would be:

ind    chr    start    end    alg
A    chr1    150    1000    alg1/alg2

Galaxy offers a cluster tool, but it operates on the entire dataset without regard to verifying each individual is the same before clustering. Is there a tool that can make sure a column (or columns) match a condition before carrying out this kind of clustering function for shared intervals?

Thanks,

Rx

galaxy clustering • 4.2k views
ADD COMMENT
7
Entering edit mode
12.4 years ago
Rob Syme ▴ 540

It sounds like you're looking for bedtools, which allows you to find the intersection of regions (among other things):

$ cat algo1.bed
chr1    100 1000
chr1    5000    6000
$ cat algo2.bed
chr1    150 1200
chr1    7000    8000
$ intersectBed -a algo1.bed -b algo2.bed
chr1    150 1000
ADD COMMENT
1
Entering edit mode

One possible solution is the awk command that aaronQuinlan has suggested below. This would split your original file into bed files grouped by the first and fifth column. Is this what you are after?

ADD REPLY
0
Entering edit mode

This looks useful. But most of the examples on the bedtools page seem to be good for intersecting bed files without pre-specifying that another column (like ID) should match. I have around 7k samples and wish to make sure that merges are only done in cases where samples match. Am I missing something? Could you possibly give an example of how to carry out the function above only when a fourth column matches?

ADD REPLY
0
Entering edit mode

I think I can imagine a solution to this that uses multiintersectBed and awk as below. It will make a lot of temporary files, but those can be cat-ed together at the end.

Thanks.

ADD REPLY
5
Entering edit mode
12.4 years ago

You may also check multiIntersectBed, which is part of the bedtools package and is discussed towards the bottom (second answer from me) of this Biostar thread. I plan to release this on the Galaxy Tool Shed in the next month.

EDIT: More explicit example.

Extend your example two to individuals, A and B.

$ cat all.txt 
A   chr1    100 1000    alg1
A   chr1    150 1200    alg2
A   chr1    5000    6000    alg1
A   chr1    7000    8000    alg2 
B   chr1    100 1000    alg1
B   chr1    150 1200    alg2
B   chr1    5000    6000    alg1
B   chr1    7000    8000    alg2

Use awk to split the file into distinct ind/alg files.

awk '{outfile=$1"_"$5".bed"; print $2"\t"$3"\t"$4 >> outfile}' all.txt

List the resulting files as a sanity check. Note the use of ">>" to create and append to files named (outfile=) based on the ind ($1) and the alg. ($5):

ls -1 *_*.bed
A_alg1.bed
A_alg2.bed
B_alg1.bed
B_alg2.bed

Use multiIntersectBed to find intervals that are common to multiple files. The fourth column is the count of files in which the interval is present. The fifth is a list of the file labels (-names argument) in which the intervals were found. The rest of the columns are T/F indicators of whether the interval was found in each file.

multiIntersectBed -i A_alg1.bed A_alg2.bed B_alg1.bed B_alg2.bed -names A1 A2 B1 B2
chr1    100 150 2   A1,B1   1   0   1   0
chr1    150 1000    4   A1,A2,B1,B2 1   1   1   1
chr1    1000    1200    2   A2,B2   0   1   0   1
chr1    5000    6000    2   A1,B1   1   0   1   0
chr1    7000    8000    2   A2,B2   0   1   0   1

You can then use awk or a simple Perl script to limit the results to those intervals involving two or more algs. for the same individual.

Or, more simply, you could just do a separate command for each individual and just look for output where column 4 is >= 2:

multiIntersectBed -i A_alg1.bed A_alg2.bed -names A1 A2 
chr1    100 150 1   A1  1   0
chr1    150 1000    2   A1,A2   1   1
chr1    1000    1200    1   A2  0   1
chr1    5000    6000    1   A1  1   0
chr1    7000    8000    1   A2  0   1

multiIntersectBed -i A_alg1.bed A_alg2.bed -names A1 A2 | awk '$4>1'
chr1    150 1000    2   A1,A2   1   1

I hope this helps.

ADD COMMENT
0
Entering edit mode

Very nice. Thanks, Aaron, for the extended treatment of the problem.

ADD REPLY
0
Entering edit mode

No problem...one note, each atomic BED file must be sorted by chrom then start for mBI to behave.

ADD REPLY
0
Entering edit mode

...and obviously, if you just sort your master file this way ahead of time, you will be fine.

ADD REPLY
0
Entering edit mode

Hello, this works great to compare multiple bed files. Though I was wondering if there is an option in multiIntersectBed to output the number of bp overlap between multiple bed files (similar to bedtools intersect -wao option). Please let me know, thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2369 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6