Question: Finding An Genomic Intervals Shared For A Given Individual As Called By Two Or More Algorithms
1
gravatar for Ryan D
7.3 years ago by
Ryan D3.3k
USA
Ryan D3.3k wrote:

Sorry for the rather inelegant title. An example may illustrate this best.

Given a dataset where genomic intervals have been identified by one or more algorithms such as this:

ind    chr    start    end    alg
A    chr1    100    1000    alg1
A    chr1    150    1200    alg2
A    chr1    5000    6000    alg1
A    chr1    7000    8000    alg2

I'd like to get back ONLY intervals that are shared for an individual by two or more algorithms. For the above, this would be:

ind    chr    start    end    alg
A    chr1    150    1000    alg1/alg2

Galaxy offers a cluster tool, but it operates on the entire dataset without regard to verifying each individual is the same before clustering. Is there a tool that can make sure a column (or columns) match a condition before carrying out this kind of clustering function for shared intervals?

Thanks,

Rx

galaxy clustering • 2.8k views
ADD COMMENTlink modified 5.2 years ago by Biostar ♦♦ 20 • written 7.3 years ago by Ryan D3.3k
7
gravatar for Rob Syme
7.3 years ago by
Rob Syme530
Perth, Western Australia
Rob Syme530 wrote:

It sounds like you're looking for bedtools, which allows you to find the intersection of regions (among other things):

$ cat algo1.bed
chr1    100 1000
chr1    5000    6000
$ cat algo2.bed
chr1    150 1200
chr1    7000    8000
$ intersectBed -a algo1.bed -b algo2.bed
chr1    150 1000
ADD COMMENTlink written 7.3 years ago by Rob Syme530
1

One possible solution is the awk command that aaronQuinlan has suggested below. This would split your original file into bed files grouped by the first and fifth column. Is this what you are after?

ADD REPLYlink written 7.3 years ago by Rob Syme530

This looks useful. But most of the examples on the bedtools page seem to be good for intersecting bed files without pre-specifying that another column (like ID) should match. I have around 7k samples and wish to make sure that merges are only done in cases where samples match. Am I missing something? Could you possibly give an example of how to carry out the function above only when a fourth column matches?

ADD REPLYlink written 7.3 years ago by Ryan D3.3k

I think I can imagine a solution to this that uses multiintersectBed and awk as below. It will make a lot of temporary files, but those can be cat-ed together at the end.

Thanks.

ADD REPLYlink written 7.3 years ago by Ryan D3.3k
5
gravatar for Aaronquinlan
7.3 years ago by
Aaronquinlan10k
United States
Aaronquinlan10k wrote:

You may also check multiIntersectBed, which is part of the bedtools package and is discussed towards the bottom (second answer from me) of this Biostar thread. I plan to release this on the Galaxy Tool Shed in the next month.

EDIT: More explicit example.

Extend your example two to individuals, A and B.

$ cat all.txt 
A   chr1    100 1000    alg1
A   chr1    150 1200    alg2
A   chr1    5000    6000    alg1
A   chr1    7000    8000    alg2 
B   chr1    100 1000    alg1
B   chr1    150 1200    alg2
B   chr1    5000    6000    alg1
B   chr1    7000    8000    alg2

Use awk to split the file into distinct ind/alg files.

awk '{outfile=$1"_"$5".bed"; print $2"\t"$3"\t"$4 >> outfile}' all.txt

List the resulting files as a sanity check. Note the use of ">>" to create and append to files named (outfile=) based on the ind ($1) and the alg. ($5):

ls -1 *_*.bed
A_alg1.bed
A_alg2.bed
B_alg1.bed
B_alg2.bed

Use multiIntersectBed to find intervals that are common to multiple files. The fourth column is the count of files in which the interval is present. The fifth is a list of the file labels (-names argument) in which the intervals were found. The rest of the columns are T/F indicators of whether the interval was found in each file.

multiIntersectBed -i A_alg1.bed A_alg2.bed B_alg1.bed B_alg2.bed -names A1 A2 B1 B2
chr1    100 150 2   A1,B1   1   0   1   0
chr1    150 1000    4   A1,A2,B1,B2 1   1   1   1
chr1    1000    1200    2   A2,B2   0   1   0   1
chr1    5000    6000    2   A1,B1   1   0   1   0
chr1    7000    8000    2   A2,B2   0   1   0   1

You can then use awk or a simple Perl script to limit the results to those intervals involving two or more algs. for the same individual.

Or, more simply, you could just do a separate command for each individual and just look for output where column 4 is >= 2:

multiIntersectBed -i A_alg1.bed A_alg2.bed -names A1 A2 
chr1    100 150 1   A1  1   0
chr1    150 1000    2   A1,A2   1   1
chr1    1000    1200    1   A2  0   1
chr1    5000    6000    1   A1  1   0
chr1    7000    8000    1   A2  0   1

multiIntersectBed -i A_alg1.bed A_alg2.bed -names A1 A2 | awk '$4>1'
chr1    150 1000    2   A1,A2   1   1

I hope this helps.

ADD COMMENTlink modified 7.3 years ago • written 7.3 years ago by Aaronquinlan10k

Very nice. Thanks, Aaron, for the extended treatment of the problem.

ADD REPLYlink written 7.3 years ago by Ryan D3.3k

No problem...one note, each atomic BED file must be sorted by chrom then start for mBI to behave.

ADD REPLYlink written 7.3 years ago by Aaronquinlan10k

...and obviously, if you just sort your master file this way ahead of time, you will be fine.

ADD REPLYlink written 7.3 years ago by Aaronquinlan10k

Hello, this works great to compare multiple bed files. Though I was wondering if there is an option in multiIntersectBed to output the number of bp overlap between multiple bed files (similar to bedtools intersect -wao option). Please let me know, thank you!

ADD REPLYlink written 14 months ago by varsha61970
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1913 users visited in the last hour