Finding Chip Seq Overlaps with Bed files
3
2
Entering edit mode
5.9 years ago
morovatunc ▴ 470

Hello,

I have written here about finding overlaps and I came a point where I got very confused. I have tried several methods for for finding overlaps but none of them seem to me logical. I have tried bedtools multi inter , bedops and bedmap. Though please help me a way to find these overlaps.

My data is consistedof 20 files ( 13 tumour, 7 normal). All of them are bed files. What I wanna  know;

1) Overlapping peaks of both datasets.

2) Overlaps of from unique ( n=1) to n= 13 for tumour or 7 for normal overlaps.

3) Bedtools multi inter does this pretty good. However, I realised that it creates false negative overlaps. (2bp region of overlap which makes no sense).

4) With bedtools intersectbed; I have to make combinations of all of the samples which makes enormous amount combination that confuses me a lot.

Can somebody help me out who has done it before? It should not be that hard?

Thank you very much

Tunc

ChIP-Seq bedtools bedmap bedops • 3.6k views
0
Entering edit mode

"2bp region of overlap which makes no sense" --> why does it make no sense? 1bp overlap is still an overlap if you do not set a minimum number of bp

0
Entering edit mode

Tool:    bedtools multiinter (aka multiIntersectBed)

Version: v2.24.0

Summary: Identifies common intervals among multiple

BED/GFF/VCF files.

Usage:   bedtools multiinter [OPTIONS] -i FILE1 FILE2 .. FILEn

Requires that each interval file is sorted by chrom/start.

Options:

-cluster    Invoke Ryan Layers's clustering algorithm.

(chrom/start/end + names of each file).

-names        A list of names (one/file) to describe each file in -i.

These names will be printed in the header line.

-g        Use genome file to calculate empty regions.

- STRING.

-empty        Report empty regions (i.e., start/end intervals w/o

values in all files).

- Requires the '-g FILE' parameter.

-filler TEXT    Use TEXT when representing intervals having no value.

- Default is '0', but you can use 'N/A' or any text.

-examples    Show detailed usage examples.

Error: missing file names (-i) to combine.

This is the help of multi inter. Now please tell me how to spesify that? Thank you

5
Entering edit mode
5.9 years ago
1. Overlapping peaks of both datasets.

First, if not sorted, make sure that your peak, tumour and normal BED files are sorted, e.g.:

$sort-bed tumour01.unknown_sort_state.bed > tumour01.bed  Repeat sorting for the remaining peak, tumour and normal BED files, as needed. You only have to sort once, at the beginning. Take the multiset union of your tumour BED files with bedops, and pipe that unioned set to a second bedops command, to find peaks that overlap all tumour elements: $ bedops --everything tumour01.bed tumour02.bed ... tumour13.bed | bedops --element-of 1 peaks.bed - > peaks_overlapping_tumour_sets.bed


Or all normal elements:

$bedops --everything normal01.bed normal02.bed ... normal07.bed | bedops --element-of 1 peaks.bed - > peaks_overlapping_normal_sets.bed  Or elements from both categories: $ bedops --everything tumour01.bed tumour02.bed ... tumour13.bed normal01.bed normal02.bed ... normal07.bed | bedops --element-of 1 peaks.bed - > peaks_overlapping_tumour_and_normal_sets.bed


If you're trying to do something else, please clarify the kind of set operation or association that you want to do.

For example, do you need to know which tumour or normal element's subset overlaps with a particular peak? The bedmap tool can help you here, but you need to preprocess your tumor and normal element subsets, first. Feel free to follow up.

2. Overlaps of from unique ( n=1) to n= 13 for tumour or 7 for normal overlaps.

You can use a generalization of this approach for finding elements common to all N subsets. For example, for N=13, where A.bed through N.bed are your 13 tumour element sets:

$N=13$ bedops --everything A.bed B.bed C.bed ... N.bed \
| bedmap --count --echo --delim '\t' - \
| uniq \
| awk -vN=${N} '$1==N' \
| cut -f2- \
> common_to_all_N_tumour_subsets.bed


You can modify this approach for N-1 (12) subsets, N-2 (11) subsets, and so on, by modifying the awk test:

$N=13$ bedops --everything A.bed B.bed C.bed ... N.bed \
| bedmap --count --echo --delim '\t' - \
| uniq \
| awk -vN=${N} '$1==(N-1)' \
| cut -f2- \
> common_to_N_minus_1_tumour_subsets.bed


You would repeat this for N=7 for your seven normal set files.

Once you have files common_to_*.bed that you need, you can use bedops or bedmap with each of them to do overlap or association tests with peaks, e.g.:

$bedmap --echo --echo-map peaks.bed common_to_all_N_tumour_subsets.bed > common_tumour_elements_that_overlap_each_peak.bed  ADD COMMENT 0 Entering edit mode Dear Alex, Thank your for your detailed answer. I followed the protocol at http://bedops.readthedocs.org/en/latest/content/usage-examples/multiple-inputs.html#multiple-inputs Which gave me peaks within groups. I guess it will give me the same results. However, I did understand the part where we compare both groups. Should I merge all peak files in a same bed and do the line below? $bedmap --count --echo --delim '\t' all_bed_files.bed


Also, You used bedops -elemen of 1 for finding overlaps but I used bedmap. Would there be a significant difference?

Thank you very much for your patient while helping with me.

0
Entering edit mode

Can you explain what you mean by "compare both groups"? Do you want to compare the peak-overlaps-with-tumour set against the peak-overlaps-with-normal set?

To answer your second question, bedops --element-of 1 just reports an overlap. It won't tell you the associated element that overlaps. To report that association (or "map") you would use bedmap.

0
Entering edit mode

Alex, exactly like you said. I want to compare normal vs tumour. However, I will achieve this with getting all of them to the same bed file. Then, $bedmap --count --echo. Do you prefer another way? ADD REPLY 0 Entering edit mode Perhaps you want the following: $ bedmap --echo --count --fraction-both 0.5 peaks.bed tumours.bed > peaks_with_counts_of_overlapping_tumours.bed
$bedmap --echo --count --fraction-both 0.5 peaks.bed normals.bed > peaks_with_counts_of_overlapping_normals.bed  You might also count the number of overlaps in common: $ bedmap --echo --count --fraction-both 0.5 peaks.bed <(bedops --everything tumours.bed normals.bed) > peaks_with_counts_of_overlapping_tumours_and_normals.bed


From these three count numbers, you can build a two-set Venn or Euler diagram of overlap events: The number of overlaps unique to tumours, the number of overlaps unique to normals, and the number of overlaps common to both tumours and normals.

This first pass is a fairly naive approach. You may want to think about normalization with this approach, since a 13-tissue set will likely have more elements than a 7-tissue set, and, by chance, the number of overlap events you get with tumours could be overrepresented by virtue of simply having more elements to start with. You might use bedops to count how many elements are common within the 13 tumour sets, and separately with the 7 normal sets, to determine how to normalize counts of both tumour and normal together.

In any case, please note the use of --fraction-both 0.5 with bedmap, which ensures that an overlapping tumour or normal element covers at least half of a peak element's region. This avoids counting an event as "common", where a tumour element only overlaps on one side of the peak, and a normal element only overlaps on the other side. Requiring 50% or more coverage ensures all elements overlap to be counted as common.

If this isn't clear, draw out three generic intervals on a line and enumerate the different ways overlap events can occur between the three intervals.

0
Entering edit mode

Alex,

Thank you for your answer. It solved my problem and This is actually what I want. But I have one last question. when I do this;

\$bedmap --count --echo --echo-map-id-uniq --mean --fraction-both .95 --delim "\t" bedops_merge_normalall.bed > answer1.bed.txt


I will explain it by example:

Say we have 5 regions that are overlapping, bedmap overlaps among each other which will create duplicates and this duplicate may mess up the calculations. Therefore, my question How can I get rid of this duplicates? What I did was taking only the unique values. Since they are 4 decimal point numbers, i think taking only the uniqes won't cause me a big problem?

regions,overlapping regions,ave
A -> B,C,D,E -> 15
B -> A,C,D,E -> 15
C -> A,B,D,E -> 15
D -> A,B,C,E -> 15
E -> A,B,C,D -> 15

0
Entering edit mode
5.9 years ago
0
Entering edit mode

I honestly read that thread 20 times. Like I mentioned at my 3rd question, multiinter way causes problems such as false positive occurance. And like I mentioned at my 4th question. since I have too many files, I did ask about alternative methods. I did not started this thread without reading current threads. I am aware of intersectbed, bedops and bedmap are possible ways to solve this.

0
Entering edit mode
2.8 years ago
morovatunc ▴ 470

http://homer.ucsd.edu/homer/ngs/mergePeaks.html

However, author seems not to respond problems related with the software so heads up.

0
Entering edit mode

"based on my experience"