Bedops element or merge operation_which one is better to use in predicting regulated genes?
1
0
Entering edit mode
9.7 years ago
xiaoyonf ▴ 60

Hi, I have two bed files from ChIP-seq of two different transcription factors. I want to predict the genes with a binding site of both of the two transcription factors +/-20kb of TSS. I can perform the following:

  1. Find the binding sites within one of bed files which overlaps with the sites in another bed file, by using --element-of and then use the output to predict genes;
  2. Or I can use merge operation to produce a bed files including the merged binding seq of two transcription factors, and then use this new bed file to predict genes.

Which one should I do?

Thanks for help!

Xiaoyong

ChIP-Seq • 3.0k views
ADD COMMENT
5
Entering edit mode
9.7 years ago

It's not 100% clear to me what you are doing, but let's start with two sorted BED files:

  • ChipA.bed
  • ChipB.bed

The operation that matches the wording of your question #1 has the following structure:

$ bedops --element-of -1 ChipA.bed ChipB.bed > AnswerQ1A.bed

The file AnswerQ1A.bed is made up of elements from ChipA.bed that overlap elements in ChipB.bed by one or more bases. This preserves columnar data in ChipA.bed.

Likewise, the following command gives different output:

$ bedops --element-of -1 ChipB.bed ChipA.bed > AnswerQ1B.bed

The file AnswerQ1B.bed is made up of elements from ChipB.bed that overlap elements in ChipA.bed by one or more bases. This, too, preserves columnar data in ChipB.bed.

What you might be interested is the set of overlapping elements common to both input sets:

$ bedops --element-of -1 ChipA.bed ChipB.bed > ChipAOverlappingB.bed
$ bedops --element-of -1 ChipB.bed ChipA.bed > ChipBOverlappingA.bed 
$ bedops --everything ChipAOverlappingB.bed ChipBOverlappingA.bed > ChipAIfAndOnlyIfOverlappingB.bed

Basically, use --element-of if you need more than the input sets' genomic regions, i.e. to preserve columnar data (IDs, scores, etc.). You might want to further filter your element sets by other criteria contained in these extra columns.

The following operation that matches your question #2 gives a different answer:

$ bedops --merge ChipA.bed ChipB.bed > AnswerQ2.bed

The file AnswerQ2.bed is made up of just genomic regions that are calculated from the genomic regions in both ChipA.bed and ChipB.bed. This second answer does not preserve columnar data. The output will not tell you where the region originally came from (though simple set operations back on ChipA.bed and ChipB.bed answer that question).

You might want this instead, if you are looking to discover genes in a set of genomic coordinates, and if the columnar data in ChipA.bed and ChipB.bed is not important or relevant. This will depend on your experiment.

For instance, if you want to look at the merged regions only within +/- 20 kbase of a set of TSSs, then (assuming you have a sorted BED file containing TSSs) you could do the following:

$ bedops --merge --range 20000 TSSs.bed \
    | bedops --element-of -1 AnswerQ2.bed - \
    > mergedChipABRegionsThatOverlap20kbWindowsAroundTSSs.bed

This last file is a constrained set of merged regions from AnswerQ2.bed - you might only be interested in focusing downstream searches in this subset of regions.

ADD COMMENT
0
Entering edit mode

Hi Alex,

Thank you so much for your detailed answer! Basically, I understand your explanation. However, I have another question not clearly described in my first post.

Say, I want to make a Venn diagram based on three ChIP.bed files. Each area within the diagram with distinctive or overlapping area needs a number to fill in to describing how all the regions in all 3 bed files distribute. How can I achieve this? One problem I have is: I did the operation exactly the same as you explained in your first part of answer. The number of overlapping regions based on Chip A or on Chip B files are different (50 difference). How can I assign the overlapping number between ChipA and ChipB in the Venn diagram?

Thanks,
Xiaoyong

ADD REPLY
2
Entering edit mode

Given two sets A and B, you can form three disjoint sets:

  1. A - (A ⋂ B) : "UniqueA"
  2. A ⋂ B : "UniqueAB"
  3. B - (A ⋂ B) : "UniqueB"

To get the first set via operations with BEDOPS, assuming a one-base overlap criterion for set inclusion or exclusion:

$ bedops --element-of -1 A B | bedops --not-element-of -1 A - > UniqueA

To get the third set:

$ bedops --element-of -1 B A | bedops --not-element-of -1 B - > UniqueB

To get the second set:

$ bedops --everything A B | bedops --not-element-of -1 - UniqueA UniqueB > UniqueAB

Then do wc -l on each result to get element counts.

You can use diff to verify that the multiset union of these three disjoint sets reconstructs the entire set of elements:

$ bedops --everything A B > AllElements
$ bedops --everything UniqueA UniqueB UniqueAB > AllUniqueElements
$ diff -s AllElements AllUniqueElements
ADD REPLY
0
Entering edit mode

This is very helpful! However, if the number of elements belonging to A in "UniqueAB" is different to the number of elements belonging to B in "UniqueAB", how can I assign the number to overlapping region in the Venn Diagram of A x B?

Thanks!

ADD REPLY
0
Entering edit mode

Unless I misunderstand your question, that would be the line count of the second set UniqueAB from my three example subsets.

ADD REPLY

Login before adding your answer.

Traffic: 2998 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6