Question: bedtools fisher interpretation of "in -a" "not in -b"
0
13 months ago by
jomo018470
jomo018470 wrote:

I am using bedtools fisher as follows:

bedtools fisher -a \$fA -b \$fB -g \$fG > \$fOut

Where fA consists of wide peaks and fB narrower peaks. The results are:

`````` Number of query intervals: 741
Number of db intervals: 18079
Number of overlaps: 1481
Number of possible intervals (estimated): 146318
phyper(1481 - 1, 741, 146318 - 741, 18079, lower.tail=F)
Contingency Table Of Counts
_________________________________________
|  in -b       | not in -b    |
in -a | 1481         | 0            |
not in -a | 16598        | 128239       |
_________________________________________
p-values for fisher's exact test
left    right   two-tail    ratio
1   0   0   inf
``````

The cell `in -a in -b (1481)` is fine because several of fB peaks may fall within one fA peak. However, there are about 150 fA peaks which do not have any overlap with fB (as found using `bedtools intersect` (counting unique ID only) and also verified in a browser).

So what is the meaning of cell `in -a not in -b` showing zero ?

fisher intersection bedtools • 623 views
written 13 months ago by jomo018470

Just to make absolutely sure, you should sort your BED files prior to running this command. BEDTools assumes that they are sorted:

Also be aware of the default parameters regarding overlap:

• -f Minimum overlap required as a fraction of A. Default is 1E-9 (i.e. 1bp).
• -r Require that the fraction of overlap be reciprocal for A and B. In other words, if -f is 0.90 and -r is used, this requires that B overlap at least 90% of A and that A also overlaps at least 90% of B.
• -s Force “strandedness”. That is, only report hits in B that overlap A on the same strand. By default, overlaps are reported without respect to strand.
• -S Require different strandedness. That is, only report hits in B that overlap A on the _opposite_ strand. By default, overlaps are reported without respect to strand.
• -split Treat “split” BAM (i.e., having an “N” CIGAR operation) or BED12 entries as distinct BED intervals.

Thank you Kevin. The files are all sorted as required. I have posted the command so you can see I am just using default values. The problem is that any constraints on -f, -r, -s and -S will further reduce overlaps and increase non-overlapping peaks. But I already have 150 non-overlapping peaks (appearing in a but not in b) which are not reported in the top-right cell of the table report.

Can you paste some of these example regions (from the #150 non-overlappers) from the input BED files? Also which genome are you using as `\$fG` ?

Genome is hg38.

Here is an example of non-overlappers:

From File a:

``````chr1    28641883    28645188    1_MACS_peak_686_lociStitched    521 .
chr1    28881968    28903124    6_MACS_peak_699_lociStitched    66  .
**chr1  30718699    30738774    6_MACS_peak_715_lociStitched    621**   .
chr1    30759728    30784018    8_MACS_peak_726_lociStitched    46  .
chr1    31927138    31958218    8_MACS_peak_759_lociStitched    120 .
``````

From file b:

``````chr1    30659797    30659940    U651    1000    .
chr1    30678239    30678635    U652    730 .
chr1    30707503    30707899    U653    889 .
chr1    30745741    30746137    U654    1000    .
chr1    30768889    30769285    U655    1000    .
chr1    30778226    30778622    U656    1000    .
``````

And another example: From file a:

``````chr3    14230153    14236363    1_MACS_peak_19060_lociStitched  617 .
chr3    14402570    14415604    3_MACS_peak_19067_lociStitched  63  .
***chr3 33899725    33915857    5_MACS_peak_19183_lociStitched  237 .***
chr3    36990979    36994791    1_MACS_peak_19199_lociStitched  566 .
chr3    37985706    37999065    3_MACS_peak_19208_lociStitched  474 .
``````

From file b:

``````chr3    33798287    33798707    U11726  1000    .
chr3    33799701    33800097    U11727  582 .
chr3    33889576    33889935    U11728  1000    .
chr3    33936202    33936615    U11729  1000    .
chr3    33962503    33962899    U11730  770 .
chr3    33962702    33963098    U11731  770 .
``````

Weird... and there's nothing odd about your genome file such that these particular intervals could be entirely overlooked?

The genome file is just a list of CHR sizes. Here is the section containing CHR1 through CHR3 which is relevant for the examples above:

``````chr1    248956422
chr10   133797422
chr11   135086622
chr11_KI270721v1_random 100316
chr12   133275309
chr13   114364328
chr14   107043718
chr14_GL000009v2_random 201709
chr14_GL000225v1_random 211173
chr14_KI270722v1_random 194050
chr14_GL000194v1_random 191469
chr14_KI270723v1_random 38115
chr14_KI270724v1_random 39555
chr14_KI270725v1_random 172810
chr14_KI270726v1_random 43739
chr15   101991189
chr15_KI270727v1_random 448248
chr16   90338345
chr16_KI270728v1_random 1872759
chr17   83257441
chr17_GL000205v2_random 185591
chr17_KI270729v1_random 280839
chr17_KI270730v1_random 112551
chr18   80373285
chr19   58617616
chr1_KI270706v1_random  175055
chr1_KI270707v1_random  32032
chr1_KI270708v1_random  127682
chr1_KI270709v1_random  66860
chr1_KI270710v1_random  40176
chr1_KI270711v1_random  42210
chr1_KI270712v1_random  176043
chr1_KI270713v1_random  40745
chr1_KI270714v1_random  41717
chr2    242193529
chr20   64444167
chr21   46709983
chr22   50818468
chr22_KI270731v1_random 150754
chr22_KI270732v1_random 41543
chr22_KI270733v1_random 179772
chr22_KI270734v1_random 165050
chr22_KI270735v1_random 42811
chr22_KI270736v1_random 181920
chr22_KI270737v1_random 103838
chr22_KI270738v1_random 99375
chr22_KI270739v1_random 73985
chr2_KI270715v1_random  161471
chr2_KI270716v1_random  153799
chr3    198295559
``````

Looks like it should be reported as a bug at the Google Groupsage: https://groups.google.com/forum/#!forum/bedtools-discuss

There may be some internal filter that's not reported in the manual. It should be calling those as unique to A, though.

Just to clarify, `bedtools intersect` does report correctly only overlapping peaks and ignores non-overlapping peaks like the ones above.