bedtools fisher interpretation of "in -a" "not in -b"
0
1
Entering edit mode
5.6 years ago
jomo018 ▴ 720

I am using bedtools fisher as follows:

bedtools fisher -a $fA -b$fB -g $fG >$fOut


Where fA consists of wide peaks and fB narrower peaks. The results are:

 Number of query intervals: 741
Number of db intervals: 18079
Number of overlaps: 1481
Number of possible intervals (estimated): 146318
phyper(1481 - 1, 741, 146318 - 741, 18079, lower.tail=F)
Contingency Table Of Counts
_________________________________________
|  in -b       | not in -b    |
in -a | 1481         | 0            |
not in -a | 16598        | 128239       |
_________________________________________
p-values for fisher's exact test
left    right   two-tail    ratio
1   0   0   inf


The cell in -a in -b (1481) is fine because several of fB peaks may fall within one fA peak. However, there are about 150 fA peaks which do not have any overlap with fB (as found using bedtools intersect (counting unique ID only) and also verified in a browser).

So what is the meaning of cell in -a not in -b showing zero?

fisher bedtools intersection • 3.3k views
0
Entering edit mode

Just to make absolutely sure, you should sort your BED files prior to running this command. BEDTools assumes that they are sorted:

Also be aware of the default parameters regarding overlap:

• -f Minimum overlap required as a fraction of A. Default is 1E-9 (i.e. 1bp).
• -r Require that the fraction of overlap be reciprocal for A and B. In other words, if -f is 0.90 and -r is used, this requires that B overlap at least 90% of A and that A also overlaps at least 90% of B.
• -s Force “strandedness”. That is, only report hits in B that overlap A on the same strand. By default, overlaps are reported without respect to strand.
• -S Require different strandedness. That is, only report hits in B that overlap A on the _opposite_ strand. By default, overlaps are reported without respect to strand.
• -split Treat “split” BAM (i.e., having an “N” CIGAR operation) or BED12 entries as distinct BED intervals.
0
Entering edit mode

Thank you Kevin. The files are all sorted as required. I have posted the command so you can see I am just using default values. The problem is that any constraints on -f, -r, -s and -S will further reduce overlaps and increase non-overlapping peaks. But I already have 150 non-overlapping peaks (appearing in a but not in b) which are not reported in the top-right cell of the table report.

0
Entering edit mode

Can you paste some of these example regions (from the #150 non-overlappers) from the input BED files? Also which genome are you using as \$fG ?

0
Entering edit mode

Genome is hg38.

Here is an example of non-overlappers:

From File a:

chr1    28641883    28645188    1_MACS_peak_686_lociStitched    521 .
chr1    28881968    28903124    6_MACS_peak_699_lociStitched    66  .
**chr1  30718699    30738774    6_MACS_peak_715_lociStitched    621**   .
chr1    30759728    30784018    8_MACS_peak_726_lociStitched    46  .
chr1    31927138    31958218    8_MACS_peak_759_lociStitched    120 .


From file b:

chr1    30659797    30659940    U651    1000    .
chr1    30678239    30678635    U652    730 .
chr1    30707503    30707899    U653    889 .
chr1    30745741    30746137    U654    1000    .
chr1    30768889    30769285    U655    1000    .
chr1    30778226    30778622    U656    1000    .


And another example: From file a:

chr3    14230153    14236363    1_MACS_peak_19060_lociStitched  617 .
chr3    14402570    14415604    3_MACS_peak_19067_lociStitched  63  .
***chr3 33899725    33915857    5_MACS_peak_19183_lociStitched  237 .***
chr3    36990979    36994791    1_MACS_peak_19199_lociStitched  566 .
chr3    37985706    37999065    3_MACS_peak_19208_lociStitched  474 .


From file b:

chr3    33798287    33798707    U11726  1000    .
chr3    33799701    33800097    U11727  582 .
chr3    33889576    33889935    U11728  1000    .
chr3    33936202    33936615    U11729  1000    .
chr3    33962503    33962899    U11730  770 .
chr3    33962702    33963098    U11731  770 .

0
Entering edit mode

Weird... and there's nothing odd about your genome file such that these particular intervals could be entirely overlooked?

0
Entering edit mode

The genome file is just a list of CHR sizes. Here is the section containing CHR1 through CHR3 which is relevant for the examples above:

chr1    248956422
chr10   133797422
chr11   135086622
chr11_KI270721v1_random 100316
chr12   133275309
chr13   114364328
chr14   107043718
chr14_GL000009v2_random 201709
chr14_GL000225v1_random 211173
chr14_KI270722v1_random 194050
chr14_GL000194v1_random 191469
chr14_KI270723v1_random 38115
chr14_KI270724v1_random 39555
chr14_KI270725v1_random 172810
chr14_KI270726v1_random 43739
chr15   101991189
chr15_KI270727v1_random 448248
chr16   90338345
chr16_KI270728v1_random 1872759
chr17   83257441
chr17_GL000205v2_random 185591
chr17_KI270729v1_random 280839
chr17_KI270730v1_random 112551
chr18   80373285
chr19   58617616
chr1_KI270706v1_random  175055
chr1_KI270707v1_random  32032
chr1_KI270708v1_random  127682
chr1_KI270709v1_random  66860
chr1_KI270710v1_random  40176
chr1_KI270711v1_random  42210
chr1_KI270712v1_random  176043
chr1_KI270713v1_random  40745
chr1_KI270714v1_random  41717
chr2    242193529
chr20   64444167
chr21   46709983
chr22   50818468
chr22_KI270731v1_random 150754
chr22_KI270732v1_random 41543
chr22_KI270733v1_random 179772
chr22_KI270734v1_random 165050
chr22_KI270735v1_random 42811
chr22_KI270736v1_random 181920
chr22_KI270737v1_random 103838
chr22_KI270738v1_random 99375
chr22_KI270739v1_random 73985
chr2_KI270715v1_random  161471
chr2_KI270716v1_random  153799
chr3    198295559

0
Entering edit mode

Looks like it should be reported as a bug at the Google Groupsage: https://groups.google.com/forum/#!forum/bedtools-discuss

There may be some internal filter that's not reported in the manual. It should be calling those as unique to A, though.

0
Entering edit mode

Just to clarify, bedtools intersect does report correctly only overlapping peaks and ignores non-overlapping peaks like the ones above.

0
Entering edit mode

Thanks - that's encouraging at least.

0
Entering edit mode

5 years later, I have the same issue.... with several files. This means that these pvalues aren't accurate. What other tool could we use to test the statistics ? Thanks

0
Entering edit mode

Please do not add answers unless you're answering the top level question. Use Add Comment or Add Reply as appropriate. I've moved your post to a comment this time.