Bedtools coverage adding feature scores
2
0
Entering edit mode
6.1 years ago

I am trying to calculate the coverage of features in a gff file and a bed file containing the windows.

The bed file (windows_test.bed), note 0 based:

chr1    0       50      1
chr1    50      100000  2


The gff file (test.gff), note 1 based:

chr1    Program     HS4     51      100353571       5       +       0       .


The score for this feature is 5

When I run bedtools coverage I get (note, using version v2.22.1 "Returns the depth and breadth of coverage of features from A on the intervals in B":

bedtools coverage -a test.gff -b windows_test.bed

chr1    50      100000  2       1       99950   99950   1.0000000
chr1    0       50      1       0       0       50      0.0000000


I get the total number of features in the chr1 50 100000 range (1) but not the total scores (5), how can I do to get this?

Thanks!

bedtools coverage score • 2.9k views
0
Entering edit mode
4.4 years ago

I think that you will want to use bedtools map rather than (or in addition to) bedtools coverage.

The documentation for map is at http://bedtools.readthedocs.io/en/latest/content/tools/map.html ; it states "bedtools map allows one to map overlapping features in a B file onto features in an A file and apply statistics and/or summary operations on those features." Sum is one of the operations available, and in fact is the default statistic calculated: "By default, map computes the sum of the 5th column (the score field for BED format) for all intervals in B that overlap each interval in A." There's an example of using it for average (which should be easily modifiable to get the sum instead) in another biostars post at How To Extract Scores From Bedgraph File Using Bed Tools .

0
Entering edit mode
4.4 years ago

You can duplicate this and add more operations with bedmap, i.e. start with:

$bedmap --echo --count --bases --echo-ref-size --delim '\t' windows.bed <(gff2bed test.gff) | awk '{ print$0"\t"($5/$6); }' > answer.bed


The --echo option reports the elements in windows.bed.

The --count option reports the number of features in the map file (the GFF file) that overlap the reference file (the windows.bed file) by one or more bases.

The --bases option reports the number of bases that overlap between window and GFF annotations.

The --echo-ref-size reports the length of the target element (for each target element).

Piping to awk does the same calculations as coverageBed.

To get to the point of your question, you can then start adding other statistical operands available in bedmap, such as --sum to get the sum of scores over the overlapping elements.

You could add this and other operands after --echo-ref-size — or even before it, with adjustments to the awk command downstream. Operand results are all separated by the tab delimiter (--delim '\t'), so whatever order you put the operands in will be the order in which answers come out.

See the following for a full review of score operations: http://bedops.readthedocs.io/en/latest/content/reference/statistics/bedmap.html#score-operations