coverage of dnase-seq narrow peak file of genome
2
0
Entering edit mode
3.4 years ago
kimkes25 ▴ 50

Hello. I want to know what percentage of the genome ( assembly hg19) is covered by the intervals in a bed file. is there a way you know? I saw some functions in bedtools but they dont do it directly. Thanks!

dnase-seq bed coverage • 1.2k views
ADD COMMENT
1
Entering edit mode
3.4 years ago

First, generate intervals for hg19 (perhaps stripping out non-nuclear and mitochondrial chromosomes):

$ fetchChromSizes hg19 | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2; }' | grep -v "[_*_|MT]" | sort-bed - > hg19.nuc.bed

To calculate coverage:

$ bedmap --skip-unmapped --delim '\t' --echo --bases-uniq --echo-ref-size --bases-uniq-f hg19.nuc.bed <(sort-bed peaks.bed) > coverage.bed

The last three columns of coverage.bed are per-chromosome calculations:

  • The number of bases in each hg19 chromosome that have coverage by peaks
  • The length of the chromosome
  • The fraction of each hg19 chromosome that has coverage by peaks (i.e., bases of coverage, divided by chromosome length)

The order of the specified bedmap operands --bases-uniq, --echo-ref-size, and --bases-uniq-f write column values in that same order.

If you want a whole-genome percentage, you could sum the number of bases, sum the chromosome lengths, and divide the two sums at the end:

$ bedmap --skip-unmapped --delim '\t' --echo --bases-uniq --echo-ref-size hg19.nuc.bed <(sort-bed peaks.bed) | awk -v FS="\t" -v OFS="\t" '{ b += $(NF-1); l += $NF; } END { print b/l; }'

Using a toolkit like bedops lets you do this calculation more easily with more complex inputs. For example, you might want GC content over all of the mappable genome, i.e. everything except for blacklisted regions. This is a bit more difficult without a set-operations based approach.

ADD COMMENT
0
Entering edit mode

Thank you , I will try it this way too!

ADD REPLY
1
Entering edit mode
3.4 years ago
ATpoint 81k

A solution with awk and the bc (basic calculator):

#/ Get chromsizes hg19:
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes

#/ Dummy BED file:
$ cat test.bed 
chr1    100 100000
chr2    20  2000000

#/ Calculate percent of genome represented by test.bed:
bc <<< "scale=10; 100 * $(awk '{sum=$3-$2}END{print sum}' test.bed) / $(awk '{sum+=$2}END{print sum}' hg19.chrom.sizes)"

Answer here would be .0637512652%

ADD COMMENT
0
Entering edit mode

Thanks! I will try this

ADD REPLY
0
Entering edit mode

When I try your example it comes out as 0. Is it possibly missing something?

ADD REPLY
0
Entering edit mode

I was also getting 0. After fiddling with the solution, I got the answer by changing sum=$3-$2 to sum+=$3-$2. Hope this helps

ADD REPLY

Login before adding your answer.

Traffic: 1559 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6