I am trying to calculate percentage repeats across a 50kb window in a genome
I have made a 50kb window bed file for my genome:
test.50kb.bed:
chr18 14300000 14350000
chr18 14350000 14400000
chr18 14400000 14450000
chr18 14450000 14500000
chr18 14500000 14550000
chr18 14550000 14600000
and I have a bed file of the repeats in the genome test.repeats.bed:
chr18 14300459 14301011 .
chr18 14308214 14308253 .
chr18 14308718 14308843 .
chr18 14308992 14309021 .
chr18 14310484 14310517 .
chr18 14312857 14312890 .
chr18 14314212 14314229 .
chr18 14314263 14314389 .
chr18 14318565 14318780 .
However, when I run the following to calculate the percentage bases which are repeats in the 50kb window:
bedmap --delim '\t' --echo --count --bases --bases-uniq genome.extents.sorted.50kb.bed genome.all_repeats.bed | awk '{ print $0"\t"($5/($3 - $2)*100); }' | grep "chr18 14300000 14350000"
I don't get what I expect! The result:
chr18 14300000 14350000 10 31103 31103 62.206
The results from where I used awk to pull out the repeat regions first (cat genome.all_repeats.bed | grep "chr18" | awk '{if ($2>=14300000 && $3<=14350000) {print ;}}' > chr18_repeats.bed
)
bedmap --delim '\t' --echo --count --bases --bases-uniq STAR.extents.sorted.50kb.bed chr18_repeats.bed | awk '{ print $0"\t"($5/($3 - $2)*100); }' | grep "chr18 14300000 14350000"
chr18 14300000 14350000 9 1169 1169 2.338
I know from sanity checking that the second result is correct. Can anyone advise me what I am doing wrong please?!
Thanks!!
Hi j.r.paris,
A small educational note: I added (code) markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below: