Question

bedmap window base overlap not reporting what I would expect

0

Entering edit mode

4.8 years ago

nketchi ▴ 10

I am trying to calculate percentage repeats across a 50kb window in a genome

I have made a 50kb window bed file for my genome:

test.50kb.bed:

chr18   14300000    14350000
chr18   14350000    14400000
chr18   14400000    14450000
chr18   14450000    14500000
chr18   14500000    14550000
chr18   14550000    14600000

and I have a bed file of the repeats in the genome test.repeats.bed:

chr18   14300459    14301011    .
chr18   14308214    14308253    .
chr18   14308718    14308843    .
chr18   14308992    14309021    .
chr18   14310484    14310517    .
chr18   14312857    14312890    .
chr18   14314212    14314229    .
chr18   14314263    14314389    .
chr18   14318565    14318780    .

However, when I run the following to calculate the percentage bases which are repeats in the 50kb window:

bedmap --delim '\t' --echo --count --bases --bases-uniq genome.extents.sorted.50kb.bed genome.all_repeats.bed | awk '{ print $0"\t"($5/($3 - $2)*100); }'  | grep "chr18    14300000    14350000"

I don't get what I expect! The result:

chr18   14300000    14350000    10  31103   31103   62.206

The results from where I used awk to pull out the repeat regions first (cat genome.all_repeats.bed | grep "chr18" | awk '{if ($2>=14300000 && $3<=14350000) {print ;}}' > chr18_repeats.bed)

bedmap --delim '\t' --echo --count --bases --bases-uniq STAR.extents.sorted.50kb.bed chr18_repeats.bed | awk '{ print $0"\t"($5/($3 - $2)*100); }'  | grep "chr18   14300000        14350000"

chr18   14300000    14350000    9   1169    1169    2.338

I know from sanity checking that the second result is correct. Can anyone advise me what I am doing wrong please?!

Thanks!!

bedmap • 829 views

ADD COMMENT • link updated 4.8 years ago by Alex Reynolds 35k • written 4.8 years ago by nketchi ▴ 10

0

Entering edit mode

Hi j.r.paris,

A small educational note: I added (code) markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 4.8 years ago by lieven.sterck 15k

score 1 · Answer 1 · 2019-06-24

1

Entering edit mode

4.8 years ago

Alex Reynolds 35k

Use of bedmap requires sort-bed-sorted inputs in order to walk through files in correct order and to calculate correct statistics.

If extracting a chromosome from your repeats works, then chromosomes in the original file are perhaps out of order. So I would suggest ensuring that genome.all_repeats.bed contains sorted intervals, i.e.:

$ sort-bed genome.all_repeats.bed > genome.all_repeats.sorted.bed

Then use genome.all_repeats.sorted.bed as input to bedmap, instead of genome.all_repeats.bed.

You could first use sort-bed --check-sort <file> to verify a file's sort status, but it is easy enough to resort directly.

I'd also look at making sure that STAR.extents.sorted.50kb.bed and/or genome.extents.sorted.50kb.bed are sorted.

ADD COMMENT • link 4.8 years ago by Alex Reynolds 35k

0

Entering edit mode

Thanks Alex!

Unfortunately, this doesn't seem to solve the problem. I checked my bed files were all sorted and they were. I re-sorted them anyway just to check, and still I get the same result.

It was suggested to be that perhaps there are overlaps in the repeats.bed file which are causing problems, and although I can see that the window I'm interested in has no overlap:

chr18   14300459    14301011    FALSE
chr18   14308214    14308253    FALSE
chr18   14308718    14308843    FALSE
chr18   14308992    14309021    FALSE
chr18   14310484    14310517    FALSE
chr18   14312857    14312890    FALSE
chr18   14314212    14314229    FALSE
chr18   14314263    14314389    FALSE
chr18   14318565    14318780    FALSE

When I include the 50kb windows either side of my 50kb window of interest, I get some overlap, e.g. :

chr18   14314263    14314389    FALSE
chr18   14318565    14318780    FALSE
chr18   14320066    14357919    FALSE
chr18   14358889    14359706    TRUE
chr18   14359696    14360296    TRUE
chr18   14360285    14360538    TRUE
chr18   14360345    14360706    TRUE
chr18   14360549    14360882    FALSE
chr18   14360929    14361219    FALSE
chr18   14361219    14361514    FALSE
chr18   14361514    14362954    TRUE

And then if I sanity check these regions together, I get the same "wrong" result as before

Can you suggest a way of overcoming this?

thanks again!

ADD REPLY • link 4.8 years ago by nketchi ▴ 10

1

Entering edit mode

That makes sense. If your repeats overlap, i.e. are not disjoint, then you could potentially get multiple counts over a (50kb) window.

Merging or partitioning your repeats would give you disjoint regions:

$ bedops --merge repeats.bed > repeats.merged.bed
$ bedops --partition repeats.bed > repeats.partitioned.bed

Using disjoint regions for mapping should give single counts, I think.

Making repeats disjoint would lose information about how many repeats overlap any given window, but other bedmap operations could be used with the original repeats for a separate calculation.

ADD REPLY • link 4.8 years ago by Alex Reynolds 35k