Question: How can I bin my bed files into 500bp bins?
0
gravatar for Tawny
4 months ago by
Tawny160
United States
Tawny160 wrote:

I have several hundred bed files that will be processed by another colleague. I have been searching for a way to bin these by 500bp bins to reduce their size and have universal bins across sample results. Here is an example bed file:

chr1    33  52  1
chr1    52  53  2
chr1    53  69  3
chr1    369 405 2
chr1    669 632 1

So each chromosome in each sample's file will have the same 500bp bins and I will have a sum of the fourth column in my new bed file. Which I imagine to look like this after processing my example bed file:

chr1    1    500    8
chr1    501    1001    1

What tools should be used to generate my output bed file?

bed • 327 views
ADD COMMENTlink modified 4 months ago • written 4 months ago by Tawny160

Check bedtools map. This can take a file of bins (e.g. from bedtools makewindows and then map your data onto these bins. It has options to perform an operation on column4, e.g. sum.

ADD REPLYlink written 4 months ago by ATpoint40k
2
gravatar for Alex Reynolds
4 months ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

You can use BEDOPS and UCSC Kent tools to do this efficiently.

Get your assembly of interest via fetchChromSizes, strip out non-nuclear chromosomes, and turn the rest into a BED file, split into 500nt bins.

Then use bedmap to map it against your example BED file, modified to a BED5 file.

Based on your question, I'll assume you want the --sum operator, but bedmap has several statistical operations. Run bedmap --help or the online docs for more detail.

For example, for assembly hg38:

$ fetchChromSizes hg38 | grep -v '_*_' | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop 500 - | bedmap --echo --sum --delim '\t' - <( awk -v FS="\t" -v OFS="\t" '{ print $1, $2, $3, ".", $4 }' signal.bed ) > answer.bed

The file answer.bed will be in your desired format, but with a 0-based index.

If you want a 1-based index, you can add one to each start coordinate:

$ fetchChromSizes hg38 | grep -v '_*_' | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop 500 - | bedmap --echo --sum --delim '\t' - <( awk -v FS="\t" -v OFS="\t" '{ print $1, $2, $3, ".", $4 }' signal.bed ) | bedops --everything --range 1:0 - > answer.1based.bed

What is the difference between one- and zero-based indexing, you might ask. There is a previous biostars question on that subject, located here: https://bit.ly/2X1HgF8

ADD COMMENTlink modified 4 months ago • written 4 months ago by Alex Reynolds31k

@Alex Reynolds your code seems appropriate however this created bins with a small overlap:

chr2    81993000        81993500
chr2    81993500        81994000
chr2    81994000        81994500

I corrected my bins using --stagger 501 with the --chop 500 command.

ADD REPLYlink modified 4 months ago • written 4 months ago by Tawny160
1

The additional comments from ATpoint and Alex Reynolds helped me clarify my understanding. I see now that I do not need to use the --stagger 501 option.

I have focused on using the BEDOPS solution so that I can become more familiar with it and because I can see a need to use other functionality from it for other work. In the end, this bedmap command works for my needs

bedmap --fraction-map 0.5 --ec --echo --skip-unmapped --sum --delim '\t' hg38_chromSizes.bed  Sample_sort.bed \
        > Sample_sort_500bpBins.bed

Using --fraction-map overcame an issue where when a coordinate spanned two bins in my reference, it was printed out twice with the same sum values making the output BED file report what look like duplicates.

I believe the Bedtools version would be a workable solution as well.

Thank you for providing such a thorough response to my question.

ADD REPLYlink written 4 months ago by Tawny160

BED is 0-based, there is no overlap in the intervals you show.

ADD REPLYlink written 4 months ago by ATpoint40k

These zero-based intervals do not overlap, even though an interval's start position number matches the previous interval's stop position number.

Take a look at the bit.ly link at the end of my answer, which goes into the difference between zero- and one-based indexing. It's an important detail for doing set operations, because the choice of indexing decides what is called an overlap.

It is convention to use zero-based indexing for BED intervals, though not required. However, I suggest sticking with zero-based indexing to minimize errors when processing with other BEDOPS or other tools, or inputting to a visualization tool like the UCSC Genome Browser, etc.

ADD REPLYlink modified 4 months ago • written 4 months ago by Alex Reynolds31k
1
gravatar for cpad0112
4 months ago by
cpad011214k
India
cpad011214k wrote:

adding example to @ATPpoint solution:

$ cat file.txt 
chr1    2000

$ bedtools makewindows -g file.txt -w 500 | bedtools map -a - -b test.txt -c 4 -o sum 
chr1    0   500 8
chr1    500 1000    1
chr1    1000    1500    .
chr1    1500    2000    .
ADD COMMENTlink written 4 months ago by cpad011214k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1399 users visited in the last hour