Looking For A Way To Count Total Length Of Regions Of Bed File
3
6
Entering edit mode
8.5 years ago
Yunfei Li ▴ 310

as title, there is no overlap region in bed file. Thanks

bed • 20k views
ADD COMMENT
0
Entering edit mode

Can you post the head of your bed file, most of them, has a start and end co-ordinate, so you could just subtract start from end and sum everything!!

ADD REPLY
0
Entering edit mode

Do your BED data contain overlapping regions, or are your regions disjoint? If the latter, or if you don't care if regions overlap, then a basic awk statement as shown in one answer will suffice. Otherwise, let us know and I'll suggest another method that accounts for both cases.

ADD REPLY
28
Entering edit mode
8.5 years ago
Raygozak ★ 1.4k

You can do it with the following command line:

cat file.bed | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
ADD COMMENT
0
Entering edit mode

Actually, I think it would be $3-$2 + 1. If you have an interval 5 to 10, subtracting counts the nucleotides 6 to 10, but misses out the first one. Otherwise looks good!

ADD REPLY
6
Entering edit mode

Actually the bed files are 0-based for the positions with the end not included (see UCSC website http://www.genome.ucsc.edu/FAQ/FAQformat.html#format1). Hence, a bed with positions: 5-10 means positions 6-10 on the genome. So no need to add 1.

ADD REPLY
2
Entering edit mode

Indeed, one reason to use 0-based indexing is to minimize calculations on a computer: the arithmetic is simpler.

ADD REPLY
0
Entering edit mode

Huh. Well, that was helpful to find out - thanks!

ADD REPLY
0
Entering edit mode

I am surprised bedtools does not have a command for that. Or does it?

ADD REPLY
2
Entering edit mode

Given sorted input BED files A, B, C, etc., an input BED file that defines bounds of chromosomes for the organism, e.g. hg19.extents.bed (link) and BEDOPS 2.4.1 (or greater), you could do something like the following:

bedops --merge A B C ... | bedmap --echo --echo-map-size hg19.extents.bed - > answer.bed

If you just have one BED file, then the following will merge overlapping regions in one file, so as to calculate unique base length:

bedops --merge A | bedmap --echo --echo-map-size hg19.extents.bed - > answer.bed

See docs for merging and mapping for more detail.

Or you can pipe merged data into the aforementioned awk statement:

bedops --merge A | awk ...

But the bedops | bedmap pipeline preserves chromosome names and extent data.

ADD REPLY
1
Entering edit mode
2.2 years ago
demolidd77 ▴ 60

For a BED file , you can cacalculate its total size to use

awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}' file.bed .
It works correctly .

ADD COMMENT
0
Entering edit mode
11 months ago
Bing • 0

In my opinion, some bed files contain forward and reverse strands.

In addition, some bed files contain regions overlaping with each other for paticular purpose.

The regions of repeated sequencing should be combined before calculating the sum of base numbers.

library(GenomicRanges)
library(dplyr)
library(data.table)

bed <- "/path/to/your/bed_file.bed"

bed.df <- fread(bed) %>% as.data.frame()

bed.gr <- makeGRangesFromDataFrame(
    df = bed.df, keep.extra.columns = TRUE,
    ignore.strand = FALSE, 
    starts.in.df.are.0based = TRUE
)

# Total bases before combining:
widthbed.gr) %>% sum()

# Combinine overlapping regions:
bed.gr.reduced <- reducebed.gr)
width(bed.gr.reduced) %>% sum()
ADD COMMENT

Login before adding your answer.

Traffic: 3087 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6