Question

Looking For A Way To Count Total Length Of Regions Of Bed File

6

Entering edit mode

11.0 years ago

Yunfei Li ▴ 310

as title, there is no overlap region in bed file. Thanks

bed • 31k views

ADD COMMENT • link updated 2.1 years ago by chunshi • 0 • written 11.0 years ago by Yunfei Li ▴ 310

0

Entering edit mode

Can you post the head of your bed file, most of them, has a start and end co-ordinate, so you could just subtract start from end and sum everything!!

ADD REPLY • link 11.0 years ago by Sukhi Singh 11k

0

Entering edit mode

Do your BED data contain overlapping regions, or are your regions disjoint? If the latter, or if you don't care if regions overlap, then a basic awk statement as shown in one answer will suffice. Otherwise, let us know and I'll suggest another method that accounts for both cases.

ADD REPLY • link 11.0 years ago by Alex Reynolds 35k

1

Entering edit mode

4.8 years ago

demolidd77 ▴ 60

For a BED file , you can cacalculate its total size to use

awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}' file.bed .
It works correctly .

ADD COMMENT • link 4.8 years ago by demolidd77 ▴ 60

0

Entering edit mode

3.5 years ago

Bing • 0

In my opinion, some bed files contain forward and reverse strands.

In addition, some bed files contain regions overlaping with each other for paticular purpose.

The regions of repeated sequencing should be combined before calculating the sum of base numbers.

library(GenomicRanges)
library(dplyr)
library(data.table)

bed <- "/path/to/your/bed_file.bed"

bed.df <- fread(bed) %>% as.data.frame()

bed.gr <- makeGRangesFromDataFrame(
    df = bed.df, keep.extra.columns = TRUE,
    ignore.strand = FALSE, 
    starts.in.df.are.0based = TRUE
)

# Total bases before combining:
widthbed.gr) %>% sum()

# Combinine overlapping regions:
bed.gr.reduced <- reducebed.gr)
width(bed.gr.reduced) %>% sum()

ADD COMMENT • link 3.5 years ago by Bing • 0

0

Entering edit mode

2.1 years ago

chunshi • 0

def main(): fs = FS() args = get_arguments()

simple_length = 0
positions = set()
for line in fs.read2list(args.bed_file):
    fields = line.split("\t")
    chromosome = fields[0]
    start_pos = int(fields[1]) + 1
    end_pos = int(fields[2]) + 1

    simple_length = simple_length + end_pos - start_pos
    #print line
    for i in xrange(start_pos, end_pos):
        positions.add(chromosome + ":" + str(i))
if args.full_output:
    print "Size of " + os.path.basename(args.bed_file) + ":" 
    print '{:,}'.format(len(positions))
    if len(positions) == simple_length:
        print "There are NO overlap regions in the bed file!"
    else:
        print "There are OVERLAP regions in the bed file!"
else:
    print len(positions)

fs.close()

ADD COMMENT • link 2.1 years ago by chunshi • 0

Istvan Albert · Accepted Answer · 2013-04-04

32

Entering edit mode

11.0 years ago

Raygozak ★ 1.4k

You can do it with the following command line:

cat file.bed | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

ADD COMMENT • link updated 11.0 years ago by Istvan Albert 100k • written 11.0 years ago by Raygozak ★ 1.4k

0

Entering edit mode

Actually, I think it would be $3-$2 + 1. If you have an interval 5 to 10, subtracting counts the nucleotides 6 to 10, but misses out the first one. Otherwise looks good!

ADD REPLY • link 11.0 years ago by Jelena Aleksic ▴ 920

6

Entering edit mode

Actually the bed files are 0-based for the positions with the end not included (see UCSC website http://www.genome.ucsc.edu/FAQ/FAQformat.html#format1). Hence, a bed with positions: 5-10 means positions 6-10 on the genome. So no need to add 1.

ADD REPLY • link 11.0 years ago by Anthony Mathelier ▴ 910

2

Entering edit mode

Indeed, one reason to use 0-based indexing is to minimize calculations on a computer: the arithmetic is simpler.

ADD REPLY • link 9.8 years ago by Alex Reynolds 35k

0

Entering edit mode

Huh. Well, that was helpful to find out - thanks!

ADD REPLY • link 11.0 years ago by Jelena Aleksic ▴ 920

0

Entering edit mode

I am surprised bedtools does not have a command for that. Or does it?

ADD REPLY • link 10.1 years ago by 14134125465346445 ★ 3.6k

2

Entering edit mode

Given sorted input BED files A, B, C, etc., an input BED file that defines bounds of chromosomes for the organism, e.g. hg19.extents.bed (link) and BEDOPS 2.4.1 (or greater), you could do something like the following:

bedops --merge A B C ... | bedmap --echo --echo-map-size hg19.extents.bed - > answer.bed

If you just have one BED file, then the following will merge overlapping regions in one file, so as to calculate unique base length:

bedops --merge A | bedmap --echo --echo-map-size hg19.extents.bed - > answer.bed

See docs for merging and mapping for more detail.

Or you can pipe merged data into the aforementioned awk statement:

bedops --merge A | awk ...

But the bedops | bedmap pipeline preserves chromosome names and extent data.

ADD REPLY • link 9.8 years ago by Alex Reynolds 35k