Question

How To Get Annotation For Bed File From Another Bed File

1

Entering edit mode

12.6 years ago

k.nirmalraman ★ 1.1k

Hello All,

I have a bed file (with Chr, Start, End, Name, Score and Strand)

Chr1 5678 5680 NA 7  +
Chr1 700  800  NA 8  -
Chr1 900  1200 NA 10 -

and would like to know, how can I get the annotation for the name column from another bed file

Chr1 5500 6000 Gene1 x +
Chr1  500 1000 Gene2 x -

or any standard genome file formats like gbk or .fna files or for that matter another bed file? So mu output file will be a bed file with Chr, Start, End, Name and Strand.

Chr1 5678 5680 Gene1 7 +
Chr1 700  800  Gene2 8 -
Chr1 900  1200 Gene2 10 -

Any easy and standard way to do this??

Bedtools usually operates more on the features but not sure if annotation from one bedfile can be extracted into the other based on overlapping feaures.

Thanks in advance!

bedtools annotation • 13k views

ADD COMMENT • link updated 10.6 years ago by Biostar 20 • written 12.6 years ago by k.nirmalraman ★ 1.1k

1

Entering edit mode

Thanks. I posted an answer using bedtools.

ADD REPLY • link 12.6 years ago by Aaronquinlan 12k

0

Entering edit mode

Could you provide a specific example (with example data) of what you are trying to do?

ADD REPLY • link 12.6 years ago by Aaronquinlan 12k

0

Entering edit mode

Now, I have added the example....

ADD REPLY • link 12.6 years ago by k.nirmalraman ★ 1.1k

score 7 · Answer 1 · 2012-11-24

The bedtools intersect command, combined with some relatively simple awk magic will accomplish what you want. The intersect command will identify the intervals in your two files that overlap. By using the -wb option, intersect will report the entire original interval from the B (second file in your example) file on the same line as the A interval. This is illustrated in the third command below. By having the A and B intervals on the same line, one can simply use awk to select the columns one wants to include in the final file. To follow your example, we choose the 1st, 2nd, 3rd, 10th, 5th, and 6th columns from the output to create the final final in your example. If you are dealing with a very large B file, you should first sort your files by chromosome and start coordinate (i.e., sort -k1,1 -k2,2n) and use the -sorted option in bedtools intersect. This exploits the fact that the intervals are pre-sorted to employ a low-memory algorithm for detecting overlaps between two files. I hope this helps.

> cat a.bed 
chr1    5678    5680    NA    7    +
chr1    700    800    NA    8    -
chr1    900    1200    NA    10    -

> cat b.bed 
chr1    5500    6000    Gene1    x    +
chr1    500    1000    Gene2     x    -

> bedtools intersect -a a.bed -b b.bed -wb
chr1    5678    5680    NA    7    +    chr1    5500    6000    Gene1    x    +
chr1    700    800    NA    8    -    chr1    500    1000     Gene2    x    -
chr1    900    1000    NA    10    -    chr1    500    1000    Gene2    x    -

> bedtools intersect -a a.bed -b b.bed -wb | awk -v OFS="\t" '{print $1,$2,$3,$10,$5,$6}'
chr1    5678    5680    Gene1    7    +
chr1    700    800    Gene2    8    -
chr1     900    1000    Gene2    10    -

score 5 · Answer 2 · 2012-11-24

You can use the bedmap tool in the BEDOPS suite to quickly and efficiently generate an answer, if your reference and annotation inputs are both sorted BED files.

(Sorting is defined by using the sort-bed or bbms utilities, which apply a lexicographical sort on chromosome and coordinate. Sorting allows BEDOPS tools to operate in constant memory and linear time, which is very useful for whole-genome inputs. The sort-bed utility works on input of sizes up to system memory, which can be a limitation for whole-genome work. The bbms tool therefore applies a lexicographical sort to whole-genome scale data on what would otherwise be a limitation on personal workstations, while use of GNU sort in this way may require some extra tricks. BEDOPS v2 will merge the functionality of sort-bed and bbms into sort-bed in a more transparent manner.)

The bedmap tool allows you to collect annotations, scores, positional or full element data from "map" elements in one UCSC-standard BED file, which overlap an element in a "reference" BED file.

To demonstrate, let's say your annotations are formatted as a BED file called annotations.bed, like so:

chr1    1000    1500    annotation-1
chr1    4000    4300    annotation-2
chr1    6200    6450    annotation-3
chr1    9120    9675    annotation-4

Here, we are following UCSC's standard definition of a BED file, with the first column representing the chromosome name, the second and third columns the genomic coordinates, and the fourth column the ID or name of the annotation.

Let's say you want to do an ad-hoc search through this file, to find annotations on the chromosome chr1 which are contained within the half-open range [2000, 7000). You could use bedmap as follows:

$ echo -e "chr1\t2000\t7000\treference-1" | bedmap --echo --echo-map-id - annotations.bed
chr1    2000    7000    reference-1|annotation-2;annotation-3

The result indicates that annotation-2 and annotation-3 are contained within the BED coordinate range [2000, 7000) on chromosome chr1.

The echo -e command passes bedmap that range via standard input, which bedmap consumes with the - hyphen character. The bedmap tool then maps annotations to that input range and shows you these results.

Results are a semi-colon-delimited list of annotations for that range, taken from the ID (or "name") column of annotations.bed. This is because we used the --echo-map-id operator, which outputs mapped IDs.

There are four --echo-map-* operators that offer the entire mapped annotation (similar to bedTools), or various pieces of it (ID, genomic range, scores) for parsing convenience. Please refer to the documentation for more detail.

Instead of a trivial example that uses echo and standard input, you can instead specify a BED file containing multiple ranges, each on their own line. For example, let's say we have a file called rangesOfInterest.bed that contains two ranges (you can specify as many ranges as you like, each on their own line):

chr1    2000    7000    reference-1
chr1    5000    10000   reference-2

(This example only has four columns — chromosome name, start and stop coordinates, and the reference ID/name — but you can have additional columns, like score and strand data, and the operations will work identically as described.)

You can then use bedmap as follows to get results containing the annotations of interest:

$ bedmap --echo --echo-map-id rangesOfInterest.bed annotations.bed
chr1    2000    7000    reference-1|annotation-2;annotation-3
chr1    5000    10000   reference-2|annotation-3;annotation-4

Further, if you want BED-formatted output, use the --delim operator in conjunction with --echo, as follows:

$ bedmap --echo --echo-map-id --delim '\t' rangesOfInterest.bed annotations.bed

This replaces the pipe character (|) with a tab (\t), so that the output is a minimal BED-formatted result:

$ bedmap --echo --echo-map-id --delim '\t' rangesOfInterest.bed annotations.bed
chr1    2000    7000    reference-1    annotation-2;annotation-3
chr1    5000    10000   reference-2    annotation-3;annotation-4

This trick is very useful if you want to pipe the results to another BEDOPS command or another utility that consumes tab-delimited data, like awk, cut, Perl or Python scripts, etc.

For example, if your output needs to be something like: "Chr, Start, End, Name and Strand" along with annotations in a sixth column, then you would pipe UCSC-standard BED data into cut to strip the score column from your UCSC-standard BED input:

$ bedmap ... | cut -f1-4,6,7 - > answer.bed

Note that your output is no longer a standard BED file, but BEDOPS tools that operate on coordinates will work with these data.

If you do not have UCSC-standard BED reference input and instead need to rearrange columns, simply pipe results into awk to print out your columns in a different order, e.g.:

$ bedmap ... | awk '{print $1"\t"$2"\t"$3"\t"$5"\t"$6"\t"$7}' - > answer.bed

Your particular awk statement will depend on which columns you decide to extract and print.

(BEDOPS tools are modular and written to accept standard input (such as what was done above with echo -e) and submit results to standard output, so it is trivially easy to perform further set or mapping operations with bedops and bedmap, respectively, or compress the output with starch, or pipe to bedTools utilities, etc. in an extended pipeline. Piping standard output from one program to standard input of another program reduces overall I/O load on a computer and can improve the speed of calculations, in addition to the performance enhancements already in BEDOPS.)