If you can convert your gene annotations into a BED file - say, converting FASTA-formatted annotations to BED with Galaxy or other tools - and your genomic range in a set of BED coordinates, then you can use the bedmap tool in the BEDOPS suite to calculate a result containing those genes located within your genomic range(s) of interest.
To demonstrate, let's say your gene annotations are formatted as a BED file called annotations.bed
, like so:
chr1 1000 1500 gene-1
chr1 4000 4300 gene-2
chr1 6200 6450 gene-3
chr1 9120 9675 gene-4
We're following UCSC's definition of a BED file, with the first column representing the chromosome name, the second and third columns the genomic coordinates, and the fourth column the ID or name of the annotation.
Let's say you want to do an ad-hoc search through this file, to find genes on the chromosome chr1
which are contained within the range [2000, 7000)
. You could use bedmap
as follows:
$ echo -e "chr1\t2000\t7000" | bedmap --echo --echo-map-id - annotations.bed
chr1 2000 7000|gene-2;gene-3
The result indicates that gene-2
and gene-3
are contained within the BED coordinate range [2000, 7000)
on chromosome chr1
.
The echo -e
command passes bedmap
that range via standard input, which bedmap
consumes with the -
hyphen character. The bedmap
tool then maps annotations to that input range and shows you these results.
Results are a semi-colon-delimited list of genes for that range, taken from the ID column of annotations.bed
. This is because we used the --echo-map-id
operator, which outputs mapped IDs. There are four --echo-map-*
operators that offer the entire mapped annotation, or various pieces of it (ID, genomic range, scores). Please refer to the documentation for more detail.
The output from a bedmap
operation is trivial to parse with Perl scripts or Java programs, through the usual split()
or regular expression calls, etc.
Instead of a trivial example that uses echo
and standard input, you can instead specify a BED file containing multiple ranges, each on their own line. For example, let's say we have a file called rangesOfInterest.bed
that contains two ranges (you can specify as many ranges as you like, each on their own line):
chr1 2000 7000
chr1 5000 10000
You can then use bedmap
as follows to get results containing the annotations of interest:
$ bedmap --echo --echo-map-id rangesOfInterest.bed annotations.bed
chr1 2000 7000|gene-2;gene-3
chr1 5000 10000|gene-3;gene-4
Further, if you want BED-formatted output, use the --delim
operator in conjunction with --echo
, as follows:
$ bedmap --echo --echo-map-id --delim '\t' rangesOfInterest.bed annotations.bed
This replaces the pipe character (|
) with a tab (\t
), so that the output is a minimal BED-formatted result:
$ bedmap --echo --echo-map-id --delim '\t' rangesOfInterest.bed annotations.bed
chr1 2000 7000 gene-2;gene-3
chr1 5000 10000 gene-3;gene-4
This trick is very useful if you want to pipe the results to another BEDOPS command or another utility that consumes BED-formatted data, like awk, Perl or Python scripts, etc.
BEDOPS tools are modular and written to accept standard input (such as what was done above with echo -e
), so it is easy to perform further set or mapping operations with bedops
and bedmap
, respectively, or compress the output with starch
, etc. in an extended pipeline. Piping standard output from one program to standard input of another program reduces overall I/O load on a computer and can improve the speed of calculations.
The bedmap
tool is also useful for quantitative work. While the gene annotations in this example do not contain any numerical data, we might want to quickly count how many genes are binned within our ranges of interest, using the --count
operator:
$ bedmap --echo --echo-map-id --count --delim '\t' rangesOfInterest.bed annotations.bed
chr1 2000 7000 gene-2;gene-3 2
chr1 5000 10000 gene-3;gene-4 2
Granted, this is a pretty mundane example, but you could imagine asking how many annotations are contained within a moving window over a genome. This would tell you where your annotations are concentrated, which could have biological significance.
Other numerical operators are available for bedmap
, useful particularly if your mapped elements contain score information, like tag counts, p-values, expression levels, etc.
To perform all these operations within R, use a system() call.