Question: Annotating Genomic Intervals
gravatar for Stephen
8.0 years ago by
Charlottesville Virginia
Stephen2.7k wrote:

How can I annotate human genomic intervals (BED file) from a ChIP-seq experiment with information such as whether the interval overlaps with a gene(s)? Upstream of a gene? Overlaps with an exon? Intron? 5kb upstream/downstream of TSS? Intergenic? Does it overlap with a DNAse I hypersensitive site?

Surely bedtools can help me with this, but I'm looking for the best workflow / data sources to use for this that will require the least amount of scripting.


ADD COMMENTlink modified 8.0 years ago by Alex Reynolds30k • written 8.0 years ago by Stephen2.7k
gravatar for Alex Reynolds
8.0 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

If a lot of your data are in UCSC or three-column BED format, the BEDOPS suite contains tools that will help you with your work. These tools are quick and powerful, offering a few useful features:

  • support for standard input and output streams (you can pass in the results from one analysis to another, using standard UNIX pipes)
  • support for multiple BED inputs for set operations
  • low memory overhead

As examples, we can go through some of your to-do list below:

To find regions (Regions.bed) which overlap DNase hypersensitive sites (DHSes.bed) by one or more bases, the bedops tool performs this particular set operation:

$ bedops --element-of -1 Regions.bed DHSes.bed

The --element-of option supports other overlap criteria, and the bedops tool supports a variety of operations that include set complement, difference, intersection, merging, multiset union and others. See --help or the documentation for a fuller listing.

Using bedextract is a very fast way to perform bedops --element-of -1 A B calculations, if your set of DHS elements do not contain nested elements (see the documentation for their definition):

$ bedextract DHSes.bed Regions.bed

(This tool uses the information in a sorted BED file to locate elements in logarithmic time, instead of linearly walking through the input. The speedup can be very significant, as a result. A lot of BED inputs can take advantage of the performance gain that bedextract provides, so this is worth investigating.)

To find the closest genes (Genes.bed) to a set of input regions (Regions.bed), use closest-features (documentation here):

$ closest-features Regions.bed Genes.bed

To get a delimited set of exons which overlap a set of regions, use bedmap (documentation here):

$ bedmap --echo --echo-map Regions.bed Exons.bed

The bedmap tool has options that allow refinement of overlap criteria, as well as reporting of other data that can be found in a typical UCSC BED file, such as score and ID information in the mapping input (in this example, the exons). This would be useful for data that contain scores (DHS density tracks, etc.) or IDs (exons, genes, SNPs, etc.).

Additionally, calculations from one tool can be piped into another BEDOPS tool, using standard UNIX pipes. UNIX pipes are a good way to "glue together" calculations, which minimizes time spent on scripting and script maintenance. Piping can also reduce file I/O, which can improve overall performance considerably.

As a further example, let's say you are interested in the set of nearest genes to BED elements that overlap DHSes. To get this answer, we can pipe the results of a bedops operation into a closest-features search:

$ bedops --element-of -1 Regions.bed DHSes.bed | closest-features - Genes.bed

(BEDOPS shares the GNU convention of using the hyphen to denote standard input.)

The bedops tool supports multiple inputs. So if you have several sets of DHSes in separate BED files, you can find overlaps between input regions and all of them in one go:

$ bedops --element-of -1 Regions.bed DHSesFirstSet.bed DHSesSecondSet.bed ... DHSesLastSet.bed

The BEDOPS toolkit aims to have a low memory overhead and to perform quickly, taking advantage of the structure of lexicographically-sorted BED streams. The sort-bed utility will do this, if inputs are unsorted:

$ sort-bed Unsorted.bed > Sorted.bed

Other BEDOPS tools (bedops, bedextract, etc.) import and export sorted data, so sorting is only needed once at the beginning of analysis, if the inputs are unsorted or are of uncertain order. Some tools also include error checking options (--ec), to help validate BED input.

ADD COMMENTlink modified 11 months ago by RamRS30k • written 8.0 years ago by Alex Reynolds30k

Thanks for the detailed how-to. I will definitely give this a shot.

ADD REPLYlink written 8.0 years ago by Stephen2.7k
gravatar for Steve Lianoglou
8.0 years ago by
Steve Lianoglou5.1k
Steve Lianoglou5.1k wrote:

If you're comfortable with R and Bioconductor, the ChIPpeakAnno package has some canned functionality to do this -- the annotatePeakInBatch function, in particular.

Check out Section 2: Examples of using ChIPPeakAnno, for an example.

ADD COMMENTlink written 8.0 years ago by Steve Lianoglou5.1k

Thanks, very comfortable in R/BioC. Edited question to include R/BioC tags. Will take a look at this package. Thanks.

ADD REPLYlink written 8.0 years ago by Stephen2.7k

ChIPpeakAnno is not using strand information therefore introducing error. Use the more recent ChIPseeker R package which essentially does the same, or the perl program HOMER

ADD REPLYlink written 5.4 years ago by Tom230
gravatar for Khader Shameer
8.0 years ago by
Manhattan, NY
Khader Shameer18k wrote:

Another option is Ensembl Region Report. API here.

ADD COMMENTlink written 8.0 years ago by Khader Shameer18k
gravatar for Sean Davis
8.0 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

Bioconductor tools of interest include rtracklayer and GenomicFeatures to gather genomic feature information and GenomicRanges and IRanges for doing very fast overlap and range manipulation. In particular, you'll want to look at findOverlaps and range operations like reduce, for example. Combinations of these can be an awesome toolset for nearly any genomic range operations including filtering and modifying, all from the comfort of the R command line....

ADD COMMENTlink written 8.0 years ago by Sean Davis26k
gravatar for vu.bush
8.0 years ago by
vu.bush10 wrote:

I would use BedIntersect -- it's an awesome tool and can compute the overlap between two bed files. You can find it in the UCSC FTP directory

ADD COMMENTlink written 8.0 years ago by vu.bush10

Thanks. I couldn't seem to find BedIntersect. Is this different from the intersectBed function in the bedtools package? I'm aware of how to use intersectBed, but it's more so finding a consolidated data source and workflow to get at this annotation with the least amount of scripting (that has likely already been done by someone else). If BedIntersect has this kind of annotation data and functionality built-in, I'd love to know where to find it.

ADD REPLYlink written 8.0 years ago by Stephen2.7k
gravatar for Pierre Lindenbaum
8.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

A Biodas server could be a way. I've just tested the Biodas server of the UCSC

$ curl -s ",100100" | head -n 50">
<GFF version="1.0" href="&lt;a href=" http:="""" cgi-bin="" das="" hg19="" features"="" rel="nofollow">">
<SEGMENT id="1" start="100000" stop="100100" version="1.00" label="1">
<FEATURE id="HIT000010175.chr1.73842.0" label="HIT000010175">
 <TYPE id="HInvGeneMrna" category="transcription" reference="no">HInvGeneMrna</TYPE>
 <GROUP id="HIT000010175.chr1.73842">

unfortunately , it is broken:

$ curl -s ",100100" | xmllint --format -
-:402: parser error : Premature end of data in tag SEGMENT line 5

-:402: parser error : Premature end of data in tag GFF line 4

-:402: parser error : Premature end of data in tag DASGFF line 3


and it is poorly documented:

BIODAS : can i retrieve some metadata about a SEGMENT ?

but you can find some other DAS server elsewhere:

ADD COMMENTlink written 8.0 years ago by Pierre Lindenbaum130k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1668 users visited in the last hour