Question: Annotaion Based On The Genomic Range
gravatar for ancient_learner
6.8 years ago by
ancient_learner610 wrote:

Hello all I have some data like this related to mouse genome

chr1 3000000 3000090
chr2 4339993 4389898
chr5 3000330 3003339
chr7 3323233 3390393

I know that by using UCSC genome browser we can get the information related to the presence of genes, proteins at those regions. however i am more interested in identifying all functional elements (may be promoters, enhancers tfs etc) with in that region. is there any way to do that. With in UCSC is there any option like that?

genome annotation R ucsc • 3.7k views
ADD COMMENTlink modified 6.5 years ago by Emily_Ensembl18k • written 6.8 years ago by ancient_learner610
gravatar for Alex Reynolds
6.8 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

If you can put your gene annotations into a BED file — say, GENCODE features obtained via the ENCODE project site and converted from GTF to BED with the Noble lab's gtf2bed script — and your genomic ranges in a set of BED coordinates (which it looks like you've already done), then you can use the bedmap tool in the BEDOPS suite to calculate a result containing those genes located within your genomic range(s) of interest.

To demonstrate, let's say your gene annotations are formatted as a BED file called annotations.bed, like so:

chr1    1000    1500    gene-1
chr1    4000    4300    gene-2
chr1    6200    6450    gene-3
chr1    9120    9675    gene-4

We're following UCSC's definition of a BED file, with the first column representing the chromosome name, the second and third columns the genomic coordinates, and the fourth column the ID or name of the annotation. We use a tab character (\t) as a field delimiter.

Let's say you want to do an ad-hoc search through this file, to find genes on the chromosome chr1 which are contained within the range [2000, 7000). You could use bedmap as follows:

$ echo -e "chr1\t2000\t7000" | bedmap --echo --echo-map-id - annotations.bed
chr1    2000    7000|gene-2;gene-3

The result indicates that gene-2 and gene-3 are contained within the BED coordinate range [2000, 7000) on chromosome chr1.

The echo -e command passes bedmap that range via standard input, which bedmap consumes with the - hyphen character. The bedmap tool then maps annotations to that input range and shows you these results.

Results are a semi-colon-delimited list of genes for that range, taken from the ID column of annotations.bed. This is because we used the --echo-map-id operator, which outputs mapped IDs. There are four --echo-map-* operators that offer the entire mapped annotation, or various pieces of it (ID, genomic range, scores). Please refer to the documentation for more detail.

The output from a bedmap operation is trivial to parse with Perl scripts or Java programs, through the usual split() or regular expression calls, etc.

Instead of a trivial example that uses echo and standard input, you can instead specify a BED file containing multiple ranges, each on their own line. For example, let's say we have a file called rangesOfInterest.bed that contains two ranges (you can specify as many ranges as you like, each on their own line):

chr1    2000    7000
chr1    5000    10000

You can then use bedmap as follows to get results containing the annotations of interest:

$ bedmap --echo --echo-map-id rangesOfInterest.bed annotations.bed
chr1    2000    7000|gene-2;gene-3
chr1    5000    10000|gene-3;gene-4

Further, if you want BED-formatted output, use the --delim operator in conjunction with --echo, as follows:

$ bedmap --echo --echo-map-id --delim '\t' rangesOfInterest.bed annotations.bed

This replaces the pipe character (|) with a tab (\t), so that the output is a minimal BED-formatted result:

$ bedmap --echo --echo-map-id --delim '\t' rangesOfInterest.bed annotations.bed
chr1    2000    7000    gene-2;gene-3
chr1    5000    10000   gene-3;gene-4

This trick is very useful if you want to pipe the results to another BEDOPS command or another utility that consumes BED-formatted data, like awk, Perl or Python scripts, etc.

BEDOPS tools are modular and written to accept standard input (such as what was done above with echo -e), so it is easy to perform further set or mapping operations with bedops and bedmap, respectively, or compress the output with starch, etc. in an extended pipeline. Piping standard output from one program to standard input of another program reduces overall I/O load on a computer and can improve the speed of calculations.

The bedmap tool is also useful for quantitative work. While the gene annotations in this example do not contain any numerical data, we might want to quickly count how many genes are binned within our ranges of interest, using the --count operator:

$ bedmap --echo --echo-map-id --count --delim '\t' rangesOfInterest.bed annotations.bed
chr1    2000    7000    gene-2;gene-3    2
chr1    5000    10000   gene-3;gene-4    2

Granted, this is a pretty mundane example, but you could imagine asking how many annotations are contained within a moving window over a genome. This would tell you where your annotations are concentrated or enriched, which could have biological significance.

Other numerical operators are available for bedmap, useful particularly if your map elements contain score information, like tag counts, p-values, expression levels, etc.

ADD COMMENTlink modified 6.5 years ago • written 6.8 years ago by Alex Reynolds28k

@Alex Can you please help me with this question How to assigning Active Enhancer to every genes within 10 kb region?. Thanks

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Bioinformatist Newbie230
gravatar for Sukhdeep Singh
6.8 years ago by
Sukhdeep Singh9.8k
Sukhdeep Singh9.8k wrote:

You have to define promoters and enhancers by yourself, there is no proper definition. Get a list of all genes, refer this Fetching Transcription Start And End For A Custom Gene List From Ucsc (Hg18/Ncbi36) for that, change the organism and build. If you know R or any other language, add and subtract the number of bases or a region of some KB (eg +/-1KB) from the TSS (labelled as txStart in the table) strand specifically. This number depends on how you define promters and then use the intersectBed tool from Bedtools. Check this How To Determine Overlaps From Coordinates or manual for usage.

For Enhancers, some people say they are 5-10KB far, but a way to do it would be overlay the ChIP-Seq data(peaks) of p300 (marker for enhancers) on the genome to get the list of enhancers and then intersect with you own file If you know Galaxy, then this might be helpful, From BED Coordinates to Genes


ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by Sukhdeep Singh9.8k

Thank you sukhdeep for your reply. i have obtained the chip-seq from GEO database. An enhancer might be of 500 bps (avg). but the chip-seq data shows only the areas where p300 binding is present. so i can regard +/- 200bps from the peak start region of chip-seq as enhancer?

ADD REPLYlink written 6.8 years ago by ancient_learner610
gravatar for Irsan
6.8 years ago by
Irsan6.9k wrote:

If you want information on annotating genomic intervals in general see some similar Biostars-posts:

ADD COMMENTlink written 6.8 years ago by Irsan6.9k
gravatar for Emily_Ensembl
6.5 years ago by
Emily_Ensembl18k wrote:

No idea about UCSC, but you can do that using the Ensembl Region Report tool.

This allows you to inout genomic coordinates, then see everything that's within them. There's a tick box list where you can choose what to see. The options are:

Genes, Transcripts and Proteins

Genomic Sequence

Constrained Elements (Conserved Regions)

Variations (SNPs and InDels)

Structural Variations (CNVs etc)

Regulatory Features

ADD COMMENTlink written 6.5 years ago by Emily_Ensembl18k

Thank you for the reply. i am more interested in Constrained Elements (Conserved Regions) feature. does this tool supports graphical view? I know ECR browser does but I cannot give each coordinate manually.

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by ancient_learner610

This will just give you a list.

ADD REPLYlink written 6.4 years ago by Emily_Ensembl18k
gravatar for Ashutosh Pandey
6.8 years ago by
Ashutosh Pandey11k wrote:

If you are comfortable with little programming and unix or you can use snpEff software and set up databases for different genomic elements like genes, transcription factor binding sites, enhancers etc then it is pretty simple thing to do. You can get most of the files you need from ENSEMBL

The cis-regulatory elements information could be derived from Regulations gff file and Regulation data files.

ADD COMMENTlink written 6.8 years ago by Ashutosh Pandey11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1106 users visited in the last hour