Question: Identify gene symbols given a list of chromosome positions
gravatar for cw00137
5.0 years ago by
United Kingdom
cw0013710 wrote:

I have downloaded ChIP-Seq data and managed to get to a point where I have a long list of chromosome positions and some expression data. My question is, how to map these chromosome locations to HUGO gene symbols? 


An example of my data is:

Peak GSM365925_ER_minus_ligand_align.bed GSM365926_ER_E2_align.bed
chr20:257411-257873| 7 49
chr20:363265-363667| 0 98
chr20:373762-374404| 3 170
chr20:549324-550256| 1 23

I would preferably like a solution in python, though I could stretch to one in perl and a solution with a GUI interface would also be welcome as I'm still a relatively inexperienced programmer. Thanks

ADD COMMENTlink modified 2.7 years ago by Tommy Carstensen190 • written 5.0 years ago by cw0013710
gravatar for brentp
5.0 years ago by
Salt Lake City, UT
brentp23k wrote:

Using python and cruzdb:

from cruzdb import Genome

hg19 = Genome('hg19')

for i, line in enumerate(open('dat.txt')):
    toks = line.split()
    if i == 0:
        print "\t".join(['gene'] + toks)
        chrom, posns = toks[0].split(":")
        start, end = map(int, posns.rstrip("|").split("-"))
        genes = hg19.bin_query('refGene', chrom, start, end)
        print "\t".join(["|".join(set(g.name2 for g in genes))] + toks)



ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by brentp23k
gravatar for Garan
5.0 years ago by
United Kingdom
Garan600 wrote:

You could try Biomart
There's a query engine under Tools > Browse data, use the Region filter field to fill in multiple regions

For example; chr20:257411-257873 would be (you might want to check both strands)


Under output section select HGNC Symbol from External references plus any others you want.

Or if you want to do this programmatically there's an example script here of a BioMart interface in Ruby which could be altered to return gene symbols by genomic position.


ADD COMMENTlink written 5.0 years ago by Garan600
gravatar for Alex Reynolds
5.0 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

First, grab RefSeq names via UCSC:

  1. Visit:
  2. Pick your assembly from the Assembly pull-down menu
  3. Choose RefSeq Genes from the Track pull-down menu
  4. Pick BED - browser extensible data from the Output Format pull-down menu
  5. Click on Get Output to download data in BED format (say, a file called "refseq.bed")

Second, convert your peaks to a sorted BED file:

$ tail -n+1 peaks.txt | cut -f1 -d '|' - | tr -s ':' '-' | sed s/-/\\t/g | sort-bed - > peaks.bed

Then do an efficient set operation with BEDOPS:

$ bedmap --echo --echo-map-id-uniq peaks.bed refseq.bed > peaks_with_refseq_annotations.bed
ADD COMMENTlink modified 7 weeks ago by RamRS25k • written 5.0 years ago by Alex Reynolds29k

Hi Alex - I am assuming that I need to upload my data using the 'Add Custom Tracks' thing? ... Also, the files I'm analyzing were downloaded and are already in .BED files. For each condition I have one 'peaks' file which I'm assuming is a positive control for analysis and a ' align' file. Are you suggesting I should sort them into another .BED file. Thanks. 

ADD REPLYlink written 5.0 years ago by cw0013710

No, I would not upload data, but download RefSeq gene names to a local BED file, and then do the mapping between peaks and RefSeq gene names with BEDOPS and standard UNIX command line tools.

If your data are already in BED format, you can just do something like: $ sort-bed peaks.bed > sorted_peaks.bed to ensure that the data are sorted. BEDOPS tools were written to take advantage of sorted BED data.

I know you said that you are not a programmer, but learning the command line is important enough that I figured I would suggest that approach with some example commands that give you the answer you want. I hope this answer gives you a taste of what you can do.

ADD REPLYlink written 5.0 years ago by Alex Reynolds29k

Hi alex! I'm wanting to do something similar but I only have the start site position in my sorted BED file e.g.

chr1 9061386
chr1 9061390

it has about ~20 lines, unfortunately my system doesn't have the most up to date version of gcc req'd for BEDOPS installation... is there another way of doing this at the command line without using BEDOPS? (yes I could install more up to date versions of gcc and g++ but that's currently out of my hands)

Look forward to your response!

ADD REPLYlink modified 15 months ago • written 15 months ago by s16671530
gravatar for Ian
5.0 years ago by
University of Manchester, UK
Ian5.5k wrote:

GALAXY is your friend here.  If you are going to be working with data based on genome coordinates then it will be you friend for "life" :)

You can upload you ChIP-seq regions, then (following Alex Reynolds instructions steps 1 to 4) upload the BED file of refseq genes.  You can overlap the genome coordinates using the intersect tool.  You may want to extend the refseq coordinates beyond the boundaries of the transcripts to increase the range of the possible overlap.


ADD COMMENTlink written 5.0 years ago by Ian5.5k
gravatar for Ming Tang
5.0 years ago by
Ming Tang2.5k
Houston/MD Anderson Cancer Center
Ming Tang2.5k wrote:

easiest way:



ADD COMMENTlink written 5.0 years ago by Ming Tang2.5k
gravatar for Tommy Carstensen
2.7 years ago by
United Kingdom
Tommy Carstensen190 wrote:

Here is yet another Python solution using pyensembl:

import pyensembl
ensembl = pyensembl.EnsemblRelease(release=75)
gene_names = ensembl.gene_names_at_locus(contig=chrom, position=pos)
ADD COMMENTlink written 2.7 years ago by Tommy Carstensen190
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1915 users visited in the last hour