Question

How to quickly obtain a BED file of genes using their gene symbols?

0

Entering edit mode

7.7 years ago

Sinji ★ 3.2k

Looking to solve a relatively easy problem that I can't seem to find a simple answer to.

I conducted some DGE analysis on some RNA-seq data (using DESeq2) and have a list of ~2000 differentially expressed genes. I'm interested in overlaying some ChIP-seq data at promoters of these ~2000 genes, and am trying to find a way to obtain a bed file of these ~2000 genes using gene symbol, or ensembl ID as input.

Is there a easy way to do this?

ChIP-Seq bed gene symbol • 2.2k views

ADD COMMENT • link updated 7.7 years ago by Alex Reynolds 35k • written 7.7 years ago by Sinji ★ 3.2k

score 5 · Accepted Answer · 2016-08-24

Here's another way, using the command line, which could be useful for automation or scripting.

You can do a mysql query of the UCSC Genome Browser for a specific gene and build, e.g., human CTCF:

$ gene="ctcf"
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol LIKE '${gene}';" hg19
+-------+----------+----------+------+
| chr16 | 67596309 | 67673088 | CTCF |
+-------+----------+----------+------+

Just replace the value of gene with your gene-of-interest, and modify the build (hg19) if you're interested in a different reference genome or organism.

Some genes have more than one transcript and are localized to a strand. You can add LIMIT 1 to the SQL query to just grab the first hit, and add the kg.strand field to get back the strand:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol, 0, kg.strand FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol LIKE '${gene}' LIMIT 1;" hg19

These commands print to standard output, so to get promoters (say, a window 500 bases upstream of the 5' end), you could pipe to awk:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol, 0, kg.strand FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol LIKE '${gene}' LIMIT 1;" hg19 | awk -vWindow=500 '{if ($6=="+") { $3 = $2; $2 = $2 - Window; print $0; } else { $2 = $3 - 1; $3 = $3 + Window; print $0; }' - > ${gene}.promoter.bed

To do this over a set of genes, you could pass in a formatted string of gene names:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol, 0, kg.strand FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol IN ('gene1', 'gene2', ...) LIMIT 1;" hg19 > genes.bed

where gene1, gene2 etc. are names of genes of interest.

score 3 · Accepted Answer · 2016-08-24

3

Entering edit mode

7.7 years ago

GenoMax 141k

How about Table browser @UCSC?

ADD COMMENT • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

Should have probably searched the Table Browser harder. There's very clearly an option to input a list of identifiers there.

Thank you! If you put this as an answer I can accept it.

ADD REPLY • link 7.7 years ago by Sinji ★ 3.2k

0

Entering edit mode

And an output as BED option too.

ADD REPLY • link 7.7 years ago by GenoMax 141k