How to quickly obtain a BED file of genes using their gene symbols?
2
0
Entering edit mode
7.7 years ago
Sinji ★ 3.2k

Looking to solve a relatively easy problem that I can't seem to find a simple answer to.

I conducted some DGE analysis on some RNA-seq data (using DESeq2) and have a list of ~2000 differentially expressed genes. I'm interested in overlaying some ChIP-seq data at promoters of these ~2000 genes, and am trying to find a way to obtain a bed file of these ~2000 genes using gene symbol, or ensembl ID as input.

Is there a easy way to do this?

ChIP-Seq bed gene symbol • 2.2k views
ADD COMMENT
5
Entering edit mode
7.7 years ago

Here's another way, using the command line, which could be useful for automation or scripting.

You can do a mysql query of the UCSC Genome Browser for a specific gene and build, e.g., human CTCF:

$ gene="ctcf"
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol LIKE '${gene}';" hg19
+-------+----------+----------+------+
| chr16 | 67596309 | 67673088 | CTCF |
+-------+----------+----------+------+

Just replace the value of gene with your gene-of-interest, and modify the build (hg19) if you're interested in a different reference genome or organism.

Some genes have more than one transcript and are localized to a strand. You can add LIMIT 1 to the SQL query to just grab the first hit, and add the kg.strand field to get back the strand:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol, 0, kg.strand FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol LIKE '${gene}' LIMIT 1;" hg19

These commands print to standard output, so to get promoters (say, a window 500 bases upstream of the 5' end), you could pipe to awk:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol, 0, kg.strand FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol LIKE '${gene}' LIMIT 1;" hg19 | awk -vWindow=500 '{if ($6=="+") { $3 = $2; $2 = $2 - Window; print $0; } else { $2 = $3 - 1; $3 = $3 + Window; print $0; }' - > ${gene}.promoter.bed

To do this over a set of genes, you could pass in a formatted string of gene names:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol, 0, kg.strand FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol IN ('gene1', 'gene2', ...) LIMIT 1;" hg19 > genes.bed

where gene1, gene2 etc. are names of genes of interest.

ADD COMMENT
3
Entering edit mode
7.7 years ago
GenoMax 141k

How about Table browser @UCSC?

ADD COMMENT
0
Entering edit mode

Should have probably searched the Table Browser harder. There's very clearly an option to input a list of identifiers there.

Thank you! If you put this as an answer I can accept it.

ADD REPLY
0
Entering edit mode

And an output as BED option too.

ADD REPLY

Login before adding your answer.

Traffic: 2896 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6