Here's another way, using the command line, which could be useful for automation or scripting.
You can do a mysql
query of the UCSC Genome Browser for a specific gene and build, e.g., human CTCF:
$ gene="ctcf"
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol LIKE '${gene}';" hg19
+-------+----------+----------+------+
| chr16 | 67596309 | 67673088 | CTCF |
+-------+----------+----------+------+
Just replace the value of gene
with your gene-of-interest, and modify the build (hg19
) if you're interested in a different reference genome or organism.
Some genes have more than one transcript and are localized to a strand. You can add LIMIT 1
to the SQL query to just grab the first hit, and add the kg.strand
field to get back the strand:
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol, 0, kg.strand FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol LIKE '${gene}' LIMIT 1;" hg19
These commands print to standard output, so to get promoters (say, a window 500 bases upstream of the 5' end), you could pipe to awk
:
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol, 0, kg.strand FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol LIKE '${gene}' LIMIT 1;" hg19 | awk -vWindow=500 '{if ($6=="+") { $3 = $2; $2 = $2 - Window; print $0; } else { $2 = $3 - 1; $3 = $3 + Window; print $0; }' - > ${gene}.promoter.bed
To do this over a set of genes, you could pass in a formatted string of gene names:
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol, 0, kg.strand FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol IN ('gene1', 'gene2', ...) LIMIT 1;" hg19 > genes.bed
where gene1
, gene2
etc. are names of genes of interest.
Should have probably searched the Table Browser harder. There's very clearly an option to input a list of identifiers there.
Thank you! If you put this as an answer I can accept it.
And an output as BED option too.