Question: How to quickly obtain a BED file of genes using their gene symbols?
0
gravatar for Sinji
4.2 years ago by
Sinji3.0k
UT Southwestern Medical Center
Sinji3.0k wrote:

Looking to solve a relatively easy problem that I can't seem to find a simple answer to.

I conducted some DGE analysis on some RNA-seq data (using DESeq2) and have a list of ~2000 differentially expressed genes. I'm interested in overlaying some ChIP-seq data at promoters of these ~2000 genes, and am trying to find a way to obtain a bed file of these ~2000 genes using gene symbol, or ensembl ID as input.

Is there a easy way to do this?

chip-seq gene symbol bed • 1.3k views
ADD COMMENTlink modified 4.2 years ago by Alex Reynolds31k • written 4.2 years ago by Sinji3.0k
5
gravatar for Alex Reynolds
4.2 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Here's another way, using the command line, which could be useful for automation or scripting.

You can do a mysql query of the UCSC Genome Browser for a specific gene and build, e.g., human CTCF:

$ gene="ctcf"
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol LIKE '${gene}';" hg19
+-------+----------+----------+------+
| chr16 | 67596309 | 67673088 | CTCF |
+-------+----------+----------+------+

Just replace the value of gene with your gene-of-interest, and modify the build (hg19) if you're interested in a different reference genome or organism.

Some genes have more than one transcript and are localized to a strand. You can add LIMIT 1 to the SQL query to just grab the first hit, and add the kg.strand field to get back the strand:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol, 0, kg.strand FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol LIKE '${gene}' LIMIT 1;" hg19

These commands print to standard output, so to get promoters (say, a window 500 bases upstream of the 5' end), you could pipe to awk:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol, 0, kg.strand FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol LIKE '${gene}' LIMIT 1;" hg19 | awk -vWindow=500 '{if ($6=="+") { $3 = $2; $2 = $2 - Window; print $0; } else { $2 = $3 - 1; $3 = $3 + Window; print $0; }' - > ${gene}.promoter.bed

To do this over a set of genes, you could pass in a formatted string of gene names:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol, 0, kg.strand FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name AND x.geneSymbol IN ('gene1', 'gene2', ...) LIMIT 1;" hg19 > genes.bed

where gene1, gene2 etc. are names of genes of interest.

ADD COMMENTlink written 4.2 years ago by Alex Reynolds31k
3
gravatar for genomax
4.2 years ago by
genomax91k
United States
genomax91k wrote:

How about Table browser @UCSC?

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by genomax91k

Should have probably searched the Table Browser harder. There's very clearly an option to input a list of identifiers there.

Thank you! If you put this as an answer I can accept it.

ADD REPLYlink written 4.2 years ago by Sinji3.0k

And an output as BED option too.

ADD REPLYlink written 4.2 years ago by genomax91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2020 users visited in the last hour