Would you advice how to get promoter sequences for all or many human genes - in flat file(s) or by SQL query? I understand there can be multiple definitions for promoter region, but just anything universal would work.
My best bet would be to use BioMart's Martview, select a database, filter by the gene IDs you have (there are other ID options there too), and the use the sequence option in the attributes to determine which parts of the gene you want, be exon, intron, promoter, upstream, downstream, etc.
I used this tool to get many upstream regions for mouse genes using just NCBI's gene IDs as input.
Here I query the UCSC mysql anonymous server for the coordinate of the region between the CDS and the transcription sites (5' UTR, but you can extend this position to get a longer 'promoter' ) ( only for strand "+", for the reverse strand use cdsEnd and txEnd...). It builds an cURL query for the USC DAS server. This url is then piped into sh to get the genomic sequences.
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -N \ -e 'select concat("curl http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=",chrom,":",txStart+1,",",cdsStart+1) from knownGene where strand="+" and txStart!= cdsStart limit 10' |\ sh > result.concatenated.xml
http://www.biodas.org/dtd/dasdna.dtd"> <DASDNA> <SEQUENCE id="chr1" start="11874" stop="12190" version="1.00"> <DNA length="317"> cttgccgtcagccttttctttgacctcttctttctgttcatgtgtatttg ctgtctcttagcccagacttcccgtgtcctttccaccgggcctttgagag gtcacagggtcttgatgctgtggtcttcatctgcaggtgtctgacttcca gcaactgctggcctgtgccagggtgcaagctgagcactggagtggagttt tcctgtggagaggagccatgcctagagtgggatgggccattgttcatctt ctggcccctgttgtctgcatgtaacttaataccacaaccaggcatagggg aaagattggaggaaaga </DNA> </SEQUENCE> </DASDNA> http://www.biodas.org/dtd/dasdna.dtd"> <DASDNA> <SEQUENCE id="chr1" start="322037" stop="324343" version="1.00"> <DNA length="2307"> gggtctccctctgttgtccaaggctggagtgtagtagtgctatcgcagct gactgcagcctcaaccttccaggctgaagcgatcctcccacctcaacctc ccacgtggctgagactacaggtgcttgccactatgcccaactaacatttg gaattttcgtatacgtggattccagaggggtgacagcgaaacgtgagtaa (...)
This is trivial to do with the UCSC table browser.
Select the gene track of interest. Then, select "sequence" for output option. Click "get output". On the next page, select "genomic" and click submit. On the next page, click the appropriate boxes, one of which is upstream by N bases. Your output will be the actual sequence. Alternatively, you can get just the coordinates by changing the parameters on the first table browser page.
The Regulatory Sequence Analysis Tools website is very handy when it comes to obtaining promoter sequences and allows nice customization for up and or downstream size with respect to certain landmarks. Even better, a bunch of species is supported.
Here's a command-line based method using UCSC and bedtools. It assumes you have a local copy of the genome, bedtools installed, and the promoter is some number that you choose relative to the transcription start site (TSS). At UCSC the left coordinate is always txStart, whereas the TSS is where transcription starts and can be the left or right coordinate depending on strand. Thus to get TSS I grab txStart for positive strand genes and txEnd for negative strand genes. I'm not sure if there's a way to combine this into a single SQL statement.
Step 1 - get TSS from UCSC using MySQL, use tail to remove header line:
# TSS for plus strand genes mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e \ 'SELECT chrom,txStart,txStart,"TSS",".",strand FROM knownGene WHERE strand = "+";' \ | tail -n +2 > tss.bed # concatenate TSS for negative strand genes mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e \ 'SELECT chrom,txEnd,txEnd,"TSS",".",strand FROM knownGene WHERE strand = "-";' \ | tail -n +2 >> tss.bed
Step 2 - use bedtools
slop to adjust the coordinates. The
-s flag will take strand into account for determining left or right. Flank takes the base next to your TSS, slop includes it. A nice advatange of bedtools for this is that you hand it a file of chromosome sizes so it doesn't create coordinates beyond chromosomal ends.
# create promoter coordinates, 1000 bases upstream of TSS for example bedtools flank -i tss.bed -g hg19.chrom.sizes -s -l 1000 -r 0 > promoter_coords.bed
Step 3 - use bedtools to extract DNA sequences:
# extract DNA sequence from fasta file bedtools getfasta -fi genomes/hg19/all_chr.fa -bed promoter_coords.bed -fo promoter_seq.fa
This should work for any model organism at UCSC, just select the right db and table names.
Look at the web site below for about 13,000 human promoter sequences.
At the University of Virginia we have collected more than 13,000 promoters of human genes. These are available online for download at the URL given below.