Question: How To Get Promoter Sequences For Human Genes?
13
gravatar for Yuri
4.0 years ago by
Yuri1.1k
Bethesda, MD
Yuri1.1k wrote:

Would you advice how to get promoter sequences for all or many human genes - in flat file(s) or by SQL query? I understand there can be multiple definitions for promoter region, but just anything universal would work.

ADD COMMENTlink modified 4.0 years ago by Gurado260 • written 4.0 years ago by Yuri1.1k

I asked a similar question here: http://biostar.stackexchange.com/questions/544/suggestions-developing-a-pipe-line-for-scanning-genomic-regions-to-identify-kno
My question was generic in nature and there was not much response. Looking forward for others comments on this specific question.

ADD REPLYlink written 4.0 years ago by Khader Shameer14k

The problem is that there is not a standard definition of promoters: for some it means a number of bases upstream the ATG, for others the TATA box, etc..

ADD REPLYlink written 4.0 years ago by Giovanni M Dall'Olio17k

@giovanni: I completely agree.

ADD REPLYlink written 4.0 years ago by Yuri1.1k
13
gravatar for Paulo Nuin
4.0 years ago by
Paulo Nuin3.5k
Canada
Paulo Nuin3.5k wrote:

My best bet would be to use BioMart's Martview, select a database, filter by the gene IDs you have (there are other ID options there too), and the use the sequence option in the attributes to determine which parts of the gene you want, be exon, intron, promoter, upstream, downstream, etc.

I used this tool to get many upstream regions for mouse genes using just NCBI's gene IDs as input.

ADD COMMENTlink written 4.0 years ago by Paulo Nuin3.5k

What database you'd recommend? I have HUGO gene symbols.

ADD REPLYlink written 4.0 years ago by Yuri1.1k

This url will give you an idea about what you need to do: http://www.biomart.org/biomart/martview?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.sequences.ensembl_gene_id|hsapiens_gene_ensembl.default.sequences.ensembl_transcript_id|hsapiens_gene_ensembl.default.sequences.5utr&FILTERS=&VISIBLEPANEL=attributepanel

Just click on Filters, enter the HGNC ids in the box and run the search.

ADD REPLYlink written 4.0 years ago by Paulo Nuin3.5k

I like both solution, but this one is more strait forward and allow downloading large amount of sequences at once. Pierre's solution would be also very useful in other cases. Thank you, guys.

ADD REPLYlink written 4.0 years ago by Yuri1.1k
9
gravatar for Pierre Lindenbaum
4.0 years ago by
France
Pierre Lindenbaum58k wrote:

Here I query the UCSC mysql anonymous server for the coordinate of the region between the CDS and the transcription sites (5' UTR, but you can extend this position to get a longer 'promoter' ) ( only for strand "+", for the reverse strand use cdsEnd and txEnd...). It builds an cURL query for the USC DAS server. This url is then piped into sh to get the genomic sequences.

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -N \
 -e 'select concat("curl http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=",chrom,":",txStart+1,",",cdsStart+1) from knownGene where strand="+" and  txStart!= cdsStart limit 10'  |\
 sh > result.concatenated.xml

result:


http://www.biodas.org/dtd/dasdna.dtd">
<DASDNA>
<SEQUENCE id="chr1" start="11874" stop="12190" version="1.00">
<DNA length="317">
cttgccgtcagccttttctttgacctcttctttctgttcatgtgtatttg
ctgtctcttagcccagacttcccgtgtcctttccaccgggcctttgagag
gtcacagggtcttgatgctgtggtcttcatctgcaggtgtctgacttcca
gcaactgctggcctgtgccagggtgcaagctgagcactggagtggagttt
tcctgtggagaggagccatgcctagagtgggatgggccattgttcatctt
ctggcccctgttgtctgcatgtaacttaataccacaaccaggcatagggg
aaagattggaggaaaga
</DNA>
</SEQUENCE>
</DASDNA>

http://www.biodas.org/dtd/dasdna.dtd">
<DASDNA>
<SEQUENCE id="chr1" start="322037" stop="324343" version="1.00">
<DNA length="2307">
gggtctccctctgttgtccaaggctggagtgtagtagtgctatcgcagct
gactgcagcctcaaccttccaggctgaagcgatcctcccacctcaacctc
ccacgtggctgagactacaggtgcttgccactatgcccaactaacatttg
gaattttcgtatacgtggattccagaggggtgacagcgaaacgtgagtaa
(...)
ADD COMMENTlink written 4.0 years ago by Pierre Lindenbaum58k

Thanks, Pierre, I'll try it. Do you know what is the highest LIMIT allowed?

ADD REPLYlink written 4.0 years ago by Yuri1.1k

@yuri don't be evil with UCSC :-) http://genome.ucsc.edu/FAQ/FAQdownloads.html#download29

ADD REPLYlink written 4.0 years ago by Pierre Lindenbaum58k
3
gravatar for Sean Davis
4.0 years ago by
Sean Davis15k
Bethesda, MD
Sean Davis15k wrote:

This is trivial to do with the UCSC table browser.

http://genome.ucsc.edu/cgi-bin/hgTables?command=start

Select the gene track of interest. Then, select "sequence" for output option. Click "get output". On the next page, select "genomic" and click submit. On the next page, click the appropriate boxes, one of which is upstream by N bases. Your output will be the actual sequence. Alternatively, you can get just the coordinates by changing the parameters on the first table browser page.

Sean

ADD COMMENTlink written 4.0 years ago by Sean Davis15k

but you cannot do that for thousand genes...

ADD REPLYlink written 4.0 years ago by Pierre Lindenbaum58k

In fact you can do for thousands of genes, but it will be slow as molasses, just click on upload identifiers and you have a input box. One problem with the UCSC approach is that sometimes it doesn't find the ids you are looking for and it does not output them. BioMart outputs everything even the "empty" ones.

ADD REPLYlink written 4.0 years ago by Paulo Nuin3.5k

Just to be clear, what I described in my answer is for ALL transcripts in a track of interest and it returns instantaneously with network bandwidth constraints, of course. nuin points out that it is straightforward to give a list of IDs, but the point about the "empty" ones is valid.

ADD REPLYlink written 4.0 years ago by Sean Davis15k
1
gravatar for Gurado
4.0 years ago by
Gurado260
Gurado260 wrote:

The Regulatory Sequence Analysis Tools website is very handy when it comes to obtaining promoter sequences and allows nice customization for up and or downstream size with respect to certain landmarks. Even better, a bunch of species is supported.

http://rsat.ulb.ac.be/rsat/

ADD COMMENTlink written 4.0 years ago by Gurado260
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 293 users visited in the last hour