Retrieve promoter sequences of all human genes
1
0
Entering edit mode
6.0 years ago
JJ ▴ 680

Hi,

I am hoping for some insight and suggestions concerning retrieving promoter sequences:

  • the quickest solution I found was using the UCSC table browser as described here. However, I rather use the gencode v27 basic annotation instead of the knownGene table to limit the number of sequences a little. (Is there a way to get only the promoter sequences of the protein coding over UCSC table browser?)
  • The prepackaged downloads of 1000 bp from UCSC are not available form hg38 unfortunately!
  • biomaRt R package with getSequence() as described here which I think is the nicest solution. I retrieved the gene identifiers form the gencode v27 basic annotation prefiltered for protein coding.

Any other suggestions as to how to retrieve them?

I have a couple of fundamental questions here:

  • I would define promoter sequences without the UTRs. However, the biomaRt description suggests to use seqType="coding_gene_flank" to get the promoter sequences. I would use "gene_flank" to exclude the UTRs. Here, I am not sure what the UCSC table browser does - do I get the upstream sequences from the TSS - so no UTRs? What happens if the UTRs are not known?
  • Should I retrieve the promoter sequences by transcript or by gene? UCSC table browser does it by transcript. BiomaRt allows both - I assume on gene level it uses the main transcript?. What is more sensible to use? I tend towards gene level as it again limits the number of sequences.

I want to look for TFBSs in the retrieved promoter sequences.

Thanks in advance for any advice!!

sequence • 3.2k views
ADD COMMENT
1
Entering edit mode

Apologies that nobody has answered. If I was interested in promotor sequences, then I would have first gone to the FANTOM5 data. I would download the co-ordinates of all of their promotors in human, obtain the sequences of these with gffread, and then proceed from there with the TFBS analysis.

ADD REPLY
0
Entering edit mode

Thank you for your answer and the tip! By searching through the Fantom5 data and reading up on it, I also found this database. It integrates the fantom5 data and one can readily download the most representative promoter per gene. I haven't found this for the fantom5 data - here I just found the coordinates for all promoters but without annotation to the genes. Maybe I just overlooked it... also FANTOM5 is just for hg19 available?

ADD REPLY
1
Entering edit mode

Yes, I believe that it is just available for hg19, but you could 'lift' these over to hg38 using the UCSC LiftOver tool (be wary, though, as some regions In hg19 may have been excluded from hg38, for whatever reason).

When you get the FASTA for each promoter region using gffread (comes bundled with Cufflinks and StringTie, last time I checked), then you can identify the TFBS site motifs in these. There are other threads regarding this part, I believe. I can suggest a few databases too, if needed.

ADD REPLY
0
Entering edit mode

Thank you for your answer and the tip with gffread - very useful utility.

Is there any reason not to use the EPD TSSs? They use Riken/ENCODE CAGE data, FANTOM5 data and EPD (old) to define their TSSs. I am thinking to use gffread or BiomaRt to retrieve the sequences of EPD TSS + 1000bp upstream.

It would be great if you could suggest some databases. So far I looked at JASPAR of course and downloaded the motifs of the TFs I am interested in. I tried out FIMO (Meme Suite) and scanMotifGenomeWide.pl (HOMER) for now. These I would then feed the promoter sequences. Any other suggestions here?

JASPAR also provides a bed file with all hits of motif predictions for hg38 as far as I have seen. Finally I discovered Remap, which provides ChIP-seq peaks (merged) for a number of TFs (unfortunately not for all I am intererested in) using numerous ChIP-seq experiments. Both I could intersect with the promoter sequences.

ADD REPLY
1
Entering edit mode

Please refer to GenoMax's answer. I was not sure of an answer, so, I originally went with a comment only in the hope to gain attention to your question (seems to have worked!). The DBs that I was going to suggest was mainly JASPAR. There are others that I cannot quite recall right now.

ADD REPLY
2
Entering edit mode
6.0 years ago
GenoMax 141k

You can download experimentally validated promoter sequences from EPD (Eukaryotic Promoter Database).

ADD COMMENT
0
Entering edit mode

Thank your for your answer! I just discovered it as well. So they provide the TSSs annotated to the corresponding genes. To get the promoter I was thinking of retrieving EPD TSS + 1000bp upstream - does this sound sensible to you?

ADD REPLY

Login before adding your answer.

Traffic: 1574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6