I am hoping for some insight and suggestions concerning retrieving promoter sequences:
- the quickest solution I found was using the UCSC table browser as described here. However, I rather use the gencode v27 basic annotation instead of the knownGene table to limit the number of sequences a little. (Is there a way to get only the promoter sequences of the protein coding over UCSC table browser?)
- The prepackaged downloads of 1000 bp from UCSC are not available form hg38 unfortunately!
biomaRtR package with
getSequence()as described here which I think is the nicest solution. I retrieved the gene identifiers form the gencode v27 basic annotation prefiltered for protein coding.
Any other suggestions as to how to retrieve them?
I have a couple of fundamental questions here:
- I would define promoter sequences without the UTRs. However, the
biomaRtdescription suggests to use
seqType="coding_gene_flank"to get the promoter sequences. I would use "gene_flank" to exclude the UTRs. Here, I am not sure what the UCSC table browser does - do I get the upstream sequences from the TSS - so no UTRs? What happens if the UTRs are not known?
- Should I retrieve the promoter sequences by transcript or by gene? UCSC table browser does it by transcript. BiomaRt allows both - I assume on gene level it uses the main transcript?. What is more sensible to use? I tend towards gene level as it again limits the number of sequences.
I want to look for TFBSs in the retrieved promoter sequences.
Thanks in advance for any advice!!