I am working on a non-model organism using customised oligo micro-arrays. I want to study the regulatory elements of differentially expressed genes (DEGs) to see whether they are controlled by shared transcriptional factors. We can map the DEGs to the recently sequenced genomic sequences of this species through selecting the best BLAST hit. I want to retrieve the regulatory genomic sequences of all DEGs and predict the possible TF binding sites of all DEGs. I want to know which tool can do this. Many thanks!
So I would break this into two problems:
- Getting Genomic data
- Scanning for TFs
Problem 1 is pretty easy. I would use Galaxy. It has tools for uploading entire genomes from arbitrary organisms and then a set of tools for extracting genomic intervals. Within ~10 minutes you should be able to get the upstream region of every gene in the locust genome. For this type of analysis I prefer to use 10Kb upstream but that number is certainly open for debate.
Problem 2 is a little more difficult and will require some approximations: My knowledge of the taxonomy of insects is not very good but I would guess that Drosophila melanogaster (DM) would be a pretty good approximation. You can use the JASPAR database to get the Position Weight Matrices for all TFs in DM (I would download them from the FTP site). Then scan the upstream region of each gene in locust. I've done this a few times before so here is a link to a Python script that should get you started. It requires the MOODS package which is here: (its not easy to find with Google) Paper, code. The script can be git-cloned/downloaded from here. Its currently pretty general (It can process any JASPAR file and SEQInteral file) but it should be easy to modify you need to do so.
If the DM PWMs aren't a reasonable approximation then you'll have to find a tool to predict them. However you'll need data other than microarray to get reasonable predictions (ChipSeq would be wonderful).
When it comes to "analysis" of which TFs are enriched in you DEG list I would suggest one of two methods. Hypergeometric test (or Fisher's Exact) can find TFs which are in your DEG list more often than one would expect by chance. This will give you a pretty good "back of the envelope" answer but the hypergeo and fisher's test are not truly representative of biology. With a little more work you could get the data into the format required by GSEA, think of each TF as a "signature", and then see how well these signatures match your DEG ranking. This method tends to be a little more accepted by the general community.
-Hope that helps,
I agree with Will but would add that you might be able to use your own expression data to filter for candidate transcription factors (TFs) that are a) turned on in either of your samples or more stringently b) differentially expressed between your samples. This assumes that your custom arrays have probes for the TFs in question.
Finally some time to address this interesting and relevant topic.
First, I am not certain that transcription factors (TFs) will show accurate, reliable, detectable expression differences between control and treatment. You are likely to see a few, but detecting many may not be feasible.
Second, I would have a look at the papers from Manolis Kellis' group on sequence comparisons to detect transcription control motifs. They did this work in yeast and in fruit flies. You may be able to use their fly data to identify motifs. JASPAR and TRANSFAC don't have much in terms of insect motifs. An alternate approach is to use the genome sequences from fruit fly, mosquito, honeybee and others to build alignments in order to identify likely regulatory motifs or TF binding sites (TFBSs). You'll have to do this on a gene by gene basis as synteny won't exist across these species as it does for the yeasts or Drosophila species Kellis et al examined. Nonetheless, such an effort may identify conserved core motifs. You can then look to see if the presence of that motif corresponds to an enrichment of upregulated, downregulated or non-responding genes.
Because you are looking at a "new" or different species, linking a motif to a TF may be difficult. I do like Will's GSEA approach but am unsure what to tell you that you should expect for a success rate with this. Could work well, could be problematic.