I am looking into some bacterial genomes. For the moment though, it is OK to assume that I am dealing with just E coli. I want to get an idea of the genes which are highly expressed in E coli.
One of the most common techniques is to first of all identify a set of known highly expressed genes (which are generally ribosomal protein gene). Then use this reference set (ribosomal proteins) to get a codon usage table (CUT) for highly expressed genes. And then calculate Codon Adapatation Index (CAI) for each gene using the CUT generated in the previous step. And subsequently rank genes based on their CAI value This is a classical method which was first published by Paul Sharp and his colleagues in 1987. Since then people have come up with minor variants of CAI, but the principle essentially remains the same.
However, I was wondering if I could make use of some publicly available microarray data to get a list of highly expressed genes instead of using some theoretical measure like CAI. I do not know if it is really possible since microarrays are designed for relative studies implying that all the differentially expressed genes that are identified by microarrays show a relative overexpression or underexpression. i.e when a wild type is compared to say a drug treat condition, then all the genes which will be found to be overexpressed in wild type condition may not necessarily be highly expressed. But they could be highly expressed relative to their expression in, say drug treated state. So is there a way to use data from NCBI's GEO to get a list of experimentally determined highly expressed genes in wild type condition.
Thanks and regards Sankalp