Hi, I'm analyzing 50+ complete genomes of Enterococcus faecium and have extracted the core genome of these sequences. Now I want to narrow down the core genome sequences by searching essential genes for these bacteria. Searching on the web, I've found DEG, CEG and OGEE, DEG being the most appropriate for this analysis since we can run BLAST searches against the database, unfortunately their websites are apparently down. Do you know about any other database containing essential genes of bacteria for this purpose?
To find what you need, ask you question in this form
In NCBI: ‘database containing essential genes of gut bacteria’
I’ve got a lot of articles.
A Comprehensive Overview of Online Resources to Identify and Predict Bacterial Essential Genes Chong Peng1, Yan Lin1, Hao Luo1 and Feng Gao1,2,3*
They enumerate databases you have already known, but they suggest some approaches to find new ones:
Sequence Derived Features of Essential Genes
(1) GC content. DNA with high GC content is believed to be more robust and stable (Seringhaus et al., 2006).
(2) Codon usage. The codon usage of essential genes suffers from more evolutionary constraints than non-essential genes (Jordan et al., 2002).
(3) Strand bias. Essential genes tend to be encoded on the leading strand of the chromosome (Lin et al., 2010; Rocha and Danchin, 2003).
(4) Protein length. Although protein length tends to become longer through evolution, essential genes, compared to non-essential genes, have a significantly higher proportion of large and small proteins relative to medium-sized proteins (Lipman et al., 2002; Gong et al., 2008).
(5) Z-curve parameter. The Z-curve theory is a bioinformatic algorithm to display base composition distributions along DNA sequences (Zhang and Zhang, 1994; Zhang, 1997; Gao and Zhang, 2004).
All the information that a given DNA sequence carries is included in the corresponding Z-curve. So Z-curve features can be used as sequence derived features for essential gene prediction (Song et al., 2014; Lin et al., 2017). Based on the Z-curve theory, Guo et al. (2017) created a λ-interval Z-curve, which considered the interval range association. They then built a support vector machine-based model to predict human gene essentiality with the λ-interval Z-curve, and obtained excellent performance (Guo et al., 2017). (6) Hurst exponent. The Hurst exponent is a characteristic parameter which describes the degree of self-similarity of a data set. For genes of similar length, the average Hurst exponent of essential genes is smaller than that of non- essential genes (Zhou and Yu, 2014).
Context-Dependent Features of Essential Proteins
(1) Domain properties. Protein essentiality is not likely to be conserved through the conservation of overall proteins but through the function of protein domains or domain combinations (Deng et al., 2011).
(2) Protein-protein interaction (PPI) network. Genes or their protein products are connected rather than isolated. Compared with non-essential genes, essential genes tend to be more highly connected in protein interaction networks. Network topology features, such as degree centrality (DC), betweenness centrality (BC), closeness centrality (CC), eigenvector centrality (EC), subgraph centrality (SC) have been used for detecting essential proteins (Estrada, 2006; Acencio and Lemke, 2009; Hwang et al., 2009; Wang et al., 2013; Xiao et al., 2015).
(3) Protein localization. Essential proteins exist in cytoplasm with a higher proportion, while locate in cell envelope such as cytoplasm membrane, periplasm, cell wall and extracellular with a much lower proportion compared with non-essential proteins (Seringhaus et al., 2006; Peng and Gao, 2014).
(4) Gene expression. Genes whose expression levels are higher and stabler under given conditions are more likely to be essential (Jansen et al., 2002).
(5) Gene Ontology. The Gene Ontology (GO) project provides a set of hierarchical controlled vocabularies for describing the biological process, molecular function, and cellular component of gene products (Ashburner et al., 2000). GO terms related to cellular localization and biological process are shown to be reliable predictors of essential genes (Acencio and Lemke, 2009
See also this post:
There is a lot of information inside.