Can you recommend me a database of bacterial gene essentiality?
Entering edit mode
4.9 years ago

Hi, I'm analyzing 50+ complete genomes of Enterococcus faecium and have extracted the core genome of these sequences. Now I want to narrow down the core genome sequences by searching essential genes for these bacteria. Searching on the web, I've found DEG, CEG and OGEE, DEG being the most appropriate for this analysis since we can run BLAST searches against the database, unfortunately their websites are apparently down. Do you know about any other database containing essential genes of bacteria for this purpose?

database essential genes bacteria • 1.4k views
Entering edit mode

other answers here are already good. just wanted to add that Craig Venter (of the J Craig Venter Institute - hah) is considered a leading synthetic biologist. the institute has been working for some time on creating "minimal genomes" and "minimal cells"; i.e. bacterial cells that have had more and more genes stripped out of the genome until they are reduced to a minimum number of cells necessary to sustain life over time.

see, for instance: (or just google "minimal cells")

Entering edit mode
4.9 years ago
natasha.sernova ★ 3.9k

To find what you need, ask you question in this form

In NCBI: ‘database containing essential genes of gut bacteria’

I’ve got a lot of articles.

For example:

A Comprehensive Overview of Online Resources to Identify and Predict Bacterial Essential Genes Chong Peng1, Yan Lin1, Hao Luo1 and Feng Gao1,2,3*

They enumerate databases you have already known, but they suggest some approaches to find new ones:

Sequence Derived Features of Essential Genes

(1) GC content. DNA with high GC content is believed to be more robust and stable (Seringhaus et al., 2006).

(2) Codon usage. The codon usage of essential genes suffers from more evolutionary constraints than non-essential genes (Jordan et al., 2002).

(3) Strand bias. Essential genes tend to be encoded on the leading strand of the chromosome (Lin et al., 2010; Rocha and Danchin, 2003).

(4) Protein length. Although protein length tends to become longer through evolution, essential genes, compared to non-essential genes, have a significantly higher proportion of large and small proteins relative to medium-sized proteins (Lipman et al., 2002; Gong et al., 2008).

(5) Z-curve parameter. The Z-curve theory is a bioinformatic algorithm to display base composition distributions along DNA sequences (Zhang and Zhang, 1994; Zhang, 1997; Gao and Zhang, 2004).

All the information that a given DNA sequence carries is included in the corresponding Z-curve. So Z-curve features can be used as sequence derived features for essential gene prediction (Song et al., 2014; Lin et al., 2017). Based on the Z-curve theory, Guo et al. (2017) created a λ-interval Z-curve, which considered the interval range association. They then built a support vector machine-based model to predict human gene essentiality with the λ-interval Z-curve, and obtained excellent performance (Guo et al., 2017). (6) Hurst exponent. The Hurst exponent is a characteristic parameter which describes the degree of self-similarity of a data set. For genes of similar length, the average Hurst exponent of essential genes is smaller than that of non- essential genes (Zhou and Yu, 2014).

Context-Dependent Features of Essential Proteins

(1) Domain properties. Protein essentiality is not likely to be conserved through the conservation of overall proteins but through the function of protein domains or domain combinations (Deng et al., 2011).

(2) Protein-protein interaction (PPI) network. Genes or their protein products are connected rather than isolated. Compared with non-essential genes, essential genes tend to be more highly connected in protein interaction networks. Network topology features, such as degree centrality (DC), betweenness centrality (BC), closeness centrality (CC), eigenvector centrality (EC), subgraph centrality (SC) have been used for detecting essential proteins (Estrada, 2006; Acencio and Lemke, 2009; Hwang et al., 2009; Wang et al., 2013; Xiao et al., 2015).

(3) Protein localization. Essential proteins exist in cytoplasm with a higher proportion, while locate in cell envelope such as cytoplasm membrane, periplasm, cell wall and extracellular with a much lower proportion compared with non-essential proteins (Seringhaus et al., 2006; Peng and Gao, 2014).

(4) Gene expression. Genes whose expression levels are higher and stabler under given conditions are more likely to be essential (Jansen et al., 2002).

(5) Gene Ontology. The Gene Ontology (GO) project provides a set of hierarchical controlled vocabularies for describing the biological process, molecular function, and cellular component of gene products (Ashburner et al., 2000). GO terms related to cellular localization and biological process are shown to be reliable predictors of essential genes (Acencio and Lemke, 2009

See also this post:

Database Of Essential Genes

There is a lot of information inside.

Entering edit mode

Really interesting ideas that you brought. Gave a quick read on the article and apparently OGEE is the only database available at the moment. Technically one can take the list of essential genes and screen against the desired database. Besides the databases, the screening of essential genes given the above parameters is also an interesting study. About the post you recommended, I've read that topic before and they also brought up great insights. I'll need to verify whether its better to employ the database only or include both database and prediction of essential genes assuming the time available. The prediction of essential genes itself would be a great separate study. Prediction of essential genes would constitute a subset of the core genome (together with other subsets of interesting genes) to both narrow down the amount of sequences to be analyzed and constitute a group of more reliable/important genes in the bacteria. I'm not sure how to screen for essential genes using the parameters you cited, I'll need to investigate about the methodology. Also for the study I'm conducting, prediction of essential genes rather than using experimentally tested genes might not be a good idea. Thank you very much! Your contribution really pointed me towards a solution to my problem.

Entering edit mode
12 weeks ago
Sapphire ▴ 10


I have around 10,000 sequences that I would like to perform a BLAST against the DEG database. I'm looking for guidance on how to proceed with this analysis. Additionally, I have 200 sequences from E. coli, and I used grep to extract hypothetical proteins. However, the number of hypothetical proteins I obtained is much larger than what is typically reported in the literature, where the average is around 300 to 500. I'm unsure of the reasons behind this discrepancy and would greatly appreciate any insights or assistance in understanding what might be causing this. If anyone has any knowledge or suggestions related to this issue, I would be grateful for your help.

Thank you in advance for your support.


Login before adding your answer.

Traffic: 1824 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6