Question

Forum:Sharing frequently asked questions about microbial genome sequencing II

1

Entering edit mode

3.1 years ago

Novogene ▴ 420

Hi all,

I summarized some answers and share them with you to solve microbiome genomic issues about bioinformatics:

Q1: Why does it happen that the gene concerned cannot be annotated?

If the gene cannot be found in assembled results, it will certainly not be annotated. When the gene exists but is not annotated, it may be the cause that the gene annotation information has not been included in the database.

Q2: Regarding the annotation of ncRNA, why there is less annotation of the 5S/16S/23S sequence?

When using the Denovo method to predict the ncRNA sequence, the complete ncRNA sequence is required to confirm the structure of the ncRNA. However, because ncRNA, especially the 16S and 23S sequences, often have certain repetitive sequence components themselves, it is not easy to assemble it correctly during the assembly process. If the entire rRNA is not spliced into a complete sequence, the corresponding rRNA sequence cannot be predicted. If the assembly is correct, the species corresponding to the sample will be less annotated in the database, or not annotated at all. As for the 18S sequence of certain species, there is no annotation included in the database and the common software.

Q3: In the function annotation results, what is the difference between Identity, Evalue, and Score?

Identity represents similarity, that is, sequence identity. The higher the value, the higher the homology and the higher the sequence similarity. This means the gene is more likely one that performs the same function.

The Score is the comparison score, the value calculated by the scoring matrix, and it is determined by the search algorithm. The larger the value, the greater the degree of matching between your sequence and the target sequence.

The Evalue is the evaluation of the reliability of the Score value. It shows that under random conditions, the similarity between other sequences and the target sequence is greater than the possibility of the Score value. So, the lower the score, the better.

Q4: Why does the Venn diagram in the analysis of shared and unique genes differ from the statistics in the table?

In the Venn diagram, each ellipse represents a sample, and the data on each area represents the number of groups that appear in and only in the samples in this area. As shown in the figure below, a group represents a group with greater than 50% similarity and sequence gene sets whose length difference is less than 0.7.

Q5: How does an SNP mutation find in a comparative genome?

Use MUMmer comparison software is used to compare each sample with the reference sequence globally.
Find different sites between the sample sequence and the reference sequence, and perform preliminary filtering to detect potential SNP sites.
Extract 100 bp of reference sequence SNP sites on each side, and then use BLAST software to compare the extracted sequence and assembly result to verify the SNP site. *If the length of the alignment is less than 101bp, it is considered to be an unreliable SNP and will be removed. But if the alignment is repeated multiple times, the SNP that is considered to be a repeated region will also be removed.
Use BLAST, TRF, and Repeatmask software to predict the reference sequence repeat region, and then filter the SNP located in the repeat region.

Q6: What are the methods for phylogenetic tree construction?

There are three ways to construct a phylogenetic tree:

Build a tree based on SNP: i. Use the SNP matrix of the sample and reference strain population to build a phylogenetic tree. ii. Connect all SNPs in the same order to obtain sequences of the same length and use PhyML software to construct a phylogenetic tree.
Build tree based on core-pan analysis: i. Use core-pan analysis to identify the single-copy core gene of the sample. ii. Use MUSCLE software for protein multiple sequence comparison, and then use TreeBeST software to build a phylogenetic tree.
Building trees based on gene families: i. Use the results of single-copy orthologous genes identified by gene family clustering. ii. Use MUSCLE software for protein multiple sequence alignment, and then use TreeBeST software to build a phylogenetic tree.

If I encounter more problems, I will summarize and share with you in the future.

microbial faq genome sequencing meta • 712 views

ADD COMMENT • link updated 3.1 years ago by Istvan Albert 100k • written 3.1 years ago by Novogene ▴ 420

0

Entering edit mode

Nice summaries. Answers to a whole bunch of questions on a single page. Very handy.

ADD REPLY • link 3.1 years ago by Istvan Albert 100k