I would like to identify as many bacteria as possible from the 16s rRNA sequencing data. I found more than 60% of the reads can be aligned to multiple bacteria species. I don't think I should ignore them. I try to assign them to specific species according to the count distribution of the 40% of the reads. Does this make sense? Is there any protocol to follow in this field? Thanks.
What you don't describe is what region of the 16S gene you amplified. Your ability to discriminate "different" representative sequences at different taxonomic levels depends upon the region sequenced. You should use highly curated databases to determine what you have (such as what is found associated with MOTHUR and /or qiime - which are software specifically designed for the purpose of analyzing 16S sequences). Mis-identification can arise from errors/missing data from your sequences as well as errors/missing data in the database you are using. Most likely those that hit to multiple species won't be able to discriminate at the species taxonomy, but at a higher level - like family. .
Check out the qiime tutorial, it's pretty thorough and should allow you to do everything you need. Notably, you typically don't view the data at species level as this is very varied, but at the genus or family, which is built in to this kind of analysis.
I think that for 16S study there is not better option than the resources offered by SILVA. They rely on a manually curated database that has been extensively used and considered to be a golden standard for bacterial phylogeny.
Hope it helps
Here is how the MEGAN tool does it:
The main problem addressed by MEGAN is to compute a “species profile” by assigning the reads from a metagenomics sequencing experiment to appropriate taxa in the NCBI taxonomy. At present, this program implements the following naive approach to this problem:
- Compare a given set of DNA reads to a database of known sequences, such as NCBI-NR or NCBI-NT , using a sequence comparison tool such as BLAST .
- Process this data to determine all hits of taxa by reads.
- For each read r, let H be the set of all taxa that r hits.
- Find the lowest node v in the NCBI taxonomy that encompasses the set of hit taxa H and assign the read r to the taxon represented by v. We call this the naive LCA-assignment algorithm (LCA = “lowest common ancestor”). In this approach, every read is assigned to some taxon.