I am looking at some shotgun (not amplicon) metagenomics data, and have observed that among the reads that are classified as belonging to specific bacteria, most are from ribosomal genes (as determined later by BLAST). This is despite the fact that this is not targeted amplicon sequencing. My interpretation is that most of the bacteria in the sample are absent from the reference database used for classification, but that due to the high level of conservation of the ribosomal genes, these are still appearing in the classification results because those portions of the genomes are "close enough" to previously sequenced genomes.
My first question is: is this a plausible interpretation of what I'm observing? Follow-up: is it a common issue with shotgun metagenomics? Secondly (let me know if this should be a separate question): is there an efficient way to "fish out" previously unclassified reads based on their overlap with a particular set of ribosomal reads from the data? I suppose this would amount to doing genome assembly, but using certain selected reads as a target or seed for assembly.
Background: What we have is Illumina paired-end (2x150bp) data from shotgun metagenomics, which I have run through Kraken (using the 8GB Minikraken database). The first thing I notice is that 99.9% of the reads are unclassified. That seems to hold true with other methods of classification (Metaphlan2 and a cursory BLASTing of a few reads). A small fraction of reads are classified as belonging to certain bacteria. I mapped those reads to the corresponding genome using Bowtie2, hoping to validate the presence of that bug in the sample. After mapping, I see very clear peaks in coverage, rather than reads mapping throughout the genome. Furthermore, the mapped reads BLAST to ribosomal sequences.