Classifying shotgun metagenomes with 16S rRNA database.
1
0
Entering edit mode
15 months ago
Konstantin ▴ 20

Hello dear all. My question is for people experienced in metagenomic analysis of complex microbial communities such as in soil. Suppose I have several soil shotgun metagenomes and I need to classify all of it reaching maximum possible diversity stored in my data. As far as I know, current full-genome databases can be divided into curated (i.e. RefSeq, progenomes) and non-curated (i.e. nr, nr-euc) ones. Use of curated, moderated databases is safe, but cannot recreate all the diversity existing in data (because they are very restricted in terms of size). In my situation, I have approximately 45% of data classified as bacteria with "progenomes" db and nearly 5% as fungi. Use of the biggest possible db (such as nr) can theoretically classify all my seqs, but the price is (as far as I understand) unacceptable percent of data classified improperly, because anyone could upload data of any quality in these databases. And also, I just do not nave enough RAM at the time to perform full-metagenome taxonomy analysis with such db as nr.

My idea was to use kraken2 to classify full metagenome data with RDP or SILVA database. This idea is based on the fact that 16S rRNA dbs are much bigger than full-genome-ones, and theoretically such approach could uncover more taxa. No sooner said than done, and such attempt classified less then 0.1% of a reads, which is ~50000 reads in absolute numbers. This, to my knowledge, corresponds to the average yield of reads from a typical soil metabarcoding analysis, with sequencing performed by MySeq.

So here starts my question: is my Idea has a sense, and if yes, is such sample size (~50000) enough to represent soil biodiversity with acceptable level of correctness? I suspect that all copies of 16S rRNA gene from N gram of soil are not proportional to all DNA from the same quantity of soil, in terms of diversity and share-of-community, but I have no other thought on that.

I would appreciate all critical comments of my idea. If you have this task (uncover maximum possible biodiversity with minimum bias of share-of-community) done by another approach, I would love to know with which exactly.

Thank ya'll for the attention to my question.

metagenomics shotgun annotation RDP biodiversity • 747 views
ADD COMMENT
0
Entering edit mode

You can't use one phylogenetic marker to capture the sample biodiversity.

MetaPhlAn does that but with a database of unique SGB(species-level genome bins)-specific marker genes. link to the preprint

ADD REPLY
0
Entering edit mode
15 months ago
joe ▴ 510

I suspect that all copies of 16S rRNA gene from N gram of soil are not proportional to all DNA from the same quantity of soil

16S classification isn't reliable unless the majority of your reads are of rRNA. So you either need to have targeted enrichment of rRNA (which isn't necessarily preferable, because you might miss things) or if you can extract total RNA and sequence that because most of the sequences there will be rRNA...still not ideal.

However, based on my understanding of what you describe, I think you should proceed two ways; 1) Figure out how to use NCBI nt dataset and work with that as a reference db for kraken2, or perhaps some environmental database that might be representative of your samples. 2) Assemble contigs and BLAST those to whatever reference (NCBI nt, etc)

ADD COMMENT

Login before adding your answer.

Traffic: 2416 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6