Question: Reference genome for microbiome WGS metagenomics ?
0
gravatar for richard
8 weeks ago by
richard0
richard0 wrote:

Which reference database is really the best ? I know metaphlan's data base has 17,000 species and 1M marker genes, but the Ensembl database has 40,000 bacterial genomes. Any ideas on what is best ?

ADD COMMENTlink modified 8 weeks ago by mschmid150 • written 8 weeks ago by richard0
1

Well NCBI has 228,000 prokaryotic genomes as of today.

Will depend on what you need/want to get from your dataset.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by genomax80k

The goal is to do WGS Shotgun sequence taxonomic profiling from human stool samples. Im trying to understand the benefits of interrogating a larger database like NCBI or Ensembl versus using the smaller curated database of Metaphlan2.

ADD REPLYlink written 8 weeks ago by richard0

Larger databases are going to have redundancy which will cause problems with running alignments/multi-mapping. Sounds like paper referenced by @Asaf below may be the way to go.

ADD REPLYlink written 8 weeks ago by genomax80k
1

What do you want to achieve? What kind of samples do you have?

You can also subset NCBI/Ensembl databases if you have specific targets or can exclude specific bacteria. Like this you can have a high resolution for you target group.

EDIT: Another strategy is to pre-screen your data with a subset of bacterial genomes. Like for example all assemblies from RefSeq with tag "Reference/Representative". Then you download all genomes related to the found genomes in the pre-screening.

EDIT2: So you have seq. data you want to screen? What type of data?

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by mschmid150

The goal is to do WGS Shotgun sequence taxonomic profiling from human stool samples. Im trying to understand the benefits of interrogating a larger database like NCBI or Ensembl versus using the smaller curated database of Metaphlan2.

ADD REPLYlink written 8 weeks ago by richard0

Many samples high throughput or just a few but more in depth analysis?

ADD REPLYlink written 8 weeks ago by mschmid150

Ideally in-depth analysis but for many samples. At first I want to do taxonomy, but later may look into gene pathways present.

ADD REPLYlink written 8 weeks ago by richard0
1

I will suggest using the corrected GTDB from this paper: https://www.biorxiv.org/content/10.1101/712166v1.full.pdf See this github: https://github.com/rrwick/Metagenomics-Index-Correction

ADD REPLYlink written 8 weeks ago by Asaf7.2k

Thank you, will check it out

ADD REPLYlink written 8 weeks ago by richard0
2
gravatar for mschmid
8 weeks ago by
mschmid150
Switzerland
mschmid150 wrote:

If you have many samples, want to do an in depth analysis and have enough time I personally would do the following:

1) Get a basic set of RefSeq or Ensembl Bacterial genome assemblies covering all taxonomic groups. Either take all of them or perform a clever sub setting (like taking representative genomes). You can do the same for Fungi and Protists if you think you might have them in the sample. Or other Eukaryotes. I guess Archaea are not necessary, but there are not that many so you can as well add some of them. You can remove genomes from surveillance projects, as many of them can be virtually identical.

2) With those you do a first screening of all samples to roughly identify what you have. I would use something like Kraken2 or Centrifuge.

3) Now extend the groups you find with closely related genomes from RefSeq and maybe GenBank. Be careful with GenBank genomes, they could have wrong taxonomic annotation. Maybe check this. For Bacteria, Fungi, Protists and what else you have, you can do the same. Kick out the Genomes where you do not see any hits. Maybe add some/all viral RefSeq genomes.

4) Now do another screening and check if you seem to have a good representation of what is there. You can check the reads which did not get any hits with Blast or so to get indications what you are missing.

5) Improve your DB a bit more if necessary

6) Enjoy :)

ADD COMMENTlink written 8 weeks ago by mschmid150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 999 users visited in the last hour