Question: Reference database for metagenomics
0
gravatar for pignottisimone
2.6 years ago by
pignottisimone20 wrote:

I am working on a comparison between current metagenomics tools, and I have troubles finding a good, complete and updated reference database. My dream would be a selection of bacterial genomes from NCBI RefSeq with representatives from each species, covering strains with high phylogenetic diversity, as proposed in GEBA. Another nice feature would be easy availability for downloading, since I don't find NCBI so user-friendly (not easy to select interesting genomes, downloading file by file with ftp takes ages, or I am simply not able to do it properly). The best option I have found is HMP, but I would prefer a complete bacterial database. Another option would be using SILVA, but I would like to compare performances on whole genomes rather than 16S only.

Do you know any free databases with these characteristics? What do people use as reference databases when dealing with metagenomics? Thanks in advance for any suggestion.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by pignottisimone20

I think it really depends on the metagenomics project but in general a database of the full reference genome of bacteria, viruses, archaea and environmental samples would make a good starting database for genome sequence based comparison.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Sej Modha4.3k

Thank you for your answer. For now the comparison will be limited to a simulated read set, so that would be more than a good start. The problem is: which, where and how to get. Do you have any advice? Furthermore, tools like Kraken build huge databases, and their construction takes more than 100GB RAM only for old bacterial refseq (~2500 seq). That's why I am interested on selecting the "best" candidates to build the database upon.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by pignottisimone20
1
gravatar for 5heikki
2.6 years ago by
5heikki8.5k
Finland
5heikki8.5k wrote:
#!/bin/bash
mkdir ref_prok_rep_genomes
cd ref_prok_rep_genomes
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/ref_prok_rep_genomes.??.tar.gz
tar zxvf ref_prok_rep_genomes.??.tar.gz
#Could play with -outfmt to get easier parsing for a tax map
blastdbcmd -db ref_prok_rep_genomes -entry all > ref_prok_rep_genomes.fna

This is representative/reference archaea + bacteria (~21 GB file). It's of course relative what is actually "representative", e.g. this db includes just one Salmonella genome (Salmonella enterica subsp. enterica serovar Typhi str. CT18)..

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by 5heikki8.5k

That's great thanks! I didn't think about checking out blast directory

ADD REPLYlink written 2.6 years ago by pignottisimone20
0
gravatar for pignottisimone
2.6 years ago by
pignottisimone20 wrote:

I want to share also what I came up with, even if 5heikki's answer is very good for my purposes. I found kind of a more modular way though:

awk -F "\t" -v OFS="\t" '$12=="Complete Genome" && $11=="latest"\
&& $5~/^(reference genome|representative genome)$/ {print $20}'\
assembly_summary_refseq.txt | awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}\
{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' > ftpfilepaths

This will print all latest versions of reference/representative genomes in RefSeq's bacterial database into the file 'ftpfilepaths', which you can later download with wget -i ftpfilepaths. To obtain their taxa instead:

awk -F "\t" -v OFS="\t" '$12=="Complete Genome" && $11=="latest"\
&& $5~/^(reference genome|representative genome)$/ {print $1, $7}'\
assembly_summary_refseq.txt > acc2taxid.map

acc2taxid.map's first column will contain the sequences' accession numbers, and the second column the taxa of the specie of each column (use $6 instead of $7 for strains).

ADD COMMENTlink written 2.6 years ago by pignottisimone20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1636 users visited in the last hour