Balanced, nonredundant set of bacterial genomes
0
0
Entering edit mode
6.4 years ago
LeWöps • 0

Hi,

For a large-scale genome analysis, I am looking a set of bacterial reference genomes. An ideal set would contain around 1-10k whole genomes that cover a broad range of bacteria, have a certain dissimilarity to each other and do not cover the same organisms multiple times (i.e. not multiple strains per organism).

Sequence databases offer sequences in large quantities of course, but I am unsure how to select a sensible subset. However, I feel like a lot of people must have had similar problems in the past. Is anyone aware of a) any data collection that might fit my needs, or b) a piece of work dealing with how to choose reference sequences?

sequence • 1.2k views
ADD COMMENT
0
Entering edit mode

You can find the current list of baterial genomes available at NCBI here. Since the choice of genomes is somewhat subjective, you may need to decide what combination will work for you.

ADD REPLY
0
Entering edit mode

Isn't ncbi refseq what you are looking for ?

a curated, non-redundant collection of reference sequences

Or maybe you can look into previous consortium effort to get non-redundant DB, such as HMP.


Nb : In case you want to download bacterial refseq, here is the recipe :

# Prepare dir & url
mkdir -p DL_refseq_bacteria && cd $_;
ftp_url='ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria';
# Create list of urls to DL
curl --silent --list-only $ftp_url/* | grep -i 'genomic.fna.gz' > list.txt;
sed -i "s|^|$ftp_url/|" list.txt;
# Actual DL
while read url; do axel -q $url; done < list.txt;
# Clean files
cat *.fna.gz > ncbi_refseq_release_bacteria.fna.gz;
rm -f !(ncbi_refseq_release_bacteria.fna.gz);

Ps : Be aware that this is going to take some time to DL 100+ Go of data

ADD REPLY
0
Entering edit mode

Isn't ncbi refseq what you are looking for ?

Not quite. Since following was listed as a requirement.

An ideal set would contain around 1-10k whole genomes that cover a broad range of bacteria, have a certain dissimilarity to each other and do not cover the same organisms multiple times (i.e. not multiple strains per organism)

ADD REPLY
0
Entering edit mode

Hi, same issue here, did yo end up finding a way to get this set? thanks

ADD REPLY
0
Entering edit mode

My comment above has a for list of current bacterial genomes.

  1. Pick ones you like and make a download list of names. Then use this to tool download the genomes: https://github.com/kblin/ncbi-genome-download
  2. you could parse out the FTP paths in the file above and use them.
ADD REPLY

Login before adding your answer.

Traffic: 1981 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6