Question: Downloading All The Incomplete Bacterial Genomes
1
gravatar for Eric Normandeau
6.8 years ago by
Quebec, Canada
Eric Normandeau10k wrote:

Following the post at download all the bacterial genomes from ncbi, I was able to download all the completed bacterial genomes easily from here: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/

However, there are a lot of bacteria for which only genome drafts of varying qualities exist.

The 'draft' portion of the ncbi bacterial genomes (ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT/) also lists some, but is this complete? Plus, there is no compiled (eg: all_draft_bacterial_genomes.fna) file like in the ftp://ftp.ncbi.nih.gov/genomes/Bacteria/. There seems to be 6970 drafts in there.

My question is: where could I download all the sequences (contigs / scaffolds) from all those incomplete genomes?

I would exclude species where only a small proportion of the genome, say less than 5 or 10%, is available.

For now, it looks like I will have to retrieve all of the file ending in scaffold.fna.tgz from the 6970 draft folders with wget. This is satisfying for ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT/, but are there other sources I should consider?

bacteria • 3.4k views
ADD COMMENTlink modified 4.3 years ago by rattus840 • written 6.8 years ago by Eric Normandeau10k
1
gravatar for rattus8
4.3 years ago by
rattus840
rattus840 wrote:

I had to write to NCBI about this.

Here is my recipe, adapted from Case 1 in this document: ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf

-- at Bash/Mac OSX prompt in the desired directory:

curl 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt' | \ awk '{FS="\t”}  \!/^#/ {print $20} '  | \ sed ‐r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)(GCA_.+)|\1\2/\2_genomic.fna.gz|' >genomic_file

-- final command, in the same directory, where you want to install the files:

wget -i genomic_file

Genbank is where all current complete and incomplete sequences are being stored and updated since Dec 12, 2015. Note that if you want a different taxonomic branch, you have to look at the NCBI ftp site (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank) and replace "bacteria" in the ftp address above with the folder you'd like. To get only Refseq genomes (complete and canonically curated sequences), replace "genbank" with "refseq". (This is also described in the factsheet.) "NOTE: if you need the assembly submiƩed to GenBank, you will need to change the curl command’s “refseq” to “genbank”. Since these assembly’s accession iniƟal are different, you will need change sed command’s “GCF” to “GCA”

If copying and pasting the lines above, or from the factsheet, gives you an error, I would paste them into a text editor with syntax highlighting (Emacs, textedit, etc.) and re-type any weirdly colored characters, and quotes on principle (right-slanted double quotes are interpreted differently than left-slanted, etc.).

Certain Bash prompts may require an escape character ("\") for special characters used within AWK commands. If you still have errors, as a second round of treatment, try removing the escape \ in front of the !.

Finally, the folder content is not a final database for standalone BLAST. (I used blastn, part of the Entrez Direct suite of tools provided by NCBI). You will have to use makeblastdb (included if you downloaded the suite to get blastn) to alter the format of and index the files for use by blastn. Furthermore (I had to write to NCBI about this too), the number of files in such a complete taxonomic database is too much for makeblastdb to handle. However, if you cat them into one file, it's fine!

cat *.fna > all_bacteria_fna_files.fna

makeblastdb -in all_bacteria_fna_files.fna -parse_seqids -dbtype nucl -title bacteria -out bacteria

Then, you have to make sure blastn has the folder containing the new database designated as a database variable.

export $BLASTDB=":$HOME/genomes/bacteria/genbank_2_3_2016"

Then you can run blastn on your new bacterial database. (Or, as you can see, this should work with any complete taxonomic download.) Good luck!!

Kim


Here are some of my extremely messy notes on the process. Feel free to ignore.

  • bacteria - Use awk/sed/curl recipe from NCBI (ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf to get files by parsing the local genome/...assembly_summary.txt file for directories for species of interest - get subdirectory “bacteria” from genbank (content of this directory: NCBI ftp genomes/genbank README, ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/README.txt: "2) genbank: content includes primary submissions of assembled genome sequence and associated annotation data, if any, as exchanged among members of the International Nucleotide Sequence Database Collaboration, of which NCBI's GenBank database is a member. The GenBank directory area includes genome sequence data for a larger number of organisms than the RefSeq directory area; however, some assemblies are unannotated. The sub-directory structure includes: a. archaea b. bacteria c. fungi d. invertebrate e. other - this directory includes synthetic genomes f. plant g. protozoa h. vertebrate_mammalian i. vertebrate_other”) - http://www.linuxtopia.org/online_books/linux_tool_guides/the_sed_faq/sedfaq5_010.html - 5.11. My script aborts with an error message, "event not found". This error is generated by the csh or tcsh shells, not by sed. The exclamation mark (!) is special to csh/tcsh, and if you use it in command-line or shell scripts--even within single quotes--it must be preceded by a backslash. Thus, under the csh/tcsh shell: sed '/regex/!d' # will fail sed '/regex/!d' # will succeed The exclamation mark should not be prefixed with a backslash when the script is called from a file, as "-f script.file". - put into emacs and re-typed anything that it colored as being a… strange character (some underscores were replacing spaces), as well as all single and double quotes - final command: wget -i genomic_file - FINISHED --2016-02-10 01:56:15-- - Downloaded: 58953 files, 62G in 18h 8m 34s (1002 KB/s)
ADD COMMENTlink written 4.3 years ago by rattus840
0
gravatar for irinagaranina24
6.8 years ago by
Russian Federation
irinagaranina2410 wrote:

Very usefull site for work with bacterial genes and genomes is the MicrobesOnline http://meta.microbesonline.org/programmers.html#Locus I gave you a link to SQL server, where you can download scaffolds from tables Scaffol, ScaffoldSeq etc.

ADD COMMENTlink written 6.8 years ago by irinagaranina2410

After scanning the site, it appears to contain information about few bacteria and only a handful of metagenome data sets. Am I missing something?

ADD REPLYlink written 6.8 years ago by Eric Normandeau10k

Eric, try for example this query to get strain names and scaffold id: mysql -h pub.microbesonline.org -u guest -pguest genomics -B -e ' source scaf.sql' > scaf.out "scaf.sql": SELECT Taxonomy.name, Scaffold.scaffoldId FROM ScaffoldSeq INNER JOIN Scaffold ON Scaffold.scaffoldId=ScaffoldSeq.scaffoldId INNER JOIN Taxonomy ON Taxonomy.taxonomyId=Scaffold.taxonomyId; To get scaffold sequence add ScaffoldSeq.sequence in first line Try to explore this page http://meta.microbesonline.org/programmers.html#Taxonomy

ADD REPLYlink written 6.8 years ago by irinagaranina2410

All I get in scaf.out is the mysql help, so it looks like there is a mistake somewhere. At this point, I am not sure that this ressource will help me.

ADD REPLYlink written 6.8 years ago by Eric Normandeau10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1270 users visited in the last hour