I am trying to download all completely assembled bacterial genomes together with the associated plasmid sequences. I download the complete sequences using biopython:
search_term= "bacteria[orgn] AND complete genome[title]" handle=Entrez.esearch(db="nucleotide", retmax=100000, term=search_term) genome_id=Entrez.read(handle)['IdList'] print "Fetched Id list..."
This gives me a list of all id numbers of the bacterial genomes. Then I use entrez eftech to download it like this (in both genbank and fasta format):
record=Entrez.efetch(db='nucleotide', id=genome, rettype='fasta', retmode='text') time.sleep(1) seq_record=Entrez.efetch(db='nucleotide', id=genome,rettype='gbwithparts', retmode='text')
However, plasmids are to my knowledge not included in 'complete genome' sequences. I know that the data I need are ordered on NCBI. When typing 'bacteria[ORGN]' as search criterion in the NCBI search, I get a page listing the different bacteria that have sequence data on ncbi (https://www.ncbi.nlm.nih.gov/genome/?term=bacteria%5BORGN%5D). Clicking on a bacterium and then the 'Organism overview: Genome assembly and annotation report' link leads me to a table listing every assembly and the corresponding plasmid sequences (https://www.ncbi.nlm.nih.gov/genome/genomes/154 ). After unclicking contigs, chromosome and scaffolds in the table, it contains exactly the data I want, with genome accession for each complete assembly, plus the corresponding plasmid ID. I can even download it in .csv format.
The problem: NCBI contains 8900 different sequenced species/strains of bacteria. If I have to download the data manually for each bacterium, I have to prolong my education for at least 10 years. Is there any biopython or NCBI guru out there who knows how to automate this?