Hey all!
I am trying to download all completely assembled bacterial genomes together with the associated plasmid sequences. I download the complete sequences using biopython:
search_term= "bacteria[orgn] AND complete genome[title]"
handle=Entrez.esearch(db="nucleotide", retmax=100000, term=search_term)
genome_id=Entrez.read(handle)['IdList']
print "Fetched Id list..."
This gives me a list of all id numbers of the bacterial genomes. Then I use entrez eftech to download it like this (in both genbank and fasta format):
record=Entrez.efetch(db='nucleotide', id=genome, rettype='fasta', retmode='text')
time.sleep(1)
seq_record=Entrez.efetch(db='nucleotide', id=genome,rettype='gbwithparts', retmode='text')
However, plasmids are to my knowledge not included in 'complete genome' sequences. I know that the data I need are ordered on NCBI. When typing 'bacteria[ORGN]' as search criterion in the NCBI search, I get a page listing the different bacteria that have sequence data on ncbi (https://www.ncbi.nlm.nih.gov/genome/?term=bacteria%5BORGN%5D). Clicking on a bacterium and then the 'Organism overview: Genome assembly and annotation report' link leads me to a table listing every assembly and the corresponding plasmid sequences (https://www.ncbi.nlm.nih.gov/genome/genomes/154 ). After unclicking contigs, chromosome and scaffolds in the table, it contains exactly the data I want, with genome accession for each complete assembly, plus the corresponding plasmid ID. I can even download it in .csv format.
The problem: NCBI contains 8900 different sequenced species/strains of bacteria. If I have to download the data manually for each bacterium, I have to prolong my education for at least 10 years. Is there any biopython or NCBI guru out there who knows how to automate this?
Not answering your question directly but the information you need about the plasmids is available in this directory. Get the
*.genomic.fna
files which have the sequence and the names of the organisms the plasmids are associated with in the fasta header. You will have to split them but sounds like you know python so that should be simple for you.Hey, thanks for your answer! The problem with this is that several strains of each species exist, but not each of those carries the plasmid. Most of the time, strain information is not in the fasta header. In the table I provided, the plasmids are associated with the respective genomes, which is important for me. I need to know exactly which plasmid is associated with which genome.
You could retrieve the table easily correct and use that information for correlating the plasmids with the right genomes?
Edit: Based on this thread What does 'complete genome' in NCBI include , the information is there in the records for genomic DNA.