Question

Download complete bacterial genomes and associated plasmid sequences from NCBI

0

Entering edit mode

7.6 years ago

wanderingstefan ▴ 30

Hey all!

I am trying to download all completely assembled bacterial genomes together with the associated plasmid sequences. I download the complete sequences using biopython:

search_term= "bacteria[orgn] AND complete genome[title]"
handle=Entrez.esearch(db="nucleotide", retmax=100000, term=search_term)
genome_id=Entrez.read(handle)['IdList']
print "Fetched Id list..."

This gives me a list of all id numbers of the bacterial genomes. Then I use entrez eftech to download it like this (in both genbank and fasta format):

record=Entrez.efetch(db='nucleotide', id=genome, rettype='fasta', retmode='text') 
time.sleep(1)
seq_record=Entrez.efetch(db='nucleotide', id=genome,rettype='gbwithparts', retmode='text')

However, plasmids are to my knowledge not included in 'complete genome' sequences. I know that the data I need are ordered on NCBI. When typing 'bacteria[ORGN]' as search criterion in the NCBI search, I get a page listing the different bacteria that have sequence data on ncbi (https://www.ncbi.nlm.nih.gov/genome/?term=bacteria%5BORGN%5D). Clicking on a bacterium and then the 'Organism overview: Genome assembly and annotation report' link leads me to a table listing every assembly and the corresponding plasmid sequences (https://www.ncbi.nlm.nih.gov/genome/genomes/154 ). After unclicking contigs, chromosome and scaffolds in the table, it contains exactly the data I want, with genome accession for each complete assembly, plus the corresponding plasmid ID. I can even download it in .csv format.

The problem: NCBI contains 8900 different sequenced species/strains of bacteria. If I have to download the data manually for each bacterium, I have to prolong my education for at least 10 years. Is there any biopython or NCBI guru out there who knows how to automate this?

ncbi genbank biopython • 4.7k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 7.6 years ago by wanderingstefan ▴ 30

0

Entering edit mode

Not answering your question directly but the information you need about the plasmids is available in this directory. Get the *.genomic.fna files which have the sequence and the names of the organisms the plasmids are associated with in the fasta header. You will have to split them but sounds like you know python so that should be simple for you.

ADD REPLY • link 7.6 years ago by GenoMax 141k

0

Entering edit mode

Hey, thanks for your answer! The problem with this is that several strains of each species exist, but not each of those carries the plasmid. Most of the time, strain information is not in the fasta header. In the table I provided, the plasmids are associated with the respective genomes, which is important for me. I need to know exactly which plasmid is associated with which genome.

ADD REPLY • link 7.6 years ago by wanderingstefan ▴ 30

0

Entering edit mode

You could retrieve the table easily correct and use that information for correlating the plasmids with the right genomes?

Edit: Based on this thread What does 'complete genome' in NCBI include , the information is there in the records for genomic DNA.

ADD REPLY • link 7.6 years ago by GenoMax 141k

score 0 · Answer 1 · 2017-10-20

0

Entering edit mode

6.5 years ago

alceal • 0

Did you get to automate it? Because I need to do the same. What I did is downloaded the CSV and loaded in a pandas data frame but I'd like to know how to do that with biopython.

ADD COMMENT • link 6.5 years ago by alceal • 0

0

Entering edit mode

Hey,

Yes, in the end I did it as described in the question. For many of the genomes the associated plasmids are contained in the multifasta file if I remember correctly :-)

ADD REPLY • link 6.4 years ago by wanderingstefan ▴ 30

score 0 · Answer 2 · 2018-01-29

You could also do this using the "Download Assemblies" button in NCBI Assembly. Start with a query like this:

https://www.ncbi.nlm.nih.gov/assembly/?term=bacteria%5Borgn%5D+AND+has_plasmid%5BProperties%5D+AND+latest_refseq%5Bfilter%5D+AND+complete_genome%5Bfilter%5D

click "Download Assemblies", select RefSeq or GenBank (for FASTA, the only difference should be the accessions), and genomic FASTA. That'll give you a tarball with one file for each assembly (genomic + plasmid(s)).