Question: Download complete bacterial genomes and associated plasmid sequences from NCBI
0
gravatar for wanderingstefan
2.4 years ago by
wanderingstefan30 wrote:

Hey all!

I am trying to download all completely assembled bacterial genomes together with the associated plasmid sequences. I download the complete sequences using biopython:

 search_term= "bacteria[orgn] AND complete genome[title]"
 handle=Entrez.esearch(db="nucleotide", retmax=100000, term=search_term)
 genome_id=Entrez.read(handle)['IdList']
 print "Fetched Id list..."

This gives me a list of all id numbers of the bacterial genomes. Then I use entrez eftech to download it like this (in both genbank and fasta format):

 record=Entrez.efetch(db='nucleotide', id=genome, rettype='fasta', retmode='text') 
time.sleep(1)
seq_record=Entrez.efetch(db='nucleotide', id=genome,rettype='gbwithparts', retmode='text')

However, plasmids are to my knowledge not included in 'complete genome' sequences. I know that the data I need are ordered on NCBI. When typing 'bacteria[ORGN]' as search criterion in the NCBI search, I get a page listing the different bacteria that have sequence data on ncbi (https://www.ncbi.nlm.nih.gov/genome/?term=bacteria%5BORGN%5D). Clicking on a bacterium and then the 'Organism overview: Genome assembly and annotation report' link leads me to a table listing every assembly and the corresponding plasmid sequences (https://www.ncbi.nlm.nih.gov/genome/genomes/154 ). After unclicking contigs, chromosome and scaffolds in the table, it contains exactly the data I want, with genome accession for each complete assembly, plus the corresponding plasmid ID. I can even download it in .csv format.

The problem: NCBI contains 8900 different sequenced species/strains of bacteria. If I have to download the data manually for each bacterium, I have to prolong my education for at least 10 years. Is there any biopython or NCBI guru out there who knows how to automate this?

download genbank biopython ncbi • 2.1k views
ADD COMMENTlink modified 12 months ago by tdmurphy160 • written 2.4 years ago by wanderingstefan30

Not answering your question directly but the information you need about the plasmids is available in this directory. Get the *.genomic.fna files which have the sequence and the names of the organisms the plasmids are associated with in the fasta header. You will have to split them but sounds like you know python so that should be simple for you.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by genomax62k

Hey, thanks for your answer! The problem with this is that several strains of each species exist, but not each of those carries the plasmid. Most of the time, strain information is not in the fasta header. In the table I provided, the plasmids are associated with the respective genomes, which is important for me. I need to know exactly which plasmid is associated with which genome.

ADD REPLYlink written 2.4 years ago by wanderingstefan30

You could retrieve the table easily correct and use that information for correlating the plasmids with the right genomes?

Edit: Based on this thread What does 'complete genome' in NCBI include , the information is there in the records for genomic DNA.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by genomax62k
0
gravatar for alceal
16 months ago by
alceal0
alceal0 wrote:

Did you get to automate it? Because I need to do the same. What I did is downloaded the CSV and loaded in a pandas data frame but I'd like to know how to do that with biopython.

ADD COMMENTlink written 16 months ago by alceal0

Hey,

Yes, in the end I did it as described in the question. For many of the genomes the associated plasmids are contained in the multifasta file if I remember correctly :-)

ADD REPLYlink modified 14 months ago • written 14 months ago by wanderingstefan30
0
gravatar for tdmurphy
12 months ago by
tdmurphy160
tdmurphy160 wrote:

You could also do this using the "Download Assemblies" button in NCBI Assembly. Start with a query like this:

https://www.ncbi.nlm.nih.gov/assembly/?term=bacteria%5Borgn%5D+AND+has_plasmid%5BProperties%5D+AND+latest_refseq%5Bfilter%5D+AND+complete_genome%5Bfilter%5D

click "Download Assemblies", select RefSeq or GenBank (for FASTA, the only difference should be the accessions), and genomic FASTA. That'll give you a tarball with one file for each assembly (genomic + plasmid(s)).

ADD COMMENTlink written 12 months ago by tdmurphy160
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1307 users visited in the last hour