Question: Is there any method to download a complete genome fasta file of a certain species from NCBI automatically
0
gravatar for thustar
3.1 years ago by
thustar100
thustar100 wrote:

Hi biostars!

I'd like to write a program to download some complete genome fasta files of a list of species from NCBI. However, when I use BioPython, I can not get the result I want.

from Bio import Entrez Entrez.email = "thustar@mlp.edu" search_term = 'Acidaminococcus sp. D21' handle = Entrez.esearch(db='nucleotide', term=search_term) record = Entrez.read(handle) ids = record['IdList']

ids will return lots of numbers and some of ids are not fasta files. Is there a better method to fetch exactly a file contain all contigs in a certain genome or at least automatically remove those ids whose corresponding files are not sequence files?

Thanks

genome • 1.9k views
ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by thustar100
1
gravatar for thustar
3.1 years ago by
thustar100
thustar100 wrote:

Thanks for everybody answering this question!

I would like to share my python codes to solve this problem partially.

import os
import cPickle

database_file = "assembly_summary_genbank.txt"
destination='/data/thustar/composition_explore/genomes/'
miss_list = []
def fetch_genome(genome_name,database_file,destination):
    global miss_list
    exist = False
    temp = os.popen('grep "'+genome_name+'" '+database_file).read()#read bash result in python, if grep did not search, return ''
    if len(temp)>0:
        link = temp.split('\t')[19]
        extra_name = link.split('/')[-1]
        link = link+'/'+extra_name+'_genomic.fna.gz'
        exist = True
        print "HIT "+genome_name
        if not os.path.isfile(destination+genome_name.replace(' ','_')+'.gz'):
            genome_name = genome_name.replace('/','_')
            os.system("wget -c "+link+" -O "+destination+genome_name.replace(' ','_')+'.gz')

        if not exist:
            print genome_name+" not exist"
            miss_list.append(genome_name)

with open('genomes.txt','r') as f:
    for line in f:
        genome_name = line.split('\t')
        genome_name = genome_name[0]
        fetch_genome(genome_name,database_file,destination)

with open('miss_list.txt','wb') as g:
    cPickle.dump(miss_list,g)
ADD COMMENTlink modified 3.1 years ago by genomax70k • written 3.1 years ago by thustar100
0
gravatar for piet
3.1 years ago by
piet1.7k
planet earth
piet1.7k wrote:

You should enclose the name of the taxon in double quotes, and you should restrict to sequences of complete replicons, see also A: Assembly Accession Number

search_term = '"Pseudomonas aeruginosa"[Organism] AND complete[Properties]'
ADD COMMENTlink written 3.1 years ago by piet1.7k

Unfortunately, after I change search_term to "Acidaminococcus sp. D21[orgn] AND complete genome", I got ids = ['224815814', '224815813', '224815811', '224815805', '224815803']. The lengths of the fasta file fetched are 118 645 690 1817 3389

There must be something wrong because the length of reference genome is several million bases. The codes of fetching procejure are these: for i in xrange(len(ids)): handle = Entrez.efetch(db="nucleotide", id=ids[i], rettype="fasta", retmode="text") record = handle.read() print len(record)

Any further information?

ADD REPLYlink written 3.1 years ago by thustar100

You have quoted the search term in the wrong way. Please try:

"Acidaminococcus sp. D21"[orgn] AND (complete[Properties] or "wgs master[Properties]")

https://www.ncbi.nlm.nih.gov/nuccore/?term=%22Acidaminococcus%20sp.%20D21%22%5BOrganism%5D%20AND%20(complete%5BProperties%5D%20or%20%22wgs%20master%22%5BProperties%5D)

This will show you that there is currently no complete genome available for "Acidaminococcus sp. D21", but one WGS set.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by piet1.7k

There is one in RefSeq genomes.

http://ftp.ncbi.nih.gov/genomes/refseq/bacteria/Acidaminococcus_sp._D21/latest_assembly_versions/GCF_000174215.1_ASM17421v1/

ADD REPLYlink written 3.1 years ago by genomax70k

Thanks for your reply. I guess @5heikki provides a good solution, but still have some steps to make it perfect. I would really appreciate it if you could tell me where to download the missing 38 genomes in my reply to @5heikki

ADD REPLYlink written 3.1 years ago by thustar100
0
gravatar for kapil.joshi036
3.1 years ago by
Student ,School of life sciences, Manipal University, Manipal, India
kapil.joshi03680 wrote:

Different genomes have been sequenced by different institutes with different motivations and interests. As such there is no single site where you can find all the genome information that you may want.

Thus said the NCBI is a good place to start as they curate GenBank database whose contents get mirrored and exchanged with other meta-genomic warehouses such as EMBL and DDBJ.

Please have a look at this as well http://www.ncbi.nlm.nih.gov/sites/genome and this to download genome data for various organisms. ftp://ftp.ncbi.nlm.nih.gov/genomes/

I would suggest you refine your question to be more specific.

ADD COMMENTlink written 3.1 years ago by kapil.joshi03680

OK. Thanks for your advice. I would like to download 290 genomes in this tabel: https://bitbucket.org/berkeleylab/metabat/wiki/genomes_table . Therefore I want to write a program to download them automatically. It is a waste of time to search and download 290 fasta files of complete genome manually.

I did search some genomes on http://www.ncbi.nlm.nih.gov/sites/genome, but I prefer to use some api or something automatical to download all files I need.

Do you have any suggestions?

ADD REPLYlink written 3.1 years ago by thustar100
2

grep the names from ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt

Column 20 provides the base ftp address for the assembly data.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by 5heikki8.4k

Thanks a lot! That file is quite useful. Most of genomes I want could be found in it.

But there are still 38 genomes not found. They are

'Alistipes onderdonkii DSM 19147',
 'Bacteroidales bacterium ph8',
 'Bacteroides sp. 2_1_7',
 'Bacteroides pectinophilus ATCC 43243',
 'Clostridiales bacterium OBRC5-5',
 'Clostridium asparagiforme DSM 15981',
 'Clostridium bartlettii DSM 16795',
 'Clostridium bolteae ATCC BAA-613',
 'Clostridium cf. saccharolyticum K10',
 'Clostridium citroniae WAL-17108',
 'Clostridium clostridioforme 2_1_49FAA',
 'Clostridium hathewayi DSM 13479',
 'Clostridium leptum DSM 753',
 'Clostridium methylpentosum DSM 5476',
 'Clostridium nexile DSM 1787',
 'Clostridium ramosum DSM 1402',
 'Clostridium saccharolyticum WM1',
 'Clostridium spiroforme DSM 1552',
 'Clostridium symbiosum WAL-14163',
 'Clostridium symbiosum WAL-14673',
 'Eubacterium biforme DSM 3989',
 'Eubacterium cylindroides T2-87',
 'Eubacterium dolichum DSM 3991',
 'Eubacterium eligens ATCC 27750',
 'Eubacterium hallii DSM 3353',
 'Eubacterium rectale ATCC 33656',
 'Eubacterium rectale DSM 17629',
 'Eubacterium rectale M104/1',
 'Eubacterium siraeum DSM 15702',
 'Eubacterium siraeum V10Sc8a',
 'Eubacterium siraeum 70/3',
 'Lachnospiraceae bacterium 4_1_37FAA',
 'Lactobacillus acidophilus 30SC',
 'Odoribacter splanchnicus DSM 220712',
 'Ruminococcus obeum ATCC 29174',
 'Ruminococcus obeum A2-162',
 'Streptococcus salivarius JIM8780',
 'butyrate-producing bacterium SSC/2'

Is there any other source to find them?

ADD REPLYlink modified 3.1 years ago by genomax70k • written 3.1 years ago by thustar100

Have you looked in here: http://ftp.ncbi.nih.gov/genomes/refseq/bacteria/ The names may be different since you may have strain designation from a different repository.

ADD REPLYlink written 3.1 years ago by genomax70k

Yeah, I looked in that website. I can not find the genome names which could not be found by my program on that website. It is possible that name designation from different repository leads to name mismatch. If the problem is that the name on NCBI is different from the genome name I want to get, there should be some file like a dictionary to map different names in the same strain. Then the question will be how to translate those names which can not be found directly in the list to genome names corresponding to the same strain in the list?

ADD REPLYlink written 3.1 years ago by thustar100

With one random sample, this is with Entrez direct:

grep "$(esearch -db taxonomy -query "Clostridium nexile DSM 1787" | efetch -format docsum | xtract -element ScientificName)" assembly_summary_genbank.txt | awk 'BEGIN{OFS=FS="\t"}{print $8,$20}'
Tyzzerella nexilis DSM 1787 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000156035.2_ASM15603v2

You can see this info also here. Maybe the other ones are also like this..

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by 5heikki8.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1558 users visited in the last hour