Question: How would I use BioPython to mass download some assemblies from NCBI?
0
gravatar for Tom
21 days ago by
Tom20
United States
Tom20 wrote:

So here's the situation. I have a spreadsheet of a few genome assemblies I need to pull form NCBI. I have the accession numbers for them, like "GCF_003031525" in a row (said accession number leads to https://www.ncbi.nlm.nih.gov/assembly/GCF_003031525.1/)

And I just need to download a bunch of assemblies (a few dozen) where I change the assembly variable, and I can get it all on my drive.

I hear BioPython can access NCBI and do this, I was kind of wondering how to prime this or if anyone has already done something this automated for a list of assemblies they have.

biopython ncbi • 152 views
ADD COMMENTlink modified 21 days ago by genomax91k • written 21 days ago by Tom20
0
gravatar for JC
21 days ago by
JC11k
Mexico
JC11k wrote:

Entrez tools can be used to avoid coding.

ADD COMMENTlink modified 21 days ago • written 21 days ago by JC11k
1

It would look like this with entrez direct:

esearch -db assembly -query GCA_003031525 | elink -target nuccore | efetch -format fasta > out.fa
ADD REPLYlink written 21 days ago by Istvan Albert ♦♦ 85k

If you specifically want to incorporate this in to a (Bio)Python script, Biopython has a submodule for Entrez. The syntax is very similar.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec139

ADD REPLYlink written 21 days ago by Joe18k
0
gravatar for Istvan Albert
21 days ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

No need to use Biopython.

To mass download assemblies you can use the FTP site (note the links to the FTP on the right hand side bar) and tools such as wget or curl from locations such as:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/031/525/GCF_003031525.1_Neophocaena_asiaeorientalis_V1/

alternatively, there are also scripts to streamline the process:

https://github.com/kblin/ncbi-genome-download

ADD COMMENTlink modified 21 days ago • written 21 days ago by Istvan Albert ♦♦ 85k
0
gravatar for genomax
21 days ago by
genomax91k
United States
genomax91k wrote:

Since non-python solutions have been mentioned, consider NCBI datasets. It is the command line tool for downloads of genomic data from NCBI.

Note: Limited to eukaryotic genomes via the web interface and to other constraints mentioned at the link posted by @Istvan in comment below.

ADD COMMENTlink modified 21 days ago • written 21 days ago by genomax91k

I have been looking at datasets. I am not too pleased with it so far, it feels like a half baked solution that has no champion.

It is not documented properly beyond a few examples. In addition, the command line interface is bit verbose and rudimentary. But this post also demonstrates my biggest gripe with it.

Let's check what happens for the accession number that the original poster needs:

   datasets download assembly GCF_003031525

it prints:

Some of the accessions provided ('GCF_003031525') are invalid NCBI Assembly Accessions.

See https://www.ncbi.nlm.nih.gov/datasets/docs/which-genomes-are-in-datasets/ for more information.

ok, let's go to the website. Here is the first message there:

NCBI Datasets has been designed to give scientists the data that they want--which means we are leaving out some of the data that we think most users won't need.

So NCBI thinks you should not need that accession above, so they won't even bother including it, let's be serious now, what kind of scientist studies GCF_003031525 anyway.

Here is a command-line tool that will not give you all information because ... seemingly they don't want to bother with things that are not popular.

ADD REPLYlink modified 21 days ago • written 21 days ago by Istvan Albert ♦♦ 85k

There is an explanation of what is excluded at the link you included above. So they are not doing this without telling users. They also tell the users where the missing excluded genomes can be found.

This is one additional tool like the others mentioned in this thread. It comes with its own limitations. One major being access to only eukaryotic genomes via web interface.

ADD REPLYlink modified 21 days ago • written 21 days ago by genomax91k

What I find super irritating is that error message says: invalid NCBI Assembly Accessions.

Are these really invalid NCBI Assembly Accessions or are these valid only that they chose not to include them?

We don't know, need to manually search NCBI.

One should not need to copy-paste links from an error message in a terminal then visit NCBI and search just to figure out that their accession is actually valid or not and that some data was just deliberately not included because "NCBI Datasets has been designed to give scientists the data that they want"

Perhaps it is the wording of that help message that ticks me off most.

ADD REPLYlink modified 21 days ago • written 21 days ago by Istvan Albert ♦♦ 85k
1

EDIT: Curiously using the fully qualified accession number (with version) works fine, so that error message is not appropriate (accession number per se is not invalid):

$ ./datasets download assembly GCF_003031525.1
Downloading: ncbi_dataset.zip    836kB 1.12MB/s

So someone must be doing an over-zealous/literal check for matches (perhaps thinking here is that you will identify a specific accession and then use it for downloads, who knows).

There is a way to send feedback:

We welcome feedback from the community. Please send any questions, comments or ideas to info@ncbi.nlm.nih.gov

ADD REPLYlink modified 21 days ago • written 21 days ago by genomax91k

nice job tracking that down, looks like it does work in the end

Usually one would not add the version, to ensure they get the latest build ... I guess here it really wants it

ADD REPLYlink modified 21 days ago • written 21 days ago by Istvan Albert ♦♦ 85k
$ ./datasets assembly-descriptors taxon "Neophocaena asiaeorientalis"
{"assemblies":[{"assembly":{"annotation_metadata":{"file":[{"estimated_size":"13363717","type":"GENOME_GFF"},{"estimated_size":"966592048","type":"GENOME_GBFF"},{"estimated_size":"24351233","type":"RNA_FASTA"},{"estimated_size":"7796429","type":"PROT_FASTA"}],"name":"NCBI Annotation Release 100","release_date":"Apr 12, 2018","release_number":"100","report_url":"https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Neophocaena_asiaeorientalis_asiaeorientalis/100/","source":"NCBI"},"assembly_accession":"GCF_003031525.1","assembly_category":"representative genome","assembly_level":"Scaffold","chromosomes":["Un","MT"],"contig_n50":86003,"display_name":"Neophocaena_asiaeorientalis_V1","estimated_size":"1672080519","org":{"assembly_counts":{"node":2,"subtree":2},"breed":"wild","common_name":"Yangtze finless porpoise","key":"1706337","parent_tax_id":"189058","rank":"SUBSPECIES","sci_name":"Neophocaena asiaeorientalis asiaeorientalis","sex":"male","tax_id":"1706337","title":"Yangtze finless porpoise"},"seq_length":"2284611699","submission_date":"2018-04-03"}},{"assembly":{"annotation_metadata":{},"assembly_accession":"GCA_003031525.1","assembly_category":"representative genome","assembly_level":"Scaffold","chromosomes":["Un"],"contig_n50":86003,"display_name":"Neophocaena_asiaeorientalis_V1","estimated_size":"659931204","org":{"assembly_counts":{"node":2,"subtree":2},"breed":"wild","common_name":"Yangtze finless porpoise","key":"1706337","parent_tax_id":"189058","rank":"SUBSPECIES","sci_name":"Neophocaena asiaeorientalis asiaeorientalis","sex":"male","tax_id":"1706337","title":"Yangtze finless porpoise"},"seq_length":"2284611699","submission_date":"2018-04-03"}}],"total_count":2}

Assembly accession is embedded in that output.

$ ./datasets assembly-descriptors taxon "Neophocaena asiaeorientalis" | jq | grep assembly_accession
        "assembly_accession": "GCF_003031525.1",
        "assembly_accession": "GCA_003031525.1",
ADD REPLYlink modified 21 days ago • written 21 days ago by genomax91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1942 users visited in the last hour