Question

Access genomes/proteomes by BioSample ID

1

Entering edit mode

20 months ago

bvm ▴ 20

I'd like to download multiple genome assemblies or proteomes using a set of BioSample IDs from NCBI.

I'm able to find the assemblies belonging to the BioSample IDS in a browser (in the search field of https://www.ncbi.nlm.nih.gov/), but couldn't find a commandline solution.

E.g. for BioSample SAMN09405588 the assembly id is PDT000806148.1, and from https://www.ncbi.nlm.nih.gov/assembly/GCA_014136285.1/ I can download the proteome: GCA_014136285.1_PDT000806148.1_protein.faa.gz Thank you for your help!

NCBI BioSample assembly • 1.5k views

ADD COMMENT • link updated 20 months ago by MirianT_NCBI ▴ 720 • written 20 months ago by bvm ▴ 20

score 3 · Accepted Answer · 2022-08-26

3

Entering edit mode

20 months ago

GenoMax 141k

Using EntrezDirect:

$ esearch -db biosample -query SAMN09405588  | elink -target assembly | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,AssemblyName,FtpPath_GenBank
GCA_014136285.1 PDT000806148.1  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/136/285/GCA_014136285.1_PDT000806148.1

Once you have the assembly accession I will suggest that you use NCBI datasets or a tool like Kai Blin's "ncbi-genome-download"

ADD COMMENT • link 20 months ago by GenoMax 141k

1

Entering edit mode

Hi, After you retrieve the list of accessions, you can download them using NCBI Datasets like this:

datasets download genome accession --inputfile list.txt

This command will download a zip file with metadata and genomic sequences and (if available), protein, transcript and GFF3 files. Feel free to reach out if you have any questions.

ADD REPLY • link 20 months ago by MirianT_NCBI ▴ 720

score 2 · Accepted Answer · 2022-08-26

Not the best solution, but still gives usable results (with python):

import requests
from bs4 import BeautifulSoup

html = requests.get("https://www.ncbi.nlm.nih.gov/assembly/?term={}".format(bs)).text
gca = html.split("GenBank assembly accession: </dt><dd>")[1].split()[0]
assembly_id = html.split("<title>")[1].strip().split()[0]
link = "https://ftp.ncbi.nlm.nih.gov/genomes/all/{0}/{1}_{2}/{1}_{2}_protein.faa.gz".format("/".join([gca[:3],gca[4:7],gca[7:10],gca[10:13]]), gca, assembly_id)

Now from the link received one can download the proteome

score 2 · Accepted Answer · 2022-08-26

It would be a two-step process. First, extract the download URL using the eutils and then utilise that URL to fetch genomic, protein or assembly files.

Assembly-specific URLs can be extracted using:

esearch -db assembly -query "SAMN09405588"|esummary|xtract -pattern FtpSites -sep "\n" -element FtpPath |sed -n 2p

This would output: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/136/285/GCA_014136285.1_PDT000806148.1