Question

Accesing reference genome from Genome database (ncbi) with biopython

0

Entering edit mode

2.5 years ago

Daniel • 0

Hello all,

I would like to acces to the reference genome RefSeq UID given a taxonomy id using the Genome database with biopython.

I will try to explain with images what I mean. I search in the Genome database using a taxonomy id. It returns me a single result, then i click on the "Reference genome" link.

search of a determinated genome with taxonomy id

Now I scroll to the bottom of the page and get RefSeq reference genome UID for the given taxonomy ID.

After clicking the link i can get the RefSeq uid

Is it possible to achieve this using biopython ?

taxonomyID genome reference biopython • 1.1k views

ADD COMMENT • link updated 2.5 years ago by GenoMax 141k • written 2.5 years ago by Daniel • 0

0

Entering edit mode

If you must use biopython then you should be able to use Bio.entrez package (LINK).

Using Entrezdirect you can simply do:

$ efetch -db nuccore -id NC_000913.3 -format fasta > NC_000913.fa
>NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC

ADD REPLY • link 2.5 years ago by GenoMax 141k

0

Entering edit mode

Sorry I think I did not explain myself correctly. I want to type a code that returns the RefSec reference genome UID and as input you only give it the taxonomy ID. So later I can fetch it from nucleotide db as you posted (this I already know how to do it).

ADD REPLY • link 2.5 years ago by Daniel • 0

score 1 · Answer 1 · 2021-10-17

Using Entrezdirect (truncated to save space).

$ esearch -db taxonomy -query "1005566  [taxID]" | elink -target nuccore | efetch -format docsum | xtract -pattern DocumentSummary -if SourceDb -contains refseq -element Caption,Title,SourceDb
NZ_AMUP00000000 Escherichia coli 07798, whole genome shotgun sequencing project refseq
NZ_JH964525 Escherichia coli 07798 strain 7798 E07798.contig.252, whole genome shotgun sequence refseq
NZ_JH964524 Escherichia coli 07798 strain 7798 E07798.contig.251, whole genome shotgun sequence refseq
NZ_JH964523 Escherichia coli 07798 strain 7798 E07798.contig.249, whole genome shotgun sequence refseq
NZ_JH964522 Escherichia coli 07798 strain 7798 E07798.contig.248, whole genome shotgun sequence refseq
NZ_JH964521 Escherichia coli 07798 strain 7798 E07798.contig.247, whole genome shotgun sequence refseq
NZ_JH964520 Escherichia coli 07798 strain 7798 E07798.contig.246, whole genome shotgun sequence refseq
NZ_JH964519 Escherichia coli 07798 strain 7798 E07798.contig.245, whole genome shotgun sequence refseq
NZ_JH964518 Escherichia coli 07798 strain 7798 E07798.contig.244, whole genome shotgun sequence refseq
NZ_JH964517 Escherichia coli 07798 strain 7798 E07798.contig.241, whole genome shotgun sequence refseq

If you only want NC* accessions then

$ esearch -db taxonomy -query "511145  [taxID]" | elink -target nuccore | efetch -format docsum | xtract -pattern DocumentSummary -if SourceDb -contains refseq -element Caption,Title,SourceDb | grep NC
NC_000913   Escherichia coli str. K-12 substr. MG1655, complete genome  refseq