Question

Efficiently retrieve the molecular type for a list of NCBI accessions

0

Entering edit mode

2.3 years ago

john ▴ 130

I have the following problem: I want to select only unspliced sequences from a list of blast result. I use a local version of the nt database. The only way I see at the moment is to use entrez to get the genbank files for each accession and than check in the Locus field for the molecular type.

This comes with a few drawbacks. I can use entrez with a list of accession in which case I get a continuous list of all genbank files in the list. Which can take quite some time to parse in case the list contains accession for full chromosomes. The other way would be to make a entrez request for each accession individually. Which makes alot of request and subsequently I need to set some down time between request or I will get a 429 http error. Which again prolongs the process.

As this whole thing should be used for a web service I do not know which sequences will be submitted and hence I possibly need to know the molecular type for all sequences in the nt data base. So the best solution for me would be to have a local data base which tells me for all NCBI accession the molecular type. This would speed up the process tremendously. So the best way I see at the moment is to download all https://ftp.ncbi.nlm.nih.gov/genbank/gbbct*.seq.gz files and parse them locally. Or is there a better way?

GenBank NCBI • 929 views

ADD COMMENT • link 2.2 years ago by john ▴ 130

score 1 · Answer 1 · 2022-01-13

1

Entering edit mode

2.3 years ago

GenoMax 141k

Can you provide a couple examples of accession numbers? There should be a way to do this using Entrezdirect.

Here is one example

$ esearch -db nuccore -query NM_000059.4 | efetch -format gb | grep LOCUS
LOCUS       NM_000059              11954 bp    mRNA    linear   PRI 09-JAN-2022

ADD COMMENT • link 2.3 years ago by GenoMax 141k

0

Entering edit mode

Okay interesting while trying to clarify my self for that comment I figured outthat your approach is way better. Until now I used the biopython module entrez which reallybadly scaled especially when trying to parse allot big gen bank files. But the bashpipeline does not seem to have this down side. So I would say thanks.

ADD REPLY • link 2.3 years ago by john ▴ 130

2

Entering edit mode

Be sure to sign up for NCBI API key if you are planning to do a lot of look-ups.

ADD REPLY • link 2.3 years ago by GenoMax 141k

0

Entering edit mode

Okay I figured out another way how this can be done and it is even quicker.

$ esummary -db nuccore -id NG_011749.1,NM_000240,F12345,AF223456 | xtract -pattern DocumentSummary -element AccessionVersion Biomol

This is much quicker than your version if the genbank file become bigger.

ADD REPLY • link 2.2 years ago by john ▴ 130