Efficiently retrieve the molecular type for a list of NCBI accessions
1
0
Entering edit mode
4 months ago
john ▴ 80

I have the following problem: I want to select only unspliced sequences from a list of blast result. I use a local version of the nt database. The only way I see at the moment is to use entrez to get the genbank files for each accession and than check in the Locus field for the molecular type.

This comes with a few drawbacks. I can use entrez with a list of accession in which case I get a continuous list of all genbank files in the list. Which can take quite some time to parse in case the list contains accession for full chromosomes. The other way would be to make a entrez request for each accession individually. Which makes alot of request and subsequently I need to set some down time between request or I will get a 429 http error. Which again prolongs the process.

As this whole thing should be used for a web service I do not know which sequences will be submitted and hence I possibly need to know the molecular type for all sequences in the nt data base. So the best solution for me would be to have a local data base which tells me for all NCBI accession the molecular type. This would speed up the process tremendously. So the best way I see at the moment is to download all https://ftp.ncbi.nlm.nih.gov/genbank/gbbct*.seq.gz files and parse them locally. Or is there a better way?

GenBank NCBI • 399 views
1
Entering edit mode
4 months ago
GenoMax 115k

Can you provide a couple examples of accession numbers? There should be a way to do this using Entrezdirect.

Here is one example

$esearch -db nuccore -query NM_000059.4 | efetch -format gb | grep LOCUS LOCUS NM_000059 11954 bp mRNA linear PRI 09-JAN-2022  ADD COMMENT 0 Entering edit mode Okay interesting while trying to clarify my self for that comment I figured outthat your approach is way better. Until now I used the biopython module entrez which reallybadly scaled especially when trying to parse allot big gen bank files. But the bashpipeline does not seem to have this down side. So I would say thanks. ADD REPLY 2 Entering edit mode Be sure to sign up for NCBI API key if you are planning to do a lot of look-ups. ADD REPLY 0 Entering edit mode Okay I figured out another way how this can be done and it is even quicker. $ esummary -db nuccore -id NG_011749.1,NM_000240,F12345,AF223456 | xtract -pattern DocumentSummary -element AccessionVersion Biomol

This is much quicker than your version if the genbank file become bigger.