I have the following problem: I want to select only unspliced sequences from a list of blast result. I use a local version of the nt database. The only way I see at the moment is to use entrez to get the genbank files for each accession and than check in the Locus field for the molecular type.
This comes with a few drawbacks. I can use entrez with a list of accession in which case I get a continuous list of all genbank files in the list. Which can take quite some time to parse in case the list contains accession for full chromosomes. The other way would be to make a entrez request for each accession individually. Which makes alot of request and subsequently I need to set some down time between request or I will get a 429 http error. Which again prolongs the process.
As this whole thing should be used for a web service I do not know which sequences will be submitted and hence I possibly need to know the molecular type for all sequences in the nt data base.
So the best solution for me would be to have a local data base which tells me for all NCBI accession the molecular type. This would speed up the process tremendously.
So the best way I see at the moment is to download all
https://ftp.ncbi.nlm.nih.gov/genbank/gbbct*.seq.gz files and parse them locally. Or is there a better way?