Joining multiple query keys in entrez Efetch
2
0
Entering edit mode
12 months ago

Hi All,

I'm trying to pull the sequences for a list of 3-400,000 accessions from the NCBI protein database via the biopython entrez api. For my specific use case, I'm processing them in 20,000 groups of around 1000 sequences (for context, each group is an smcog, but that doesn't really matter here), but for the purposes of this question I'm keen to just get as many as possible in one go. For example, if I can find a way to extract 100,000 sequences, I could get the sequences in one go and split them into groups later.

I've heard people talking about extracting e.g. 100,000 sequences with EFetch (https://www.biostars.org/p/280872/), but I'm struggling to get more than 200 sequences at a time. I've tried various methods, including making a http POST as suggested by the docs, but they're all hitting similar limits beyond around 200 accessions. I think this is because there seems to be a limit on the length of URLs that can be taken by browsers (https://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers), which presumably impacts the length of the URLs that can be accepted by http POST calls. But if this is true, how can someone manage to get 100,000 sequences?

My latest attempt, below, is iteratively calling EPost to make a list of query keys for a single web env, with each key corresponding to 200 accessions (the limit I've been running up against). So for a group of 1000 accessions, the final output will be a web env and a list of query keys e.g. ['1','2','3','4','5']. My question is, can I join these query keys into a single value for the query_key parameter in EFetch? This would be more efficient than going through each key in a for loop I think, and the url size being generated should be fairly low as it is a list of keys rather than 1000 accessions. I've looked here https://dataguide.nlm.nih.gov/eutilities/history.html, and have tried building a query key string term with %23 or # with the AND key word, but neither work (HTTP Error 500: Internal Server Error), although the link implies what I want to do is possible.

Also, please let me know if this all sounds hideously inefficient, most of this is coming from the docs so there is probably a much better way of doing this. Due to the nature of the accessions (they're from 2012 and NCBI has since done a massive re-org with non-redundant ref seqs, so the local databases may not have these accessions) and my crappy laptop, I'm trying to avoid downloading the database locally, although this might be better. I'd also really like to get this api stuff working if I can, as a learning experience, rather than give up.

This is all part of a bigger script, but the issues described above are in this bit:

#make a web environment and add query keys for each accession_list,
#which together make accession_list_whole

webenv_assigned=False
webenv=''
query_keys=[]
server_limit=200

#accession_list_whole is a big list of accessions e.g. ['accession 1', 'accession 2',...] for >1000 accessions.
for index in range(0,len(accession_list_whole), server_limit):

#make useable accession list
accession_list=accession_list_whole[index:index+server_limit]

#feed into epost.
teststr=','.join(accession_list)
if webenv_assigned:
#build up a list of query keys for a single web env
epost=Entrez.epost(db="protein", id=','.join(accession_list), WebEnv=webenv)
else:
#for first loop, assign a web env
epost=Entrez.epost(db="protein", id=','.join(accession_list))
webenv = search_results["WebEnv"]
query_keys.append(search_results["QueryKey"])
webenv_assigned=True

'''I now have a web env (MCID_5fa80e1470e80f0dc50d04fb) and a list of query keys [1,2,3,4,5].  The next bit of code feeds the web env and query keys to efetch.  I could do this in a for loop and iteratively call efetch for each query key, appending the resultant seqRecords to object_new, but if I can do this all at once i think it will be more efficient.'''

for attempt in range(5):
#for loop = this is just in case there's a weird server error, probably unnecessary
try:
object_new=[]
################################
#What is the correct query_key  parameter format here?
#I have tied %23<key> and #<key> with AND, neither work.
################################
join='#'
bool='+AND+'
queryKey=join+query_keys[0]+bool
for key in query_keys[1:-1]:
queryKey+=join+key+bool
queryKey+=join+query_keys[-1]
################################
#--> 1+AND+#2+AND+#3+AND+#4+AND+#5+AND+#6+AND+#7+AND+#8+AND+#9
################################

fetch=Entrez.efetch(
db="protein",
rettype='fasta',#rettype="gb",
retmax=len(accession_list_whole),
webenv=webenv,
#the annoying param:
query_key=queryKey
)
object_new=list(SeqIO.parse(fetch, "fasta"))
fetched=True
break
except Exception as e:
print(e)
#HTTP Error 500: Internal Server Error
print('str used to generate web env and query key:  ',teststr)
print('query_key used to generate web env and query key:  ',queryKey)
fetched=False
continue

#I then write the id and seq for each seqRecord in object_new to a txt file in fasta format

entrez Efetch biopython NCBI • 661 views
2
Entering edit mode
12 months ago
GenoMax 109k

please let me know if this all sounds hideously inefficient

I would have to say yes. When you need that many accessions extracted you should consider using blastdbcmd utility from blast+ and relevant pre-made nr blast indexes from NCBI. It would be a whole lot simpler, though it does require you to download the indexes.

If you wish to pursue the Entrez API method then have you signed up for NCBI_API_KEY? Use Entrezdirect (the command line version of Entrez utils) to avoid this whole python/URL overhead. Keep a reasonable amount of gap between consecutive queries. Can't easily tell if you are "querying" the data or simply fetching it. If you have accessions then efetching them would be efficient than doing queries.

0
Entering edit mode

Hi genomax, thanks for the reply.

My problem with the blastdbcmd option is that a lot of the accessions I'm using are not easily available now. They're taken from a paper that used data extracted in 2012, and shortly after this NCBI removed a load of redundant sequences and replaced them with non-redundant ref seqs instead, which have different accessions beginning with WP_. The api seems to be finding them OK despite this, but I'm guessing it's doing things behind the scenes which may not happen with whatever I'm trying to do locally. So my worry is that these old accessions, or links to them, wont be in whatever database I download. Does that make sense/is this correct?

As an example, if I search for YP_005532033.1 in the NCBI web page, it doesn't come up in any of the databases as the record is removed, but it still shows a link to the details of the removed record: https://www.ncbi.nlm.nih.gov/search/all/?term=YP_005532033.1. The web page is linking to what I'm guessing is an archived record in the protein database, and I'm not sure if this archived record would be present in whatever databases I downloaded locally.

Also, I've had a look at the NCBI FTP site before but I can't see an option for the 'protein' database, would this just be nr?

Yes, I've got an NCBI API key, it's further up the script this is all part of but isn't shown here. Entezdirect looks good, but unfortunately I've not got a unix OS - this might sound thick but would it work on the linux subsystem for windows, linux and unix are meant to be pretty similar?

Thanks again! Tim

1
Entering edit mode

I guess that is a valid issue and would not be solved by using nr db.

I am going to show an example of the accession you posted above using Entrezdirect on command line (truncated to show just 2 lines) :

\$ efetch -db protein -id "YP_005532033" -format fasta | grep -A 2 ">"
>YP_005532033.1 hypothetical protein RAM_20480 [Amycolatopsis mediterranei S699]
MGIIIALLVVWAVLVVLGIVVKSLFWLIIVGAVLFVATGVIGFVKREALGRRH


massa.kassa.sc3na has some good suggestions that you should consider as well.

I can't see an option for the 'protein' database, would this just be nr?

That is the superset of all protein accessions.

0
Entering edit mode

Ah yes that does look like it would be ideal. I'll have a play and see if I can get it working on windows (seems possible - http://home.cc.umanitoba.ca/~psgendb/birchhomedir/doc/NCBI/edirect/chapter6.pdf). Thanks a lot for your help with this!

2
Entering edit mode
12 months ago

Hi, I've had a glimpse at your code and it does seems overly complicated. I'm regularly obtaining 500 - 1000 records per query. The queries can fail, but it is not very common (at least for me).

You should know, that you should pass the accessions directly to the Entrez.efetch (without the "AND" stuff). Also you don't need to call the epost by yourself. The Entrez.efetch do that by itself when the number of accessions is > 200 (see this https://github.com/biopython/biopython/blob/0788383d63ca7210d6242ae126d76f6166fe6a0f/Bio/Entrez/__init__.py#L203)

From my experience, the number of accessions obtainable depends on the record size, since you talk about proteins, It should be OK, to retrieve 1k at a time, (be prepared to handle fails).

Hope that helps somewhat.

0
Entering edit mode

That has actually helped massively! Maybe it's because I know what its doing better now and had some mistake before, but when I tried using efetch originally I swear this approach wasn't working, which sent me down the rabbit hole. Now >3000 is easy, no fails (I already had a catch for that)! Thanks a lot.

It's still quite slow but its now something I can just leave running over a weekend from the looks it.

Both you and genomax have been great, but I don't think I can accept either answer as they're both comments?