3 months ago by
Seattle, WA USA
You might consider how Google approached building a search engine using the HTTP protocol.
HTTP has almost no notion of metadata, outside of perhaps specifications for MIME types and stream size or last-updated tags. Even then, MIME types are specified entirely at the discretion of the server or developer, and the rest of the metadata is mostly useless for scientific work. Assuming that what you're retrieving will have the wrong MIME type is safe. You might be pleasantly surprised if it does match and you can interpret what you get back.
Maybe you'll be lucky and some developer adds custom
x-* headers to HTTP responses that are domain-specific. I will bet that practically no one does this for public services because there is virtually no standard that is agreed upon for custom HTTP headers where biological databases shared via REST or other HTTP-trafficked services.
Shrug. The "shruggle" is real.
The way web search engines work is by downloading, processing, and indexing. There's almost nothing inherent in HTTP to help search engines with the kind of searches 99.9999% of all users make.
In your case, perhaps, you might add HTTP GETs with FTP gets, making use of the content offered by databases and the content in cited publications (including citations themselves, and where they go) as a way to tag or index resources for searching, as well as drawing a weighted graph of interconnected or related resources in order to rank search results.
You're effectively rebuilding a web search engine at this point, but as you have domain knowledge (ie a background in biological sciences) that most — nearly all, in fact — Google engineers do not, maybe you can build a better biology-focused search engine informed by your knowledge and what you know would be useful to other biologists.
Build a better mm10 trap!