Forum: Airport.bio: First class ticket to any biological database in the world
2
gravatar for Bohdan Khomtchouk
5 months ago by
Stanford University
Bohdan Khomtchouk320 wrote:

Dear Biostars Community:

I'm looking to gather feedback and suggestions on a recent project: http://airport.bio

Github repository: https://github.com/airportbio/airport-web

Specifically, I would appreciate any advice about how to relate metadata with FTP. In other words, is there a way to integrate metadata into the current FTP search engine?

Any other thoughts on other topics are welcome.

snp rna-seq chip-seq forum • 544 views
ADD COMMENTlink modified 5 months ago by Alex Reynolds25k • written 5 months ago by Bohdan Khomtchouk320
1

It looks like you're trying to build a registry or catalog of biological databases. This has been done several times before with mostly limited success.

ADD REPLYlink written 5 months ago by Jean-Karim Heriche16k

Airport.bio works well but is still in its early stages.

ADD REPLYlink written 5 months ago by Bohdan Khomtchouk320

It's not that the previous attempts were not working. It's mostly that they didn't get many users and many eventually died because they were not maintained: biological databases are continuously being produced, change URL as labs move and tend to have a short life on the web so any catalog has to be continually updated or lose relevance. If you want your resource to be useful, you have to offer a service that's better than e.g. Google as a baseline, both in terms of usability and information provided.

ADD REPLYlink written 5 months ago by Jean-Karim Heriche16k

This is why we added a "Suggest new server" button. You can always add new databases as they come out (e.g., new publications) or change the URL of existing databases as needed (if they move). Google's crawlers don't crawl this deep, and Google doesn't index FTP like HTTP.

ADD REPLYlink modified 5 months ago • written 5 months ago by Bohdan Khomtchouk320
2

You can always add new databases as they come out (e.g., new publications) or change the URL of existing databases as needed (if they move)

This is precisely what nobody did in previous attempts.

ADD REPLYlink written 5 months ago by Jean-Karim Heriche16k

FTP as in file transfer protocol?

ADD REPLYlink written 5 months ago by Alex Reynolds25k

Yes, that's correct.

ADD REPLYlink written 5 months ago by Bohdan Khomtchouk320

Serious question. What is the benefit of this tool? You would keep checking and make sure those FTP links are not stale? Would there be a free form search interface that would suggest multiple sites?

ADD REPLYlink written 5 months ago by genomax55k

What do you mean by "Would there be a free form search interface that would suggest multiple sites?"

ADD REPLYlink written 5 months ago by Bohdan Khomtchouk320

Someone would type "rat genes" and you would show them available databases. Or am I missing the scope of the tool?

ADD REPLYlink written 5 months ago by genomax55k

My goal is to ultimately allow users to type in any query into the search bar and it will search the respective metadata of all the files in all the selected biological databases. However, I'm seeking advice on how to integrate metadata into the current FTP search engine. FTP is a great protocol to simultaneously connect to multiple databases at once and search them in parallel. However, FTP does not include any metadata information for the respective file descriptions (i.e., the full paths of the files). Certain directories include a README file describing its contents, but that's pretty much it. Nothing like the level of metadata description you would expect from tools like metaSRA or GEO.

ADD REPLYlink modified 5 months ago • written 5 months ago by Bohdan Khomtchouk320

In what sense is FTP:

great protocol to simultaneously connect to multiple databases at once and search them in parallel

what do you mean by "searching via FTP"? AFAIK FTP has no support for searching.

ADD REPLYlink written 5 months ago by Istvan Albert ♦♦ 77k

It searches through a database of already traversed paths and connects to the respective path via FTP.

ADD REPLYlink written 5 months ago by Bohdan Khomtchouk320
1

Your descriptions in general and the presentation of the service itself suffers greatly from wrong kind of specificity. THere is little information about your service from the point of view of what an end user needs.

This has nothing to do with it being a "early" attempt or not.

There is a lack of clarity of explaining in simple terms what the service does. I tried it a few times, read through this thread and still do not understand it the very basic idea behind it - What does this do?

When you say:

It searches through a database of already traversed paths and connects to the respective path via FTP.

This explanation is a good example of a recurring pattern of writing answers that have little actionable information:

  1. "Searches through a database" - What is a database? A single file? Several files connected via common columns? All files contained in RefSeq? An actual database dump?
  2. "already traversed paths" - what does that mean? As an end user why would I need to know or care about whether a path was already traversed or not? Why does it matter to me?
  3. "Connects to the respective path" - respective to what? What gets connected to what?
ADD REPLYlink modified 5 months ago • written 5 months ago by Istvan Albert ♦♦ 77k
3
gravatar for Istvan Albert
5 months ago by
Istvan Albert ♦♦ 77k
University Park, USA
Istvan Albert ♦♦ 77k wrote:

the user interface is severely flawed

  1. It is not clear what a "keyword" means. Is "cancer" a valid keyword ... apparently not. Is a gene name a keyword? Apparently not. I am confused ... what is a valid keyword? What does this site do? How would users know what to search for?

  2. I perform a search, gives me no results. I press browser back button, it takes me back to Biostar (the referring site) rather than back to the search interface. Confusing and counterintuitive.

ADD COMMENTlink modified 5 months ago • written 5 months ago by Istvan Albert ♦♦ 77k

"Cancer" is a valid keyword when all databases are selected, as are certain gene names depending on if they show up in the respective databases' file paths. However, the output returned isn't very useful because it returns exact matches (or similar words) for just the keywords that appear in the search paths, instead of returning something like metadata giving you a good description and understanding of what you're looking at on a file-by-file basis. Tentatively, we included a separate link to search for metadata and any README files (if they exist) but this is not enough. So, from an engineering perspective, airport.bio achieves its task but it needs additional integration with other metadata sources that would make the output biologically useful.

ADD REPLYlink modified 5 months ago • written 5 months ago by Bohdan Khomtchouk320

I just tried a similar thing, searching for the name of a bacteria I know to be in RefSeq. I selected every database, but still got no results returned, even from refseq

ADD REPLYlink written 5 months ago by jrj.healey6.8k
2

Give some slack to Bohdan Khomtchouk :

Airport.bio works well but is still in its early stages.

Bohdan Khomtchouk : It may be best to remove the tool tag on this post since that signifies it is ready for action. Which it admittedly is not.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax55k

Correct, I'm just gathering feedback and asking for advice on how to solve my "metadata problem" (or rather "lack of metadata").

ADD REPLYlink written 5 months ago by Bohdan Khomtchouk320

I would suggest you explain the problem a bit more in detail.

ADD REPLYlink written 5 months ago by WouterDeCoster32k

@jrj.healey: What was your search query? I'm almost certain that your search string does not appear anywhere in the RefSeq (or other databases) FTP site. This is the problem that I'm facing. I need to somehow bring in metadata into airport.bio because FTP is not enough. I'm looking for advice on how I could potentially do this.

ADD REPLYlink modified 5 months ago • written 5 months ago by Bohdan Khomtchouk320

The search string was literally the name of a bacterial genus, in this case "Photorhabdus".

ADD REPLYlink written 5 months ago by jrj.healey6.8k
3
gravatar for bradford.condon
5 months ago by
bradford.condon30 wrote:

Your tool would benefit from more documentation on what it does and why it does it. Example searches, a mission statement about the scope of the databases included, that sort of thing.

ADD COMMENTlink written 5 months ago by bradford.condon30
1
gravatar for Alex Reynolds
5 months ago by
Alex Reynolds25k
Seattle, WA USA
Alex Reynolds25k wrote:

You might consider how Google approached building a search engine using the HTTP protocol.

HTTP has almost no notion of metadata, outside of perhaps specifications for MIME types and stream size or last-updated tags. Even then, MIME types are specified entirely at the discretion of the server or developer, and the rest of the metadata is mostly useless for scientific work. Assuming that what you're retrieving will have the wrong MIME type is safe. You might be pleasantly surprised if it does match and you can interpret what you get back.

Maybe you'll be lucky and some developer adds custom x-* headers to HTTP responses that are domain-specific. I will bet that practically no one does this for public services because there is virtually no standard that is agreed upon for custom HTTP headers where biological databases shared via REST or other HTTP-trafficked services.

Shrug. The "shruggle" is real.

The way web search engines work is by downloading, processing, and indexing. There's almost nothing inherent in HTTP to help search engines with the kind of searches 99.9999% of all users make.

In your case, perhaps, you might add HTTP GETs with FTP gets, making use of the content offered by databases and the content in cited publications (including citations themselves, and where they go) as a way to tag or index resources for searching, as well as drawing a weighted graph of interconnected or related resources in order to rank search results.

You're effectively rebuilding a web search engine at this point, but as you have domain knowledge (ie a background in biological sciences) that most — nearly all, in fact — Google engineers do not, maybe you can build a better biology-focused search engine informed by your knowledge and what you know would be useful to other biologists.

Build a better mm10 trap!

ADD COMMENTlink modified 5 months ago • written 5 months ago by Alex Reynolds25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1837 users visited in the last hour