Forum:Airport.bio: First class ticket to any biological database in the world
3
2
Entering edit mode
6.5 years ago

Dear Biostars Community:

I'm looking to gather feedback and suggestions on a recent project: http://airport.bio

Github repository: https://github.com/airportbio/airport-web

Specifically, I would appreciate any advice about how to relate metadata with FTP. In other words, is there a way to integrate metadata into the current FTP search engine?

Any other thoughts on other topics are welcome.

RNA-Seq ChIP-Seq SNP • 3.0k views
ADD COMMENT
1
Entering edit mode

It looks like you're trying to build a registry or catalog of biological databases. This has been done several times before with mostly limited success.

ADD REPLY
0
Entering edit mode

Airport.bio works well but is still in its early stages.

ADD REPLY
0
Entering edit mode

It's not that the previous attempts were not working. It's mostly that they didn't get many users and many eventually died because they were not maintained: biological databases are continuously being produced, change URL as labs move and tend to have a short life on the web so any catalog has to be continually updated or lose relevance. If you want your resource to be useful, you have to offer a service that's better than e.g. Google as a baseline, both in terms of usability and information provided.

ADD REPLY
0
Entering edit mode

This is why we added a "Suggest new server" button. You can always add new databases as they come out (e.g., new publications) or change the URL of existing databases as needed (if they move). Google's crawlers don't crawl this deep, and Google doesn't index FTP like HTTP.

ADD REPLY
2
Entering edit mode

You can always add new databases as they come out (e.g., new publications) or change the URL of existing databases as needed (if they move)

This is precisely what nobody did in previous attempts.

ADD REPLY
0
Entering edit mode

FTP as in file transfer protocol?

ADD REPLY
0
Entering edit mode

Yes, that's correct.

ADD REPLY
0
Entering edit mode

Serious question. What is the benefit of this tool? You would keep checking and make sure those FTP links are not stale? Would there be a free form search interface that would suggest multiple sites?

ADD REPLY
0
Entering edit mode

What do you mean by "Would there be a free form search interface that would suggest multiple sites?"

ADD REPLY
0
Entering edit mode

Someone would type "rat genes" and you would show them available databases. Or am I missing the scope of the tool?

ADD REPLY
0
Entering edit mode

My goal is to ultimately allow users to type in any query into the search bar and it will search the respective metadata of all the files in all the selected biological databases. However, I'm seeking advice on how to integrate metadata into the current FTP search engine. FTP is a great protocol to simultaneously connect to multiple databases at once and search them in parallel. However, FTP does not include any metadata information for the respective file descriptions (i.e., the full paths of the files). Certain directories include a README file describing its contents, but that's pretty much it. Nothing like the level of metadata description you would expect from tools like metaSRA or GEO.

ADD REPLY
0
Entering edit mode

In what sense is FTP:

great protocol to simultaneously connect to multiple databases at once and search them in parallel

what do you mean by "searching via FTP"? AFAIK FTP has no support for searching.

ADD REPLY
0
Entering edit mode

It searches through a database of already traversed paths and connects to the respective path via FTP.

ADD REPLY
1
Entering edit mode

Your descriptions in general and the presentation of the service itself suffers greatly from wrong kind of specificity. THere is little information about your service from the point of view of what an end user needs.

This has nothing to do with it being a "early" attempt or not.

There is a lack of clarity of explaining in simple terms what the service does. I tried it a few times, read through this thread and still do not understand it the very basic idea behind it - What does this do?

When you say:

It searches through a database of already traversed paths and connects to the respective path via FTP.

This explanation is a good example of a recurring pattern of writing answers that have little actionable information:

  1. "Searches through a database" - What is a database? A single file? Several files connected via common columns? All files contained in RefSeq? An actual database dump?
  2. "already traversed paths" - what does that mean? As an end user why would I need to know or care about whether a path was already traversed or not? Why does it matter to me?
  3. "Connects to the respective path" - respective to what? What gets connected to what?
ADD REPLY
3
Entering edit mode
6.5 years ago

the user interface is severely flawed

  1. It is not clear what a "keyword" means. Is "cancer" a valid keyword ... apparently not. Is a gene name a keyword? Apparently not. I am confused ... what is a valid keyword? What does this site do? How would users know what to search for?

  2. I perform a search, gives me no results. I press browser back button, it takes me back to Biostar (the referring site) rather than back to the search interface. Confusing and counterintuitive.

ADD COMMENT
0
Entering edit mode

"Cancer" is a valid keyword when all databases are selected, as are certain gene names depending on if they show up in the respective databases' file paths. However, the output returned isn't very useful because it returns exact matches (or similar words) for just the keywords that appear in the search paths, instead of returning something like metadata giving you a good description and understanding of what you're looking at on a file-by-file basis. Tentatively, we included a separate link to search for metadata and any README files (if they exist) but this is not enough. So, from an engineering perspective, airport.bio achieves its task but it needs additional integration with other metadata sources that would make the output biologically useful.

ADD REPLY
0
Entering edit mode

I just tried a similar thing, searching for the name of a bacteria I know to be in RefSeq. I selected every database, but still got no results returned, even from refseq

ADD REPLY
2
Entering edit mode

Give some slack to Bohdan Khomtchouk :

Airport.bio works well but is still in its early stages.

Bohdan Khomtchouk : It may be best to remove the tool tag on this post since that signifies it is ready for action. Which it admittedly is not.

ADD REPLY
0
Entering edit mode

Correct, I'm just gathering feedback and asking for advice on how to solve my "metadata problem" (or rather "lack of metadata").

ADD REPLY
0
Entering edit mode

I would suggest you explain the problem a bit more in detail.

ADD REPLY
0
Entering edit mode

@jrj.healey: What was your search query? I'm almost certain that your search string does not appear anywhere in the RefSeq (or other databases) FTP site. This is the problem that I'm facing. I need to somehow bring in metadata into airport.bio because FTP is not enough. I'm looking for advice on how I could potentially do this.

ADD REPLY
0
Entering edit mode

The search string was literally the name of a bacterial genus, in this case "Photorhabdus".

ADD REPLY
3
Entering edit mode
6.5 years ago

Your tool would benefit from more documentation on what it does and why it does it. Example searches, a mission statement about the scope of the databases included, that sort of thing.

ADD COMMENT
1
Entering edit mode
6.5 years ago

You might consider how Google approached building a search engine using the HTTP protocol.

HTTP has almost no notion of metadata, outside of perhaps specifications for MIME types and stream size or last-updated tags. Even then, MIME types are specified entirely at the discretion of the server or developer, and the rest of the metadata is mostly useless for scientific work. Assuming that what you're retrieving will have the wrong MIME type is safe. You might be pleasantly surprised if it does match and you can interpret what you get back.

Maybe you'll be lucky and some developer adds custom x-* headers to HTTP responses that are domain-specific. I will bet that practically no one does this for public services because there is virtually no standard that is agreed upon for custom HTTP headers where biological databases shared via REST or other HTTP-trafficked services.

Shrug. The "shruggle" is real.

The way web search engines work is by downloading, processing, and indexing. There's almost nothing inherent in HTTP to help search engines with the kind of searches 99.9999% of all users make.

In your case, perhaps, you might add HTTP GETs with FTP gets, making use of the content offered by databases and the content in cited publications (including citations themselves, and where they go) as a way to tag or index resources for searching, as well as drawing a weighted graph of interconnected or related resources in order to rank search results.

You're effectively rebuilding a web search engine at this point, but as you have domain knowledge (ie a background in biological sciences) that most — nearly all, in fact — Google engineers do not, maybe you can build a better biology-focused search engine informed by your knowledge and what you know would be useful to other biologists.

Build a better mm10 trap!

ADD COMMENT

Login before adding your answer.

Traffic: 1627 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6