Hello all!
As many people here are aware, bioinformaticians and biologists alike spend a huge number of man-hours on Biostars asking questions and giving answers. In fact, I would go as far as to say Biostars is one of the biggest bioinformatic collaborations on the planet. As a result, it’s one of the few places with somewhat reliable data on what people are struggling with the most in their research.
As part of my thesis [ before anyone points out I should be writing ;~) ] I’d like to include some rough statistics on what sorts of questions people are asking the most, with a view to seeing these questions better addressed by future bioinformatic software.
To achieve this I’ll read through an entire year’s worth of posts, and try and categorise them in such a way that we can do a little reflection. Here are the tags/categories I have thought of so far. Note that a post can presumably have more than one category.
- Unsure which software to use
- How to convert data (into a standard format)
- How to convert data (into a non-standard format)
- Problem installing software
- Problem installing software dependencies
- Problem using software correctly (insufficient documentation)
- Problem using software correctly (insufficient reading of the documentation)
- Help needed for experimental design (in silico)
- Help needed for experimental design (biological)
- Problem with software (bug)
- Problem with software (feature request)
- Problem making sense of software’s result (insufficient documentation)
- Problem making sense of software’s result (insufficient reading of the documentation)
If anyone else would like to add/remove some categories, or could suggest an alternative approach I hadn’t considered, that would be fantastic. But please do so before this coming Saturday (22nd) when I will start the process of reading and categorizing. Once I’m done I’ll post a link to the SQLite database (or Excel spreadsheet) with a unique ID being the post ID, a column for the posted date, and column for each category, with a value of 1 if it’s true, 0 if it’s false. I think anything else beyond a boolean will be a bit subjective so, yeah, true/false categories only please ;-)
Hopefully not manually. That would be a great way of postponing submission of your thesis for another year. If you used a rolling window you could stay a student forever :)
On a serious note: It may be useful to get an idea of how many posts originally had incomplete information or the question posed was not clear. Since people sometimes go back and edit the original posts this may have to be determined by the chain of comments associated with the post.
Hahah, yes manually :) I don't trust my machine-learning-fu enough to do it any other way :P "Question required clarification" is a fantastic category idea - thanks genomax, i'll add it in.
Umm, no. That's not worth your time. Write up your actual thesis and stop procrastinating.
OK if I don't get it done by Monday then i'll admit defeat and be back in the library with this face -> :(
Just remember that when it comes to dissertation writing...
Are you serious?
That is 13,500 odd posts this year at the time of this writing.
Oh, well i'd aimed to do 10,000 and spend roughly 10 seconds on each, which is 27 hours. I think I can get it done from Friday night to Monday morning. You have to remember I have absolutely no life.
http://www.datacommunitydc.org/blog/2014/07/natural-language-processing-python-r
There are also quite a few statistics-related questions. As an aside, I sometimes get the impression that questions are being asked by students who don't get adequate supervision/training at their home institution. I am not sure how this could affect or be reflected in the results you'll get but if it is widespread, I don't think that future bioinformatics software can solve this issue. Or put another way, what I am suggesting is that the problem(s) leading to the questions may not be software-related but have an upstream cause such as poor training/education/supervision.
Nice project. A kind of post I sometimes see (not the most entertaining IMO) :
Btw, how are you going to discriminate between "insufficient documentation" and "insufficient reading" ? Seems a bit subjective too.
Great idea - added :)
Are you thinking of using latent semantic indexing to read all posts?
Just hire a whole bunch of undergrads to do this :P.
Please don't encourage John to do that! I have to vote on all hires (John and I are at the same place) and our meetings are already long enough...
I was hoping that after grabbing a year's worth of training data, someone else could come along with the LSI (with Istavan's consent) to fill out the rest of the years. That's why i'm including the post numbers in the database and not just a tally of categories (which I could do a lot faster).
I should say that i'm not going to manually browse (via browser) each and every post on biostars, that would take too long. I've written a tiny python app to grab a post (but not follow image/JS links etc), dump the text to my terminal, grab a second post, hold it until i'm done reading/categorizing the first post, then once i'm done grab a third post and show me the second post. So it's not a crawler since it's not automatic, it's just "buffered", but it's not a browser either, since we're not requesting the large assets like images. And it makes creating the database and entering categories really quick since data/post id is automated. It's literally just a lot of scrolling and pushing the number buttons.
Glad to see that you are not that desperate to remain a student :)
are you only interested in putting technical aspects in your thesis or does it also involve part of biological interpretations?
It's 50:50 technical implementations of stuff and interpreting data.
You should also add a category for "Witty, chatty and "other" responses unrelated to the subject matter/question in the post". Also posts that use "other Biostars posts for providing answers" so we get an idea of how many threads could have been saved, if person posting the thread had done a search beforehand.
I was only going to classify the threads, not the comments on threads - but a "question previously answered" catagory is a good one