The following is a brief review and evaluation of the HarvardX: PH525x Data Analysis for Genomics MOOC course distributed via the EdX network: https://www.edx.org/ The course ran between April 7 and June 13 2014
The instructors for the course are
- Rafa Irizarry, PhD, Professor of Biostatistics, Department of Biostatistics, Harvard School of Public Health
- Michael Love, PhD , Postdoctoral Fellow, Department of Biostatistics, Harvard School of Public Health
Lectures and Labs
The course focuses on what to do with genomic data that has been brought to some sort of "tabular form". What this means is that the topics covered in the lectures assume that other bioinformatics tools have already been run and that their results are available in standard formats. For example SAM files for alignments or BED files for peak callers, tabular data of gene expression etc.
The course took place over 8 weeks and the lectures were released on a weekly basis. One week's worth of material was divided into 2 - 3 topics where each topic was covered by 3 to 6 videos with an average length of about 5 minutes. Longest video seems to be around 15 minutes, the shortest is less than 2 minutes.
There are purely statistical theory oriented lectures, there are lectures describing genomic technologies, and there are lectures discussing Bioconductor packages. Each topic may also include one or more labs that demonstrate the use of the various libraries. There is a very large number of subjects covered in a quite a bit of detail. I liked that a lot (but with some reservations, see later)
There is an open source book for the lectures:
And there is a github repository for the labs
The lectures do improve with time, but especially at the beginning one experiences a sense of being presented filler material. This is especially true for the microarray related lectures, those concepts are not revisited later. I couldn't help but feel that there was a wrong order to the course as well, too much theoretical discussions early on. The sections that were of interest to most people that sign up for a course like are presented too late in the last two weeks by the time when most of the student churn took place. Furthermore there is disconnect between the theoretical discussions and the practical applications, it is not clear how many of the concepts that were presented in lectures 1-6 apply during the problem solving of the final courses.
Still criticism aside overall a great value and reference material to be had there.
I consider this to be the least developed element of the course. I believe that when it comes to data analysis learning takes place only while doing the task itself so for me home works are the most important component.
About half of the home works can be solved by simple trial and error (sometimes the alternative code snippets are not even valid R code, so that option is really easy to rule out then). The remainder of the problems seem to fall into two categories: some felt like mind games where I was trying to figure out what they actually mean by words that were never defined.
The other half were "deep water" exercises where seemingly overcomplicated R codes were tossed the student's way to serve as a starting point. This can be very intimidating, yet most of the time the actual solution was trivially obtainable from these starting steps. Most home works could be solved in 15 minutes or less. They lacked organization and design of allowing the student to methodically work through a problem and did not support "learning by doing it". Everything was over very quickly and each question felt either too easy or too complicated.
This is bit worrisome especially since there is a commercial component to this. Those paying $250 or so can earn a Verified Certificate of Achievement, but in my estimate achieving the 75% completion should be extremely easy and would not demonstrate particular skills. (BTW I believe that in HW 5 either the answer key or the image of dendogram is incorrect, so if you can't pass because of losing points there make sure to lodge a complaint ;-) )
Undoubtably it is fun and feels greatly empowering to be able to learn at your own pace and revisit topics as necessary. I forgot what it is like to be a student again. Moreover watching lectures grew on me, one just needs to make it into a habit to start appreciating the experience.
This was my first course in the so called Massive Open Online Course setup. But frankly expected a lot more MOOC-ness, and a sense of being part of a group. Strangely the interface that EdX has very few community aspects to it. All along I felt alone, and while I know that there is a link that I could visit to go to a forum section, the main site always felt empty and desolate. There is a lot of missed potential there.
Suggestion to improve the course
Too much time is spent on data wrangling with regular R constructs and too little time is spent on explaining and interacting with the biological data representations within BioConductor. There are a large number of excellent resources to learn R and "data wrangling" in general and far fewer to understand how BioConductor works and the typical usage patterns that actually make sense.
Even when the code deals with Bioconductor the examples and plotting is needlessly complicated by overly long R constructs that slice/dice/reformat/matches all while plotting and performing the stats. Some of these issues are caused by R itself; the language is both powerful but at the same time obtuse and wildly inconsistent. It takes a lot of effort to write simple code. In the end all that adds to a substantial cognitive overhead.
In my opinion a few week's worth of lectures should be dedicated to each of the concepts of Genomic Ranges, Genomic Alignments and Gene Ontologies and the methods to process and combine the information in some of these very arcane object representations of Bioconductor. The authors of the course keep suggesting for the students to study the vignettes (help files for the tools) but that is really a feasible idea. The quality and content of vignettes vary greatly, and are rarely written with a well defined educational purpose. I found most vignettes overwhelming and difficult to comprehend, they may be good reference to remind us what the tools do but not learn them. Experts should tell us what is worth knowing about each library.
It was fun and free so it is hard to argue with that. And I'd like to thank the instructors for taking on this challenge. It is hard to be first in any field, especially one as complex genomic data analysis.
So who is this course for? Will someone new to the field learn the 101 of genomic data analysis? I doubt that, this would be a very difficult course for a newcomer, the difficulty varies greatly. I would argue that this course is best for someone already with experience in the field. I got a lot out of it in very little time. I think I am now better scientist because of this course, so let me again thank the instructors for their effort and work.