The following is a brief review and evaluation of the **HarvardX: PH525x Data Analysis for Genomic**s MOOC course distributed via the EdX network: https://www.edx.org/ The course ran between April 7 and June 13 2014

The instructors for the course are

- Rafa Irizarry, PhD, Professor of Biostatistics, Department of Biostatistics, Harvard School of Public Health
- Michael Love, PhD , Postdoctoral Fellow, Department of Biostatistics, Harvard School of Public Health

## Lectures and Labs

The course focuses on what to do with genomic data that has been brought to some sort of "tabular form". What this means is that the topics covered in the lectures assume that other bioinformatics tools have already been run and that their results are available in standard formats. For example SAM files for alignments or BED files for peak callers, tabular data of gene expression etc.

The course took place over 8 weeks and the lectures were released on a weekly basis. One week's worth of material was divided into 2 - 3 topics where each topic was covered by 3 to 6 videos with an average length of about 5 minutes. Longest video seems to be around 15 minutes, the shortest is less than 2 minutes.

There are purely statistical theory oriented lectures, there are lectures describing genomic technologies, and there are lectures discussing Bioconductor packages. Each topic may also include one or more labs that demonstrate the use of the various libraries. There is a very large number of subjects covered in a quite a bit of detail. I liked that a lot (but with some reservations, see later)

There is an open source book for the lectures:

http://genomicsclass.github.io/book/

And there is a github repository for the labs

https://github.com/genomicsclass/labs

The lectures do improve with time, but especially at the beginning one experiences a sense of being presented filler material. This is especially true for the microarray related lectures, those concepts are not revisited later. I couldn't help but feel that there was a wrong order to the course as well, too much theoretical discussions early on. The sections that were of interest to most people that sign up for a course like are presented too late in the last two weeks by the time when most of the student churn took place. Furthermore there is disconnect between the theoretical discussions and the practical applications, it is not clear how many of the concepts that were presented in lectures 1-6 apply during the problem solving of the final courses.

Still criticism aside overall a great value and reference material to be had there.

## Homework/Evaluation

I consider this to be the least developed element of the course. I believe that when it comes to data analysis learning takes place only while doing the task itself so for me home works are the most important component.

About half of the home works can be solved by simple trial and error (sometimes the alternative code snippets are not even valid R code, so that option is really easy to rule out then). The remainder of the problems seem to fall into two categories: some felt like mind games where I was trying to figure out what they actually mean by words that were never defined.

The other half were "deep water" exercises where seemingly overcomplicated R codes were tossed the student's way to serve as a starting point. This can be very intimidating, yet most of the time the actual solution was trivially obtainable from these starting steps. Most home works could be solved in 15 minutes or less. They lacked organization and design of allowing the student to methodically work through a problem and did not support "learning by doing it". Everything was over very quickly and each question felt either too easy or too complicated.

This is bit worrisome especially since there is a commercial component to this. Those paying $250 or so can earn a Verified Certificate of Achievement, but in my estimate achieving the 75% completion should be extremely easy and would not demonstrate particular skills. (BTW I believe that in HW 5 either the answer key or the image of dendogram is incorrect, so if you can't pass because of losing points there make sure to lodge a complaint ;-) )

## About MOOCs

Undoubtably it is fun and feels greatly empowering to be able to learn at your own pace and revisit topics as necessary. I forgot what it is like to be a student again. Moreover watching lectures grew on me, one just needs to make it into a habit to start appreciating the experience.

This was my first course in the so called Massive Open Online Course setup. But frankly expected a lot more MOOC-ness, and a sense of being part of a group. Strangely the interface that EdX has very few community aspects to it. All along I felt alone, and while I know that there is a link that I could visit to go to a forum section, the main site always felt empty and desolate. There is a lot of missed potential there.

## Suggestion to improve the course

Too much time is spent on data wrangling with regular R constructs and too little time is spent on explaining and interacting with the biological data representations within BioConductor. There are a large number of excellent resources to learn R and "data wrangling" in general and far fewer to understand how BioConductor works and the typical usage patterns that actually make sense.

Even when the code deals with Bioconductor the examples and plotting is needlessly complicated by overly long R constructs that slice/dice/reformat/matches all while plotting and performing the stats. Some of these issues are caused by R itself; the language is both powerful but at the same time obtuse and wildly inconsistent. It takes a lot of effort to write simple code. In the end all that adds to a substantial cognitive overhead.

In my opinion a few week's worth of lectures should be dedicated to each of the concepts of Genomic Ranges, Genomic Alignments and Gene Ontologies and the methods to process and combine the information in some of these very arcane object representations of Bioconductor. The authors of the course keep suggesting for the students to study the vignettes (help files for the tools) but that is really a feasible idea. The quality and content of vignettes vary greatly, and are rarely written with a well defined educational purpose. I found most vignettes overwhelming and difficult to comprehend, they may be good reference to remind us what the tools do but not learn them. Experts should tell us what is worth knowing about each library.

## Overall Evaluation

It was fun and free so it is hard to argue with that. And I'd like to thank the instructors for taking on this challenge. It is hard to be first in any field, especially one as complex genomic data analysis.

So who is this course for? Will someone new to the field learn the 101 of genomic data analysis? I doubt that, this would be a very difficult course for a newcomer, the difficulty varies greatly. I would argue that this course is best for someone already with experience in the field. I got a lot out of it in very little time. I think I am now better scientist because of this course, so let me again thank the instructors for their effort and work.

Thanks for the constructive feedback and the review, it's really useful. We also were hoping to make some more interactive homeworks in the future, rather than the simple multiple choice quizzes. At the midway point, some of the students pointed us to The Analytics Edge from MITx, which has longer R data analysis assessments which proceeded step by step, which we should look to copy. Fully interactive quizzes where the students enter code to be evaluated would be ideal.

I checked on the dendrogram also ;-) The 'trick' answer is the two samples which are close to each other horizontally but are joined high on the y-axis (height). The correct answer -- I went and checked with

`dist()`

! -- are the samples which are farther apart horizontally but join lower on the y-axis.I spent a lot of time with this problem since puzzled me to no end. I believe that I understand how to read a tree - or shall I say I always thought I did.

The distance between samples E and G should be indeed, as you state it, the

vertical spacebetween the point where E and G are joined relative to E and G. No disagreement there. But then no matter how much I look at this (see image below) I see only a tiny (almost zero) separation vertically between E and G.Let's all start at E, we'll go up a teeny-tiny amount then down to G. The distance is almost zero. Anyone that follows your instructions would do this. That is how the concept is explained.

This is why I say that either the answer key or your image is incorrect. That is not what a dendogram should look like anyhow. Why are nodes E and G drawn at height 20 to begin with?

The only logical explanation I have is that what you really meant to draw was a dendogram where all labels A,B,C,D,E,G were draw at height zero and the joining points stayed were they are right now. And in fact that would actually look like a typical dendogram that we get we perform hierarchical clustering.

I still stick to my opinion that this image does not match the data that you are describing. The only reason I point this out in so much detail is that the correct answers have monetary value here - someone could lose $250.

Distance is calculated by the height at which the branches join on the y-axis. Try this:

plot(hclust(dist(c(1,2,20,30))))

Observations 3 and 4 join at 10 on the y-axis (height) because they are a distance of 10 apart. It's tricky but we do say for this question to go back and listen to the exact definition of dendrogram in the lecture.

I understand that but that is not the full story. It is always a relative distance. It matters where one starts on the Y axis and where one stops. One does not just read off the Y scale and call that the distance. As we follow the branches we sum the vertical paths that we traverse between the start and end locations, this sum is the distance.

But now I understand what is going on.

What that particular R command does when producing the dendogram is some sort of optimization where it tries to minimize the length of branches that are drawn, probably for aesthetic reasons, thus for samples the labels will be drawn right under the join, their branch lengths will look like zero but in reality they are not. I think that this optimization is ill conceived as it defeats the purpose of visualization in the first place.

This below is what an unambiguous and typical dendogram would look like for the same problem.

I didn't really get the problem because I missed the course, but its plot(hclust(dist(c(1,2,20,30))),hang=-1) to get the labels to the bottom. Isn't this just an aesthetic thing? Or are the distances in your newly drawn example different on purpose from the first picture your gave? D has now a larger distance to CAB then F to EG. In the upper picture its F is more distant from EG than the whole D / C / AB cluster.

I drew this by quick eyeballin, since I did not have access to original data hence the inconsistencies with the relative distances.

I do however think that your suggestion is right, when hang=-1 it draws the traditional looking dendograms, so it is a matter of aestetics and not material difference.

You may be thinking of phylogenetic trees, but this is not true of the clustering dendrogram, which is why we teach in the lecture to read the distance between clusters by reading off the y-axis. The phylogenetic tree is trying to capture the true, underlying tree. The hierarchical clustering algorithm is imposing a tree (maybe the data has nothing to do with a tree) by greedily combining clusters. I don't believe the algorithm accomplishes a global minimum of preserving sample to sample distances which can be traced along branch verticals.

Here's another example, where drawing the branches to 0 still doesn't provide a clustering dendrogram which follows the logic of being able to trace the distance by following the lines.

Neither:

nor

Gives an indication that sample 3 and 4 are three times closer together than sample 1 and 6, using the logic of tracing the branch verticals.

Your example demonstrates a failure of clustering and not that of displaying it. Due to average linkage 3 is pulled towards the group formed by 1 and 2 whereas 4 is pulled towards the group of 5 and 6. So the result of the clustering wrong. Hence the distances in the clustering results are not right either.

But then data is uniform to begin with so it is not surprising that after clustering generates incorrect results.

Incidentally I have learned a lot about clustering and more importantly that displaying it can end up looking surprisingly different. I still think that some of the explanations that you give will not generalize correctly and work only because you show them between immediate neighbors. But I will leave it at that.

Thanks for chiming in and I appreciate the effort to set the record straight.