Monday, February 08, 2021

Why I Hated One Genapsys Slide

I claimed in my Miscellanea piece that I was one post away from being done with J.P. Morgan -- oops, forgot I had drafted a minor screed on data display which I'll push out before the last piece - particularly since I hinted I would be taking Genapsys to task on this subject.  Unexpectedly good timing too: maybe new Genapsys CEO Jason Myer's first big initiative can be to fix this plot!

There was one slide in Genapsys' J.P. Morgan presentation that just didn't seem right, and the more I looked at it the more annoyed I got at the exhibition of poor graphical design choices.  I'm going to walk through my complaints in the hope that others might learn from it.  Before I tear it apart, here is the figure -- what do you not like?



I've kvetched before about graphical design -- I don't lack for opinions.  I'll confess my own execution isn't always  quite up to my principals, but that pains me.  My main influences have been dear old Dad and Edward Tufte.  Dad often liked to say that the most legible diagram of all is a blank sheet of paper -- there is absolutely no confusion in the message it sends.  So don't add anything to the sheet unless it actually adds to your message -- that was his credo. I took Tufte's course while a grad student and highly recommend it.  He's not perfect either -- damn close but not there.  There's a diagram he praises in his first book that gives me a headache from moire effects.  

We can quickly spot violations of the minimalist ethic of Robison pere and Dr. Tufte: why are there these extra unlabeled datapoints on the swirl graph that I've marked with red arrows.  Actually, that whole grey swirl has issues, but we'll get to that later -- and really isn't important to Genapsys' point.  Indeed, it is something you might expect to be lifted from an Illumina plot.

As a scientist, my first concern is accuracy.  If a graph is deliberately deceiving, it is unacceptable.  The Genapsys plot doesn't appear to have any gross violations of this ethic, though the ambiguity from plotting with giant markers isn't ideal.  No, what really toasted my cookies here is that this is supposed to visually argue a point -- and the execution of plot utterly botches that objective. 

The axes are the first sign of serious trouble -- both have breaks in them (red arrows).  Those are sometimes unavoidable, but you really, really want to try avoiding them.  The Y-axis is scaled as a log axis, which I am fond of, but with the break its now neither linear nor continuous, potentially inducing confusion.  Plus the swirl is unbroken -- plotting a continuous trend line on discontinuous axes is starting to get into misleading territory -- though here it appears to be cluelessness rather than malice.  


Of course, the artist hasn't done themselves any favors with the big blue text box eating space (orange arrow) -- particularly when there is so much empty plot space in the upper right corner

Now I'm not any good at making fancy graphics, so I roughed out what the plot should look like if plotted without breaks.  It really illustrates that the Y-axis break is pure poor planning -- there was very little Y-axis omitted.  It also emphasizes that the NovaSeq is just a totally different beast, requiring a huge capital investment to gain operational savings.  


If one wants to just focus on the desktop instruments, we can toss NovaSeq to get this plot which perhaps better emphasizes how the 144M chip -- if successfully launched -- would be radically different for these metrics


Now that I've plotted it, I realize the effect isn't as huge as I once thought -- but I'll double down and say that makes the Genapsys graph execution even worse!  Their overly complex presentation of what is really simple still obscured their message -- or at least gave an opening for someone to claim that the presentation was distorting things.  By cleaning up the plot they can make their point more cleanly and remove an angle for a competing salesperson to try to discount it.  

In reality, the cost question is much more complex and nuanced.  The slide doesn't include the newest NextSeq instruments and their price/performance.  Comparing cost per gigabase makes sense if there aren't significant other costs.  For example, if I must spread my run across multiple Genapsys chips that would fit on a single Illumina flowcell for the machine I have, there is a cost in terms of multiple loadings (lab-side labor!) and in making sure the right data is amalgamated (compute side labor!). There's also the ongoing advantage of less expensive hardware: after the first year there are service contracts (and/or licenses) that are percentages of the purchase price.  But that gets into really messy territory, as it depends on the sizes of your projects.  So its more a reminder that in making these decisions you should consider your projects and their sizes and quality demands and only use published numbers as starting guides.  

Good graphic layout -- it matters!






1 comment:

Unknown said...

Of all "How to display data" nerds, this guy is my favorite. Some him in grad school, and has UC Santa Cruz pay to bring him multiple times. Was worth every penny. https://www.principiae.be/X0000.php