Why In Logoplots Are Bits Preferred Over Probability?
I have some logoplots using http://weblogo.threeplusone.com/ and I have also seen these plots in various papers. I'm a cell biologist so unfortunately I don't understand much of the "information content" or entropy when reading about what bits really is. However, most people seem to still prefer it over probability. So, could anyone tell me (or point me to an easy to understand explanation) why it is better? What benefits am I missing if I go with probability? Is is still "ok" to go with probability or is it a big no-no amongst bioinformaticans?

• I think it is very easy to understand conservation with probability. The height is simply all the input sequences, so if an amino axis is half of the total height it's simply present in half of the sequences
• it is very easy to compare conservation between different residues since the total height of all positions are the same
• show a probability logo to a non-bioinformatican and they will get it with no prior knowledge. Show a bit logo and they will most likely be confused (not only what a "bit" is, but also why the total height is different between residues). Today my PI asked me if I had photoshopped away some amino acids since the residues didn't add up to the same height :(
From a "Tuftean" viewpoint, perhaps, what is trying to be communicated with this figure? When looking at a logo, most people are not really interested in the bases where there is equal probability of each residue. But in a probability rendering, these bases are presented to the viewer with the same visual emphasis as bases where there is an (interesting, potentially biologically relevant) overabundance of one or two residues. In the entropy rendering, however, these bases are dismissed, so that the viewer's eye is directed to the residues that contribute most to the motif's information content. In both cases, the figure shows the same underlying data, but it is about presentation and what is trying to be communicated.

If a nucleotide is 100% conserved it contains 2 bits of information about that position since it tells you enough to nail down one of the four possibilities. Two bits can represent four values:

00 01 10 11


Does this sounds kind of gimmicky and irrelevant, maybe even pretentious? You might be onto something there.

Personally I dislike LOGOs but for a different reason: one residue has to be "on top" even in a dead heat.

Take a look at my berryLogo implementation if you are interested in alternatives: https://github.com/leipzig/berrylogo

There is also this glyph-based representation which does not fix the dead heat problem but is quite attractive

https://github.com/ISA-tools/SequenceLogoVis

I don't see how the use of the bit is pretentious, it wasn't chosen to sound "cool", it was chosen because sequence logos and so on build directly off of information theory.

I can't say why one method is preferred over the other, but they are a nonlinear scaling and your final point about photo shopping is pertinent. If you had sequences with nothing in common but one base, those other 4-way sites should NOT have any motif, and displaying four little letters would be visually noisy and unhelpful. I'd rather those base pairs not be drawn, especially in a small diagram context. It would obscure when you have a weak motif like 20/20/20/40. The bit system would highlight the 40% better.