Maximum bits in ggseqlogo is less then 2?
3
1
Entering edit mode
9 days ago
Mattan ▴ 10

Hi all, I'm using this code as a toy example to see if I understand how ggseqlogo works

# Create a data frame with the sequences
sequences <- c("ACA","ATA","AGA","AAA","ATA","AGA")
df <- data.frame(Sequence = sequences)

# Plot the sequence logo
ggseqlogo(df$Sequence)  I fail to understand the results I'm getting. Why isn't the maximuum 2 bits for positions 0,2? Also, why is the middle poistion (1) isn't populated? We do know that G and T are more common there. ggseqlogo • 303 views ADD COMMENT 2 Entering edit mode 8 days ago Mensur Dlakic ★ 27k Most sequence logo implementations have a small sample correction by default. This is easy to test by using 15-20 sequences, in which case the height of terminal residues should be closer to 2 bits and the middle stack might show up. This is what I get after increasing the number of sequences to ~100. You may need to turn the small sample correction off. ADD COMMENT 2 Entering edit mode 8 days ago The original paper describes the information at a position as: Information(l) = 2 - (H(l) + e(n))  Here, Information(l) is the value along the y-axis in your plots. H(l) is the uncertainty at position l (which decreases when you have more of the same bases at that position in the sequence), while e(n) is a correction factor required when the number n of sample sequences is small. That correction factor is apparently described in this paper but it is also defined in the source code for ggseqlogo. ADD COMMENT 1 Entering edit mode 8 days ago bk11 ★ 2.7k Did you check with method="prob" too? ggseqlogo(df$Sequence, method="prob")