Hi, I'm interested in generating sequence logo of a series of defensin related proteins I've clustered using CD-Hit. There are aproximately 7600 sequences with about 2500 clusters, but many of them have few sequences per cluster. Which is the minimum number of sequences should have a cluster of proteins I should use to generate reliable seuquence logos? Thanks
I remember I saw some time ago that the minimum should be 6, but I've lost that information due to a hacking of my computer. I'm Aware of the small samples correction. I'm trying to undertand the formula about the information content. Ri = log2(20)-(Hi+en), and heigth=f b,ixRi. If I suppose that in a specific position there is only one kind of amino acid (f b,i =1, then for n=3, heigth = -0,2466, something that I don't know what does it mean, and if n=4, heigth= 0,8955, and if n=6, heigth=2,0377. And heigth should be in bits units.