I'm looking for methods to build logo from fasta sequences. I'm not sure what strategy to adopt. My input is a fasta file containing various sequences of various sizes (precisely binding sites, so they're rather short). Looks like this:
>Chromosome:180614-180629 GTAAATTACCGTCAG >Chromosome:200346-200361 GTAAAACCTGGTAAG >Chromosome:461646-461661 GTAAAGAGATCACCA >Chromosome:461694-461709 GTAAAGCACTGAAAG >Chromosome:461714-461731 GTAAACTAAGCGTTGTC
What I have done is to first align the sequences, using an R package named DECIPHER.
>Chromosome:180614-180629 -CATTTAATGGCAGTC------- >Chromosome:200346-200361 ---GTAAAACCTGGTAAG----- >Chromosome:461646-461661 ---GTAAAGAGATCACCA----- >Chromosome:461694-461709 ----GAAAGTCACGAAATG---- >Chromosome:461714-461731 ---GTAAACTAAGCGTTGTC---
I then generated logos using the aligned fasta file, using the R library ggseqlogo, and the online version of weblogo. However I'm not satisfied with the result in both cases.
It doesn't seem to take into account the "N" characters from my sequences. For example the initial "C" looks big, but in reality it only appears twice among 50 sequences, otherwise this position is unknown.
How can I correct this? Either by changing some parameter, or maybe "trimming" the extremities somehow beforehand?