Question: Building sequence logo from uneven sequences
0
gravatar for rioualen
3 days ago by
rioualen360
France
rioualen360 wrote:

Hi,

I'm looking for methods to build logo from fasta sequences. I'm not sure what strategy to adopt. My input is a fasta file containing various sequences of various sizes (precisely binding sites, so they're rather short). Looks like this:

>Chromosome:180614-180629
GTAAATTACCGTCAG
>Chromosome:200346-200361
GTAAAACCTGGTAAG
>Chromosome:461646-461661
GTAAAGAGATCACCA
>Chromosome:461694-461709
GTAAAGCACTGAAAG
>Chromosome:461714-461731
GTAAACTAAGCGTTGTC

What I have done is to first align the sequences, using an R package named DECIPHER.

>Chromosome:180614-180629
-CATTTAATGGCAGTC-------
>Chromosome:200346-200361
---GTAAAACCTGGTAAG-----
>Chromosome:461646-461661
---GTAAAGAGATCACCA-----
>Chromosome:461694-461709
----GAAAGTCACGAAATG----
>Chromosome:461714-461731
---GTAAACTAAGCGTTGTC---

I then generated logos using the aligned fasta file, using the R library ggseqlogo, and the online version of weblogo. However I'm not satisfied with the result in both cases.

ggseqlogo

ggseqlogo

weblogo

weblogo

It doesn't seem to take into account the "N" characters from my sequences. For example the initial "C" looks big, but in reality it only appears twice among 50 sequences, otherwise this position is unknown.

How can I correct this? Either by changing some parameter, or maybe "trimming" the extremities somehow beforehand?

Thanks!

ADD COMMENTlink modified 2 days ago • written 3 days ago by rioualen360

Using weblogo: https://weblogo.berkeley.edu/cache/fileFo0Hto.png

ADD REPLYlink written 3 days ago by genomax68k

The second one I posted is from weblogo, but maybe I'm missing some parameter here? I just posted 5 sequences here but they're 52 in this particular case.

ADD REPLYlink modified 3 days ago • written 3 days ago by rioualen360

I see. Have considered limiting the logo to just the part where there is something in each position?

ADD REPLYlink written 3 days ago by genomax68k

Yeah but I'm afraid of losing information, and there has to be a cleaner way to do it, it seems like quite a basic problem to me...

ADD REPLYlink written 3 days ago by rioualen360
1
gravatar for SMK
3 days ago by
SMK1.3k
Ghent, Belgium
SMK1.3k wrote:

Hi rioualen,

What you can do is:

ggseqlogo

seqs <- chartr('ATCG-', 'ATCGN', aln)
ggseqlogo(seqs, method = 'p', namespace = 'ATCGN')

Where aln is your alignment characters. (note that here I switched to the "probability" method, you can turn it back to "bits" by removing method = 'p')

ADD COMMENTlink modified 3 days ago • written 3 days ago by SMK1.3k

Yeah that's the idea, only I don't want the "N" to appear, for it kinda disturbs the visualization!

ADD REPLYlink written 3 days ago by rioualen360

Indeed. A workaround is to make "N" in light grey:

example_cs

by:

cs_gaps <-
  make_col_scheme(
    chars = c("A", "T", "C", "G", "N"),
    cols = c("#178739", "#CA112C", "#1D4887", "#F3A522", "grey97")
  )
ggseqlogo(seqs, method = 'p', namespace = 'ATCGN', col_scheme = cs_gaps)

But also, base 1, 21, 22, and 23 contain not much info and might be trimmed.

ADD REPLYlink modified 3 days ago • written 3 days ago by SMK1.3k

Apart from the apparent gaps that it generates, it also introduces a bias in the evaluation of nucleotide weights, and impacts the visualisation:

enter image description here enter image description here

ADD REPLYlink written 2 days ago by rioualen360

Can you show the raw alignment for this one?

ADD REPLYlink written 2 days ago by SMK1.3k
>ECK120011216(+)
----GTAAATTACCGTCAG-----
>ECK120011212(+)
----GTAAAGACGAACAATAAAT-
>ECK120016885(+)
----ACAATAAATTTTTAC-----
>ECK120017106(+)
----GTAAAACCTGGTAAG-----
>ECK120016896(+)
----GTAAAGAGATCACCA-----
>ECK120016894(+)
----GTAAAGCACTGAAAG-----
>ECK120016892(+)
----GTAAACTAAGCGTTGTC---
>ECK120029870(-)
----GTAATTTTTCGTAAT-----
>ECK120029882(+)
----GTAAAACCCCGTTTA-----
>ECK125230694(+)
----GCAACTCCCTGAAACG----
>ECK120016460(-)
---TCTTTTTGAAACCAAAT----
>ECK120016458(-)
----TTTACTTTTGGTTAC-----
>ECK120016462(-)
----GTTACGGAATATTAC-----
>ECK120016464(-)
--AGTTATCTGTTTGTTAA-----
>ECK120029872(-)
----GTAAAGATGGGTAAA-----
>ECK120029876(+)
----TTTATAAACATTCCG-----
>ECK120016233(-)
----GAAATCAGATGTAAT-----
>ECK120016235(-)
----TTTAAGATTTGTAAT-----
>ECK125110235(-)
---AGTTACATGTTTAAC------
>ECK120016237(-)
--------CATTTAGTTACATGT-
>ECK125141167(+)
---TGTAACTAAATGTAA------
>ECK120016239(-)
----GTTACATTTAGTTAC-----
>ECK125110233(-)
---GGTTAACATTTTAAT------
>ECK125110231(-)
----TCTAAACTTAATAAA-----
>ECK120016241(-)
----CATTTCTAAACTTAA-----
>ECK125110229(-)
----CTTTTCTATCATTTC-----
>ECK125110227(-)
----GCAATAAAAACCAAATG---
>ECK125110225(-)
----TTAAAATTGTGCAAT-----
>ECK125110223(-)
--AAGTTATCACCATTT-------
>ECK125110221(-)
----TGCATTATTAGTAA------
>ECK125110219(-)
----TTTTTATATGCATTA-----
>ECK125141169(+)
----TGGACAATTTTTTAC-----
>ECK120011322(+)
----GTAACGCAGCGTTAA-----
>ECK120029874(-)
---ATTTACAAATTCTTTGC----
>ECK125202681(+)
----TTGACTTATACTTGCC----
>ECK120029880(-)
---TTCTATGAAAATATTGAC---
>ECK120029639(-)
----CTTTCATTGTTTTAC-----
>ECK120029878(-)
----TTTGTCTCGATATAC-----
>ECK120016120(-)
----GTAAAAAGACGTAAA-----
>ECK120029884(+)
----GTAAAAATATATAAA-----
>ECK120019268(+)
----CTAACGCGTAGATAAAA---
>ECK125141177(+)
----CATTCAGACGATTCC-----
>ECK120016466(-)
-GCATTTACATTT-----------
>ECK120016468(-)
----CTTATTTCGCCATTC-----
>ECK120016470(-)
----GTAAAGAAGGGTAAA-----
>ECK125141175(+)
GAAAGTCAGTTAATGTAAT-----
>ECK125141173(+)
-----CAAAGAATACTTGC-----
>ECK125141171(+)
----AAAAGGGCATGATAATTT--
>ECK120016801(-)
----GTTACGGAACTTTAC-----
>ECK120015590(+)
----TTTACATTTTTTTGC-----
>ECK120012502(+)
----GCAAAACGCCGTAAG-----
>ECK125230685(-)
---CGTAAATGAGAGTAAA-----
>ECK125230685(-)
---CGTAAATGAGAGTAAA-----
>ECK125141179(-)
----TGAAGAGGCGGTTTA-----
>ECK120017108(-)
---GGTAAAGGCAACAAA------
>ECK120012504(-)
----GTAAAGCGGCGAAAA-----
>ECK120029886(+)
----GTCAGCCTGTGTAAA-----
>ECK120011218(-)
----GTAAAATTAGGTAAA-----
>ECK120017107(+)
----GTAATGGTTTTCGTCTAC--
>ECK125141183(+)
---GTTAAACAAACGGTCAAA---
>ECK125141181(+)
----GTAAAGCGGGGATAAAT---
>ECK120011214(+)
----GTAAAAGCTTGTAAG-----
>ECK120016114(+)
----TTTACGTTGTTTTAC-----
>ECK120016114(-)
----GTAAAACAACGTAAA-----
>ECK120016186(-)
----GCAAACATGCGTCAG-----
>ECK120016186(+)
----CTGACGCATGTTTGC-----c
>ECK120016116(+)
----GTAAACTCTCTATCGTTGAA
>ECK120012506(-)
----GTAAAAACGCGTAAA-----
>ECK120016123(+)
----GTAAAGTAAGGTAAA-----
>ECK125230687(+)
----GTAACGTGGCGTAAAC----
ADD REPLYlink written 2 days ago by rioualen360
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 567 users visited in the last hour