Question: Multiple alignment software
0
gravatar for juanjo75es
5 weeks ago by
juanjo75es60
juanjo75es60 wrote:

I have these sequences:

>a
GCATCCCGATGGTCACGGTCGCCACCAGCCTCGCCAACGACGGCA
>b
GTCCGGCCATGCCCAGCAGCACGGCAAGGGTCAGGCCCACAATCGGA
>c
GTGCAAACGGATGCCACGCGCGACGCACGCTTCGCCGCGGCGTTTAGC
>d
GTAGGCCGGGCCGAAGGCGCGACGCTTGTATGCGGAGGCGAACTGCTTGCGA
>e
GCTGCTCCAGTAACCGCCCTATGGTCGGGGTCACATTCTGGCCCTGGCTCCA

I want to make a multiple alignment. That's what I get from Clustal Omega:

------GCATCCCGATGGTCACGGTCGCCACCAGCCTCGCCAACGACGGCA------------
GTCCGGCCATGCCCAGCAGCACGGCAAGGGTCAGGCCCACAATCGGA----------------
--------------GTGCAAACGGATGCCACGCGCGACGCACGCTTCGCCGCGGCGTTTAGC-
------GTAGGCCGGGCCGAA-----GGCGCGACGCTTGTATGCGGAGGCGAACTGCTTGCGA
------GCTGCTCCAGTAACCGCCCTATGGTCGGGGTCACATTCTGGCCCT----GGCTCCA-

That's what I get from t-coffee:

--------------GC------------ATCCCGATGGTCACGGTCGCCACCAGCCTCGCCAACGACGGCA
G-TCCGG-------CCATGCCCAGCAGCACGGCAAGGGTCAGGCCCACAATCGGA----------------
GTGCAAACGGATGCCACGCGCGACGCACGCTTCGCCGCGGCGTTTAGC-----------------------
G-----TAGGCCGGGCCGAAGGCGCGACGCTTGTATGCGGAGGCGAACTGCTTGCGA--------------
--------------GCTGCTCCAGTAACCGCCCTATGGTCGGGGTCACATTCTGGCCCTGGCTCCA-----

That's what I get from an algorithm I designed and developed in one day while watching tv (also tested with larger sequences and larger datasets which ClustalO and t-Coffee refuse for being too long or crash when aligning):

G--CAT-C-CC-G--ATG-GTC--ACGGTC-GC---CAC--CA-GCCTC-G-CCAA--C-GA-C-G----GC-A
G----T-C-CG-GCCAT--G-C--CCAG-CAG----CACGGCAAGGGTCAGGCCCA--C-AATC-G----G--A
GTGCAA-A-CG-G--AT--GCC--AC-G-C-GC---GAC-GCACGCTTC-G-CC-G--C-GG-C-GTTTAGC--
G--TAG-G-CC-G----G-GCC-GAAGG-C-GC---GAC-GCTTGTAT--G-CGGAGGC-GAACTG-CTTGCGA
G--C-TGCTCCAG-TAACCGCCCTATGGTC-GGGGTCA---CA-TTCTG-G-CC----CTGG-C-T----CC-A

It's only me who finds the last alignment as the most accurate? Am I missing something?

ADD COMMENTlink modified 5 weeks ago by Istvan Albert ♦♦ 81k • written 5 weeks ago by juanjo75es60
1

As I take from some comments, looks like the problem here is that my algorithm and the other ones are designed for different goals. Mine is (actually) conceived to align non-conserved sequences while the other two are mostly designed to align conserved regions at which INDELs are less frequent than SNPs. And due to that different goals my impression on which alignment is best seems to be biased.

It's only me who finds the last alignment as the most accurate?

Likely yes given that alignment of unconserved regions is not widely explored.

Am I missing something?

Yes. That people are usually (by far) more interested on genes and proteins than on DNA in general.

Therefore, looks like that most people will be safe for now using the other softwares for their usual needs.

ADD REPLYlink written 5 weeks ago by juanjo75es60
2

IMO the thing you're missing is that opening a new gap is a lot more expensive than expanding on an existing gap or putting up with a mismatch. Biology works by keeping what works, and what works is a functional unit that does things. Genes are functional units, and so are protein domains. Changes that affect those units are selected against in nature, so while your algorithm makes the sequences look more "aligned" in the sense of Microsoft Word's Justify setting, it is not biological alignment.

ADD REPLYlink written 5 weeks ago by RamRS24k

I am not aligning genes... Biology is not only about genes. The biomedical industry is what seems to be only about genes. People here are biased, And they don't seem to take seriously any scientifical method. Sorry if I generalize. People who do not agree with all bullshit displayed here should say that if they don't agree.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by juanjo75es60

It doesn’t matter whether you’re aligning genes or not, indels are still less frequent than polymorphisms, and your algorithm doesn’t reflect that.

Now, please stop calling everyone here biased because we haven’t agreed with you. If you want honest feedback, we have provided it - if you don’t, you’re welcome to go elsewhere.

ADD REPLYlink written 5 weeks ago by Joe14k

We are all trying to keep this discussion on track and not take things personally, but incensing statements make that difficult. I agree with you that biology is not just about genes. However, we are aligning sequences here, so there needs to be a reason we align sequences as well as a metric to measure how close we are to that goal. What is your goal/reason for aligning these three sequences, and what metric are you using to show that your algorithm got us closer (than the other two algorithms) to that goal?

If your algorithm is better to reach that goal, it is better - as simple as that. It does need to be a well-defined and biologically relevant goal though, or we are not solving a biological problem.

ADD REPLYlink written 5 weeks ago by RamRS24k
7

To be blunt, yours is the worst there.

As ATpoint commented, gaps and gap extensions are (and should be) much more costly than mismatches to reflect biological reality (SNPs are more common than INDELs).

An alignment as ‘gappy’ as yours would be a nightmare to do subsequent phylogenetic analysis with (where entire or mostly gap columns are often discarded since they lack evolutionary signal).

ADD REPLYlink written 5 weeks ago by Joe14k
4

I can only agree with ATpoint and Joe here I'm afraid.

apart from the points they have already made, I want to add one more and that is that it makes no sense to align random sequences to eachother. Where did you got the sequences from you are using, from a gene family? If you really want to benchmark or showcase your tools performance you take sequences from a well defined gene family and align those. There are many papers and such on this specific topic (nothing really super recent though but still all valid)

ADD REPLYlink written 5 weeks ago by lieven.sterck6.1k

These are not random sequences. These are sub-sequences of wider sequences which were found by finding local alignments of a SINE sequence into a genome (if I remember well). Therefore, the sequences in which they are included are likely related. I am not interested on aligning conserved regions. If they are conserved then they will obviously have more SNPs than INDELs. Not sure in other cases.

ADD REPLYlink written 5 weeks ago by juanjo75es60
3

pfff, not sure where to start but it's apparent (and I mean this in the most helpful way) you are lacking a serious level a basic (molecular) biology, especially to get involved in such a topic.

of a SINE sequence into a genome (if I remember well). Therefore, the sequences in which they are included are likely related.

Sorry, but this is a totally wrong assumption . Moreover, aligning (or working with) transposon and TE-related sequences is even an extra level of difficulties.

I am not interested on aligning conserved regions.

"funny" you mention this as this is exactly what people do to benchmark alignment methods as for those we at least have some good indication what the result should be like.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by lieven.sterck6.1k

So, finding two almost dentical SINE sequences in two different places of a genome and assuming they are related means "Sorry, but this is a totally wrong assumption". Could you expand on that?

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by juanjo75es60
1

Yes but let me first mention that the statement you make here in your comment is not the same as the one in the original I quoted.

The SINE elements themself might be related but they don't necessarily have to be! SINE, just as many other transposon/TE sequences are part of big families and it's up to now still not really clear if they all share a common origin (hence being related). Despite this the location where they integrate into a genome are not related simply because they have the same kind of SINE in it. Integration sites of SINE do not impose biological relationship.

You could in theory align the SINE sequences (there is at least some sort of relationship) , however they are prone to rapid mutation accumulation making aligning them difficult.

Anyway this is all a minor comment compared to the ones others have posted here (and which I fully support) , the main one being that there is a biological reasoning behind doing sequence alignments that can not be neglected.

ADD REPLYlink written 5 weeks ago by lieven.sterck6.1k
3

Not really my field but shouldn't an alignment algorithm try to introduce as few gaps as possible to find the best overall alignment between sequences. I think this is biologically meaningful as in e.g. protein-coding sequences in an evolutionary context gaps would probably lead to frameshifts. While a few might make sense your alignment is basically gapping sequences within the , lets call it here coding or "core" sequence until they fit to each other. In contrast the other two algorithms try to keep those "sub-sequences" with local similarity gap-free and rather introduce them at the ends. I think keeping things as ungapped as possible in the "core" sequence is the better a priori assumption rather to force sequences to match each other over the full length.

I cannot really think of a meaningful biological process where that many gaps would indicate a conservation, be it protein-coding functions, transcription factor binding etc. What makes you think that this is a good strategy? Again, not really my core field, just thinking aloud.

Edit: That's what I get from an algorithm I designed and developed in one day while watching tv::: Not sure why you write something like this but to be honest, this is exactly how this alignment of yours looks like. If you do not like people giving snarky comments, maybe better avoid these kind of sentences, to keep everything serious.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by ATpoint24k

Not sure why you write something like this

That usually helps to instantly identify those who do not have honest intentions writing an answer. They usually lose their cool for no apparent reason. Apart from that it's (more or less) the truth and it's usually healthy to display the truth . Glad it "helped" you to identify "how this alignment of yours looks like"

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by juanjo75es60
3

If ATpoint was prejudiced in his answer, that's because you wrote your post in, what appears to be an arrogant manner. It alienates people. You might want to consider this in your future interactions.

It doesn't change the fact that, at least on this data set, your algorithm, whatever it is, doesn't appear particularly performant.

Based on the current data, there is nothing to suggest your algorithm is good on divergent data as you suggest. All we know is it is not very good on notionally similar or related sequences (assuming that is the case for your example data).

I don't believe that DNA based alignment will ever be particularly good at resolving poorly conserved sequences (unless you know something about the ground truth of the evolution of that set of sequences and can calibrate accordingly). For this purpose, protein alignment and particularly HMM based alignment methods excel. I cannot see your algorithm outperforming them at the moment; certainly not until you provide some equivalent benchmark data anyway.

Some general advice: don't be so precious about your algorithm . You seem to be asking for opinions (repeatedly) but are not prepared to listen to answers or opinions which do not align with your preconceived ideas. It is possible your algorithm is no good. Be open to that possibility.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Joe14k
2

Unfortunately, that is not true either. An incensing statement runs counter to the assumption that people are here on good faith, and given that tone is not apparent online, it propagates worst assumptions on all sides. Let's just stick to the science and leave our personal feelings out of the conversation.

ADD REPLYlink written 5 weeks ago by RamRS24k

So you're calling AT dishonest? Really classy.

ADD REPLYlink written 5 weeks ago by swbarnes26.7k

Let's please not continue this line of conversation. We are all professionals here and nothing needs to be taken personally.

ADD REPLYlink written 5 weeks ago by RamRS24k

For completeness: I actually noticed this elaborate sentence of yours after I commented, that is why I used edit as I always do to indicate changes. Believe it or not, the section above the edit therefore even represents my "honest" opinion that was intended to be helpful. Negative criticism is unpleasant but sometimes necessary. Anyway, given that you repetitively, in this thread and the one you posted before, behave offended after receiving negative criticism makes me pull out of this thread and those you will post in the future. Good luck with your research effort.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by ATpoint24k
1

An important bioinformatics rule: trash in, is trash out. If you have that situation there is nothing to benchmark or to show. If you want to show that your algorithm is good then do it at least with a more real life situation.

Another thing, look at the title of this forum... it is called biostars and is made for bioinformatics. What you are doing now does not make any sense, if it is just a algorithm for string comparison then stackoverflow is maybe a better place.

ADD REPLYlink written 5 weeks ago by gb1.0k

At this stage its worse even than that I think, given that we know nothing of what the algorithm does or what its for (other than in theory being good for divergent sequences), garbage in -> garbage out could be gold in -> garbage out while is black-boxed in some mystery algorithm.

ADD REPLYlink written 5 weeks ago by Joe14k
1

Just some updates given that people seem to be quite interested on that topic. (I) I didn't read all comments. Too much nonsesnse IMHO. I will try to focus only on what is interesting for me. Some people here seem to be very young and/or lacking many vital experiences. That's the better explanation I can find (not everybody, of course). (II) I have been making more tests with my algorithm. I improved it a bit to make it discard gaps (just discarding columns with too many gaps). My conclusions for now are: 1) It performs similarly to Clustal Omega when making phylogenetic trees for conserved regions (but faster). 2) It apparently performs better with non conserved regions. What kind of evaluation I am doing? I just make trees from the alignments and then compare trees and paiwise alignments in the tree. I find that the trees make similar sense than trees obtained with Clustal and pairwise alignments are similar with my algorithm and Clustal when using conserved regions but pairwise alignments are better when using my algorithm with unconserved regions. That's all. Quite subjective yet. I will continue doing some tests and will likely report. But one important thing is that you can use it in replacement of Clustal and there will not be much difference. Tested for example with the 35 mammals alignment data (got from Ensembl) and indeed my tree seems to make more sense than the Clustal one. (III) If you remember, it's not only that I made a software for replacement of Clustal Omega, I also made one as replacement for BLAST and I use also my own software for generating phylogenetic trees. I have spent quite more time on these than in the multialignment software. I don't care much if the multialignment is better or not. I just implemented it because the alignments I had been obtaining from Clustal and Coffee were not good enough in my opinion. People here making a big thing for something I clearly stated I did "in a day while watching tv". That's not something one says to try to give special value to his work... I didn't pretend to impress anyone with that part of my work... People seem quite confused, BTW that's their problem... This is a tree using Clustal: 35 mammals tree from Clustal Omega alignment A tree with my algorithm: 35 mammals tree from alternative alignment Will likely continue reporting when I have more data...

ADD REPLYlink written 4 weeks ago by juanjo75es60
5

Too much nonsense

Some people here seem to be very young and/or lacking many vital experiences.

People seem quite confused, BTW that's their problem...

Stop it with the bad faith and ad hominem comments or you will be banned from the forum.

ADD REPLYlink written 4 weeks ago by RamRS24k
1

Can you share the actual Newick tree files for those trees?

How are you calculating branch lengths in your algorithm?

ADD REPLYlink written 4 weeks ago by Joe14k

Could someone explain me why my answer (the one I consider the right one) was moved to a comment while a deeply wrong answer is keeped?

ADD REPLYlink written 5 weeks ago by juanjo75es60

Because yours isn’t an answer, it is more suitable as a comment. You came here looking for our opinions (ostensibly), so your own opinion that just ignores everything we said cannot by definition be an answer to the question.

ADD REPLYlink written 5 weeks ago by Joe14k

Your content adds to the discussion but does not really answer the question as it looks like a defensive argument and not an objective statement. It would fit as an "answer" if this were a Forum discussion post. However, since this is a "Question", Mensur's post addresses the question better and is thus better suited as an answer.

Would you prefer if your content was an answer so you could accept it? As with all online communities, the community gets to decide what is most helpful but if you'd like for your post to be made an answer, we can go ahead and do that.

ADD REPLYlink written 5 weeks ago by RamRS24k
7
gravatar for Mensur Dlakic
5 weeks ago by
Mensur Dlakic1.8k
USA
Mensur Dlakic1.8k wrote:

It's only me who finds the last alignment as the most accurate? Am I missing something?

I think you are missing the purpose of alignments of biological sequences. These are not abstract problems where the goal is to make the prettiest picture, or the one that has the highest number of match columns and the lowest number of mismatched columns. The goal of alignments is find a proper representation of the evolutionary history that connects two sequences.

It may help to think about the alignment of two sequences first (say, top two sequences as you wrote them), and then extrapolate that onto multiple alignments. Let us say that biological events that shape the alignment of two sequences are mutations, insertions and deletions. How many of those biological events would be enough to explain a given alignment? More importantly, if we can explain it adequately with 2-3 events (like Clustal alignment did), should we try to explain it better with 50 or so events (like yours did)?

This is already implied in previous responses: knowing how to code an alignment algorithm is good, but knowing how to optimize biological alignments requires some knowledge of biology as well. I told you a week or so ago that BLAST has been developed by a team of people over decades. Same for Clustal. Maybe you are that good to surpass their combined effort in a day while watching TV - ask the right questions and study the field, and we will find out.

ADD COMMENTlink written 5 weeks ago by Mensur Dlakic1.8k

You hit the nail on the head with the "look pretty" part. The underlying measure of accuracy is the problem here. It is quite impressive that OP wrote an algorithm to align multiple sequences in a day.

ADD REPLYlink written 5 weeks ago by RamRS24k

Mensur, thanks for replying again. I guessed you realized from the last time that I didn't agree with your reply. I don't think it's the point of this forum to discuss philosophical or political issues. That's why I didn't explain you last time why I think you are deeply wrong. I won't either do this time. Not here. But you can open a thread to discuss phylosophycal issues if you like and I'd likely participate.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by juanjo75es60

Again you make a deeply unaccurate statement "if we can explain it adequately with 2-3 events". You need a lot more events to explain their alignment. Indeed you need more events than with mine. Unless you believe that DNA replication makes a lot of transcription mistakes in just one event. It's quite ridiculous having to discuss so basic concepts here... And yes, I'm not an expert. I have only been working in this field around 6 months. Maybe that's why I have fresher the basic concepts and I am not so biased like people here seem to be. As some one commented we should focus on science. And science is not about reputation, it's about facts and methods.

ADD REPLYlink written 5 weeks ago by juanjo75es60
3

And yes, I'm not an expert. I have only been working in this field around 6 months. Maybe that's why I have fresher the basic concepts and I am not so biased like people here seem to be.

You can't first ask our opinion and then disagree with the opinions you receive. If you think your algorithm is superior and all of us are dumb, then fine, go ahead and leave us alone in our ignorance.

You can also get my opinion. You are currently not contributing anything meaningful to this forum. You don't have the experience to back up your ideas, and your understanding of biology is flawed at best. You engage in fights with everyone here who tries to help you with making your algorithm better and biologically sound. That's absurd and a bad attitude. You cannot handle criticism. I consider this behavior abuse of the volunteer contributors of biostars and will take actions to prevent this going further if necessary.

You can consider this the only warning you will receive.

Have a nice day,
Wouter

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by WouterDeCoster41k

Without wishing to put words in his mouth, Mensur was no doubt simplifying to illustrate the point. Its not trivial to count the total number of events (it would approximately be the total number of contiguous gaps (~2 in the clustal alignment if you ignore the sequence ends (since you dont know what sequence was either side in the example, you should), plus however many mismatches there are, but as you admit in any case, its fewer than your algo.

DNA transcription does make lots of mistakes potentially (e.g. strand slippage on highly repetitive sequences), however, a event that looks like ATG-----TA is (likely) a single event, from a spontaneous deletion, not several deletion events. Now, its never possible to know for sure (such is phylogenetics), so you choose a parsimonious explanation. This is where and why we say, with the best will in the world, that you are lacking some of the subtleties of the biological reality; its more than just an exercise in playing ‘match the letters’.

I’m not sure why you think us biased (beyond us not agreeing with you?). It’s not like we get up every morning and swear allegiance to Altschul while saluting a flag of the NCBI.

ADD REPLYlink written 5 weeks ago by Joe14k
2
gravatar for Istvan Albert
5 weeks ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

An important detail that should be mentioned here is that none of the alignments you present there are inherently "good" or "bad"

Alignment algorithms maximize a score. The result (if implemented correctly) is the alignment that produces the maximal score. The score is built from the rewards and penalties associated with each match, mismatch, and gap.

Every single alignment above should be reproducible by any tool when the scores are chosen identically. That being said some aligner implementations may not be tunable enough to cover all use cases (often for performance reasons).

Long story short, your alignment seems "better" for you because you chose the value certain matches a certain way. That is fine.

This does not make the other alignment "bad", even less so makes the other tools "bad" - it just means that the other alignment method chose different scoring (rewards and penalties) and tries to address a different use case.

If you can easily implement algorithms that produce multiple sequence alignments (while watching TV) then congrats, not many people can do that, write it up, publish it.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Istvan Albert ♦♦ 81k
1
gravatar for fishgolden
5 weeks ago by
fishgolden420
fishgolden420 wrote:

At a glance, I thought that all alignments are nonsense. I wouldn't do any downstream analysis with any of those alignments.

However, if you want to discuss about the accuracy of algorithms; T-Coffee, Clustal Omega, and yours, you should evaluate those algorithms with some benchmark dataset.

I'm a protein person thus I don't not know much about benchmark dataset of nucleotides so much but

I think ROSE (dataset generator) is very famous.

https://www.ncbi.nlm.nih.gov/pubmed/9545448

There are many papers using ROSE to evaluate algorithms https://scholar.google.com/scholar?client=firefox-b-d&um=1&ie=UTF-8&lr&cites=3626960321284721983 thus you can follow them.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by fishgolden420
1

Agreed none of the alignments are very good. I also tested mafft and that didn’t produce great results either, so I suspect this task requires some optimisation of alignment parameters rather than running with defaults.

In any case, a good resource for alignment benchmarking is Balibase.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Joe14k

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC29792/ BaliBASE is from proteins which have 3D structure... Nucleotide people may use it for benchmarking algorithms, however, it's not suitable for his analysis, at least....

ADD REPLYlink written 4 weeks ago by fishgolden420

Why not? It’s literally for benchmarking nucleotide alignments as you say.

It’s a good litmus test for an aligned, because if it can’t manage those best case scenarios, it’s no good.

ADD REPLYlink written 4 weeks ago by Joe14k
1

They are amino acid sequences thus they must be reverse translated into nucleotide when you benchmark nucleotide alignment software. Then it becomes codon-based alignment benchmark (the gaps are placed between two codon frames. It's quite biased.) & It can not be applied to aligning SINE.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by fishgolden420
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2098 users visited in the last hour