Multiple alignment software
3
0
Entering edit mode
3.4 years ago
juanjo75es ▴ 130

I have these sequences:

>a
GCATCCCGATGGTCACGGTCGCCACCAGCCTCGCCAACGACGGCA
>b
GTCCGGCCATGCCCAGCAGCACGGCAAGGGTCAGGCCCACAATCGGA
>c
GTGCAAACGGATGCCACGCGCGACGCACGCTTCGCCGCGGCGTTTAGC
>d
GTAGGCCGGGCCGAAGGCGCGACGCTTGTATGCGGAGGCGAACTGCTTGCGA
>e
GCTGCTCCAGTAACCGCCCTATGGTCGGGGTCACATTCTGGCCCTGGCTCCA


I want to make a multiple alignment. That's what I get from Clustal Omega:

------GCATCCCGATGGTCACGGTCGCCACCAGCCTCGCCAACGACGGCA------------
GTCCGGCCATGCCCAGCAGCACGGCAAGGGTCAGGCCCACAATCGGA----------------
--------------GTGCAAACGGATGCCACGCGCGACGCACGCTTCGCCGCGGCGTTTAGC-
------GTAGGCCGGGCCGAA-----GGCGCGACGCTTGTATGCGGAGGCGAACTGCTTGCGA
------GCTGCTCCAGTAACCGCCCTATGGTCGGGGTCACATTCTGGCCCT----GGCTCCA-


That's what I get from t-coffee:

--------------GC------------ATCCCGATGGTCACGGTCGCCACCAGCCTCGCCAACGACGGCA
G-TCCGG-------CCATGCCCAGCAGCACGGCAAGGGTCAGGCCCACAATCGGA----------------
GTGCAAACGGATGCCACGCGCGACGCACGCTTCGCCGCGGCGTTTAGC-----------------------
G-----TAGGCCGGGCCGAAGGCGCGACGCTTGTATGCGGAGGCGAACTGCTTGCGA--------------
--------------GCTGCTCCAGTAACCGCCCTATGGTCGGGGTCACATTCTGGCCCTGGCTCCA-----


That's what I get from an algorithm I designed and developed in one day while watching tv (also tested with larger sequences and larger datasets which ClustalO and t-Coffee refuse for being too long or crash when aligning):

G--CAT-C-CC-G--ATG-GTC--ACGGTC-GC---CAC--CA-GCCTC-G-CCAA--C-GA-C-G----GC-A
G----T-C-CG-GCCAT--G-C--CCAG-CAG----CACGGCAAGGGTCAGGCCCA--C-AATC-G----G--A
GTGCAA-A-CG-G--AT--GCC--AC-G-C-GC---GAC-GCACGCTTC-G-CC-G--C-GG-C-GTTTAGC--
G--TAG-G-CC-G----G-GCC-GAAGG-C-GC---GAC-GCTTGTAT--G-CGGAGGC-GAACTG-CTTGCGA
G--C-TGCTCCAG-TAACCGCCCTATGGTC-GGGGTCA---CA-TTCTG-G-CC----CTGG-C-T----CC-A


It's only me who finds the last alignment as the most accurate? Am I missing something?

alignment clustal omega tcoffee • 3.2k views
1
Entering edit mode

As I take from some comments, looks like the problem here is that my algorithm and the other ones are designed for different goals. Mine is (actually) conceived to align non-conserved sequences while the other two are mostly designed to align conserved regions at which INDELs are less frequent than SNPs. And due to that different goals my impression on which alignment is best seems to be biased.

It's only me who finds the last alignment as the most accurate?

Likely yes given that alignment of unconserved regions is not widely explored.

Am I missing something?

Yes. That people are usually (by far) more interested on genes and proteins than on DNA in general.

Therefore, looks like that most people will be safe for now using the other softwares for their usual needs.

3
Entering edit mode

IMO the thing you're missing is that opening a new gap is a lot more expensive than expanding on an existing gap or putting up with a mismatch. Biology works by keeping what works, and what works is a functional unit that does things. Genes are functional units, and so are protein domains. Changes that affect those units are selected against in nature, so while your algorithm makes the sequences look more "aligned" in the sense of Microsoft Word's Justify setting, it is not biological alignment.

0
Entering edit mode

I am not aligning genes... Biology is not only about genes. The biomedical industry is what seems to be only about genes. People here are biased, And they don't seem to take seriously any scientifical method. Sorry if I generalize. People who do not agree with all bullshit displayed here should say that if they don't agree.

0
Entering edit mode

It doesn’t matter whether you’re aligning genes or not, indels are still less frequent than polymorphisms, and your algorithm doesn’t reflect that.

Now, please stop calling everyone here biased because we haven’t agreed with you. If you want honest feedback, we have provided it - if you don’t, you’re welcome to go elsewhere.

0
Entering edit mode

We are all trying to keep this discussion on track and not take things personally, but incensing statements make that difficult. I agree with you that biology is not just about genes. However, we are aligning sequences here, so there needs to be a reason we align sequences as well as a metric to measure how close we are to that goal. What is your goal/reason for aligning these three sequences, and what metric are you using to show that your algorithm got us closer (than the other two algorithms) to that goal?

If your algorithm is better to reach that goal, it is better - as simple as that. It does need to be a well-defined and biologically relevant goal though, or we are not solving a biological problem.

8
Entering edit mode

To be blunt, yours is the worst there.

As ATpoint commented, gaps and gap extensions are (and should be) much more costly than mismatches to reflect biological reality (SNPs are more common than INDELs).

An alignment as ‘gappy’ as yours would be a nightmare to do subsequent phylogenetic analysis with (where entire or mostly gap columns are often discarded since they lack evolutionary signal).

4
Entering edit mode

I can only agree with ATpoint and Joe here I'm afraid.

apart from the points they have already made, I want to add one more and that is that it makes no sense to align random sequences to eachother. Where did you got the sequences from you are using, from a gene family? If you really want to benchmark or showcase your tools performance you take sequences from a well defined gene family and align those. There are many papers and such on this specific topic (nothing really super recent though but still all valid)

0
Entering edit mode

These are not random sequences. These are sub-sequences of wider sequences which were found by finding local alignments of a SINE sequence into a genome (if I remember well). Therefore, the sequences in which they are included are likely related. I am not interested on aligning conserved regions. If they are conserved then they will obviously have more SNPs than INDELs. Not sure in other cases.

3
Entering edit mode

pfff, not sure where to start but it's apparent (and I mean this in the most helpful way) you are lacking a serious level a basic (molecular) biology, especially to get involved in such a topic.

of a SINE sequence into a genome (if I remember well). Therefore, the sequences in which they are included are likely related.

Sorry, but this is a totally wrong assumption . Moreover, aligning (or working with) transposon and TE-related sequences is even an extra level of difficulties.

I am not interested on aligning conserved regions.

"funny" you mention this as this is exactly what people do to benchmark alignment methods as for those we at least have some good indication what the result should be like.

0
Entering edit mode

So, finding two almost dentical SINE sequences in two different places of a genome and assuming they are related means "Sorry, but this is a totally wrong assumption". Could you expand on that?

1
Entering edit mode

Yes but let me first mention that the statement you make here in your comment is not the same as the one in the original I quoted.

The SINE elements themself might be related but they don't necessarily have to be! SINE, just as many other transposon/TE sequences are part of big families and it's up to now still not really clear if they all share a common origin (hence being related). Despite this the location where they integrate into a genome are not related simply because they have the same kind of SINE in it. Integration sites of SINE do not impose biological relationship.

You could in theory align the SINE sequences (there is at least some sort of relationship) , however they are prone to rapid mutation accumulation making aligning them difficult.

Anyway this is all a minor comment compared to the ones others have posted here (and which I fully support) , the main one being that there is a biological reasoning behind doing sequence alignments that can not be neglected.

3
Entering edit mode

Not really my field but shouldn't an alignment algorithm try to introduce as few gaps as possible to find the best overall alignment between sequences. I think this is biologically meaningful as in e.g. protein-coding sequences in an evolutionary context gaps would probably lead to frameshifts. While a few might make sense your alignment is basically gapping sequences within the , lets call it here coding or "core" sequence until they fit to each other. In contrast the other two algorithms try to keep those "sub-sequences" with local similarity gap-free and rather introduce them at the ends. I think keeping things as ungapped as possible in the "core" sequence is the better a priori assumption rather to force sequences to match each other over the full length.

I cannot really think of a meaningful biological process where that many gaps would indicate a conservation, be it protein-coding functions, transcription factor binding etc. What makes you think that this is a good strategy? Again, not really my core field, just thinking aloud.

Edit: That's what I get from an algorithm I designed and developed in one day while watching tv::: Not sure why you write something like this but to be honest, this is exactly how this alignment of yours looks like. If you do not like people giving snarky comments, maybe better avoid these kind of sentences, to keep everything serious.

0
Entering edit mode

Not sure why you write something like this

That usually helps to instantly identify those who do not have honest intentions writing an answer. They usually lose their cool for no apparent reason. Apart from that it's (more or less) the truth and it's usually healthy to display the truth . Glad it "helped" you to identify "how this alignment of yours looks like"

3
Entering edit mode

If ATpoint was prejudiced in his answer, that's because you wrote your post in, what appears to be an arrogant manner. It alienates people. You might want to consider this in your future interactions.

It doesn't change the fact that, at least on this data set, your algorithm, whatever it is, doesn't appear particularly performant.

Based on the current data, there is nothing to suggest your algorithm is good on divergent data as you suggest. All we know is it is not very good on notionally similar or related sequences (assuming that is the case for your example data).

I don't believe that DNA based alignment will ever be particularly good at resolving poorly conserved sequences (unless you know something about the ground truth of the evolution of that set of sequences and can calibrate accordingly). For this purpose, protein alignment and particularly HMM based alignment methods excel. I cannot see your algorithm outperforming them at the moment; certainly not until you provide some equivalent benchmark data anyway.

Some general advice: don't be so precious about your algorithm . You seem to be asking for opinions (repeatedly) but are not prepared to listen to answers or opinions which do not align with your preconceived ideas. It is possible your algorithm is no good. Be open to that possibility.

2
Entering edit mode

Unfortunately, that is not true either. An incensing statement runs counter to the assumption that people are here on good faith, and given that tone is not apparent online, it propagates worst assumptions on all sides. Let's just stick to the science and leave our personal feelings out of the conversation.

0
Entering edit mode

So you're calling AT dishonest? Really classy.

0
Entering edit mode

Let's please not continue this line of conversation. We are all professionals here and nothing needs to be taken personally.

0
Entering edit mode

For completeness: I actually noticed this elaborate sentence of yours after I commented, that is why I used edit as I always do to indicate changes. Believe it or not, the section above the edit therefore even represents my "honest" opinion that was intended to be helpful. Negative criticism is unpleasant but sometimes necessary. Anyway, given that you repetitively, in this thread and the one you posted before, behave offended after receiving negative criticism makes me pull out of this thread and those you will post in the future. Good luck with your research effort.

2
Entering edit mode

Just some updates given that people seem to be quite interested on that topic. (I) I didn't read all comments. Too much nonsesnse IMHO. I will try to focus only on what is interesting for me. Some people here seem to be very young and/or lacking many vital experiences. That's the better explanation I can find (not everybody, of course). (II) I have been making more tests with my algorithm. I improved it a bit to make it discard gaps (just discarding columns with too many gaps). My conclusions for now are: 1) It performs similarly to Clustal Omega when making phylogenetic trees for conserved regions (but faster). 2) It apparently performs better with non conserved regions. What kind of evaluation I am doing? I just make trees from the alignments and then compare trees and paiwise alignments in the tree. I find that the trees make similar sense than trees obtained with Clustal and pairwise alignments are similar with my algorithm and Clustal when using conserved regions but pairwise alignments are better when using my algorithm with unconserved regions. That's all. Quite subjective yet. I will continue doing some tests and will likely report. But one important thing is that you can use it in replacement of Clustal and there will not be much difference. Tested for example with the 35 mammals alignment data (got from Ensembl) and indeed my tree seems to make more sense than the Clustal one. (III) If you remember, it's not only that I made a software for replacement of Clustal Omega, I also made one as replacement for BLAST and I use also my own software for generating phylogenetic trees. I have spent quite more time on these than in the multialignment software. I don't care much if the multialignment is better or not. I just implemented it because the alignments I had been obtaining from Clustal and Coffee were not good enough in my opinion. People here making a big thing for something I clearly stated I did "in a day while watching tv". That's not something one says to try to give special value to his work... I didn't pretend to impress anyone with that part of my work... People seem quite confused, BTW that's their problem... This is a tree using Clustal: A tree with my algorithm: Will likely continue reporting when I have more data...

5
Entering edit mode

Too much nonsense

Some people here seem to be very young and/or lacking many vital experiences.

People seem quite confused, BTW that's their problem...

Stop it with the bad faith and ad hominem comments or you will be banned from the forum.

1
Entering edit mode

Can you share the actual Newick tree files for those trees?

How are you calculating branch lengths in your algorithm?

1
Entering edit mode

An important bioinformatics rule: trash in, is trash out. If you have that situation there is nothing to benchmark or to show. If you want to show that your algorithm is good then do it at least with a more real life situation.

Another thing, look at the title of this forum... it is called biostars and is made for bioinformatics. What you are doing now does not make any sense, if it is just a algorithm for string comparison then stackoverflow is maybe a better place.

0
Entering edit mode

At this stage its worse even than that I think, given that we know nothing of what the algorithm does or what its for (other than in theory being good for divergent sequences), garbage in -> garbage out could be gold in -> garbage out while is black-boxed in some mystery algorithm.

0
Entering edit mode

Could someone explain me why my answer (the one I consider the right one) was moved to a comment while a deeply wrong answer is keeped?

0
Entering edit mode

Because yours isn’t an answer, it is more suitable as a comment. You came here looking for our opinions (ostensibly), so your own opinion that just ignores everything we said cannot by definition be an answer to the question.

0
Entering edit mode

Your content adds to the discussion but does not really answer the question as it looks like a defensive argument and not an objective statement. It would fit as an "answer" if this were a Forum discussion post. However, since this is a "Question", Mensur's post addresses the question better and is thus better suited as an answer.

Would you prefer if your content was an answer so you could accept it? As with all online communities, the community gets to decide what is most helpful but if you'd like for your post to be made an answer, we can go ahead and do that.

7
Entering edit mode
3.4 years ago
Mensur Dlakic ★ 22k

It's only me who finds the last alignment as the most accurate? Am I missing something?

I think you are missing the purpose of alignments of biological sequences. These are not abstract problems where the goal is to make the prettiest picture, or the one that has the highest number of match columns and the lowest number of mismatched columns. The goal of alignments is find a proper representation of the evolutionary history that connects two sequences.

It may help to think about the alignment of two sequences first (say, top two sequences as you wrote them), and then extrapolate that onto multiple alignments. Let us say that biological events that shape the alignment of two sequences are mutations, insertions and deletions. How many of those biological events would be enough to explain a given alignment? More importantly, if we can explain it adequately with 2-3 events (like Clustal alignment did), should we try to explain it better with 50 or so events (like yours did)?

This is already implied in previous responses: knowing how to code an alignment algorithm is good, but knowing how to optimize biological alignments requires some knowledge of biology as well. I told you a week or so ago that BLAST has been developed by a team of people over decades. Same for Clustal. Maybe you are that good to surpass their combined effort in a day while watching TV - ask the right questions and study the field, and we will find out.

0
Entering edit mode

You hit the nail on the head with the "look pretty" part. The underlying measure of accuracy is the problem here. It is quite impressive that OP wrote an algorithm to align multiple sequences in a day.

0
Entering edit mode

Mensur, thanks for replying again. I guessed you realized from the last time that I didn't agree with your reply. I don't think it's the point of this forum to discuss philosophical or political issues. That's why I didn't explain you last time why I think you are deeply wrong. I won't either do this time. Not here. But you can open a thread to discuss phylosophycal issues if you like and I'd likely participate.

0
Entering edit mode

Again you make a deeply unaccurate statement "if we can explain it adequately with 2-3 events". You need a lot more events to explain their alignment. Indeed you need more events than with mine. Unless you believe that DNA replication makes a lot of transcription mistakes in just one event. It's quite ridiculous having to discuss so basic concepts here... And yes, I'm not an expert. I have only been working in this field around 6 months. Maybe that's why I have fresher the basic concepts and I am not so biased like people here seem to be. As some one commented we should focus on science. And science is not about reputation, it's about facts and methods.

3
Entering edit mode

And yes, I'm not an expert. I have only been working in this field around 6 months. Maybe that's why I have fresher the basic concepts and I am not so biased like people here seem to be.

You can't first ask our opinion and then disagree with the opinions you receive. If you think your algorithm is superior and all of us are dumb, then fine, go ahead and leave us alone in our ignorance.

You can also get my opinion. You are currently not contributing anything meaningful to this forum. You don't have the experience to back up your ideas, and your understanding of biology is flawed at best. You engage in fights with everyone here who tries to help you with making your algorithm better and biologically sound. That's absurd and a bad attitude. You cannot handle criticism. I consider this behavior abuse of the volunteer contributors of biostars and will take actions to prevent this going further if necessary.

You can consider this the only warning you will receive.

Have a nice day,
Wouter

0
Entering edit mode

Without wishing to put words in his mouth, Mensur was no doubt simplifying to illustrate the point. Its not trivial to count the total number of events (it would approximately be the total number of contiguous gaps (~2 in the clustal alignment if you ignore the sequence ends (since you dont know what sequence was either side in the example, you should), plus however many mismatches there are, but as you admit in any case, its fewer than your algo.

DNA transcription does make lots of mistakes potentially (e.g. strand slippage on highly repetitive sequences), however, a event that looks like ATG-----TA is (likely) a single event, from a spontaneous deletion, not several deletion events. Now, its never possible to know for sure (such is phylogenetics), so you choose a parsimonious explanation. This is where and why we say, with the best will in the world, that you are lacking some of the subtleties of the biological reality; its more than just an exercise in playing ‘match the letters’.

I’m not sure why you think us biased (beyond us not agreeing with you?). It’s not like we get up every morning and swear allegiance to Altschul while saluting a flag of the NCBI.

3
Entering edit mode
3.4 years ago

An important detail that should be mentioned here is that none of the alignments you present there are inherently "good" or "bad"

Alignment algorithms maximize a score. The result (if implemented correctly) is the alignment that produces the maximal score. The score is built from the rewards and penalties associated with each match, mismatch, and gap.

Every single alignment above should be reproducible by any tool when the scores are chosen identically. That being said some aligner implementations may not be tunable enough to cover all use cases (often for performance reasons).

Long story short, your alignment seems "better" for you because you chose the value certain matches a certain way. That is fine.

This does not make the other alignment "bad", even less so makes the other tools "bad" - it just means that the other alignment method chose different scoring (rewards and penalties) and tries to address a different use case.

If you can easily implement algorithms that produce multiple sequence alignments (while watching TV) then congrats, not many people can do that, write it up, publish it.

1
Entering edit mode
3.4 years ago
fishgolden ▴ 460

At a glance, I thought that all alignments are nonsense. I wouldn't do any downstream analysis with any of those alignments.

However, if you want to discuss about the accuracy of algorithms; T-Coffee, Clustal Omega, and yours, you should evaluate those algorithms with some benchmark dataset.

I'm a protein person thus I don't not know much about benchmark dataset of nucleotides so much but

I think ROSE (dataset generator) is very famous.

https://www.ncbi.nlm.nih.gov/pubmed/9545448

There are many papers using ROSE to evaluate algorithms https://scholar.google.com/scholar?client=firefox-b-d&um=1&ie=UTF-8&lr&cites=3626960321284721983 thus you can follow them.

1
Entering edit mode

Agreed none of the alignments are very good. I also tested mafft and that didn’t produce great results either, so I suspect this task requires some optimisation of alignment parameters rather than running with defaults.

In any case, a good resource for alignment benchmarking is Balibase.

0
Entering edit mode

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC29792/ BaliBASE is from proteins which have 3D structure... Nucleotide people may use it for benchmarking algorithms, however, it's not suitable for his analysis, at least....

0
Entering edit mode

Why not? It’s literally for benchmarking nucleotide alignments as you say.

It’s a good litmus test for an aligned, because if it can’t manage those best case scenarios, it’s no good.

1
Entering edit mode

They are amino acid sequences thus they must be reverse translated into nucleotide when you benchmark nucleotide alignment software. Then it becomes codon-based alignment benchmark (the gaps are placed between two codon frames. It's quite biased.) & It can not be applied to aligning SINE.