Question: MSA of very long sequences?
0
gravatar for majeedaasim
9 months ago by
majeedaasim20
United States
majeedaasim20 wrote:

I have 9 very long sequences with average length of 620513. I need to align them for phylogenetic analysis. How can I align such big sequences?

THanks

very long sequences msa • 593 views
ADD COMMENTlink modified 9 months ago by jrj.healey8.5k • written 9 months ago by majeedaasim20

BWA is a good option

ADD REPLYlink written 9 months ago by Buffo1.2k

IT is an aligner, I need to perform MSA, multiple sequence alignment for phylogenetic analysis

ADD REPLYlink written 9 months ago by majeedaasim20

However given the length of sequences I guess it would be too much for most existing tools unless they are run on a high capacity computer cluster

ADD REPLYlink written 9 months ago by The90

Are you certain that those sequences are related by phylogeny so that a MSA can be logically constructed? If you are not sure about that trying to align sequences may result in a not-logical alignment.

You could use a program like mauve to see if the sequences are related (i.e. there are not rearrangements etc) before trying the MSA.

ADD REPLYlink written 9 months ago by genomax58k

yes I am certain because I obtained these through orthology detection tools. I obtained 356 for each species, so that there are 356 orthologous groups wherein each group has no more than one gene. In other words each group has single copy of an orthololog across all the species. After that I merged all the single copy orthologs of a species to create a single super sequence for each species. That is why these sequences are too big. This I did to generate a species tree and not the gene tree.

ADD REPLYlink written 9 months ago by majeedaasim20
1

If you merge different genes and align them, regions in boundary of different genes will aligned to next or previous unrelated genes accidentally. Those regions are just "noise" and disturb following analysis. (If the genes are sorted according to the position of chromosomes, the story may be different.)

Thus the procedure should not be

merge -> align

should be

align -> merge

.

In my opinion.

ADD REPLYlink written 9 months ago by fishgolden260

Very good point. Merge the individual gene alignments!

ADD REPLYlink written 9 months ago by jrj.healey8.5k

After that I merged all the single copy orthologs of a species to create a single super sequence for each species.

By doing what you had asked in the other thread (HOw to merge multifasta sequence into a single sequence having only one header? )? I am not sure how you can do meaningful phylogenetic analysis by concatenating sequence of multiple genes into a single artificial sequence for each species.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax58k

I think I've heard of people doing concatenated multifasta alignments before now, I wouldn't like to vouch for how good an idea it is, but I think it's somewhat accepted (presumably the sequences are reasonably similar anyway as they were probably identified as orthologs with like a 70% nt ID or something)..

ADD REPLYlink written 9 months ago by jrj.healey8.5k

orthologs with like a 70% nt ID

That is the critical piece. Hopefully OP has done the due diligence.

One could still make a phylogeny by incorporating the species/gene_names in the headers and keeping the sequences separate. It would be an interesting way to see if the orthologs identified follow logical pattern or if there are mistakes.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax58k

Consider using ASTRAL-II to infer a species tree from gene trees of all the ortholog groups instead. It may well be faster than trying to align extremely long sequences, not to mention accuracy often suffers for very long alignments.

ADD REPLYlink written 9 months ago by jrj.healey8.5k

you could try using mafft

ADD REPLYlink written 9 months ago by popayekid5540
0
gravatar for The
9 months ago by
The90
United States
The90 wrote:

"Muscle" mentions the following:

"2.3 Large alignments If you have a large number of sequences (a few thousand), or they are very long, then the default settings of may be too slow for practical use. A good compromise between speed and accuracy is to run just the first two iterations of the algorithm. On average, this gives accuracy equal to T-Coffee and speeds much faster than CLUSTALW. This is done by the option –maxiters 2, as in the following example.

muscle -in seqs.fa -out seqs.afa -maxiters 2

"

ADD COMMENTlink written 9 months ago by The90

I ran muscle with maxiters 2, but the process is killed automatically. I am using a computer with RAM=64GB.

ADD REPLYlink written 9 months ago by majeedaasim20
0
gravatar for jrj.healey
9 months ago by
jrj.healey8.5k
United Kingdom
jrj.healey8.5k wrote:

I've tested:

  • MUSCLE
  • T-COFFEE
  • Kalign
  • Mafft
  • LASTZ
  • DIalign
  • PASTA

With varying degrees of accuracy/quality.

I've also done up to about 30kb with CLUSTALO in the past which seemed to work reasonably well.

Kalign and LAST are specifically intended for long sequences though, so start there.

ADD COMMENTlink modified 9 months ago • written 9 months ago by jrj.healey8.5k

Doesn’t answer the question as such as it’s only pair wise, but MUMMer will do long seqs also

ADD REPLYlink written 9 months ago by jrj.healey8.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1982 users visited in the last hour