MSA of very long sequences?
3
0
Entering edit mode
4.7 years ago
majeedaasim ▴ 60

I have 9 very long sequences with average length of 620513. I need to align them for phylogenetic analysis. How can I align such big sequences?

THanks

MSA VERY LONG SEQUENCES • 4.2k views
1
Entering edit mode

Are you certain that those sequences are related by phylogeny so that a MSA can be logically constructed? If you are not sure about that trying to align sequences may result in a not-logical alignment.

You could use a program like mauve to see if the sequences are related (i.e. there are not rearrangements etc) before trying the MSA.

0
Entering edit mode

yes I am certain because I obtained these through orthology detection tools. I obtained 356 for each species, so that there are 356 orthologous groups wherein each group has no more than one gene. In other words each group has single copy of an orthololog across all the species. After that I merged all the single copy orthologs of a species to create a single super sequence for each species. That is why these sequences are too big. This I did to generate a species tree and not the gene tree.

3
Entering edit mode

If you merge different genes and align them, regions in boundary of different genes will aligned to next or previous unrelated genes accidentally. Those regions are just "noise" and disturb following analysis. (If the genes are sorted according to the position of chromosomes, the story may be different.)

Thus the procedure should not be

merge -> align

should be

align -> merge

.

In my opinion.

1
Entering edit mode

Very good point. Merge the individual gene alignments!

1
Entering edit mode

Consider using ASTRAL-II to infer a species tree from gene trees of all the ortholog groups instead. It may well be faster than trying to align extremely long sequences, not to mention accuracy often suffers for very long alignments.

0
Entering edit mode

After that I merged all the single copy orthologs of a species to create a single super sequence for each species.

By doing what you had asked in the other thread (HOw to merge multifasta sequence into a single sequence having only one header? )? I am not sure how you can do meaningful phylogenetic analysis by concatenating sequence of multiple genes into a single artificial sequence for each species.

1
Entering edit mode

I think I've heard of people doing concatenated multifasta alignments before now, I wouldn't like to vouch for how good an idea it is, but I think it's somewhat accepted (presumably the sequences are reasonably similar anyway as they were probably identified as orthologs with like a 70% nt ID or something)..

0
Entering edit mode

orthologs with like a 70% nt ID

That is the critical piece. Hopefully OP has done the due diligence.

One could still make a phylogeny by incorporating the species/gene_names in the headers and keeping the sequences separate. It would be an interesting way to see if the orthologs identified follow logical pattern or if there are mistakes.

1
Entering edit mode

you could try using mafft

0
Entering edit mode

BWA is a good option

0
Entering edit mode

IT is an aligner, I need to perform MSA, multiple sequence alignment for phylogenetic analysis

0
Entering edit mode

However given the length of sequences I guess it would be too much for most existing tools unless they are run on a high capacity computer cluster

1
Entering edit mode
4.7 years ago
Joe 20k

I've tested:

• MUSCLE
• T-COFFEE
• Kalign
• Mafft
• LASTZ
• DIalign
• PASTA

With varying degrees of accuracy/quality.

I've also done up to about 30kb with CLUSTALO in the past which seemed to work reasonably well.

Kalign and LAST are specifically intended for long sequences though, so start there.

0
Entering edit mode

Doesn’t answer the question as such as it’s only pair wise, but MUMMer will do long seqs also

0
Entering edit mode

for some 30k - 60k sequences between Kalign, Mafft and CLUSTALO, mafft with the following options performed the best:

mafft --retree 1 --maxiterate 0 in.fa > out.aln

0
Entering edit mode
4.7 years ago
The ▴ 180

"Muscle" mentions the following:

"2.3 Large alignments If you have a large number of sequences (a few thousand), or they are very long, then the default settings of may be too slow for practical use. A good compromise between speed and accuracy is to run just the first two iterations of the algorithm. On average, this gives accuracy equal to T-Coffee and speeds much faster than CLUSTALW. This is done by the option –maxiters 2, as in the following example.

muscle -in seqs.fa -out seqs.afa -maxiters 2

"

0
Entering edit mode

I ran muscle with maxiters 2, but the process is killed automatically. I am using a computer with RAM=64GB.

0
Entering edit mode
8 months ago
Enhancer • 0

Just like fishgolden stated, the alignment should have come before concatenation. I am not sure if Alignment before concatenation and after concatenation would produce the same super-matrix. I once did but I opted for before concat and I think it has a higher level of certainty compare to aligning a huge and long sequences, which would definitely be prone to error