Hello folks, I am seeking an advice on Multiple Sequence Alignment that I am trying to get.
The fasta file i am trying to align belongs to Sars-Cov-2 Spike protein, it has nearly 600k sequences and ranges from 1270-1275 aa. I have aligned with clustalo and mafft with default parameters. The aligned sequences extends 60k a.a. I tried to pass
--globalpair flag to mafft but it fails all the time. It takes more than 5 days to align with 80 threads and 400GB RAM.
Lately, I have come across a tool called MAGUS (GitHub and Article is here), the tool is pretty impressive to me. In short what it does is that splits the fasta in chunks then align them and later merge them. MAGUS takes like 2 days to output the result (with the same hardware configs). However, with MAGUS my alignment still stretches to around 15k.
So there are some sequences that are messing up my alignment. Usually when I come across an issue like this, I eyeball the alignment file on any MSA viewer tool and find the culprit ones, discard them but this is really hard when you have 600k sequences.
At the moment what I am trying:
MAGUS produced 100 subsets of alignments in order to merge them at the final stage. I took this 100 alignment file.
- Check each of the alignment length.
- If the alignment length is anything longer than my threshold (1290) filter it.
- Build a pairwise identity matrix
- Find the lowest identical ones in each column and rank them. Thus the highest ranking would give me the most divergent sequence in that alignment file.
- Open that subset alignment file in a MSA viewer and compare the highest rankings sequences.
Now, I am not sure if this is gonna work, and this is also gonna take some time too. I just wanted to drop by here and seek if there is any advice I can get.