Multiple Sequence Alignment for Full length genomes
1
0
Entering edit mode
3.8 years ago
amer_ghl • 0

Hello everyone,

I am trying to align full-length genomes of coronavirus. I have 1800 sequences and each sequence is about 30000 nt (50 Mb) I tried the webserver of MAFFT, MUSCLE, CLUSTAL OMEGA but they are functional only for a small data sample (4Mb at max), otherwise, they crush.

I would be thankful for any recommendations or suggestions.

RNA-Seq alignment next-gen • 2.2k views
ADD COMMENT
1
Entering edit mode

There aren't really any good options for large scale multiple genome alignments, this is still something of an unsolved computational challenge.

That said, you could take a look at mugsy or LASTZ which can handle larger data, but in my experience make pretty crappy alignments. Several 10s of kilobases is pretty much the limit for most tools.

What exactly is your end goal? There may be a simpler orthogonal way you could approach the task.

ADD REPLY
0
Entering edit mode

Thanks for your reply. I am trying to identify all the possible mutations in the genome of the coronavirus. I checked some similar studies, but their samples were way smaller than mine (around 300 genomes).

I just thought it would be better to explore the mutations on a bigger sample.

Any ideas?

ADD REPLY
2
Entering edit mode

Just map each one to the reference and find the mutations, no need for MSA

ADD REPLY
0
Entering edit mode

we are talking about 1800 sequences, so how many individual alignments will you need? :)

ADD REPLY
1
Entering edit mode

Just choose one reference, this one is commonly used: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2/ compare each other sequence to it and call mutations. Should be a nice exercise.

ADD REPLY
1
Entering edit mode

What Asaf said. Unless the genomes are really close, MSA will introduce alignment artefacts anyway. What you can do, relatively easily, is multiple pairwise alignment, e.g. with mummer or similar tools that others have suggested, and compare all the sequences to a particular reference.

ADD REPLY
1
Entering edit mode

Coronavirus genomes so should be very close. Many may be sequence redundant so that number can be culled down to something smaller.

ADD REPLY
1
Entering edit mode

You can give minimap2 a try and also mummer

ADD REPLY
1
Entering edit mode

minimap2 is not going to generate a multiple sequence alignment.

ADD REPLY
0
Entering edit mode

Yeah, didn't read the question through

ADD REPLY
1
Entering edit mode

not sure if it does multiple sequence alignment but you could have a look at https://github.com/genotoul-bioinfo/dgenies

ADD REPLY
1
Entering edit mode

Two additional options. Not that you don't have many already.

  1. Use the method that NextStrain folks are using to generate their genome alignments for SARS: https://nextstrain.org/help/general/about-nextstrain
  2. Since this is a small enough genome Mauve may work as well.

This is designated as the RefSeq Genome in NCBI.

ADD REPLY
2
Entering edit mode

Another option:

Rob Lanfear has a repo up where he has already run the MSA https://github.com/roblanf/sarscov2phylo/

He uses MAFFT too in this script https://github.com/roblanf/sarscov2phylo/blob/master/scripts/global_profile_alignment.sh

ADD REPLY
1
Entering edit mode

have you considered doing MSA on the multiple specific regions of interest. For example, doing MSA on all annotated protein coding regions in the reference genome against your 1800 genomes? I think the protein coding regions would be smaller than the limit for most softwares, and you can also run these MSAs in parallel

ADD REPLY
0
Entering edit mode
3.8 years ago
amer_ghl • 0

Thanks everyone for the interesting ideas. It seems the MAFFT does the job well!

ADD COMMENT

Login before adding your answer.

Traffic: 2559 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6