Question: Multiple Sequence Alignment for Full length genomes
0
gravatar for amer_ghl
5 months ago by
amer_ghl0
amer_ghl0 wrote:

Hello everyone,

I am trying to align full-length genomes of coronavirus. I have 1800 sequences and each sequence is about 30000 nt (50 Mb) I tried the webserver of MAFFT, MUSCLE, CLUSTAL OMEGA but they are functional only for a small data sample (4Mb at max), otherwise, they crush.

I would be thankful for any recommendations or suggestions.

rna-seq next-gen alignment • 226 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by amer_ghl0
1

There aren't really any good options for large scale multiple genome alignments, this is still something of an unsolved computational challenge.

That said, you could take a look at mugsy or LASTZ which can handle larger data, but in my experience make pretty crappy alignments. Several 10s of kilobases is pretty much the limit for most tools.

What exactly is your end goal? There may be a simpler orthogonal way you could approach the task.

ADD REPLYlink modified 5 months ago • written 5 months ago by Joe18k

Thanks for your reply. I am trying to identify all the possible mutations in the genome of the coronavirus. I checked some similar studies, but their samples were way smaller than mine (around 300 genomes).

I just thought it would be better to explore the mutations on a bigger sample.

Any ideas?

ADD REPLYlink written 5 months ago by amer_ghl0
2

Just map each one to the reference and find the mutations, no need for MSA

ADD REPLYlink written 5 months ago by Asaf8.5k

we are talking about 1800 sequences, so how many individual alignments will you need? :)

ADD REPLYlink written 5 months ago by amer_ghl0
1

Just choose one reference, this one is commonly used: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2/ compare each other sequence to it and call mutations. Should be a nice exercise.

ADD REPLYlink written 5 months ago by Asaf8.5k
1

What Asaf said. Unless the genomes are really close, MSA will introduce alignment artefacts anyway. What you can do, relatively easily, is multiple pairwise alignment, e.g. with mummer or similar tools that others have suggested, and compare all the sequences to a particular reference.

ADD REPLYlink written 5 months ago by Joe18k
1

Coronavirus genomes so should be very close. Many may be sequence redundant so that number can be culled down to something smaller.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax92k
1

You can give minimap2 a try and also mummer

ADD REPLYlink written 5 months ago by Asaf8.5k
1

minimap2 is not going to generate a multiple sequence alignment.

ADD REPLYlink written 5 months ago by genomax92k

Yeah, didn't read the question through

ADD REPLYlink written 5 months ago by Asaf8.5k
1

not sure if it does multiple sequence alignment but you could have a look at https://github.com/genotoul-bioinfo/dgenies

ADD REPLYlink written 5 months ago by lieven.sterck9.0k
1

Two additional options. Not that you don't have many already.

  1. Use the method that NextStrain folks are using to generate their genome alignments for SARS: https://nextstrain.org/help/general/about-nextstrain
  2. Since this is a small enough genome Mauve may work as well.

This is designated as the RefSeq Genome in NCBI.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax92k
2

Another option:

Rob Lanfear has a repo up where he has already run the MSA https://github.com/roblanf/sarscov2phylo/

He uses MAFFT too in this script https://github.com/roblanf/sarscov2phylo/blob/master/scripts/global_profile_alignment.sh

ADD REPLYlink written 5 months ago by Philipp Bayer6.8k
1

have you considered doing MSA on the multiple specific regions of interest. For example, doing MSA on all annotated protein coding regions in the reference genome against your 1800 genomes? I think the protein coding regions would be smaller than the limit for most softwares, and you can also run these MSAs in parallel

ADD REPLYlink written 5 months ago by manaswwm130
0
gravatar for amer_ghl
5 months ago by
amer_ghl0
amer_ghl0 wrote:

Thanks everyone for the interesting ideas. It seems the MAFFT does the job well!

ADD COMMENTlink written 5 months ago by amer_ghl0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1096 users visited in the last hour