Question: Split A Multiple Sequence Alignment (Possibly A Programming Challenge)
2
gravatar for Lee Katz
8.7 years ago by
Lee Katz2.9k
Atlanta, GA
Lee Katz2.9k wrote:

I have a very large multiple sequence alignment from MAUVE, which represents a large homologous region between several genomes. I want to fine-tune this alignment using MUSCLE or ClustalW. However, both of these programs cannot handle the multiple sequence alignment because it is so large. Is there a way to split this multiple sequence alignment into smaller pieces?

The split should be safe. In other words, split the multiple sequence alignment along homologous regions. I'm not sure if there is a standard algorithm or program to do this already, or if someone needs to write an innovative subroutine/function for it. Thank you for your help.

* Edit * Is it possible that the MAUVE command line option --max-backbone-gap or the option --max-gapped-aligner-length is my answer? --max-gapped-aligner-length? That MAUVE has a way of limiting the size of each LCB in the first place? The documentation doesn't explain it well though...

MAUVE CLI documentation: http://asap.ahabs.wisc.edu/mauve-aligner/mauve-user-guide/using-mauvealigner-from-the-command-line.html

clustalw multiple programming • 3.2k views
ADD COMMENTlink modified 8.7 years ago by Jarretinha3.3k • written 8.7 years ago by Lee Katz2.9k
2

How large is your sequence? It will be nice to know the round figure so people here can know your challenges.

ADD REPLYlink written 8.7 years ago by Thaman3.2k

How large is your sequence?

ADD REPLYlink written 8.7 years ago by Thaman3.2k

I guess you should split on syntenic regions. How large is very large?

ADD REPLYlink written 8.7 years ago by Dave Gerrard190

Have you tried to use Exonerate instead of MUSCLE or Clustal?

ADD REPLYlink written 8.7 years ago by Jarretinha3.3k

One of my large alignments is 86636 nucleotides, 1.4 megabytes. Another one is 861 kilobytes.

I have not tried Exonerate yet but guessing from "exon" it is not exactly what I want.

ADD REPLYlink written 8.7 years ago by Lee Katz2.9k

Exonerate is a general purpose aligner aimed at genome projects. You can even use it for gene models. By the way, the sizes of your aligments are small. MUSCLE should work . . .

ADD REPLYlink written 8.7 years ago by Jarretinha3.3k

Well, not so small . . .

ADD REPLYlink written 8.7 years ago by Jarretinha3.3k

I guess you should split on syntenic regions. How large is very large?

ADD REPLYlink written 8.7 years ago by Dave Gerrard190
1
gravatar for Lee Katz
8.7 years ago by
Lee Katz2.9k
Atlanta, GA
Lee Katz2.9k wrote:

My solution:

# split a large MSA into smaller chunks

use Bio::Perl;
use Bio::AlignIO;
use Data::Dumper;
use List::Util qw[min max];
use strict;
use warnings;

exit(main(@ARGV));

sub main{
  my($lcbFile,$format)=@_;
  my @aln=splitLcb($lcbFile,$format);

  # write the split alignment
  for(my $i=0;$i<@aln;$i++){
    my $outFile="$lcbFile.part$i";
    print "Writing $outFile\n";
    my $out=Bio::AlignIO->new(-format=>"fasta",-file=>">$outFile");
    $out->write_aln($aln[$i]);
  }
  return 0;
}

# only looks at a single MSA in the input file (does not accept more than one aln in an aln file)
sub splitLcb{
  my($lcb,$format)=@_;
  # look for a set of xx nt that are homologous and split in the middle.
  my $minHomologousNucleotides=499;

  my @slicedAln; # result
  my $in=Bio::AlignIO->new(-file=>$lcb,-format=>$format);
  my $aln=$in->next_aln;

  # walk along the LCB to find a region of 100% identity
  my $alnLength=$aln->length;
  my $matchLine=$aln->match_line;

  # use the match line to find regions of homology
  SEARCH_FOR_HOMOLOGY: for(my $col=$minHomologousNucleotides*2;$col<$alnLength;$col++){
    if(is_col_identical($matchLine,$col)){
      # now see if the next xx nts are identical too
      my $length=min($alnLength,$col+$minHomologousNucleotides-1);
      for($col=$col;$col<$length;$col++){
        if(!is_col_identical($matchLine,$col)){
          next SEARCH_FOR_HOMOLOGY;
        }
      }
      my $homologyStart=($col-$minHomologousNucleotides+1);
      my $homologyStop=$col-1;
      print "Region of homology is $homologyStart - $homologyStop, and the minHomologousNucleotides=$minHomologousNucleotides, alignmentLength=$alnLength\n";

      # slice the alignment into two chunks
      $homologyStart++;$homologyStop++; # base-1 coordinates
      my $sliceCoordinate=int ($minHomologousNucleotides/2+$homologyStart);
      $slicedAln[0]=$aln->slice(1,$sliceCoordinate);
      $slicedAln[1]=$aln->slice($sliceCoordinate+1,$alnLength);
      last SEARCH_FOR_HOMOLOGY;
    }
  }
  return (@slicedAln);
}

sub is_col_identical{
  my($matchLine,$col)=@_;
  my $t=substr($matchLine,$col,1);
  if($t eq '*'){
    return 1;
  }
  return 0;
}
ADD COMMENTlink modified 11 days ago by RamRS24k • written 8.7 years ago by Lee Katz2.9k
0
gravatar for Jarretinha
8.7 years ago by
Jarretinha3.3k
São Paulo, Brazil
Jarretinha3.3k wrote:

Hi Lee,

I don't see the point in splitting your MSA. MUSCLE and CLUSTAL can be compiled to run in 64 bit machines. So, your limit will be your memory (RAM + swap). Right now I'm running MUSCLE on a dozen of 150 kbp Herpesviridae genomes. It consumed all my RAM and a large piece of my swap. But it's running!

Nevertheless, I think you are in a much more delicate situation. Mauve is quite good in detecting indels/inversions/translocations and similar genome variations. You shouldn't use a progressive MSA program to refine your MSA at the cost of introducing much more gaps than necessary.

ADD COMMENTlink written 8.7 years ago by Jarretinha3.3k

I want to use MUSCLE on a MAUVE LCB because MAUVE gives a coarse alignment, and MUSCLE (or clustal) can give a more accurate alignment on a local level.

I need to split the LCB (mauve output) because it is extremely slow right now. It's been running since last week on two out of 70 of the LCBs. However, if I split the larger LCBs, then MUSCLE should run faster.

I don't believe I am introducing any gaps by splitting the LCB based on a region of homology, and so I don't think that I am introducing any inaccuracies.

ADD REPLYlink written 8.7 years ago by Lee Katz2.9k

You should have said that you needed a speed optimization. In my experience with mauve (and progressiveMauve) if you are certain that you don't have large scale events in your genome the --collinear option with the right matrix will give results as good as MUSCLE. On the other hand, if do have inversions or similar MUSCLE will mess up your alignment. Anyway, if you genome segment is all collinear the solution is trivial: just split equally or randomly. If not, you'll to find collinear boundaries(your anchors). You're using homology. A region could be homologous without being collinear.

ADD REPLYlink written 8.7 years ago by Jarretinha3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1528 users visited in the last hour