Mummer To Viewable Alignment Format (Fasta Or Aln...)
2
2
Entering edit mode
13.2 years ago
Yannick Wurm ★ 2.5k

Hi,

I'm aligning pairs of genomic scaffolds, looking for 1000bp-size insertions/deletions. Its easy to get a dotplot for an overview. It's also easy to find SNPs at a small-scale. But I need to find putative insertions and subsequently confirm them in the lab. Thus I need something I can open in Jalview or something else, where it is easy to see the insertions and to copy-paste the relevant sequences into something for primer design.

I can get such alignments in useable time with clustal (sorry), but not if both sequences are 1MB or bigger. Mummer is super fast, but has a weird output format. Anything else I could use that is fast but gives a user-friendly output? (BLAST would be great, but it divides the aligning bits up into multiple HSPs instead of one long contiguous alignment).

Cheers yannick

genomics alignment • 8.6k views
ADD COMMENT
1
Entering edit mode

Would the 'show-aligns' command in Mummer package help? I used that for relative small regions. Btw, from the Mummer output (like 'show-snps') you can get snp/indel positions with flanking regions. I used that for grabbing sequences and pipe into primer3 to get primers.

ADD REPLY
0
Entering edit mode

Hi Vitis & thanks for the suggestion. Yes, for small inserts it works ... but I'm interested in multiple-kb insertions. Those are split into different "aligns" by mummer...

ADD REPLY
4
Entering edit mode
13.2 years ago
ALchEmiXt ★ 1.9k

You can parse the 'coords' file of MUMmer (which you can get using the --coords option) into a so-called BLAST-crunch file. That file can directly be read into for instance Artemis Comparsion Tool (ACT of Sanger see their website).

An example on the galaxy platform we have posted a while ago. here where you can just grab the perl nucmer_coords2ACT_galaxy.pl file.

Basically all it does is to calculate a score. For example:

while (<COORDS>)
    {
    unless ($_ =~ /^(\s*)\d/){next}
    $_ =~ s/\|//g;

    my @f = split;
          # create crude match score = ((length_of_match * %identity)-(length_of_match * (100 - %identity))) /20
    my $crude_plus_score=($f[4]*$f[6]);
    my $crude_minus_score=($f[4]*(100-$f[6]));
    my $crude_score=  int(($crude_plus_score  - $crude_minus_score) / 20);
          # reorganise columns and print crunch format to stdout
          # score        %id   S1    E1    seq1  S2    E2    seq2  (description)
    print OUT " $crude_score $f[6] $f[0] $f[1] $f[7] $f[2] $f[3] $f[8] nucmer comparison coordinates\n"
    }
ADD COMMENT
0
Entering edit mode

thanks for the quick reply. can you copy-paste nucleotides out from ACT?

ADD REPLY
0
Entering edit mode

yes. The ACT and Artemis suites are meant for annotation/curation onto the single nt level: http://www.sanger.ac.uk/resources/software/artemis/ There is Artemis for single sequence and ArtemisComparisonTool (ACT) for multiple sequences.

ADD REPLY
0
Entering edit mode
3.4 years ago
krinsman • 0

I created a Python script for doing this, although it appears to be quite slow. The main bottleneck appears to be show-aligns, but I could be wrong about that.

The main/important function to use is delta_to_fasta. It requires Pandas and Biopython as dependencies. If it is going really slowly you can try turning on verbose=True.

Update/edit: Sometimes for really large query sequences (e.g. 25GB but not 500 MB), writing to file only occurs in chunks, so the output file won't be empty until after verbose mode says about 20-25 pairs of ref+query sequences should have already been written. I don't know what the cause or reason for this is. In any case the fact that it is streaming/uses generators means that you can see partial output as the writing progresses and get some output if it's canceled partway through, which is helpful given how incredibly slow this can be.

import subprocess
import re

import pandas as pd

from io import StringIO

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqIO import write

## NOTE: `mummer` command returns `.mum` format files, for which this script is powerless
## NOTE: Only `nucmer` or `promer` commands return files in required `.delta` format.
## NOTE: I learned this the hard way -- you don't have to.


def delta_file_to_dataframe(delta_filename, skiprows=4):
    """
    I believe '1' and 'R' always refer to reference sequences, and '2' and 'Q' to query sequences.
    Reference and query corresponding mostly to the order of input of the files to the nucmer or promer commands.
    """
    coords_file = subprocess.run(['show-coords', '-c', '-l', '-r', '-T', delta_filename],
                                  stdout=subprocess.PIPE).stdout.decode('utf-8')

    column_names =[ 'S1', 'E1', 'S2', 'E2', 'LEN 1', 'LEN 2', '% IDY', 'LEN R', 'LEN Q', 'COV R', 'COV Q', 'Ref Id', 'Query Id']
    dataframe = pd.read_table(StringIO(coords_file), skiprows=skiprows, index_col=False, header=None, names=column_names)
    return dataframe

def get_alignments_from_ids(delta_filename, ref_id, query_id):
    alignments = subprocess.run(['show-aligns', delta_filename, ref_id, query_id],
                                  stdout=subprocess.PIPE).stdout.decode('utf-8')
    # Note that no sorting is done by default for the output of `show-aligns`, so we _may_ assume
    # that the order of the matches is the same as their order of appearance in the deltafile

    # "Beginning delimiter" of every alignment in the `show-aligns` output
    begin_alignment_regex = '-- BEGIN alignment \[ (?P<ref_direction>[+\-])1 (?P<ref_start>[0-9]+) - (?P<ref_end>[0-9]+) \|' + \
    ' (?P<query_direction>[+\-])1 (?P<query_start>[0-9]+) - (?P<query_end>[0-9]+) \]\n\n'
    # "End delimiter" of every alignment in the `show-aligns` output
    end_alignment_regex = '\n\n--\s+END alignment \[ [+\-]1 [0-9]+ - [0-9]+ \| [+\-]1 [0-9]+ - [0-9]+ \]'

    # Goal is to capture everything between the begin alignment strings and the end alignment strings
    parse_regex = '(?s)'+begin_alignment_regex+'(?P<alignment_string>.*?)'+end_alignment_regex
    # FYI:    have to use (?s) at beginning to ensure '.' will also match new lines
    # See:    https://stackoverflow.com/questions/42302482/python-find-a-string-between-two-strings-repeatedly#comment116031644_42302556
    parsed_alignments = [match.groupdict() for match in re.finditer(parse_regex, alignments)]   

    # Now have a DataFrame with the first columns containing information about the
    # alignment and the last column containing everything in between one of the
    # "BEGIN alignment" and "END alignment" blocks of the `show-aligns` output
    parsed_alignments = pd.DataFrame(parsed_alignments)

    alignment_strings = list(parsed_alignments['alignment_string']) 

    parsed_alignments = parsed_alignments.drop(columns=['alignment_string'])

    ref_sequences, query_sequences = tuple(zip(*[clean_alignment_string(alignment_string) for alignment_string in alignment_strings]))

    parsed_alignments['ref_sequence'] = ref_sequences
    parsed_alignments['query_sequence'] = query_sequences
    return parsed_alignments

def extract_ref_and_query_seq(cleaned_lines):
    return re.findall('[0-9]+\s+(.*)\n[0-9]+\s+(.*)', cleaned_lines)[0]

def clean_alignment_string(alignment_string):
    # The line consisting of spaces, |'s, and numbers above each reference-query sequence pair in show-aligns output
    seqs_overline_regex = '[\n]?[\s0-9\|]+\n'
    # The line beneath each reference-query sequence pair in show-aligns output pointing to mismatches with ^'s
    seqs_underline_regex = '\n[\s\^]+[\n]?'
    # We REALLY need those question marks, or otherwise the matching will delete the first and last lines

    cleaned_seqs = re.findall('(?s)'+seqs_overline_regex+'(.*?)'+seqs_underline_regex, alignment_string)

    extracted_sequences = [extract_ref_and_query_seq(line) for line in cleaned_seqs]
    extracted_sequences = list(zip(*extracted_sequences))
    extracted_sequences = [''.join(extracted_sequence).upper() for extracted_sequence in extracted_sequences]
    ref_sequence, query_sequence = tuple(extracted_sequences)

    return ref_sequence, query_sequence


def get_seq_record(group, delta_filename, ref_sequences=True, query_sequences=True, ungap=False, verbose=False):

    assert ref_sequences or query_sequences, "The generator has nothing to do if neither reference nor query sequences are yielded"

    ref_id, query_id = group[0]

    match_information = group[1]
    percent_identities = list(match_information['% IDY'])
    # Note that no sorting is done by default for the output of `show-aligns`, so we _may_ assume
    # that the order of the matches is the same as their order of appearance in the deltafile
    alignments = get_alignments_from_ids(delta_filename, ref_id, query_id)

    for row_number in range(len(alignments.index)):
        if verbose:
            print("\tWriting match #{}".format(row_number+1))
        if ref_sequences:
            ref_start = alignments['ref_start'][row_number]
            ref_end = alignments['ref_end'][row_number]
            ref_direction = alignments['ref_direction'][row_number]
            ref_seq_id = '{}:{}-{}({})'.format(ref_id, ref_start, ref_end, ref_direction)

            pct_identity = percent_identities[row_number]
            ref_description = '{}% identity match to query'.format(pct_identity)

            ref_sequence = Seq(alignments['ref_sequence'][row_number])
            if ungap:
                ref_sequence = ref_sequence.ungap(".")

            ref_seqrecord = SeqRecord(ref_sequence, id=ref_seq_id, description=ref_description)
            yield ref_seqrecord

        if query_sequences:
            query_start = alignments['query_start'][row_number]
            query_end = alignments['query_end'][row_number]
            query_direction = alignments['query_direction'][row_number]
            query_seq_id = '{}:{}-{}({})'.format(query_id, query_start, query_end, query_direction)

            pct_identity = percent_identities[row_number]
            query_description = '{}% identity match to reference'.format(pct_identity)

            query_sequence = Seq(alignments['query_sequence'][row_number])
            if ungap:
                query_sequence = query_sequence.ungap(".")

            query_seqrecord = SeqRecord(query_sequence, id=query_seq_id, description=query_description)
            yield query_seqrecord

def delta_to_fasta(delta_filename, output_filename, ref_sequences=True, query_sequences=True, ungap=True, output_format="fasta", verbose=False, skiprows=4):
    """
    The created SeqRecord objects only have id, description, and sequence fields.
    So the output formats BioPython supports for this besides "fasta" include
    e.g. "tab" and do not include "nexus" or "fastq".

    That being said, I imagine you could easily modify the code for `get_seq_record`
    to add the desired/required extra attributes to each SeqRecord such that your
    preferred output format is supported.
    """
    dataframe = delta_file_to_dataframe(delta_filename, skiprows=skiprows) 

    seqrecord_iterators = [get_seq_record(group, delta_filename, ref_sequences=ref_sequences,
                                         query_sequences=query_sequences, ungap=ungap, verbose=verbose)
                          for group in iter(dataframe.groupby(by=['Ref Id', 'Query Id']))]

    def concatenate_iterables(*iterables, verbose=False):
        for counter, iterable in enumerate(iterables, 1):
            if verbose:
                print('Starting group #{}'.format(counter))
            yield from iterable
    ultimate_seqrecord_iterator = concatenate_iterables(*seqrecord_iterators, verbose=verbose)

    write(ultimate_seqrecord_iterator, output_filename, output_format)
ADD COMMENT
0
Entering edit mode

For anyone who doesn't like the use of external libraries, here's a different approach I wrote six months ago that only uses built-in Python, and in general doesn't have as many batteries included, uses fewer regexes, and has fewer options. Note that it is even slower though and not streaming at all, so if cancelled midway, there's no output and all results are lost. Perhaps adding a verbose mode would not be difficult but it also currently doesn't have one. Also it's only able to output to FASTA and no other formats.

Anyway it works well enough, better than nothing, and the only reason I bothered to spend the time making the new version today is because I mistakenly concluded that this old version doesn't work for a dumb reason. I would still recommend the newer version however if you have BioPython and pandas anyway.

import os
import subprocess
import re

def delta_to_fasta(delta_filename, output_filename=None):
    if output_filename is None:
        assert len(delta_filename.rsplit('.')) == 2, 'Give the delta file a file extension so that a basename can be identified and used for the output filename, otherwise supply a filename of your choosing with the `output_filename` parameter.'
        delta_file_basename = delta_filename.rsplit('.')[0]

        fasta_filename = '{}.fasta'.format(delta_file_basename)
    else:
        fasta_filename = output_filename

    assert os.path.isfile(fasta_filename) == False, 'The filename {} is already used on the system. Please move or rename that file, or choose a different filename using the `output_filename` parameter.'.format(fasta_filename)

    coords_file = subprocess.run(['show-coords', '-c', '-l', '-r', '-T', delta_filename],
                                  stdout=subprocess.PIPE).stdout.decode('utf-8')

    coords_file_contents = coords_file.split('\n')[4:-1]

    seq_names = None
    alignments = []

    fasta_strings = []

    for line in coords_file_contents:
        pct_identity = line.split('\t')[6]

        if seq_names != tuple(line.split('\t')[-2:]): # this if clause should also work correctly in base case
#            assert len(alignments) == 0
            seq_names = tuple(line.split('\t')[-2:])
            first_seq_name, second_seq_name = seq_names

            aligns_file = subprocess.run(['show-aligns', delta_filename, first_seq_name, second_seq_name],
                                        stdout=subprocess.PIPE).stdout.decode('utf-8')
            alignments = alignments_from_aligns_file(aligns_file)

        alignment = alignments.pop(0)
        fasta_strings.append(sequences_lines_from_alignment(alignment, first_seq_name, second_seq_name, pct_identity))

    with open(fasta_filename, 'w') as fasta_file:
        fasta_file.write('\n\n'.join(fasta_strings))
    print('Finished, aligned sequences with coordinates and percent identities of matches should have been written in FASTA format to {}'.format(fasta_filename))

def alignments_from_aligns_file(aligns_file):
    file_contents = '\n'.join(aligns_file.split('\n')[3:-3]) # get rid of cruft in first 3 and last 3 lines of output
    alignments = file_contents.split('\n-- BEGIN alignment ')[1:]
    return alignments

def sequences_lines_from_alignment(alignment_string, first_seq_name='first_sequence', second_seq_name='second_sequence', pct_identity=None):
    header, lines, footer = alignment_string.split('\n\n')

    coordinates_regex = r'\[ ([\+\-])1 ([0-9]+) \- ([0-9]+) \| ([\+\-])1 ([0-9]+) \- ([0-9]+) \]'
    first_strand_dir, first_seq_start_coord, first_seq_end_coord, second_strand_dir, second_seq_start_coord, second_seq_end_coord = re.findall(coordinates_regex, header)[0]

    lines = lines.split('\n')
    lines = ["\n".join(lines[i:i+4]) for i in range(0, len(lines), 4)]

    first_sequence = ""
    second_sequence = ""

    for line in lines:
        _, first_sequence_part, second_sequence_part, _ = line.split('\n')
        remove_coord_numbers_regex = r'[0-9]+\s+(.*)'
        first_sequence_part = re.search(remove_coord_numbers_regex, first_sequence_part).group(1)
        second_sequence_part = re.search(remove_coord_numbers_regex, second_sequence_part).group(1)

        first_sequence += first_sequence_part
        second_sequence += second_sequence_part

    if pct_identity is not None:
        pct_identity_string = ' {}% identity'.format(pct_identity)
    else:
        pct_identity_string = ''

    first_fasta_string = '>{}:{}-{}({}){}\n{}'.format(first_seq_name, first_seq_start_coord, first_seq_end_coord,
                                                      first_strand_dir, pct_identity_string, first_sequence)
    second_fasta_string = '>{}:{}-{}({}){}\n{}'.format(second_seq_name, second_seq_start_coord, second_seq_end_coord,
                                                       second_strand_dir, pct_identity_string, second_sequence)

    fasta_string = '\n'.join([first_fasta_string, second_fasta_string])
    return fasta_string
ADD REPLY

Login before adding your answer.

Traffic: 809 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6