Finding pairwise alignment between two multifasta files for similarity search
0
0
Entering edit mode
6.6 years ago
Siddharth • 0

I'm trying to compare genomic elements(eg. think of genes) from two closely related organisms and trying to find if they are "conserved" (I'm pretty sure this is not the right term for pairwise comparison), or putting it coarsely, have high sequence similarity(or identity if you will, not sure how much difference it makes in case of nucleotides). My current pipeline has identified and extracted these elements of interest and I store them in two different fasta files, each containing sequences from one of the two organisms I'm studying. I initially decided to use BLAST but then given the high frequency of SNPs and large size of these elements, I thought maybe good old smith waterman alignment would be more suitable to find similarity, but python library I used (Scikit-Bio) for alignment (and thereafter creation of a score matrix(not scoring matrix!) ) didn't work out, for reasons I don't understand, but I believe it has something to do with the wrapper used by library to perform SSW. So back to square one, either I can try to do the same using biopython or R, or change my strategy if it is fundamentally flawed. In terms of machine learning, this will be akin to creating a similarity matrix between two different sets of vectors.

Initial question: I'm trying to create a score matrix of M*N dimensions by aligning M sequences from a fasta file to N sequences from another. Currently I'm using scikit-bio but I think their wrapper for stripped smith-waterman alignment is a bit buggy (or maybe I'm doing a major mistake). My current code works fine if I compare same multi fasta file, but it fails as soon as I try to compare two different files (Error: Alignment score and position are not consensus.). Ultimately I want to compare which sets of sequences from two organism are similar, and I can change my strategy if there is something better than sequence alignment based similarity/identity comparison, or if there is something equivalent and working in biopython or R(which can be parallelized). My current code looks like this:

# -*- coding: utf-8 -*-
import sys
import os
import pandas as pd
from optparse import OptionParser
from multiprocessing import Pool
from itertools import repeat
from subprocess import call
from collections import defaultdict
import numpy as np
from skbio.sequence import DNA
from skbio.alignment import local_pairwise_align_ssw
from skbio.alignment import StripedSmithWaterman

def compute_scores(dna1, dnas2):
    # StripedSmithWaterman docs:
    # http://scikit-bio.org/docs/0.4.2/generated/skbio.alignment.StripedSmithWaterman.html
    ssw1 = StripedSmithWaterman(dna1)
    # AlignmentStructure docs:
    # http://scikit-bio.org/docs/0.4.2/generated/skbio.alignment.AlignmentStructure.html
    # https://github.com/biocore/scikit-bio/blob/9dc60b4248912a4804c90d0132888d6979a62d51/skbio/alignment/_lib/ssw.c
    return [ssw1(dna2).optimal_alignment_score for dna2 in dnas2]

class sequenceCompare:

    '''Common class for comparing multifasta files'''

    def __init__(
        self,
        fasta1,
        fasta2
        ):
        self.fasta1 = fasta1 + ".fasta"
        self.fasta2 = fasta2 + ".fasta"

    def computeScore(self):
        sequenceList1 = {}
        sequenceList2 = {}
        with open(self.fasta1) as file_one:
            sequenceList1 = {line.strip(">\n"):next(file_one).rstrip() for line in file_one}        
        with open(self.fasta2) as file_two:
            sequenceList2 = {line.strip(">\n"):next(file_two).rstrip() for line in file_two}  
        with Pool(os.cpu_count()) as p:
            values2 = list(sequenceList2.values())
            data = p.starmap(compute_scores, zip(sequenceList1.values(), repeat(values2)))
            df = pd.DataFrame(data, columns=list(sequenceList1.keys()), index=list(sequenceList2.keys()))
            # df contains the resulting data frame
            output = self.fasta1 + "_x_" + self.fasta2 + ".tsv"
            df.to_csv(output, sep='\t')
Python FASTA alignment R • 3.0k views
ADD COMMENT
1
Entering edit mode

Is this an assignment? Why not just use BLAST, and make on of your FASTA files the db?

ADD REPLY
0
Entering edit mode

No, this is not an assignment. I wish it was. But anyways, I'm avoiding blast because my sequences range from 100ish nucleotides to 40kbp, and I'm not sure if BLAST is a suitable tool for similarity search in such extremes.

ADD REPLY
1
Entering edit mode

Yes, it is. Why do you have that suspicion?

ADD REPLY
0
Entering edit mode

I just don't know how it handles interspersed low complexity regions and repetitive elements. Specially the larger sequences have those characteristics. While it can be argued that they are not "functionally" important, I would still have them accounted for.

ADD REPLY
0
Entering edit mode

So it's the content that's the problem, not the query length :-)

ADD REPLY
0
Entering edit mode

As Ram says: Blast will do just fine here. You might opt to disable the filtering low complexity but otherwise it's your weapon of choice!

ADD REPLY
1
Entering edit mode

100ish nucleotides to 40kbp

Doing pair wise comparisons does not appear wise in that case.

ADD REPLY
0
Entering edit mode

I agree with that. Looking at the matrix generated from subsampled data certainly tells so. But then again, all those relatively shorter reads have a score close to zero, so I don't think it should be a big problem.

ADD REPLY
0
Entering edit mode

Ultimately I want to compare which sets of sequences from two organism are similar

Could you expand on this? We may be able to suggest better approaches, but to do so, we would need more details.

Do you have sets of alignments? And you want to compare one set of alignments to another set of alignments? What exactly are your alignments?

Or maybe you want to merge alignments, without performing full alignment again?

ADD REPLY
0
Entering edit mode

I added a bit more explanation at start. Sorry for the confusion!

ADD REPLY
0
Entering edit mode

I googled the errormessage "Alignment score and position are not consensus." and thought that is caused by overflow of variable. One or some of your sequences might be too long for that module.

https://github.com/10XGenomics/cellranger/blob/master/tenkit/lib/python/striped_smith_waterman/ssw.c

ADD REPLY

Login before adding your answer.

Traffic: 1453 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6