Any alignment programs that accept regular expressions within input sequences?
0
0
Entering edit mode
6.3 years ago
Ghoti ▴ 90

I've been working with a pseudo-consensus amino acid sequence (pseudo in the sense that it is not necessarily a true/accurate representation of the population due to sampling bias) via the Geneious program. In order to create the consensus sequence, a conservation threshold must be designated. If the proportion of matching residues at a given position exceed the set threshold, the residue is preserved in the consensus. Otherwise, it is listed as unknown. It would better suite my purposes if I could instead list multiple residues that exceed a reduced, more lenient threshold at a given position as a regular expression. Example:

Sequence 1 = ABCABC

Sequence 2 = BBCDAA

Sequence 3 = BBACAC

Sequence 4 = ACAABC

Output with required threshold at >=75% and no regular expressions = XBXXXC

Output with required threshold at >=50% and regular expressions allowed = [AB]B[CA]A[AB]C

My thought is this would be more accommodating for alignment scoring similar to how scoring matrices score residues based on similarity/dissimilarity. I'm trying to optimize alignments of sequences originating from single stranded RNA viruses with high rates of mutation and recombination. I'm also more interested in unique sequences/residues (harder to match/align) than the prevalence of reoccurring residues. This brings me to my question: Are there any alignment programs that accept regular expressions as input?

Edit: I feel it's necessary to emphasize that the odds of coincidental homologous/similar regions within "my" genome are low due to size (15kb in total length)

alignment regular expression • 1.4k views
ADD COMMENT
1
Entering edit mode

Why not make a HMM profile instead?

ADD REPLY
0
Entering edit mode

Thanks for the suggestion. Profile HMM is basically what I described, but after reading up on it, I worry it will be overly computationally demanding. Perhaps instead I'll adjust scoring matrices. MAFFT in particular allows for a user defined 248x248 matrix (in alpha testing) which could use unique characters to account for paired residues at a given position within an aligned consensus.

https://mafft.cbrc.jp/alignment/software/textcomparison.html#userdefinedmatrix

ADD REPLY
0
Entering edit mode

What makes you think that? Maybe it missed it, but what is the size of the dataset you're working on (length/number of sequences?)

ADD REPLY
0
Entering edit mode

I need to perform tens of thousands of pairwise alignments in a few hours. Thus far I've been using MAFFT localpair (computes with Smith-Waterman algorithm). In order to preserve computation time, I thought that a simple change to the scoring matrix would be sufficient. However, the required MAFFT version is in alpha and I'm not sure if it's accessible to the public. I also thought that updating MAFFT could resolve an issue with terminal gaps (which I've been discussing with you).

I was overly dismissive of the Profile HMM method. I'll attempt to find a tool that can be deployed through Python.

ADD REPLY

Login before adding your answer.

Traffic: 1547 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6