Question

'position-aware' aligning of sequences with letter annotations

1

Entering edit mode

4.5 years ago

Joe 21k

I've been racking my brains about the best way to do this for a while now but am no closer to the answer it seems, so I could use some help!

I'll explain the problem conceptually, then in context as that may help. Essentially the problem is this:

Given a pair of sequences, at least one of which has per-letter annotations in biopython, I need to align the shorter of the 2 sequences (since one is likely to have gaps, where as the annotated sequence will be 'full'), and be able to 'carry' the per-letter annotations with it.

For example, if I have these 2 dummy sequences (with some scores immediately below the longer sequence):

>1
ATGCAT
135198
>2
ATCAT

In practice these scores are not all ints and will be floats, but for the purpose of demonstration random numbers will suffice.

Now if these sequences were aligned, and the scores were carried/transferred to the other sequence we'd get:

ATGCAT
AT-CAT
13-198

So the question is, how can I align these in general, and preserve these annotations? So far, I've been looking to see if pairwise2 can handle SeqRecords and deal with moving those characters around as the object would preserve the character-score relationship, but it seems to only deal with literal sequence strings. In the above case, it would be pretty trivial to copy the annotations over to the new sequence and then just 'drop' any position where there is a gap.

In more complicated real-world examples though, I'm expecting that this logic may not always hold, but I am still trying to kind of tease this thread through in my head to understand whether I'm overcomplicating things.

I said I'd offer the real world motivation for this too, so the reason I need to do this is that I have a number of fasta files which have been analysed for their immunogenicity properties, and I am in the process of 'rendering' this on to the 3D structure. Its common for 3D structures to be missing stretches of residues though, so in order to correctly 'associate' the sequence with its structure, and the right residues together, I need to be able to align them and (as far as possible) unambiguously identify which residue in the structure corresponds to the value in the fasta/scores file.

Thoughts/feelings/code gratefully recieved.

python biopython alignment pairwise • 1.7k views

ADD COMMENT • link 14 months ago by Joe 21k

1

Entering edit mode

Re-posted it at SO with R/Bioconductor tags: https://stackoverflow.com/q/59682107/680068

ADD REPLY • link 4.5 years ago by zx8754 11k

0

Entering edit mode

I don't think there is yet an elegant way to do this in Biopython. In theory the MutableSeq class would be well suited for this task, since it handles sequences that are not single characters, but in practise it's not currently maintained in a way that could be used. You can at least use the newer Bio.Align module rather than pairwise2

ADD REPLY • link 4.5 years ago by MutationalMeltdown ▴ 190

0

Entering edit mode

Yeah I did take a look at the PairwiseAligner() but it doesn't seem to offer any additional functionality.

I'm actually a bit surprised that the alignment modules don't ingest SeqRecords since that's pretty much the primary currency of biopython.

Shame. Perhaps there's a feature request in the making if its confirmed there's definitely no elegant way of doing it. I thought there might be since when it comes to slicing SeqRecords, it's quite intelligent in how it handles features, annotations and per-letter annotations.

ADD REPLY • link 4.5 years ago by Joe 21k

0

Entering edit mode

The PairwiseAligner() offers some different alignment algorithms and a fast C-implementation, but yes it does not solve your issues. The alignment module will 'ingest' Seq/MutableSeq/UnknownSeq objects (not just str) and I'm not sure it would make sense for them to ingest SeqRecords since alignment is purely based on the sequence

What is the input format, since it holds floats, it can't be the modified FASTA format you posted? If there is a standard input format, then this could possibly be an enhancement for Biopython to read via SeqIO, then it could create a SeqRecord with letter annotations as you suggested. (However, you'll still have to do some work to parse the resulting alignment and map the score annotations back)

ADD REPLY • link 4.5 years ago by MutationalMeltdown ▴ 190

0

Entering edit mode

At the moment the input format is entirely undefined and can be whatever I want it to be. The fasta above was just a clear example

One could even encode it as a sort of fastq style format, however the real values are floats as I mentioned so they cannot be encoded in a single character easily (unlike, say fastq scores are).

Biopython's per letter annotations can simply be set as SeqRecord.letter_annotations["Name"] = [...] so the data format can be pretty flexible as all you need is a list.

ADD REPLY • link 4.5 years ago by Joe 21k

score 2 · Accepted Answer · 2023-05-23

This just popped up again in my inbox, so thought I'd come back with the solution I came up with.

I believe I arrived at the eventual solution based on an idea from Devon Ryan.

It's embedded in to a much more convoluted codebase, but the general idea was this:

Create a pairwise alignment of the sequences as normal.
Iterate through the alignment and create a sort of CIGAR string which captures the various sequence 'movements'.
Apply/map this same set of movements to the list of annotations corresponding to each position, thus repeating the set of operations

Code here:

https://github.com/jrjhealey/ISIS/blob/master/ImmunoRender.py#L35-L103