I've been racking my brains about the best way to do this for a while now but am no closer to the answer it seems, so I could use some help!
I'll explain the problem conceptually, then in context as that may help. Essentially the problem is this:
Given a pair of sequences, at least one of which has
per-letter annotations in
biopython, I need to align the shorter of the 2 sequences (since one is likely to have gaps, where as the annotated sequence will be 'full'), and be able to 'carry' the per-letter annotations with it.
For example, if I have these 2 dummy sequences (with some scores immediately below the longer sequence):
>1 ATGCAT 135198 >2 ATCAT
In practice these scores are not all
ints and will be
floats, but for the purpose of demonstration random numbers will suffice.
Now if these sequences were aligned, and the scores were carried/transferred to the other sequence we'd get:
ATGCAT AT-CAT 13-198
So the question is, how can I align these in general, and preserve these annotations? So far, I've been looking to see if
pairwise2 can handle SeqRecords and deal with moving those characters around as the object would preserve the character-score relationship, but it seems to only deal with literal sequence strings. In the above case, it would be pretty trivial to copy the annotations over to the new sequence and then just 'drop' any position where there is a gap.
In more complicated real-world examples though, I'm expecting that this logic may not always hold, but I am still trying to kind of tease this thread through in my head to understand whether I'm overcomplicating things.
I said I'd offer the real world motivation for this too, so the reason I need to do this is that I have a number of fasta files which have been analysed for their immunogenicity properties, and I am in the process of 'rendering' this on to the 3D structure. Its common for 3D structures to be missing stretches of residues though, so in order to correctly 'associate' the sequence with its structure, and the right residues together, I need to be able to align them and (as far as possible) unambiguously identify which residue in the structure corresponds to the value in the fasta/scores file.
Thoughts/feelings/code gratefully recieved.