Hola peeps :)
I develop a Python package which has a module that aligns gene nucleotide sequences to contigs with Minimap2.
Given the query (gene) and target (contig) nucleotide sequences, I can reconstruct the aligned sequences using the CIGAR. I can also extract the resulting gene sequence from the contig, and translate to get the resulting amino acid sequence.
However, I would like to calculate the number of amino acid matches between the query protein sequence and protein sequence extracted from the contig to get a protein percent identity value for reporting.
I recognise that this can be achieved with a pairwise protein aligner, however I want to avoid including an additional dependency just for this one function.
I have tried iterating through the aligned sequences in chunks of 3 (codons) and comparing the codon translations, however this does not handle insertions and deletions in nucleotide space.
So my question is: is it possible to infer the pairwise amino acid alignment from a pairwise nucleotide alignment (with CIGAR) and the corresponding amino acid sequences, and how could I implement this in Python?
An example problematic CIGAR is:
'71M4I14M4D4M1I18M2D3M2I2M1D14M1D3M1I130M2I15M1I4M1D1M2D4M2D2M1I6M4D8M2I3M3I6M3I7M1D4M2D31M1D1M2I3M1D31M1I7M1D52M2I1M1D3M1D41M1I8M1D6M6I101M2D4M2I45M1I6M1D116M1D8M1I562M'
(the sequences are quite long so I will paste on request)
Thanks in advance!
Honestly, given that the first 71M could introduce any number of mutations, I doubt there's any easy/reliable way to infer amino acid alignment. For example, what if in the first 71M, you create an amino acid sequence that matches the last 22 amino acids, but then ends in a stop codon. I don't see any way you would know that the amino acid sequence matches the last half. Plus the frameshift mutations on top of that.
It seems it would be easier to just implement your own pairwise algorithm if you really don't want a dependency.