Hello all :)
I am running a python script that aligns transcripts of certain genes to the transcripts of their paralogs (using pairwise2.align.localxx). There are approximately 1500 alignments carried out, and most of them are done pretty quickly; but some alignments take an unproportionally long time (20 minutes and longer). I noticed the results of these slow alignments all have a very low alignment score - so I assume the aligner is just having a hard time figuring out a way to fit two very different sequences.
Currently, because of the long alignment time, the script runs for many hours without finishing (if it doesn't crush, which also happens). So my question is - how can I make the script more efficient, so it would not linger so much on sequences that have very little in common? For me, it doesn't matter if the all alignments which have lower than a certain minimum score would be skipped/dropped.
Thank you in advance!
Python is not very efficient for that type of work. There are all kinds of existing tools that are much faster: BLAST, FASTA and DIAMOND are probably best known among them. While these are local aligners - you may need a global aligner - they can be adjusted to make global alignments. Unless there is something peculiar you need that these programs can't do, I suggest you switch from your python scripts to one of these programs.
Seconding Mensur, use a dedicated program. Have you looked at a program designed for aligning transcripts?