Hi all, I am trying to create a tool which utilises BLATv.35 in order to pairwise-align all transcripts in a gene (for all genes in the genome). I am using the commandline BLAT version which requires a .fasta file for each gene and produces a .psl for each gene, this is a lot of IO (as i produce 20,000 fasta files and 20,000 psl files for instance).
My starting point is a .fasta file with all the transcripts in the entire genome and a .txt file with the clustering of which transcripts are in which gene. Since their are hundreds of thousands of transcripts blatting the fasta file against itself is very time consuming and is very inefficient (since i only need to pairwise align transcripts from within the same gene and not the other 219,999 genes), so i want to split this into a blat instance per gene. But in order to do so i need a fasta file per gene, therein lies the problem.
My tool is a python tool (and i run blat as a command to the shell from within). I am trying to speed up my program by bypassing all the IO. Ideally i would pass a list (where each element was a transcript sequence) for each gene and blat the list against itself pairwise and then store the output all within a python framework (without needing to create a .fasta file and .psl for each gene).
My plan is one of the following:
1) modify the blat source code such that i can simply input arrays of strings, and output strings and encorporate this function into my python code using cython.
2) Use a pairwise aligner native to python (i have tested this already, these are much slower than BLAT)
3) By pass blat altogether, write/import my own aligner in C/C++ and use that instead.
I was simply wondering whether anyone else had thought or implemented a solution and which of the above solutions the biostar community thought would be best?
Thanks for your time, Anthony