I am trying to identify the A-site bound codons for a Ribo-Seq dataset. I know the A-site offsets for the three reading frames for the various lengths of reads, but I am puzzled about which reading frame to pick for each read. I can of course manually do this by getting the coding sequence for the gene it was mapped to, do a local alignment, and see which reading frame the read maps to, but since I have several million reads, this is not feasible.
The only way I can think of to do it by brute force is create a script that goes through my SAM file, extracts the Read ID, Transcript ID, and Sequence, then looks up the coding sequence in another fasta by Transcript ID. This would then run ClustalOmega on the two sequences, take the output from the query sequence, looks at the number of "-" placeholders before the query sequence, and then use the length modulus 3 to figure out which reading frame the read is in, and export the reading frame (0, 1, 2) along with the Read ID to a tsv or something...but that seems ridiculously convoluted and would probably take days to run.
These are ribo-seq reads, so the read lengths are only as long as a ribosome is wide, about 15-30 nts. Usually I would do the stop codon trick! I guess the main thing is that the rest of my analysis script is in Python and Python is so slow for anything where you have to read through files line by line or open a bunch compared to bash...I've been meaning to learn parallelization in Python though, this might be the project to do it.