Question

Identifying reading frame for Ribo-Seq reads

0

Entering edit mode

3.2 years ago

plberry ▴ 30

I am trying to identify the A-site bound codons for a Ribo-Seq dataset. I know the A-site offsets for the three reading frames for the various lengths of reads, but I am puzzled about which reading frame to pick for each read. I can of course manually do this by getting the coding sequence for the gene it was mapped to, do a local alignment, and see which reading frame the read maps to, but since I have several million reads, this is not feasible.

The only way I can think of to do it by brute force is create a script that goes through my SAM file, extracts the Read ID, Transcript ID, and Sequence, then looks up the coding sequence in another fasta by Transcript ID. This would then run ClustalOmega on the two sequences, take the output from the query sequence, looks at the number of "-" placeholders before the query sequence, and then use the length modulus 3 to figure out which reading frame the read is in, and export the reading frame (0, 1, 2) along with the Read ID to a tsv or something...but that seems ridiculously convoluted and would probably take days to run.

RNA-Seq Ribo-Seq A-site • 686 views

ADD COMMENT • link updated 3.2 years ago by Mensur Dlakic ★ 27k • written 3.2 years ago by plberry ▴ 30

score 0 · Answer 1 · 2021-02-09

but since I have several million reads, this is not feasible.

It doesn't strike me as infeasible, given that you are willing to do some scripting and optimization. Neither does the thing you propose in the second paragraph. What if it takes even days to run? Other than possibly your own impatience, is there a problem waiting days to find out?

That aside, if you have reads that are 150 bp or longer, it is almost a guarantee that a reading frame that doesn't have a stop codon for the whole read length is a correct one. For most coding sequences you are not going to have 2 different reading frames going 50 residues without a stop codon. There will be a small fraction of reads where that's not the case, but whichever frame gives you the longest ORF is almost guaranteed to be correct. This advice applies only if you don't want to wait days to do it properly.