Identifying reading frame for Ribo-Seq reads
1
0
Entering edit mode
3.2 years ago
plberry ▴ 30

I am trying to identify the A-site bound codons for a Ribo-Seq dataset. I know the A-site offsets for the three reading frames for the various lengths of reads, but I am puzzled about which reading frame to pick for each read. I can of course manually do this by getting the coding sequence for the gene it was mapped to, do a local alignment, and see which reading frame the read maps to, but since I have several million reads, this is not feasible.

The only way I can think of to do it by brute force is create a script that goes through my SAM file, extracts the Read ID, Transcript ID, and Sequence, then looks up the coding sequence in another fasta by Transcript ID. This would then run ClustalOmega on the two sequences, take the output from the query sequence, looks at the number of "-" placeholders before the query sequence, and then use the length modulus 3 to figure out which reading frame the read is in, and export the reading frame (0, 1, 2) along with the Read ID to a tsv or something...but that seems ridiculously convoluted and would probably take days to run.

RNA-Seq Ribo-Seq A-site • 686 views
ADD COMMENT
0
Entering edit mode
3.2 years ago
Mensur Dlakic ★ 27k

but since I have several million reads, this is not feasible.

It doesn't strike me as infeasible, given that you are willing to do some scripting and optimization. Neither does the thing you propose in the second paragraph. What if it takes even days to run? Other than possibly your own impatience, is there a problem waiting days to find out?

That aside, if you have reads that are 150 bp or longer, it is almost a guarantee that a reading frame that doesn't have a stop codon for the whole read length is a correct one. For most coding sequences you are not going to have 2 different reading frames going 50 residues without a stop codon. There will be a small fraction of reads where that's not the case, but whichever frame gives you the longest ORF is almost guaranteed to be correct. This advice applies only if you don't want to wait days to do it properly.

ADD COMMENT
0
Entering edit mode

These are ribo-seq reads, so the read lengths are only as long as a ribosome is wide, about 15-30 nts. Usually I would do the stop codon trick! I guess the main thing is that the rest of my analysis script is in Python and Python is so slow for anything where you have to read through files line by line or open a bunch compared to bash...I've been meaning to learn parallelization in Python though, this might be the project to do it.

ADD REPLY

Login before adding your answer.

Traffic: 1748 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6