Entering edit mode
3.1 years ago
LeeLee
▴
10
I am using ribotricer software to search for ORF in ribo-seq data. This software will return the start and end positions of the ORF on the genome as below:
ENST00000327044.7_944697_959240_2247 chr1 -
The information in the above line is, in order, gene information, gene start position, gene end position, ORF length, chromosome, positive and negative chain. I want to use it to predict the amino acid sequence, but I found that the nucleotide sequence obtained in this way has introns. How can I pass it? Use annotation files to remove introns in such a sequence?
That's a transcript ID and the transcript start and end, not the gene.
Yes this is the transcript ID. But the start and end, I understand it refers to the position on the chromosome, because 959240 minus 944697 is significantly greater than the ORF length of 2247
I guess what is better is to take the coordinates for CDS from the annotation file, splice the transcript sequence using those coordinates (in case of multiple CDS) and patch them together to create an entire transcript CDS
Yes, this is a good idea, but it is still difficult for me to achieve. Because the start of ORF may not be on the first exon, I don't know how to process it in batches.
From what I understand, the start of the ORF will be the first CDS (not the first exon which may contain UTR). You can make a simple check by checking if the start codon is at the beginning of the first CDS.
Are you looking to get the sequence from your sequencing data or the reference sequence for this ORF?
Yes, in more detail, what I want to do is to get the 2247 nucleotide sequence of the exon region from 944697 to 959240 on the minus strand of chromosome 1. The point is how to do such a large amount of processing, which is very difficult for me, I am looking for whether there is a suitable tool to do it.
The reference sequence in the reference genome?
Just download the cDNA sequence file