I would like to find the lengths of UTR3, UTR5 and ORF regions for each transcript of protein coding genes. My plan was to find the ORF length looking at the coordinates of start and stop codons and the lengths of the UTRs by their coordinates. Problem is that some transcripts have 0 (or 2) start/stop codons, 0 (or multiple) UTRs, while I expected 1 start codon, 1 stop codon and 2 UTRs.
I could discard those transcripts looking at how many start/stop codons or UTRs I have for each transcript but I wanted to know if there is a better way to do it.
Looking at tags for some transcripts, I noticed that when there were multiple UTRs there was the 'alternative_3_UTR'/'alternative_5_UTR' tag, when there was no stop codon or start codon the 'cds_end_NF'/'cds_start_NF' tags, when there were less than 2 UTRs the 'mRNA_start_NF'/'mRNA_stop_NF' tags.
After filtering looking at tags the number of transcripts with an unexpected number of start/stop codons or UTRs was greatly reduced but was not 0.
Am I making wrong assumptions about the tags? Am I missing something?
I am new to the topic and I'm quite confused by this, I understand why it is possible to have no start/stop codon but don't get why some transcripts have 2, looking at coordinates it seems like the 3 bases of a codon are split in different areas? It looks like 1 of them has length 1 and the other length 2 and I guess that together they would make one codon but noticing that didn't really help with understanding how that works. Any suggestion on material where I could learn are welcome