Hi all,
I'm mining viruses in RNA Seq data, and often see that nucleotide gaps (~ 50-100 nucleotides) differentiating isoforms lay outside the predicted coding regions. I've been working under the assumption that this type of isoform would represent alternatively spliced variants, but that isn't consistent with my data if the variation exists outside the coding regions. Possible explanation: Predicted coding regions are wrong, although for some transcripts I have > 90% amino acid similarity to references and near complete viral genomes, so I think in those cases it would be less likely. I also suppose they could be different viruses all together. Or could these be artifacts in the assembly process? Any information that sheds light on why this happens would be greatly appreciated! Thanks!
I'm not an expert in viruses but in other organisms this does not have to be the case. Isoform as in alternative transcript (could be via splicing differences, exon skipping, intron retention, ...) are defined on the transcript level (== mRNA thus) and as such has nothing to do with the coding region. You can have different isoforms that have identical proteins, thus where the alternative is not in the coding region (but for instance UTR).
I do add that most attention goes to those variants where the variation has effect on the translated protein, but strictly speaking it does not have to be.
Thank you! This was very helpful!
Can you clarify how you are doing that?