Dear community. I have a rather broad question I would like have your input on. I am trained in structural biology and proteomics, and following the needs of my project, I have recently ventured into learning how to analyze transcript level RNA-seq data. Like in proteomics and protein identification, my fellow biologists seem recalcitrant to the idea that RNA-seq experimental data can be used to infer gene isoforms and even better calculate their abundances. It is literally impossible to convey the message that the excel spread sheet they get at the end of the analysis is not a list of predictions one needs to validate with a vast amount of convoluted PCR and/or cloning experiments. In the case of mass spec data I have well funded arguments to show that an MS/MS fragmentation pattern explains a peptide sequence or a phosphorylation site for example, if the statistical parameters are good enough. In the case of isoform reconstruction from RNA-seq data, I am not sure I have all the arguments at hand. I would therefore appreciate if any (or many) of you could give me the point of view of bioinformaticists. A few specific questions are: Are the bioinformatics tools available (e.g. TopHat-Cufflinks-Cuffdiff), mature enough to reconstruct isoforms? Furthermore, in a reference-guided analysis, what can I make of novel isoforms, in particular those tagged as class_code J? Your input will be appreciated. G.
As a computational biologist who understands a bit about how these programs work but who has never worked with this data, I'm not comfortable with any isoform reconstructions yet unless the splicing graph has only has one or a small number of possible isoforms. In most RNA-seq technologies, the reads only cover 1-3 exons, so you have to infer the existence and abundance of isoforms by matching exons and exon-exon junctions with the same read depth. This seems to me to be problematic. However, this is an active area of research, and people have published algorithms to infer isoform abundance but I suspect that they only work well when (a) there are only a small number of isoforms expressed and (b) the isoforms all have relatively high abundances.
I do believe the splice boundary calls and I also believe the gene abundance levels (where all reads to the gene are counted, regardless of the isoform). What kind of novel isoforms are you talking about? If they contain a previously unobserved splice boundary that is supported by substantial read depth, I would be more confident than if they don't contain any new exons or splice boundaries.