Question

Differences between "gene prediction" vs. "transcript reconstruction"

0

Entering edit mode

3.8 years ago

ljason2020 • 0

Hi all,

I am rather new to genome/transcriptome analysis, so I apologize if this is a basic question. I have done some reading of different analysis methods in the literature; specifically, I'm confused about the difference between the tasks of "gene prediction" (e.g., AUGUSTUS, BRAKER) versus genome-guided "transcript reconstruction" (e.g., Cufflinks, StringTie), both of which seem capable of taking in pre-aligned RNA-seq reads (.bam) and computing relevant genomic regions. Note that I am NOT referring to de novo transcript assembly algorithms like Trinity.

What I currently believe the difference is: Gene prediction is ultimately trying to annotate the features of a genome—there is pretty much a "right answer" for each organism. Additionally, gene prediction tends to incorporate many data sources and prediction methods, like sequence homology, searching for known nucleotide patterns ab initio, etc. Transcript reconstruction is ultimately trying to characterize transcription at the very time of RNA-sequencing; the output is passed on to other analysis tasks such as differential expression. Thus, there is more focus on estimating expression, as well as searching for different transcript isoforms for the same gene.

However, even if that is the case—I hope I'm not mixing biological concepts here, but since transcripts are produced from a gene, aren't the genomic regions that both tasks are seeking to locate one and the same? In other words, for organisms with reference sequences, why use Cufflinks/Stringtie at all if you can ostensibly assemble reads more accurately using gene prediction software?

Any corrections, insights, or suggestions for further reading are appreciated!

RNA-Seq alignment gene prediction Assembly • 1.1k views

ADD COMMENT • link updated 3.8 years ago by lieven.sterck 15k • written 3.8 years ago by ljason2020 • 0

score 3 · Accepted Answer · 2020-06-29

In essence you are correct.

the main difference (in a nutshell) is that gene-prediciton tries to annotate/predict all genes in a genome (== the complete proteome), while the transcript reconstruction methods (as in the name) try to correctly re-create transcripts from the RNAseq data provided.

Gene prediction usually includes many more resources then only the RNAseq data, this is because even today it is practically impossible to sample each and every transcripts/gene that is present in the genome (biological reasons, you would need to do a massive amount of different samplings to be able to encounter all genes being expressed). To overcome this, gene prediction does not solely relies on RNAseq data, but as you said also protein info and for instance ab-initio methods.

in transcript reconstruction you are not aiming to get all transcripts/genes from the genome but only those that are present in your current sample.