What is the difference between gene-level and transcript-level quantification?
Is gene-level quantification performed on genome orf sequences and transcription-level on mRNA sequences with introns removed and isoforms taken into account or I'm I getting this all wrong? If it's right where do I get transcripts reference sequences, because usually only genome annotations exists.
The difference between gene-level and transcript level quantification is, well, that gene-level summarizes counts over genes, and transcrpt-level summarizes counts over transcripts.
Both gene-level and transcript-level may be calculated in several ways:
1) mapping to the genome and using an annotation to count reads overlapping the features of interest. The difference is how multi-mapping reads treated: in general they are discarded when summarizing genes directly, and apportioned using an expectation-maximization algorithm when summarizing over transcripts.
2) mapping to the transcriptome (with all isoforms from each gene represented as sequences). Counts are apportioned using an expectation-maximization algorithm, and counts from all isoforms from each gene are summed-up if summarizing at the gene-level.
If you want to use the transcriptome to do the quantification, Ensembl provides fasta downloads for (coding and non-coding) transcript sequences, or you can extract transcript sequences from a genome and its annotation - gffread from StringTie and rsem-prepare-reference from RSEM are two programs to perform this task.
There's a difference between the read alignment step (which needs the actual sequence) and the quantification (which basically just counts the numbers of reads overlapping with defined loci, i.e., genes or transcripts). The standard workflow for model organisms with well established genome sequences and annotation is to:
align to the genome (sequence in a fasta file), perhaps using transcriptome information (usually a gtf file), using a read alignment tool such as STAR
count reads overlapping with genes, where genes are often defined as the sum of all exons for all transcripts of a given gene (introns are usually excluded), typical tools for this step are featureCounts or HTSeqCounts.
I have the feeling that your original confusion may stem from the rise of kallisto and salmon, which are being sold as tools for transcript quantification. These tools tend to not do the traditional read alignment, instead they try to focus on the sequence representing the transcriptome only and perform "pseudoalignment" and quantification. How to obtain the transcriptome sequence is well described here.
Hi Friederike! I think that's exactly why I got confused. If I understand well, kallisto and salmon will replace the alignment to genome step (for instance using STAR) and instead just return quantification information for whatever transcripts are provided (could be entire chromosomes or let's say gene orfs) is that correct?
Hi Friederike! I think that's exactly why I got confused. If I understand well, kallisto and salmon will replace the alignment to genome step (for instance using STAR) and instead just return quantification information for whatever transcripts are provided (could be entire chromosomes or let's say gene orfs) is that correct?
I believe they will take whatever sequence file you provide them as is, yes. But I haven't explored that in detail.