7 weeks ago by
STAR is an aligner. Its job is to work out where in the genome each sequencing read came from, and that is (primarily) what it outputs - a list of reads and their co-ordinates on the genome (in the form of a BAM file). It is very concerned about getting the the location correct at a base-by-base level.
In order to do expression analysis, you need to turn these locations into gene expression values. There are several ways to do this, but the simplest is just to count the number of reads that come from the location in the genome that overlaps with each gene (an example of a program that does this is featureCounts). The alternate method is to employ statistical models to assign reads to likely transcripts (where several transcripts might overlap on the genome). An example of a program that does this is RSEM.
Both Kallisto and Salmon are quantifiers - they take a file containing sequencing reads and output a gene expression level. Of course working out where in the transcriptome these reads come from is the first step in this process. After locating the read on they employ statistical models to turn this into transcript expression levels that take into account how certain they are about which transcript a read comes from.
It might seem that Kallisto and Salmon are doing more, so they should take longer, but this is not in fact true - it turns out they are much quicker. This generation of tools was originally based on the observation that if you are quantifying genes, you care more about the set of transcripts a read could have come from, rather than its precise location within that transcript/on the genome. They do this in different ways - Kallisto uses "pseudo-alignment" and the lastest version of Salmon uses something called "selective-alignment", which is somewhere between what STAR and Kallisto do (as I understand it), although older versions used something called quasi-alignment.
Pros and Cons:
- Kallisto and Salmon are much quicker and less memory intensive than STAR + stand-alone quantification.
- They give transcript level expression information (where as STAR + counting only give gene-level, although STAR + RSEM gives transcript).
- They gracefully deal with cases where reads map to multiple transcripts or genes.
- Normally Salmon and Kallisto only map the transcriptome (the sequence of the transcripts a cell produces) rather than the genome.
- They have a different set of false positives and negative than alignment based approaches.
- Their results are only as good as the transcript annotation that is input into them. They can't quantify genes or splice-variants that are not in their input. Differences between the real transcriptome and your annotation of the transcriptome will reduce accuracy.
- They cannot be used for finding new genes, transcripts or splice forms, or for any analysis other than quantification.
modified 7 weeks ago
7 weeks ago by
i.sudbery ♦ 6.3k