Question

Could you explain the difference between STAR, KALLISTO, SALMON etc. to experimental Biologist/non-bioinformatician

29

Entering edit mode

4.6 years ago

WUSCHEL ▴ 750

Could you explain the difference between STAR, KALLISTO, SALMON etc. to experimental Biologist/non-bioinformatician.

If possible, the pros and cons of each pipeline.

Edit below

I ask this because three of my colleagues use this 3 difference tools for RNASeq. Basically to answer the same type of biological questions.

RNA-Seq alignment next-gen R assembly • 31k views

ADD COMMENT • link updated 3 months ago by Ram 43k • written 4.6 years ago by WUSCHEL ▴ 750

6

Entering edit mode

@Devon has about the best answer for that here: A: Alignment and mapping

STAR is an aligner. Kallisto/salmon are mapping (technically pseudo-aligners as @lieven points out) programs.

For getting a biological answer either of pipelines should be fine. Mappers will save time compared to aligners .

ADD REPLY • link 4.6 years ago by GenoMax 141k

0

Entering edit mode

Well now I get too confused :( .

Mapping is part of the alignment. So why they created new mapping tools like Kallisto/salmon if "For getting a biological answer either of pipelines should be fine" when STAR is already available. Why use mapping, if it is included in alignment?

ADD REPLY • link 4.6 years ago by WUSCHEL ▴ 750

7

Entering edit mode

Kallisto and salmon are not really mappers in the strict sense of the word (== pseudoaligners) . While star does mapping in the old-school sense of the word (== start with a seed, find exact match and extent) the others work more like 'BAC-fingerprinting' , they create some sort of approximation of the reference and reads (eg. something like a kmer profile) and then they match the profiles rather than the actual sequences.

Speed and accuracy are the major drivers for kallisto/salmon I guess , they are for instance much better suited for isoform quantification

ADD REPLY • link 4.5 years ago by lieven.sterck 15k

1

Entering edit mode

Thank you very much genomax & Lieven, This is very helpful to understand. Thank you for your time.

ADD REPLY • link 4.6 years ago by WUSCHEL ▴ 750

1

Entering edit mode

Dear All,

Thank you for your feedback. Based on your answers I decided to use "Kallisto" and have done my analysis.

How I have a confusing outcome. Out of my 5 T-DNA knockout plants, two of them are showing the expressing of the gene (even higher than the WT).

These are published lines and I have already genotyped them by PCR before.

Is this because of the pseudoaligning?

How can I check this? Is there any way of mapping?

P.S. I do not believe there is cross-contamination!

Edit: This comment is linked to this question; How to map RNASeq data to reference genome to check T-DNA insertions?

ADD REPLY • link 4.5 years ago by WUSCHEL ▴ 750

1

Entering edit mode

When things like that happen it's good to get a BAM file. My guess is that the KO is deleting a single exon and that the recycling of the resulting non-sense transcript is simply lower in some of your samples. Alternatively, if there are paralogs then perhaps they really are messing up the pseudoalignment.

ADD REPLY • link 4.5 years ago by Devon Ryan 104k

2

Entering edit mode

A good start would be to do a traditional alignment and creation of a browser track to visually examine the reads on a browser such as IGV. deeptools bamCoverage can conveniently create these tracks. Often by eye you can capture what is happening more intuitively than by all these (pseudo)alignment metrics.

ADD REPLY • link 4.5 years ago by ATpoint 82k

0

Entering edit mode

Thank you Devon Ryan & ATpoint. Does this mean I can not rely on Kallisto o/p for my downstream data analysis? Because I can not compare the mutant gene expression with this o/p...... Btw I posted same question How to map RNASeq data to reference genome to check T-DNA insertions? Sory about that.

ADD REPLY • link 4.5 years ago by WUSCHEL ▴ 750

2

Entering edit mode

Pseudoaligners are generally quite reliable, you may simply have one of the cases where they're not. That remains to be seen.

ADD REPLY • link 4.5 years ago by Devon Ryan 104k

0

Entering edit mode

But how could I interpret my data if my mutant shows the highest expression than WT by pseudoaligning? Any method?

ADD REPLY • link 4.5 years ago by WUSCHEL ▴ 750

1

Entering edit mode

How was the KO done? Deleting one exon, the entire gene? That is important to know for interpretation.

ADD REPLY • link 4.5 years ago by ATpoint 82k

1

Entering edit mode

Are you able to make the data public? It'd be much easier to figure out what's going on then.

Note that knocked out genes are often still highly expressed, they're just not translated.

ADD REPLY • link 4.5 years ago by Devon Ryan 104k

0

Entering edit mode

@ATpoint, This is T-DNA insertions (in one line in Exon and other lines in Intron)

@Devon Ryan, Sorry Devon at this point I can not make data available in public.

I was more irritated because mutant gene >>> WT gene RNASeq data ...

ADD REPLY • link 4.5 years ago by WUSCHEL ▴ 750

Ram · Accepted Answer · 2019-09-24

STAR is an aligner. Its job is to work out where in the genome each sequencing read came from, and that is (primarily) what it outputs - a list of reads and their co-ordinates on the genome (in the form of a BAM file). It is very concerned about getting the the location correct at a base-by-base level.

In order to do expression analysis, you need to turn these locations into gene expression values. There are several ways to do this, but the simplest is just to count the number of reads that come from the location in the genome that overlaps with each gene (an example of a program that does this is featureCounts). The alternate method is to employ statistical models to assign reads to likely transcripts (where several transcripts might overlap on the genome). An example of a program that does this is RSEM.

Both Kallisto and Salmon are quantifiers - they take a file containing sequencing reads and output a gene expression level. Of course working out where in the transcriptome these reads come from is the first step in this process. After locating the read on they employ statistical models to turn this into transcript expression levels that take into account how certain they are about which transcript a read comes from.

It might seem that Kallisto and Salmon are doing more, so they should take longer, but this is not in fact true - it turns out they are much quicker. This generation of tools was originally based on the observation that if you are quantifying genes, you care more about the set of transcripts a read could have come from, rather than its precise location within that transcript/on the genome. They do this in different ways - Kallisto uses "pseudo-alignment" and the lastest version of Salmon uses something called "selective-alignment", which is somewhere between what STAR and Kallisto do (as I understand it), although older versions used something called quasi-alignment.

Pros and Cons:

Kallisto and Salmon are much quicker and less memory intensive than STAR + stand-alone quantification.
They give transcript level expression information (where as STAR + counting only give gene-level, although STAR + RSEM gives transcript).
They gracefully deal with cases where reads map to multiple transcripts or genes.
Normally Salmon and Kallisto only map the transcriptome (the sequence of the transcripts a cell produces) rather than the genome.
- They have a different set of false positives and negative than alignment based approaches.
- Their results are only as good as the transcript annotation that is input into them. They can't quantify genes or splice-variants that are not in their input. Differences between the real transcriptome and your annotation of the transcriptome will reduce accuracy.
- They cannot be used for finding new genes, transcripts or splice forms, or for any analysis other than quantification.

GenoMax · Accepted Answer · 2019-09-24

STAR is a general purpose aligner. It can align reads to a genome or transcriptome, or whatever you provide. It can perform spliced alignment, and returns base-level alignments for the reads. These alignments can then be used for different purposes like being fed to transcript assembly tools, variant calling pipelines, or transcript quantification tools (e.g. you can use STAR to provide alignments to salmon).

The other tools are both transcript quantification tools. They map reads to the transcriptome and apply statistical inference to determine transcript abundances. They serve a different general purpose compared to STAR. It's worth noting that STAR has the ability to count reads mapping to different genes, and so can be used to estimate gene level abundances, but has no statistical model and so doesnt infer transcript abundance or deal with gene multimapping reads in a principled manner.

score 9 · Accepted Answer · 2019-09-25

Regarding the pros and cons of the programs, you might find this article useful: https://www.nature.com/articles/s41598-017-01617-3 It finds that kallisto and Salmon produce near identical results, and that STAR (with HTseq for producing gene counts) is less accurate (due to some of the reasons explained above). In terms of speed/memory requirements, the difference between programs is substantial. In recent benchmarking of kallisto vs. STAR on workflows for single-cell RNA-seq https://www.biorxiv.org/content/10.1101/673285v2 we found that kallisto was 2.6 times faster than STAR. More importantly, kallisto used much less memory, in some cases 15x less RAM than STAR. This makes it possible to run kallisto on a laptop rather than a server, and facilitates reproducible workflows.