Question

kallisto: strand-specific and fragment length calculation

1

Entering edit mode

7.2 years ago

user230613 ▴ 380

Hi!

I'm starting to use kallisto to do transcript-level expression quantification. I have some questions:

1) Does kallisto infer the strandness of the input data just like salmon does (--libType A)? I guess the answer is no.

2) For other hand, kallisto has the next to options:

--fr-stranded             Strand specific reads, first read forward
--rf-stranded             Strand specific reads, first read reverse

Are these options only working for PE data?

3) Regarding the fragment length estimation when using SE datasets:

-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE               Estimated standard deviation of fragment length
                              (default: -l, -s values are estimated from paired
                               end data, but are required when using --single)

What does DOUBLE mean? Do we have to specify the double of the number calculated?

Thank you in advance

RNA-Seq kallisto • 10k views

ADD COMMENT • link updated 7.2 years ago by pmelsted ▴ 120 • written 7.2 years ago by user230613 ▴ 380

2

Entering edit mode

7.2 years ago

pmelsted ▴ 120

No, if you are unsure I would recommend blatting a few reads to see which strand they map to. Also if you happen to choose the wrong version you'll have significantly fewer reads mapping.
This works for SE and PE data.
As pointed out by Devon, this means that it accepts a floating point value or an integer. The -l and -s parameters are required for SE data and refer to the fragment length distributions, for PE data they can be estimated from the paired reads. Typical values for RNA-Seq are -l 200 and -s 30.

ADD COMMENT • link 7.2 years ago by pmelsted ▴ 120

score 2 · Accepted Answer · 2017-05-15

2

Entering edit mode

7.2 years ago

Devon Ryan 104k

No
They should work for SE data too (never tried, though). You probably want --rf-stranded for anything remotely recent.
An example of a double is 200.0 or 123.4. That is, any number with a decimal point. The documentation there should really be changed, since I don't expect those without C/C++/etc. programming experience to know that "double" means "double precision floating point value" (or what that even means)).

ADD COMMENT • link 7.2 years ago by Devon Ryan 104k

1

Entering edit mode

Just to add to the answer, there is an option for SE data (--single).

ADD REPLY • link 7.2 years ago by biofalconch ★ 1.1k

0

Entering edit mode

Sorry, I have another question, "fragment-length" is not the same as read length, is it? I mean, it can't be inferred using input SE fastq files

ADD REPLY • link 7.2 years ago by user230613 ▴ 380

0

Entering edit mode

Correct, fragment length refers to the length of the fragments loaded onto the sequencer. If this is your own dataset, then either you or whoever did the sequencing should know this (it can be estimated from a bioanalyzer plot). If this is a public dataset, then hopefully the value is written down somewhere.

ADD REPLY • link 7.2 years ago by Devon Ryan 104k

0

Entering edit mode

Hello

Sorry, I am a little confused by you saying that --rf-stranded is most likely the most appropriate option. For SE data, wouldn't you want to only process reads that align to the forward strand of the transcript?

Or have I made an error here?

ADD REPLY • link 6.5 years ago by Thomas ▴ 160

0

Entering edit mode

It doesn't matter whether you sequence SE or PE, read #1 in a pair aligns with the opposite orientation of the originating fragment for recent (since ~2013) data. In a parlance that many prefer, read #1 should align to the opposite strand of the transcript/gene.

ADD REPLY • link 6.4 years ago by Devon Ryan 104k

0

Entering edit mode

by originating fragment, do you mean the transcriptome or genome sequences?

ADD REPLY • link 6.4 years ago by elzedleeu ▴ 20

1

Entering edit mode

Either way. If you align to the transcriptome then read #2 should always be aligned as its reverse complement.

ADD REPLY • link 6.4 years ago by Devon Ryan 104k

0

Entering edit mode

One more question.

RSEQC package outputting "1+-,1-+,2++,2--" , basically means that read#2 'set' the strand, since aligns in the same strand of the transcript/gene. Thus, read #1 aligns to the opposite strand of the transcript/gene (i.e. reverse-complemented).

For this library type (apparently the most common nowadays), parameter --rf-stranded should be the one to use in 'kallisto quant' for abundance estimation using a reference transcriptome. Is that right?

The link below has confused me in this respect, and just wanted to be sure:

https://github.com/griffithlab/rnaseq_tutorial/blob/master/manuscript/supplementary_tables/supplementary_table_5.md

ADD REPLY • link 5.1 years ago by kiran7 • 0

0

Entering edit mode

Correct, TruSeq is the most common and it's --rf-stranded (if that's wrong, you'll be able to tell from the terrible quantitation metrics).

ADD REPLY • link 5.1 years ago by Devon Ryan 104k