Question: kallisto: strand-specific and fragment length calculation
1
gravatar for user230613
3.6 years ago by
user230613280
Europe
user230613280 wrote:

Hi!

I'm starting to use kallisto to do transcript-level expression quantification. I have some questions:

1) Does kallisto infer the strandness of the input data just like salmon does (--libType A)? I guess the answer is no.

2) For other hand, kallisto has the next to options:

--fr-stranded             Strand specific reads, first read forward
--rf-stranded             Strand specific reads, first read reverse

Are these options only working for PE data?

3) Regarding the fragment length estimation when using SE datasets:

-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE               Estimated standard deviation of fragment length
                              (default: -l, -s values are estimated from paired
                               end data, but are required when using --single)

What does DOUBLE mean? Do we have to specify the double of the number calculated?

Thank you in advance

rna-seq kallisto • 5.2k views
ADD COMMENTlink modified 3.6 years ago by pmelsted110 • written 3.6 years ago by user230613280
2
gravatar for Devon Ryan
3.6 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:
  1. No
  2. They should work for SE data too (never tried, though). You probably want --rf-stranded for anything remotely recent.
  3. An example of a double is 200.0 or 123.4. That is, any number with a decimal point. The documentation there should really be changed, since I don't expect those without C/C++/etc. programming experience to know that "double" means "double precision floating point value" (or what that even means)).
ADD COMMENTlink written 3.6 years ago by Devon Ryan97k
1

Just to add to the answer, there is an option for SE data (--single).

ADD REPLYlink written 3.6 years ago by biofalconch470

Sorry, I have another question, "fragment-length" is not the same as read length, is it? I mean, it can't be inferred using input SE fastq files

ADD REPLYlink written 3.5 years ago by user230613280

Correct, fragment length refers to the length of the fragments loaded onto the sequencer. If this is your own dataset, then either you or whoever did the sequencing should know this (it can be estimated from a bioanalyzer plot). If this is a public dataset, then hopefully the value is written down somewhere.

ADD REPLYlink written 3.5 years ago by Devon Ryan97k

Hello

Sorry, I am a little confused by you saying that --rf-stranded is most likely the most appropriate option. For SE data, wouldn't you want to only process reads that align to the forward strand of the transcript?

Or have I made an error here?

ADD REPLYlink written 2.8 years ago by Thomas100

It doesn't matter whether you sequence SE or PE, read #1 in a pair aligns with the opposite orientation of the originating fragment for recent (since ~2013) data. In a parlance that many prefer, read #1 should align to the opposite strand of the transcript/gene.

ADD REPLYlink modified 2.7 years ago • written 2.8 years ago by Devon Ryan97k

by originating fragment, do you mean the transcriptome or genome sequences?

ADD REPLYlink written 2.7 years ago by elzedleeu20
1

Either way. If you align to the transcriptome then read #2 should always be aligned as its reverse complement.

ADD REPLYlink written 2.7 years ago by Devon Ryan97k

One more question.

RSEQC package outputting "1+-,1-+,2++,2--" , basically means that read#2 'set' the strand, since aligns in the same strand of the transcript/gene. Thus, read #1 aligns to the opposite strand of the transcript/gene (i.e. reverse-complemented).

For this library type (apparently the most common nowadays), parameter --rf-stranded should be the one to use in 'kallisto quant' for abundance estimation using a reference transcriptome. Is that right?

The link below has confused me in this respect, and just wanted to be sure:

https://github.com/griffithlab/rnaseq_tutorial/blob/master/manuscript/supplementary_tables/supplementary_table_5.md

ADD REPLYlink written 17 months ago by kiran70

Correct, TruSeq is the most common and it's --rf-stranded (if that's wrong, you'll be able to tell from the terrible quantitation metrics).

ADD REPLYlink written 17 months ago by Devon Ryan97k
1
gravatar for pmelsted
3.6 years ago by
pmelsted110
United States
pmelsted110 wrote:
  1. No, if you are unsure I would recommend blatting a few reads to see which strand they map to. Also if you happen to choose the wrong version you'll have significantly fewer reads mapping.

  2. This works for SE and PE data.

  3. As pointed out by Devon, this means that it accepts a floating point value or an integer. The -l and -s parameters are required for SE data and refer to the fragment length distributions, for PE data they can be estimated from the paired reads. Typical values for RNA-Seq are -l 200 and -s 30.

ADD COMMENTlink written 3.6 years ago by pmelsted110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1802 users visited in the last hour