What does the fragment length (-l) argument in Kallisto actually do?
1
3
Entering edit mode
4 months ago

What is the effect of setting "--fragment-length" (-l) either too low or too high for Kallisto single-end quantification and how could this affect your conclusions?

I've seen some variations of this question on here, but not a clear explanation for what the fragment length argument (which is required for running Kallisto on single-end data) actually does and how it affects quantification.

Some background/motivation: As others have noted, it is sometimes difficult to determine what this length should be, especially for data that you didn't generate yourself. I'm not sure how I feel about guessing a number for it due to previous experience because I've seen some very different results from quantification resulting from changing fragment length which I'm not sure how to think about. For example, in one experiment where there was ribosomal RNA contamination in an NEB-Next library, using the bioanalyzer-derived fragment size (~300 bp) led to an rRNA contamination content of about 10% of the reads, but setting the fragment length to 1 led to a contamination content of about 30%, more similar to what I got using bowtie mapping of the reads. How is this happening?

The reason I tried -l 1 is because I found a paper using QuantSeq data which used a fragment length of 1 (-l 1 -s 1) on their data, but I can no longer find that reference unfortunately. What is the effect of setting the fragment length to 1 and could this lead to incorrect conclusions?

Kallisto Pseudoalignment RNA-seq • 1.4k views
0
Entering edit mode
0
Entering edit mode

Thanks, I've already seen this post (and many others), but it doesn't answer my question. I'm not asking how to choose, I'm asking what does it do. How is fragment length actually used by the software and what is the effect of choosing different fragment lengths to analyze the same data? I will try to find some time to day to do a bit of analysis on this and post it here, but ideally, I'm hoping someone can explain the actual algorithmic use of the fragment length in Kallisto.

1
Entering edit mode

In single-end mode, in addition to determining how transcript length should be adjusted to arrive at the effective length, the --fragment-length parameter determines which pseudoalignments should be allowed and which should be discarded.

The -l and -s parameters define the expected distribution of fragment lengths, with the value passed to -l being the mean of this distribution. Let μ be the user-provided mean of this distribution. When mapping reads in single-end mode, kallisto will discard alignments where the mapping position of the read is < μ bases from the end of the transcript. In this case, the model says that the probability of choosing a fragment of length < μ is < 1/2, and so this alignment is discarded from consideration. If all pseudoalignments of the read map in a manner like this, the read goes unmapped. This means that the value provided to this parameter affects not only the effective transcript lengths, but the actual pseudoalignments considered during quantification.

1
Entering edit mode

I think this is right, though I don't quite understand what you mean by the probability of choosing a fragment length µ < is < 1/2 part. I thinking that they must calculate the probability of the read matching the transcript based on both the -l and the -s parameters and some kind of Gaussian derived probability? But you are definitely right that the -l parameter is not only about the effective length, but also about the pseudoalignment! I wasn't sure about this, which is part of why I asked the question. I did some empirical testing now which I'll post.

0
Entering edit mode

-l and -s are both required to generate a Gaussian distribution, which kallisto then truncates (e.g. we get rid of everything in the distribution over 1000 bp long) before calculating the mean of the truncated distribution.

0
Entering edit mode

You can disable the feature where kallisto discards pseudoalignments based on fragment lengths (it's disabled in "kallisto bus" mode for bustools and it can be disabled in "kallisto quant" via the --single-overhang option).

0
Entering edit mode

dsull may be able to address this specific request. That sort of detail may require a dive into code (if you are able to).

0
Entering edit mode
4 months ago

As answered by bioinfoguy, the fragment length (-l) affects both the effective length (used in the TPM calculation) and which transcripts the read can be assigned to.

I've now had the chance to revisit some old data to show this in action. This data is from an RNA-seq library with Bioanalyzer determined fragment length of 240 nt and sd of 110 nt. I tried running Kallisto on it in two different ways: first way: -l 240 -s 110 (using the Bioanalyzer lengths) second way: -l 1 -s 1 (pretending the fragment length is 1)

I added a pseudocount of 0.01 to the counts to allow log ratio calculation for all values, and I'm only showing transcripts detected in at least one of the analyses. I'm only showing the effect for transcripts under 400 nt to make it easier to see.

You can see quite clearly that setting -l 1 increases the counts assigned to short transcripts. At 240 nt, you now get some transcripts with fewer counts by setting -l 1, which I assume is due to competition with short transcripts (reads that would have been assigned to these transcripts >= 240 nt with -l 240 might now be assigned to shorter transcripts with -l 1).

One interesting thing to note is that this sample is a total RNA sample with rRNA depletion only performed for the longer rRNA transcripts (18S and 28S in Drosophila). So I know that this sample contains both 5S and 5.8S rRNA contamination (shown on the graph as "small rRNAs"). The length of these transcripts is <150 nt. Kallisto is not finding these transcripts in the library with -l set to the Bioanalyzer derived value of fragment length = 240 nt, even though I'm quite sure that they should be in the library. I suppose it shouldn't matter for downstream differential expression analyses as long as you are using the same protocol for all your samples, but I was surprised to find that changing the -l parameter had such a large effect.

I don't know exactly how this works mathematically, so if someone wants to elaborate further, feel free. It's quite interesting! I'm also curious why you don't need to specify fragment length for Salmon, which I thought worked almost the same way as Kallisto.

1
Entering edit mode

salmon has a default fragment length coded in for single-end data that is used unless overwritten by the user

1
Entering edit mode

Also, salmon will not discard alignments due to the fragment length distribution parameters. Rather, it will just adjust the assignment probability based on the implied fragment length and that length's probability under the model.

1
Entering edit mode

Re: salmon, salmon has the '--fldMean' and '--fldSD' options for single-end reads.

If you give -l 1, your effective lengths will be larger (which you can see by inspecting the abundances.tsv file), so when it comes to TPM computation, counts will be divided by a larger denominator.

The reason estimated counts themselves change is simply because the likelihood function optimized by the EM algorithm uses effective lengths.

For single-cell UMI data when we don't care about effective length normalization, the effective length is always 1.

0
Entering edit mode

Thanks for explaining! So I have to admit now that I'm struggling to understand the advantage of using the empirically determined fragment length instead of -l 1 (to force it to consider all the data this way). I get that doing that will not give you TPM using the correct effective length, but it seems like the estimated counts might be more accurate. At least in my case, I'm pretty sure that these libraries have 5S and 5.8S rRNAs in them at high levels, so using the empirically determined mean of 240 nt seems to be discarding data.

0
Entering edit mode

So, a few issues here: Using length normalization is necessary to prevent super long transcripts from getting inflated abundances (which would happen if you used estimated counts).

As for why we use effective length rather than actual length, a few good readings:

http://robpatro.com/blog/?p=235 - A good blog post from the salmon author

https://arxiv.org/abs/1104.3889 - A good paper from the kallisto author

As for discarding pseudoalignments (which is a separate issue altogether from effective lengths), you can disable that (see above).

0
Entering edit mode

Thank you, sorry I missed your earlier comment about the --single-overhang option above. I think I will probably use that from now on to avoid this issue. Unless you can suggest any reason why that's not advisable / why it's not the default in normal kallisto use.

0
Entering edit mode

Hello and thank you for this extremely helpful question! Would you or anyone have any further insights to share on when using the --single-overhang option is/isn't advisable? I have played around with it in my own data and it increases the percent of reads pseudo-aligned by ten percent. In addition, I wanted to verify (sorry if this is a newbie question) - kallisto's fragment length refers to the length of the insert without the added adapters from the library prep, correct? I assume the adapters are not included in its computation and their length should be subtracted? Thank you.

0
Entering edit mode

Generally, I haven't observed much of a difference between having that option on vs. off, but if you observe something funky, then enable single-overhang.

When it's turned off, kallisto looks at paired-end reads where only one of the reads map. And then it gets the position of the mapping along the transcript it is mapped to. If the position + average fragment length exceeds the length of the transcript, it is considered "out of bounds" and that mapping is discarded. This is supposed to increase the specificity of kallisto's mapping.

When the single overhang option is turned on, this procedure is not done.

With reads that can span the full length of transcripts, disabling single overhang (i.e. enabling that procedure above) may, to some extent, improve your mapping a small amount. However, for short transcripts, this can be a bit problematic since any k-mer mapping is going to be near the end of the transcript.

I personally (at least recently) usually use the single overhang option (i.e. I don't use that procedure described above), because fragment length is sort of variable and maybe the mapping near the end of the transcript occurred simply because there was a very short fragment in the library, or because the transcript is short, or because your dataset is 3' biased, or because the polyA tail, or because some stuff was attached to the end of the transcript prior to fragmentation (e.g. RT primer). Enabling single-overhang makes kallisto more generalizable for handling different library preps.

When using kallisto+bustools, --single-overhang is always enabled.

0
Entering edit mode

And yes, fragment length is without adapters.