How to choose parameters for kallisto single end mode?
1
0
Entering edit mode
4 months ago
bioinfo ▴ 50

Hello,

I have some fastq files produced by NextSeq. The average fragment length is between 350-400bp. The library prep adds 139bp of adapters, so the inserts are 200-250bp.

Would the kallisto command look like this?

kallisto quant -i index -o output --single -l 350 -s 50 R1.fastq.gz --rf-stranded


Thank you

kallisto RNA-seq • 1.0k views
0
Entering edit mode

my hunch is that the fragment length only matters when the run is in paired-end mode - in single-end mode I don't see how that would have any effect whatsoever

0
Entering edit mode

The kallisto manual says that you need to specify both -l and -s for single end mode. I am just not sure if the values I chose are correct.

1
Entering edit mode
4 months ago

I made a comment on this post, but that turns out to be not quite right.

I was told via other sources that Kallisto will apply a correction based on the fragment length, basically ignoring alignments that appear to not "fit" into the transcript when the whole fragment is considered.

If so, the value should reflect the original fragment length corresponding to the biologically relevant template and not the artificial construct with ligated adapters.

1
Entering edit mode

TPM normalized counts (which is what kallisto outputs) requires fragment length information. This is because, in TPM, counts are divided by the "effective length" rather than the actual transcript length. The effective length is the number of positions a fragment can start along a given transcript.

This is the primary reason kallisto requires fragment length information (for paired-end data, fragment length information is inferred automatically from your reads).

0
Entering edit mode

I always thought the effective length is computed based on read lengths (subtracting half the read length from each end).

As you and others pointed out, there is a second effective size correction on the 3' end based on fragment size -

I think it is a correction that could have a more substantial effect in some circumstances, for example, when the transcripts are incompletely characterized - which is very common for many organisms: we know the coding regions but not the full transcripts. In those cases, it could be better to "lie" about the fragment size and claim it to be either shorter or have a larger standard deviation - that way it would allow all the data to be used.

0
Entering edit mode

Effective length equals to transcript length minus the insert size + 1.

If transcripts don't get fragmented, then it's equal to transcript length minus read length + 1.

Reference: Pachter's 2011 Arxiv paper.

0
Entering edit mode

Thank you. Does that mean that I should use -l 225 -s 25 as parameters?

0
Entering edit mode

Yes, if you are sure that that is the insert length (the adapters should not be counted).