How to choose parameters for kallisto single end mode?
1
0
Entering edit mode
4 months ago
bioinfo ▴ 50

Hello,

I have some fastq files produced by NextSeq. The average fragment length is between 350-400bp. The library prep adds 139bp of adapters, so the inserts are 200-250bp.

Would the kallisto command look like this?

kallisto quant -i index -o output --single -l 350 -s 50 R1.fastq.gz --rf-stranded

Thank you

kallisto RNA-seq • 1.0k views
ADD COMMENT
0
Entering edit mode

my hunch is that the fragment length only matters when the run is in paired-end mode - in single-end mode I don't see how that would have any effect whatsoever

ADD REPLY
0
Entering edit mode

The kallisto manual says that you need to specify both -l and -s for single end mode. I am just not sure if the values I chose are correct.

ADD REPLY
1
Entering edit mode
4 months ago

I made a comment on this post, but that turns out to be not quite right.

I was told via other sources that Kallisto will apply a correction based on the fragment length, basically ignoring alignments that appear to not "fit" into the transcript when the whole fragment is considered.

If so, the value should reflect the original fragment length corresponding to the biologically relevant template and not the artificial construct with ligated adapters.

ADD COMMENT
1
Entering edit mode

TPM normalized counts (which is what kallisto outputs) requires fragment length information. This is because, in TPM, counts are divided by the "effective length" rather than the actual transcript length. The effective length is the number of positions a fragment can start along a given transcript.

This is the primary reason kallisto requires fragment length information (for paired-end data, fragment length information is inferred automatically from your reads).

ADD REPLY
0
Entering edit mode

I always thought the effective length is computed based on read lengths (subtracting half the read length from each end).

As you and others pointed out, there is a second effective size correction on the 3' end based on fragment size -

I think it is a correction that could have a more substantial effect in some circumstances, for example, when the transcripts are incompletely characterized - which is very common for many organisms: we know the coding regions but not the full transcripts. In those cases, it could be better to "lie" about the fragment size and claim it to be either shorter or have a larger standard deviation - that way it would allow all the data to be used.

ADD REPLY
0
Entering edit mode

Effective length equals to transcript length minus the insert size + 1.

If transcripts don't get fragmented, then it's equal to transcript length minus read length + 1.

Reference: Pachter's 2011 Arxiv paper.

ADD REPLY
0
Entering edit mode

Thank you. Does that mean that I should use -l 225 -s 25 as parameters?

ADD REPLY
0
Entering edit mode

Yes, if you are sure that that is the insert length (the adapters should not be counted).

ADD REPLY

Login before adding your answer.

Traffic: 2011 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6