Question

kallisto "gene" counts

0

Entering edit mode

2.0 years ago

sovrappensiero ▴ 90

I'm new to using kallisto, and I have a newbie question. If I want to get "gene counts" from the (EDIT: meant to type "abundance file" not pseudobam file) a pseudobam file, is it as simple as mapping the gene ID to transcipt ID using the gtf file based off the transcript reference? What's the difference between doing this and using the --genomebam and --gtf options in kallisto quant to project the transcript alignments to genome coordinates? I did the latter and the only additional file I got is a pseudoalignments.bam.bai file; the abundance file looks the same.

I thought that it was not super straightforward to get gene counts for a transcript quantification tool like kallisto vs. a traditional aligner like STAR or bowtie2, but I know my knowledge is outdated.

gtf rna-seq kallisto • 3.0k views

ADD COMMENT • link updated 22 months ago by Ram 43k • written 2.0 years ago by sovrappensiero ▴ 90

1

Entering edit mode

2.0 years ago

Soheil ▴ 110

You can use the tximport package to summarize transcript-level abundances to gene-level. You can find the vignette here

ADD COMMENT • link 2.0 years ago by Soheil ▴ 110

0

Entering edit mode

Summing the transcript-level TPMs to get gene-level TPMs is better if you're interested in gene-level expression.

tximport is useful if you want to use deseq2 for downstream-level gene expression analysis.

ADD REPLY • link 2.0 years ago by dsull ★ 5.8k

0

Entering edit mode

I'm not sure how gene-level TPM can be directly calculated from transcript-level TPM.

ADD REPLY • link 2.0 years ago by Soheil ▴ 110

0

Entering edit mode

Summing the transcript-level TPMs gives you gene-level TPMs and it's the correct way to do so mathematically. This is because a gene can have multiple different transcripts, each having different lengths and read counts, so the notion of dividing by "gene length" doesn't make sense (and summing over raw counts doesn't make sense either) -- this is why you have sum transcript-level TPMs to get gene-level TPMs as discussed in https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1419-z (see the background and the supplemental text) and https://www.nature.com/articles/nbt.2450 (see figure 1).

This is what's implemented in sleuth and, more recently, in kallisto | bustools (when running the workflow on bulk or smart-seq2 RNA-seq datasets with the --tcc option).

ADD REPLY • link 2.0 years ago by dsull ★ 5.8k

score 3 · Accepted Answer · 2022-04-13

3

Entering edit mode

2.0 years ago

dsull ★ 5.8k

Don't get gene counts from a pseudobam file. Please. The kallisto bam options were invented so that the mappings could be viewed in a genome browser (e.g. IGV), not for quantification purposes.

Just get gene counts using the standard kallisto workflow.

ADD COMMENT • link 2.0 years ago by dsull ★ 5.8k

0

Entering edit mode

Ah ok. Thanks! Also, sorry I mistyped - I meant "If I want to get gene counts from the abundance file..."

So I can do that the first way? ("mapping the gene ID to transcipt ID using the gtf file based off the transcript reference") using the abundance file?

ADD REPLY • link 2.0 years ago by sovrappensiero ▴ 90

2

Entering edit mode

Basically, you just summarize the TPM abundances of all transcripts associated with a particular gene to get gene-level abundances.

For what it's worth, I recommend building kallisto indices using the kb-python package: pip install kb-python and using kb ref (which will output the kallisto index, the transcriptome fasta, and the gene-to-transcript mapping) on the genomic FASTA and GTF.

ADD REPLY • link 2.0 years ago by dsull ★ 5.8k