Question

Kallisto: Impact of chosen reference transcriptome on the resulting TPMs

1

Entering edit mode

4.2 years ago

iraun 6.2k

Hi there! I have run kallisto (version 0.43.1) twice, one run using the full human transcriptome (concatenating cDNA + ncRNAs into a fasta file) and another run using just cDNA sequences. Files were obtained from ensembl:

The used command, the input paired-fastq files.. etc, was the same in both runs, the only thing changing is the reference transcriptome. When I check the TPMs calculated by those runs, for those transcript in common between the two runs (cDNAs) I got different values. See for example the following 4 sequence IDs:

Only cDNAs run:

target_id   length  eff_length  est_counts  tpm
ENST00000364315.1   119 23.3642 21491.2 403952
ENST00000362545.1   121 22.3106 5657.45 111360
ENST00000355057.2   373 190.462 9049    20864.8
ENST00000564138.5   8707    8518.74 324053  16705.6

Full transcriptome run:

target_id   length  eff_length  est_counts  tpm
ENST00000364315.1   119 22.6864 666.502 2673.26
ENST00000362545.1   121 21.9112 0   0
ENST00000355057.2   373 189.952 9047    4333.78
ENST00000564138.5   8707    8515.74 43.7178 0.467134

The TPMs obtained are very different. Is this expected? If I am interested on the expression of a subset of transcripts, should I include only those as reference or all of them?

Thank you in advance,

RNA-Seq kallisto • 2.1k views

ADD COMMENT • link updated 4.0 years ago by Biostar 20 • written 4.2 years ago by iraun 6.2k

0

Entering edit mode

I am not 100% sure if this affects pseudoalignment methods such as kallisto, but with traditional aligners, you might misalign reads to a close enough reference sequence if their actual origin is not included in the reference. You'll always have that risk when using reduced reference sets, so I'd say better to align/pseudoalign against a set as complete as possible and then filter afterwards.

ADD REPLY • link 4.2 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Indeed, I totally agree about the misalignment risk. But kallisto authors recommend just to use cDNA, (see https://github.com/pachterlab/kallisto-transcriptome-indices/releases) so.. I am confused.

ADD REPLY • link 4.2 years ago by iraun 6.2k

1

Entering edit mode

With some caveats (see this Twitter thread or the corresponding blog post).

ADD REPLY • link 4.2 years ago by igor 13k

0

Entering edit mode

Jeez....oh come on ><;

ADD REPLY • link 4.2 years ago by cschu181 ★ 2.8k

0

Entering edit mode

OMG... this is even more confusing, since Kaliisto author recommends using cDNA.. but if the file contains haplotypes.. I don't think it is the most sensible approach.. Let's see if the author drops by this issue or if another user has faced the same problem. At the moment I don't know anymore what should be used as reference transcriptome.

ADD REPLY • link 4.2 years ago by iraun 6.2k

1

Entering edit mode

As for my two cents -- in your first table, the TPMs will sum up to 1000000 (as always is the case). Yet two transcripts (ENST00000364315.1 and ENST00000362545.1), representing 5S ribosomal pseudogenes, are already amounting to >500000. Two 5S ribosomal pseudogenes are expressed more highly than every other transcript combined -- probably due to contamination.

I'm wondering if, with a more complete transcriptome reference set, the reads are simply getting mapped to more transcripts (e.g. reads that originally mapped to A are now getting mapped among A, B, C, D, and E where B, C, D, and E are part of the extended transcriptome reference set). Might be helpful to do a gene-level summary analysis?

ADD REPLY • link 3.6 years ago by dsull ★ 5.8k

0

Entering edit mode

You could also try and run salmon or sailfish or ... on both sets and see if that results in similar tpm differences.

ADD REPLY • link 4.2 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Yes, that is right. One way to do this would be to extract transcript Ids with respect to non haplotypes using a simple grep command on the annotation(GTF) file and then extract information belonging to those Ids from the ref fasta file and save them as a new fasta file.

ADD REPLY • link 3.6 years ago by nageshprabhu.k • 0