Question: Kallisto: Impact of chosen reference transcriptome on the resulting TPMs
1
gravatar for iraun
3 months ago by
iraun3.7k
Norway
iraun3.7k wrote:

Hi there! I have run kallisto (version 0.43.1) twice, one run using the full human transcriptome (concatenating cDNA + ncRNAs into a fasta file) and another run using just cDNA sequences. Files were obtained from ensembl:

The used command, the input paired-fastq files.. etc, was the same in both runs, the only thing changing is the reference transcriptome. When I check the TPMs calculated by those runs, for those transcript in common between the two runs (cDNAs) I got different values. See for example the following 4 sequence IDs:

Only cDNAs run:

target_id   length  eff_length  est_counts  tpm
ENST00000364315.1   119 23.3642 21491.2 403952
ENST00000362545.1   121 22.3106 5657.45 111360
ENST00000355057.2   373 190.462 9049    20864.8
ENST00000564138.5   8707    8518.74 324053  16705.6

Full transcriptome run:

target_id   length  eff_length  est_counts  tpm
ENST00000364315.1   119 22.6864 666.502 2673.26
ENST00000362545.1   121 21.9112 0   0
ENST00000355057.2   373 189.952 9047    4333.78
ENST00000564138.5   8707    8515.74 43.7178 0.467134

The TPMs obtained are very different. Is this expected? If I am interested on the expression of a subset of transcripts, should I include only those as reference or all of them?

Thank you in advance,

rna-seq kallisto • 301 views
ADD COMMENTlink modified 7 weeks ago by Biostar ♦♦ 20 • written 3 months ago by iraun3.7k

I am not 100% sure if this affects pseudoalignment methods such as kallisto, but with traditional aligners, you might misalign reads to a close enough reference sequence if their actual origin is not included in the reference. You'll always have that risk when using reduced reference sets, so I'd say better to align/pseudoalign against a set as complete as possible and then filter afterwards.

ADD REPLYlink written 3 months ago by cschu1812.3k

Indeed, I totally agree about the misalignment risk. But kallisto authors recommend just to use cDNA, (see https://github.com/pachterlab/kallisto-transcriptome-indices/releases) so.. I am confused.

ADD REPLYlink written 3 months ago by iraun3.7k
1

With some caveats (see this Twitter thread or the corresponding blog post).

ADD REPLYlink modified 3 months ago • written 3 months ago by igor10k

Jeez....oh come on ><;

ADD REPLYlink written 3 months ago by cschu1812.3k

OMG... this is even more confusing, since Kaliisto author recommends using cDNA.. but if the file contains haplotypes.. I don't think it is the most sensible approach.. Let's see if the author drops by this issue or if another user has faced the same problem. At the moment I don't know anymore what should be used as reference transcriptome.

ADD REPLYlink written 3 months ago by iraun3.7k
1

@Lior Pachter -- we'll see if Lior drops by and provides an answer.

As for my two cents -- in your first table, the TPMs will sum up to 1000000 (as always is the case). Yet two transcripts (ENST00000364315.1 and ENST00000362545.1), representing 5S ribosomal pseudogenes, are already amounting to >500000. Two 5S ribosomal pseudogenes are expressed more highly than every other transcript combined -- probably due to contamination.

I'm wondering if, with a more complete transcriptome reference set, the reads are simply getting mapped to more transcripts (e.g. reads that originally mapped to A are now getting mapped among A, B, C, D, and E where B, C, D, and E are part of the extended transcriptome reference set). Might be helpful to do a gene-level summary analysis?

ADD REPLYlink modified 3 months ago • written 3 months ago by dsull1.2k

You could also try and run salmon or sailfish or ... on both sets and see if that results in similar tpm differences.

ADD REPLYlink written 3 months ago by cschu1812.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1999 users visited in the last hour