Kallisto: Impact of chosen reference transcriptome on the resulting TPMs
0
1
Entering edit mode
4.2 years ago
iraun 6.2k

Hi there! I have run kallisto (version 0.43.1) twice, one run using the full human transcriptome (concatenating cDNA + ncRNAs into a fasta file) and another run using just cDNA sequences. Files were obtained from ensembl:

The used command, the input paired-fastq files.. etc, was the same in both runs, the only thing changing is the reference transcriptome. When I check the TPMs calculated by those runs, for those transcript in common between the two runs (cDNAs) I got different values. See for example the following 4 sequence IDs:

Only cDNAs run:

target_id   length  eff_length  est_counts  tpm
ENST00000364315.1   119 23.3642 21491.2 403952
ENST00000362545.1   121 22.3106 5657.45 111360
ENST00000355057.2   373 190.462 9049    20864.8
ENST00000564138.5   8707    8518.74 324053  16705.6

Full transcriptome run:

target_id   length  eff_length  est_counts  tpm
ENST00000364315.1   119 22.6864 666.502 2673.26
ENST00000362545.1   121 21.9112 0   0
ENST00000355057.2   373 189.952 9047    4333.78
ENST00000564138.5   8707    8515.74 43.7178 0.467134

The TPMs obtained are very different. Is this expected? If I am interested on the expression of a subset of transcripts, should I include only those as reference or all of them?

Thank you in advance,

RNA-Seq kallisto • 2.1k views
ADD COMMENT
0
Entering edit mode

I am not 100% sure if this affects pseudoalignment methods such as kallisto, but with traditional aligners, you might misalign reads to a close enough reference sequence if their actual origin is not included in the reference. You'll always have that risk when using reduced reference sets, so I'd say better to align/pseudoalign against a set as complete as possible and then filter afterwards.

ADD REPLY
0
Entering edit mode

Indeed, I totally agree about the misalignment risk. But kallisto authors recommend just to use cDNA, (see https://github.com/pachterlab/kallisto-transcriptome-indices/releases) so.. I am confused.

ADD REPLY
1
Entering edit mode

With some caveats (see this Twitter thread or the corresponding blog post).

ADD REPLY
0
Entering edit mode

Jeez....oh come on ><;

ADD REPLY
0
Entering edit mode

OMG... this is even more confusing, since Kaliisto author recommends using cDNA.. but if the file contains haplotypes.. I don't think it is the most sensible approach.. Let's see if the author drops by this issue or if another user has faced the same problem. At the moment I don't know anymore what should be used as reference transcriptome.

ADD REPLY
1
Entering edit mode

As for my two cents -- in your first table, the TPMs will sum up to 1000000 (as always is the case). Yet two transcripts (ENST00000364315.1 and ENST00000362545.1), representing 5S ribosomal pseudogenes, are already amounting to >500000. Two 5S ribosomal pseudogenes are expressed more highly than every other transcript combined -- probably due to contamination.

I'm wondering if, with a more complete transcriptome reference set, the reads are simply getting mapped to more transcripts (e.g. reads that originally mapped to A are now getting mapped among A, B, C, D, and E where B, C, D, and E are part of the extended transcriptome reference set). Might be helpful to do a gene-level summary analysis?

ADD REPLY
0
Entering edit mode

You could also try and run salmon or sailfish or ... on both sets and see if that results in similar tpm differences.

ADD REPLY
0
Entering edit mode

Yes, that is right. One way to do this would be to extract transcript Ids with respect to non haplotypes using a simple grep command on the annotation(GTF) file and then extract information belonging to those Ids from the ref fasta file and save them as a new fasta file.

ADD REPLY

Login before adding your answer.

Traffic: 1993 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6