Hi there!
I have run kallisto
(version 0.43.1) twice, one run using the full human transcriptome (concatenating cDNA + ncRNAs into a fasta file) and another run using just cDNA sequences. Files were obtained from ensembl:
- cDNAs: ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
- ncRNAs: ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
The used command, the input paired-fastq files.. etc, was the same in both runs, the only thing changing is the reference transcriptome. When I check the TPMs calculated by those runs, for those transcript in common between the two runs (cDNAs) I got different values. See for example the following 4 sequence IDs:
Only cDNAs run:
target_id length eff_length est_counts tpm
ENST00000364315.1 119 23.3642 21491.2 403952
ENST00000362545.1 121 22.3106 5657.45 111360
ENST00000355057.2 373 190.462 9049 20864.8
ENST00000564138.5 8707 8518.74 324053 16705.6
Full transcriptome run:
target_id length eff_length est_counts tpm
ENST00000364315.1 119 22.6864 666.502 2673.26
ENST00000362545.1 121 21.9112 0 0
ENST00000355057.2 373 189.952 9047 4333.78
ENST00000564138.5 8707 8515.74 43.7178 0.467134
The TPMs obtained are very different. Is this expected? If I am interested on the expression of a subset of transcripts, should I include only those as reference or all of them?
Thank you in advance,
I am not 100% sure if this affects pseudoalignment methods such as kallisto, but with traditional aligners, you might misalign reads to a close enough reference sequence if their actual origin is not included in the reference. You'll always have that risk when using reduced reference sets, so I'd say better to align/pseudoalign against a set as complete as possible and then filter afterwards.
Indeed, I totally agree about the misalignment risk. But kallisto authors recommend just to use cDNA, (see https://github.com/pachterlab/kallisto-transcriptome-indices/releases) so.. I am confused.
With some caveats (see this Twitter thread or the corresponding blog post).
Jeez....oh come on ><;
OMG... this is even more confusing, since Kaliisto author recommends using cDNA.. but if the file contains haplotypes.. I don't think it is the most sensible approach.. Let's see if the author drops by this issue or if another user has faced the same problem. At the moment I don't know anymore what should be used as reference transcriptome.
As for my two cents -- in your first table, the TPMs will sum up to 1000000 (as always is the case). Yet two transcripts (ENST00000364315.1 and ENST00000362545.1), representing 5S ribosomal pseudogenes, are already amounting to >500000. Two 5S ribosomal pseudogenes are expressed more highly than every other transcript combined -- probably due to contamination.
I'm wondering if, with a more complete transcriptome reference set, the reads are simply getting mapped to more transcripts (e.g. reads that originally mapped to A are now getting mapped among A, B, C, D, and E where B, C, D, and E are part of the extended transcriptome reference set). Might be helpful to do a gene-level summary analysis?
You could also try and run salmon or sailfish or ... on both sets and see if that results in similar tpm differences.
Yes, that is right. One way to do this would be to extract transcript Ids with respect to non haplotypes using a simple grep command on the annotation(GTF) file and then extract information belonging to those Ids from the ref fasta file and save them as a new fasta file.