Question

kallisto index build difference according to version

0

Entering edit mode

4 months ago

estilo • 0

Hi all, I'm trying to implement kallisto for a dataset of single-end RNA-seq data,

And obviously started with building an index (The files were downloaded from ensembl).

Homo_sapiens.GRCh37.ncrna.fa.gz
Homo_sapiens.GRCh37.cdna.all.fa.gz

using the command

kallisto index -i index.idx Homo_sapiens.GRCh37.ncrna.fa.gz Homo_sapiens.GRCh37.cdna.all.fa.gz

And although this wasn't intended, I found some difference between the indices build when using different versions of kallisto. (Yes, using the same files mentioned above.)

So, my question is, should the size of index files build differ between versions of kallisto? My concern is the significantly low mapping rate from version 0.50.1 (~10%) whereas index from version 0.46.1 improves the rate up to ~70%. And I have no proof or guidance on using index build from previous version to conduct quant analysis on the newer version.

kallisto 0.46.1 generated index size of 2,218,101 kb

[build] target de Bruijn graph has 1202693 contigs and contains 117035043 k-mers

kallisto 0.50.1 generated index size of 285,725 kb

[build] target de Bruijn graph has 1013194 contigs and contains 117035043 k-mers

For detail, I've used windows (intel) for using version 0.46.1 and MacOS (M2 ARM) for 0.50.1.

I would very much appreciate to get any feedbacks on what I'm missing here, or what the consensus of 'best practice' is considered to be.

index version kallisto • 759 views

ADD COMMENT • link updated 4 months ago by dsull ★ 5.9k • written 4 months ago by estilo • 0

1

Entering edit mode

It is possible that different versions of a program may have improvements in way indexes are created so the file sizes may be different. You are also working on two separate operating systems which may have their own nuances. File sizes is never a good criteria for any comparisons and certainly not across two OS's.

My concern is the significantly low mapping rate from version 0.50.1 (~10%) whereas index from version 0.46.1 improves the rate up to ~70%

I don't understand why. Unless there were differences in program defaults between the two versions that are causing this difference. @dsull from Pachter lab participates here and may have additional insights to offer.

And I have no proof or guidance on using index build from previous version to conduct quant analysis on the newer version.

Unless the program authors advise against it, so it should be fine to use an index built with an older version of the program with a newer one. If file formats change between the versions authors will provide notice and/or the index may fail to work.

ADD REPLY • link 4 months ago by GenoMax 141k

0

Entering edit mode

Thanks for the comment, especially the one on the file size for comparison gives me so much relief. Like you said, the index build from different OS does not work, but will. try to downgrade the MacOS version to 0.46.1. Will update on the OP./

ADD REPLY • link 4 months ago by estilo • 0

0

Entering edit mode

As GenoMax has alluded to, kallisto 0.50.1 uses a DIFFERENT index version than kallisto 0.46.1 (the index data structure itself is different so, naturally, the file sizes will differ).
You CANNOT use an older version of a kallisto index with a newer kallisto version (or a newer kallisto index with an older version) -- the program should fail if you do so.
You should not see a 60% difference in mapping rate between the two versions. Can you show me the commands you ran for the kallisto read mapping+quantification step so I can diagnose the issue?

ADD REPLY • link 4 months ago by dsull ★ 5.9k

0

Entering edit mode

That explains, thanks alot.
Yes, the program failed.

kallisto quant -i index.idx -o /directory/file --single -l 50 -s 2 -t 8 file01.fastq.gz

In particular, the dataset was generated from an illumina single-end RNA-seq (HiSeq 4000), from GSE175718. I figured specifying length and SD at the closest would be okay. I've used the same command for both versions, although the only difference would be the directory where the files are located and written.

I also ran the same fastq files thorugh clc genomics workbench for 'rna sequencing' using the same version of genome and cDNA, and the %of reads mapped were approximately 20% higher. Not sure direct comparison with alignment-based mapping is appropriate for pseudo-alignment, but difference in mapping rate at the beginning is quite confusing. Indeed I have a lots to cover, and thanks for the thoughtful review and comments to my somewhat dumbfound questions. Good day to all.

Just to give a specific result from the dataset, the results from first file GC080710_A01 using 0.50.1 was

[quant] processed 4,354,791 reads, 2,721,629 reads pseudoaligned. (62.6%)

ADD REPLY • link 4 months ago by estilo • 0

1

Entering edit mode

OK, I tried similar commands for a 37 million read dataset but it seems to have nearly identical mappings between the two versions (only one read is different -- likely due to the graph-based index structure being represented differently between the two versions). See the analysis below:

https://colab.research.google.com/drive/1ocxzswP2ZW8-CeLxupxCN7r-sHMlOfBB?usp=sharing

The older version is a bit faster on a small number of threads but consumes 2x as much memory. (The new version is faster on a larger number of threads, and supports key features important in single-cell and nucleus RNAseq analysis if you ever want to play around with those).

ADD REPLY • link 4 months ago by dsull ★ 5.9k