Should kallisto take hours to run?
1
3
Entering edit mode
4.1 years ago

Hello,

I'm trying to align some test .fastq files to a reference using kallisto.

The reference is composed of human transcriptome + a couple of plasmid sequences (~12000 characters) stored in .fa format. I generated the index using this command:

humFa="/path/to/ucsc/fasta/files/RefGenomes/H_sapiens/hg19/*.fa"
plasFa="/path/to/plasmid/fasta/files/*.fa"

kallisto index -i humPlas_kallisto_transcripts.idx $humFa$plasFa --make-unique


...and the resulting file is. 70.49 GB.

I have tried to align paired end .fastq files to this index using kallisto, but I keep running into issues:

# On my Mac laptop (macOS 10.13.3, 3.5 GHz Processor, 16 GB memory):

The issue seems to be that the program dies prematurely, but I don't know why. I run this with only Terminal open, and I don't touch anything while it's running:

./kallisto quant -i ~/Desktop/humPlas_kallisto_transcripts.idx -o ~/Desktop/kallOutput/ -b 100 ~/Desktop/tstFastq/R1.fastq ~/Desktop/tstFastqR2.fastq

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 50,798
[index] number of k-mers: 2,969,625,638
Killed: 9


# On a computational cluster:

I run the same command (as above):

kallisto quant -i $kallistoIdx -o$outputFileLoc -t 4 -b 100 $Read1$Read2


with the following cluster queue settings (SLURM):

#SBATCH --nodes=4
#SBATCH --mem=256G


...and then the job auto-aborts after six hours because it hasn't completed in that time.

Am I correct in thinking this kallisto alignment is taking suspiciously long/is being suspiciously buggy? Has anyone run into either of these issues? Am I missing anything that might be making kallisto slower?

# EDIT - SOLUTION

Thanks for the feedback - indeed, the problem is that I was using UCSC's genome files, not transcriptome.

I got the transcriptome corresponding to hg19 from the Ensembl archive here: wget ftp://ftp.ensembl.org/pub/release-67/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.67.cdna.all.fa.gz

Then I regenerated the index using that file in my \$humFa path. The resulting index was much smaller.

Now everything is working well, and I can align to my plasmids+human genome. Appreciate the help!

kallisto RNA-Seq • 4.1k views
1
Entering edit mode

Kallisto has taken a few hours to run before. It should be fine.

5
Entering edit mode
4.1 years ago
h.mon 34k

The reference is composed of human genome + a couple of plasmid sequences (~12000 characters) stored in .fa format.

kallisto uses a reference transcriptome.

0
Entering edit mode

Oops, I mistyped - you are correct. Modifying original post

0
Entering edit mode

Ah! I think I see your point. I thought the fasta files from hg19 represented a human transcriptome, but I see they represent a reference genome: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/

1
Entering edit mode

If your plasmids have transcribed and non-transcribed parts, or multiple genes, you may create their "transcriptome" fasta using gffread from Stringtie, using the orginal fasta+gtf.

0
Entering edit mode