On my Mac laptop (macOS 10.13.3, 3.5 GHz Processor, 16 GB memory):

Question

Should kallisto take hours to run?

3

Entering edit mode

6.1 years ago

Kristin Muench ▴ 620

Hello,

I'm trying to align some test .fastq files to a reference using kallisto.

The reference is composed of human transcriptome + a couple of plasmid sequences (~12000 characters) stored in .fa format. I generated the index using this command:

humFa="/path/to/ucsc/fasta/files/RefGenomes/H_sapiens/hg19/*.fa"
plasFa="/path/to/plasmid/fasta/files/*.fa"

kallisto index -i humPlas_kallisto_transcripts.idx $humFa $plasFa --make-unique

...and the resulting file is. 70.49 GB.

I have tried to align paired end .fastq files to this index using kallisto, but I keep running into issues:

On my Mac laptop (macOS 10.13.3, 3.5 GHz Processor, 16 GB memory):

The issue seems to be that the program dies prematurely, but I don't know why. I run this with only Terminal open, and I don't touch anything while it's running:

./kallisto quant -i ~/Desktop/humPlas_kallisto_transcripts.idx -o ~/Desktop/kallOutput/ -b 100 ~/Desktop/tstFastq/R1.fastq ~/Desktop/tstFastqR2.fastq

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 50,798
[index] number of k-mers: 2,969,625,638
Killed: 9

On a computational cluster:

I run the same command (as above):

kallisto quant -i $kallistoIdx -o $outputFileLoc -t 4 -b 100 $Read1 $Read2

with the following cluster queue settings (SLURM):

#SBATCH --nodes=4
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=4
#SBATCH --mem=256G

...and then the job auto-aborts after six hours because it hasn't completed in that time.

Am I correct in thinking this kallisto alignment is taking suspiciously long/is being suspiciously buggy? Has anyone run into either of these issues? Am I missing anything that might be making kallisto slower?

Thank you for your help!

EDIT - SOLUTION

Thanks for the feedback - indeed, the problem is that I was using UCSC's genome files, not transcriptome.

I got the transcriptome corresponding to hg19 from the Ensembl archive here: wget ftp://ftp.ensembl.org/pub/release-67/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.67.cdna.all.fa.gz

Then I regenerated the index using that file in my $humFa path. The resulting index was much smaller.

Now everything is working well, and I can align to my plasmids+human genome. Appreciate the help!

kallisto RNA-Seq • 5.8k views

ADD COMMENT • link 6.1 years ago by Kristin Muench ▴ 620

1

Entering edit mode

Kallisto has taken a few hours to run before. It should be fine.

ADD REPLY • link 6.1 years ago by Hussain Ather ▴ 990

score 5 · Accepted Answer · 2018-03-29

5

Entering edit mode

6.1 years ago

h.mon 35k

The reference is composed of human genome + a couple of plasmid sequences (~12000 characters) stored in .fa format.

kallisto uses a reference transcriptome.

ADD COMMENT • link 6.1 years ago by h.mon 35k

0

Entering edit mode

Oops, I mistyped - you are correct. Modifying original post

ADD REPLY • link 6.1 years ago by Kristin Muench ▴ 620

0

Entering edit mode

Ah! I think I see your point. I thought the fasta files from hg19 represented a human transcriptome, but I see they represent a reference genome: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/

I'll re-download a human reference transcriptome and try again. Thank you!

ADD REPLY • link 6.1 years ago by Kristin Muench ▴ 620

1

Entering edit mode

If your plasmids have transcribed and non-transcribed parts, or multiple genes, you may create their "transcriptome" fasta using gffread from Stringtie, using the orginal fasta+gtf.

ADD REPLY • link 6.1 years ago by h.mon 35k

0

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY • link 6.1 years ago by WouterDeCoster 47k