Should kallisto take hours to run?
1
3
Entering edit mode
4.1 years ago

Hello,

I'm trying to align some test .fastq files to a reference using kallisto.

The reference is composed of human transcriptome + a couple of plasmid sequences (~12000 characters) stored in .fa format. I generated the index using this command:

humFa="/path/to/ucsc/fasta/files/RefGenomes/H_sapiens/hg19/*.fa"
plasFa="/path/to/plasmid/fasta/files/*.fa"

kallisto index -i humPlas_kallisto_transcripts.idx $humFa $plasFa --make-unique

...and the resulting file is. 70.49 GB.

I have tried to align paired end .fastq files to this index using kallisto, but I keep running into issues:

On my Mac laptop (macOS 10.13.3, 3.5 GHz Processor, 16 GB memory):

The issue seems to be that the program dies prematurely, but I don't know why. I run this with only Terminal open, and I don't touch anything while it's running:

./kallisto quant -i ~/Desktop/humPlas_kallisto_transcripts.idx -o ~/Desktop/kallOutput/ -b 100 ~/Desktop/tstFastq/R1.fastq ~/Desktop/tstFastqR2.fastq

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 50,798
[index] number of k-mers: 2,969,625,638
Killed: 9

On a computational cluster:

I run the same command (as above):

kallisto quant -i $kallistoIdx -o $outputFileLoc -t 4 -b 100 $Read1 $Read2

with the following cluster queue settings (SLURM):

#SBATCH --nodes=4
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=4
#SBATCH --mem=256G

...and then the job auto-aborts after six hours because it hasn't completed in that time.

Am I correct in thinking this kallisto alignment is taking suspiciously long/is being suspiciously buggy? Has anyone run into either of these issues? Am I missing anything that might be making kallisto slower?

Thank you for your help!

EDIT - SOLUTION

Thanks for the feedback - indeed, the problem is that I was using UCSC's genome files, not transcriptome.

I got the transcriptome corresponding to hg19 from the Ensembl archive here: wget ftp://ftp.ensembl.org/pub/release-67/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.67.cdna.all.fa.gz

Then I regenerated the index using that file in my $humFa path. The resulting index was much smaller.

Now everything is working well, and I can align to my plasmids+human genome. Appreciate the help!

kallisto RNA-Seq • 4.1k views
ADD COMMENT
1
Entering edit mode

Kallisto has taken a few hours to run before. It should be fine.

ADD REPLY
5
Entering edit mode
4.1 years ago
h.mon 34k

The reference is composed of human genome + a couple of plasmid sequences (~12000 characters) stored in .fa format.

kallisto uses a reference transcriptome.

ADD COMMENT
0
Entering edit mode

Oops, I mistyped - you are correct. Modifying original post

ADD REPLY
0
Entering edit mode

Ah! I think I see your point. I thought the fasta files from hg19 represented a human transcriptome, but I see they represent a reference genome: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/

I'll re-download a human reference transcriptome and try again. Thank you!

ADD REPLY
1
Entering edit mode

If your plasmids have transcribed and non-transcribed parts, or multiple genes, you may create their "transcriptome" fasta using gffread from Stringtie, using the orginal fasta+gtf.

ADD REPLY
0
Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY

Login before adding your answer.

Traffic: 957 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6