I'm trying to align some test .fastq files to a reference using kallisto.
The reference is composed of human transcriptome + a couple of plasmid sequences (~12000 characters) stored in .fa format. I generated the index using this command:
humFa="/path/to/ucsc/fasta/files/RefGenomes/H_sapiens/hg19/*.fa" plasFa="/path/to/plasmid/fasta/files/*.fa" kallisto index -i humPlas_kallisto_transcripts.idx $humFa $plasFa --make-unique
...and the resulting file is. 70.49 GB.
I have tried to align paired end .fastq files to this index using kallisto, but I keep running into issues:
On my Mac laptop (macOS 10.13.3, 3.5 GHz Processor, 16 GB memory):
The issue seems to be that the program dies prematurely, but I don't know why. I run this with only Terminal open, and I don't touch anything while it's running:
./kallisto quant -i ~/Desktop/humPlas_kallisto_transcripts.idx -o ~/Desktop/kallOutput/ -b 100 ~/Desktop/tstFastq/R1.fastq ~/Desktop/tstFastqR2.fastq [quant] fragment length distribution will be estimated from the data [index] k-mer length: 31 [index] number of targets: 50,798 [index] number of k-mers: 2,969,625,638 Killed: 9
On a computational cluster:
I run the same command (as above):
kallisto quant -i $kallistoIdx -o $outputFileLoc -t 4 -b 100 $Read1 $Read2
with the following cluster queue settings (SLURM):
#SBATCH --nodes=4 #SBATCH --ntasks=4 #SBATCH --cpus-per-task=4 #SBATCH --mem=256G
...and then the job auto-aborts after six hours because it hasn't completed in that time.
Am I correct in thinking this kallisto alignment is taking suspiciously long/is being suspiciously buggy? Has anyone run into either of these issues? Am I missing anything that might be making kallisto slower?
Thank you for your help!
EDIT - SOLUTION
Thanks for the feedback - indeed, the problem is that I was using UCSC's genome files, not transcriptome.
I got the transcriptome corresponding to hg19 from the Ensembl archive here: wget ftp://ftp.ensembl.org/pub/release-67/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.67.cdna.all.fa.gz
Then I regenerated the index using that file in my $humFa path. The resulting index was much smaller.
Now everything is working well, and I can align to my plasmids+human genome. Appreciate the help!