I consistently get the following error message when I try and build a HISAT2 index for a Mouse geneome.
Error: Encountered internal HISAT2 exception (#1)
My call to hisat2 is as follows:
hisat2-build -f -p 8 --ss genome.ss --exon genome.exon $GENOME genome_tran
Where "$GENOME" contains a comma separated list of fasta files (one for each chromosome)
I'm using the "build_index.sh" script with some minor modifications. Up until now, I've not had an issue with index building. I'm running this on a unix server with slurm job control; I've verified that my job is being assigned 8 cpus, and at least 500 Gb RAM.
My complete HISAT2 output and script are posted below. If anyone has any ideas about how to trouble shoot, please let me know.
Complete HISAT2 output:
home/abf/bin/hisat2-build
/home/abf/bin/hisat2_extract_splice_sites.py
/home/abf/bin/hisat2_extract_exons.py
Settings:
  Output files: "genome_tran.*.ht2"
  Line rate: 7 (line is 128 bytes)
  Lines per side: 1 (side is 128 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  chr.1.fa
  chr.2.fa
  chr.3.fa
  chr.4.fa
  chr.5.fa
  chr.6.fa
  chr.7.fa
  chr.8.fa
  chr.9.fa
  chr.10.fa
  chr.11.fa
  chr.12.fa
  chr.13.fa
  chr.14.fa
  chr.15.fa
  chr.16.fa
  chr.17.fa
  chr.18.fa
  chr.19.fa
  chr.X.fa
  chr.Y.fa
  chr.MT.fa
Reading reference sizes
Reading reference sizes
  Time reading reference sizes: 00:00:22
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:18
  Time to read SNPs and splice sites: 00:00:01
Total time for call to driver() for forward index: 00:20:28
Error: Encountered internal HISAT2 exception (#1)
Command: hisat2-build --wrapper basic-0 -f -p 8 --ss genome.ss --exon genome.exon chr.1.fa,chr.2.fa,chr.3.fa,chr.4.fa,chr.5.fa,chr.6.fa,chr.7.fa,chr.8.fa,chr.9.fa,chr.10.fa,chr.11.fa,chr.12.fa,chr.13.fa,chr.14.fa,chr.15.fa,chr.16.fa,chr.17.fa,chr.18.fa,chr.19.fa,chr.X.fa,chr.Y.fa,chr.MT.fa genome_tran
Deleting "genome_tran.1.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.2.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.3.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.4.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.5.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.6.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.7.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.8.ht2" file written during aborted indexing attempt.
My Script:
#!/bin/sh
#SBATCH --job-name=BUILD_MOUSE_INDEX
#SBATCH --ntasks=8
#SBATCH --mem=512000
# Downloads sequence for the GRCm38 release 96 version of M. musculus (mouse) from
# Ensembl.
#
# By default, this script builds and index for just the base files,
# since alignments to those sequences are the most useful.  To change
# which categories are built by this script, edit the CHRS_TO_INDEX
# variable below.
#
export PATH=$PATH:/home/abf/bin
declare -a CHR=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y MT)
declare -a GENOME=()
ENSEMBL_RELEASE=98
ENSEMBL_GRCm38_BASE=ftp://ftp.ensembl.org/pub/release-${ENSEMBL_RELEASE}/fasta/mus_musculus/dna
ENSEMBL_GRCm38_GTF_BASE=ftp://ftp.ensembl.org/pub/release-${ENSEMBL_RELEASE}/gtf/mus_musculus
GTF_FILE=Mus_musculus.GRCm38.${ENSEMBL_RELEASE}.chr.gtf # Excludes unplaced contigs
# GTF_FILE=Mus_musculus.GRCm38.${ENSEMBL_RELEASE}.gtf
get() {
        file=$1
        if ! wget --version >/dev/null 2>/dev/null ; then
                if ! curl --version >/dev/null 2>/dev/null ; then
                        echo "Please install wget or curl somewhere in your PATH"
                        exit 1
                fi
                curl -o `basename $1` $1
                return $?
        else
                wget -nv $1
                return $?
        fi
}
HISAT2_BUILD_EXE=./hisat2-build
if [ ! -x "$HISAT2_BUILD_EXE" ] ; then
        if ! which hisat2-build ; then
                echo "Could not find hisat2-build in current directory or in PATH"
                exit 1
        else
                HISAT2_BUILD_EXE=`which hisat2-build`
        fi
fi
HISAT2_SS_SCRIPT=./hisat2_extract_splice_sites.py
if [ ! -x "$HISAT2_SS_SCRIPT" ] ; then
        if ! which hisat2_extract_splice_sites.py ; then
                echo "Couldnt find hisat2_extract_splice_sites.py in current directory or PATH"
                exit 1
        else
                HISAT2_SS_SCRIPT=`which hisat2_extract_splice_sites.py`
        fi
fi
HISAT2_EXON_SCRIPT=./hisat2_extract_exons.py
if [ ! -x "$HISAT2_EXON_SCRIPT" ] ; then
        if ! which hisat2_extract_exons.py ; then
                echo "Could not find hisat2_extract_exons.py in current directory or in PATH"
                exit 1
        else
                HISAT2_EXON_SCRIPT=`which hisat2_extract_exons.py`
        fi
fi
#rm -f genome.fa
# Un comment this block if retrieving individual chromosomes
for c in ${CHR[@]}; do
    F="Mus_musculus.GRCm38.dna.chromosome.$c.fa"
    G=$(echo $F | sed 's/Mus_musculus\.GRCm38\.dna\.chromosome\./chr./')
    if [ ! -f $G ] ; then
        get ${ENSEMBL_GRCm38_BASE}/$F.gz || (echo "Error getting $F" && exit 1)
        gunzip $F.gz || (echo "Error unzipping $F" && exit 1)
        mv $F "chr.$c.fa"
    fi
    GENOME=("${GENOME[@]}" "chr.$c.fa")
done
GENOME=$(echo ${GENOME[@]} | sed 's/\s/,/g')
if [ ! -f $GTF_FILE ] ; then
       get ${ENSEMBL_GRCm38_GTF_BASE}/${GTF_FILE}.gz || (echo "Error getting ${GTF_FILE}" && exit 1)
       gunzip ${GTF_FILE}.gz || (echo "Error unzipping ${GTF_FILE}" && exit 1)
fi
if [ ! -f genome.ss ] ; then
       ${HISAT2_SS_SCRIPT} ${GTF_FILE} > genome.ss
       ${HISAT2_EXON_SCRIPT} ${GTF_FILE} > genome.exon
fi
hisat2-build -f -p 8 --ss genome.ss --exon genome.exon $GENOME genome_tran
                    
                
                
You got the point right. If the required amount of RAM is unavailable, use pre-built indexes from HiSat2.