Question: Hisat2 index builder seems to be running indefinitly
15 months ago by
caranlove10 wrote:

Hello, I am attempting to create an new index from Emsemble reference files, and the index builder is taking far longer than what I am used to when creating a new index. The builder command has been running now for >48 hrs and I am a bit confused on why it is taking so long/if it is working.

I am running: hisat2-build -p 6 --ss /path/to/CanFam3.1.97_intron.bed --exon /path/to/CanFam3.1.97_exonsFile.table -f /path/to/Canis_familiaris.CanFam3.1.dna.toplevel.fa CanFam3.1.97

And the output I have gotten from this run so far is:

  Output files: "CanFam3.1.97.*.ht2"
  Line rate: 7 (line is 128 bytes)
  Lines per side: 1 (side is 128 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
Reading reference sizes
  Time reading reference sizes: 00:00:17
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:13

But it has been on this last 'Time to join reference sequences' for >12 hrs.
The .fa file appears to be formatted correctly: 

>1 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF

As does the gtf file that the intron and exon files were created from:

X       ensembl gene    1575    5716    .       +       .       gene_id "ENSCAFG00000010935"; gene_version "3"; gene_source "ensembl"; gene_biotype "protein_coding";
X       ensembl transcript      1575    5716    .       +       .       gene_id "ENSCAFG00000010935"; gene_version "3"; transcript_id "ENSCAFT00000017396"; transcript_version "3"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding";

Can anyone help me determine why this index is taking far more time to run than when I have created them in the past?

Thank you for your help!

hisat2 rna-seq
Does it still run? You can check with the top command in a new terminl window.

Yes, it does appear to still be running.

Have you solved the problem yet? I have the same problem.

