Question: samtools index command running slow and truncated EOF warnings
13 months ago
vascoambrogi0 wrote:

Hi all,

I am new to FASTA FASTQ SAM BAM and related explorations.

I am learning on the go, I apologies for any lack of substance.

I am working on a human read, my whole human genome sequencing, downloaded on the service provider's website.

What I have is a:

  • BAM file gz compressed
  • BAI file gz compressed
  • FASTQ R1 file gz compressed
  • FASTQ R2 file gz compressed

To speed up things I decompressed all the fles, this made me run through the truncated EOF error on samtools. I dont have any error when I use the *.gz files.

Is there a way to avoid that? I tried to manually force the EOF, but I still get the warning and errors using VIEW samtools command, essential command.

But what is puzzling me at the moment is the CPU usage of samtools jobs. If I use the *.gz files, 25% of each core is used , if I use the uncompressed files, 2 to 5 % of the core is used (I tried the -@ INT flag, nothing changes).

Is that normal?

As an example, when I run the command:

  • samtools index -@ INT file.bam.gz >>>>> 25%
  • samtools index -@ INT file.bam >>>>>> 2-5 %

Many thanks to all :)

Thank you Pierre,

Thank you for the random access precision, I'll keep it in mind next time I compress a Jay-z flow.

What about the low core usage and EOF warnings//errors?

For whom concerned and substance,

  • gz is BAM, not gzip and;
  • samtools was graciously doing nothing when fed with a SAM, decompressed gz.
13 months ago
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum133k wrote:

bam file should be already compressed using the BGZ format, you don't need to recompress it with gzip (which is incompatible with bgzf, gzip cannot do random-access)


samtools index your.bam

should be enough

