I am trying to automate extracting data from an indexed BAM file and I am getting read errors:
[E::bgzf_read_block] Invalid BGZF header at offset 51686517456 [E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
This failure happens after about 12 hours of analysis, near the end of the 52GB BAM file:
$ ll HUDEP.control.DS182418.chr8.bam* -rw-r--r-- 1 areynolds stamlab 51901713847 Jun 15 12:46 HUDEP.control.DS182418.chr8.bam -rw-r--r-- 1 areynolds stamlab 7250712 Jun 15 12:59 HUDEP.control.DS182418.chr8.bam.bai
samtools quickcheck on it, which reports no issues:
$ samtools quickcheck -vvv HUDEP.control.DS182418.chr8.bam verbosity set to 3 checking HUDEP.control.DS182418.chr8.bam opened HUDEP.control.DS182418.chr8.bam HUDEP.control.DS182418.chr8.bam is sequence data HUDEP.control.DS182418.chr8.bam has 25 targets in header. HUDEP.control.DS182418.chr8.bam has good EOF block.
I can otherwise read data from it with manual
pysam commands to random locations within the BAM file.
$ python Python 3.8.16 (default, Mar 2 2023, 03:21:46) [GCC 11.2.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pysam >>> print(pysam.__version__) 0.21.0
Remaking the BAM file might be an option but it is very time-consuming. I'd like to understand what went wrong, so I don't have to remake files again.
Is that the only option or are there other tests and fixes I could apply, before this?