Question

Index files from alignments are always 96 bytes (Oxford Nanopore)

2

Entering edit mode

16 months ago

Ibrahim Tanyalcin ★ 1.2k

BACKGROUND

I have dockerized app running IGV.js and serving files from backend. In casual alignment files from shotgun sequencing (illumina etc.) I can see that my server correctly responds to range requests sent by IGV:

Content-Range: bytes 1969716952-1969767576/2396323876
Content-Type: application/octet-stream

This works okay and my servers responds with 206 HTTP status code (partial response) for almost all bam-file/fasta-ref pairs. The resulting payloads are always around 12-50kB. This is the expected behavior as it avoids IGV having to download the whole bam/fasta.

ISSUE

Given: A reference file around 4kB (reverse transcripted RNA) with hundreds of reads (very high depth).

In certain cases where bam files belong to Oxford Nanopore generated data, IGV.js will request the entire bam file, whether the bam file is 10Mb or 700Mb. The range request from the browser (initiated by IGV.js) looks like this:

Range: bytes=270-7327614

And of course the server has no choice but send almost the entire file:

Content-Range: bytes 270-7327614/7612045

This basically requests the entire file and will cause the browser to freeze in cases where the bam file size 700Mb, until the entire bam is read into memory. It was difficult to understand at the beginning why this behavior occurs, then I inspected the index (.bai) files and realized regardless of bam file size, the generated *.bai file is always 96bytes. This does not occur in normal bam/fasta pairs, where the *.bai files are generally around 1-2Mb.

I am guessing that above is the reason why IGV.js is requesting the entire file for ONT datasets. Why are these *.bai files always 96b and how can I fix it?

WHAT HAS BEEN TRIED

Setting the visibility Window option for IGV.js to 50 base pairs so that IGV does not request anything until your viewport is small enough did not work. Even in case of 50bp, flanking regions are not requested, which means even if you scroll a little bit to the left/right, IGV will re-request the entire file every time.

(Below are pseudo descriptions, I did not run them literally as they are in bash)

SAMTools Mappings Sorter .mapped.bam > .mapped.sorted.bam

SAMTools Mappings Indexer .mapped.sorted.bam > mapped.sorted.bam.bai --> 96bytes!

Minimap2 Aligner for Long Reads .fastq.gz > .mapped.bam + .mmi + .mapped.bam.bai --> 96bytes!

In both cases above, the generated bai files were always 96b. This file size did not vary based on the bam file size.

ADDITIONAL INFO

I posted this question on Github as issue

DISCLAIMER

I do not own the data, and I am not authorized to share it.

alignment bai oxford-nanopore bam • 936 views

ADD COMMENT • link updated 16 months ago by LChart 3.9k • written 16 months ago by Ibrahim Tanyalcin ★ 1.2k

score 3 · Accepted Answer · 2022-11-30

3

Entering edit mode

16 months ago

LChart 3.9k

The bam index provides a mapping between blocks of the block-gzipped sam file and positions on the reference genome. By default, the minimum interval size is 2^14 basepairs, which exceeds the total length of your reference; so all of the blocks in your bam file belong to the first (and only) interval. The resulting .bai file is therefore very small since it only needs to show that the EOF belongs to the first interval (1-4kb) and therefore all preceeding blocks also.

I don't believe there is anything you can do about this; this behavior is normal. If you want to stop the browser from hanging, I think you will need to downsample or split your bam file into smaller files. There is a -m option in samtools to alter the interval size, but I believe that only applies to csi indexes and not bai indexes.

ADD COMMENT • link 16 months ago by LChart 3.9k

0

Entering edit mode

Many thanks for the explanation! Before I accept, I would like to ask a few more things to clarify the situation for myself. So as far as I understand:

The only manageable way is to split the fastq/fast5 into multiple parts -> align them individually to produce smaller bams -> index the small bams

In this case, I had 7-10Mb bam files instead if 700Mb bam file, as I described in the question. Even in these small bam files, IGV was requesting the entire file over and over again if you scroll left or right. From your explanation I understand there is no workaround.

Is there a platform for Oxford Nanopore, where I can post the issue and they get together with IGV developers and develop a new format, scheme..etc. to prevent this from happenning?? Because requesting files over and over again, even small, is very inefficient.

ADD REPLY • link 16 months ago by Ibrahim Tanyalcin ★ 1.2k

0

Entering edit mode

You can split the bam easily after alignment (sort by read name, chop it up, sort by position).

You can post the issue in the appropriate IGV repository: https://github.com/igvteam and be sure to specify that your issue is a side-case where the reference length is smaller than the .bai interval size. My guess is that there may be an edge case when, if there is only one "interval" in the index, then even if all the reads are held in memory, any kind of scrolling re-triggers a request when it otherwise wouldn't.

ADD REPLY • link 16 months ago by LChart 3.9k