Hello Timze W ,
indexing is a very fascinating topic.
The index file produced by
samtools faidx and
.bai have very different structures. I guess there are mainly two reasons for it:
- In a
bamfile we have typically much more entrys than in a
- The way we query the data. For a
fasta file we typically ask "Give me the sequence with the id XY". For
bam files we ask "Give me all reads that overlap a region"
fasta index is quite simply. It just contains the name of sequences, where in our file the header starts, how long the header is and how much bases the sequence have. See the specs for it. As the number of sequence in a
fasta file is quite small (compared to a
bam file) we can just iterate over the index file to find the offset of a sequence we like to have in a reasonable time.
In case of the
bam the index file is organized in bins, which contains the offset of reads that overlap a region. See the sam specs for it . To be able to say where a bin begins and end it is necessary to sort the