Question: should FASTA files be sorted before indexed with SAMtools?
gravatar for Timze W
4 months ago by
Timze W20
Timze W20 wrote:

Hello dears,

We can index the files to access random sequences fast, and it should be sorted before indexed for the SAM/BAM.

But when it turns to FASTA files, I could not understand why we can directly index without sort first ?

 samtools faidx <ref.fa>

Are there any easy explanation? THANKS

index samtools sort fasta • 192 views
ADD COMMENTlink modified 4 months ago by finswimmer10k • written 4 months ago by Timze W20
gravatar for finswimmer
4 months ago by
finswimmer10k wrote:

Hello Timze W ,

indexing is a very fascinating topic.

The index file produced by samtools faidx and .bai have very different structures. I guess there are mainly two reasons for it:

  1. In a bamfile we have typically much more entrys than in a fasta file.
  2. The way we query the data. For a fasta file we typically ask "Give me the sequence with the id XY". For bam files we ask "Give me all reads that overlap a region"

The fasta index is quite simply. It just contains the name of sequences, where in our file the header starts, how long the header is and how much bases the sequence have. See the specs for it. As the number of sequence in a fasta file is quite small (compared to a bam file) we can just iterate over the index file to find the offset of a sequence we like to have in a reasonable time.

In case of the bam the index file is organized in bins, which contains the offset of reads that overlap a region. See the sam specs for it . To be able to say where a bin begins and end it is necessary to sort the bam file.

fin swimmer

ADD COMMENTlink written 4 months ago by finswimmer10k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1297 users visited in the last hour