I'm confused by how an FM-index is built from a genome with multiple chromosomes (or more generically, any multi-sequence file). I understand the principles of the BWT, but do aligners such as BWA and Bowtie compute a separate BWT for each sequence or do they concatenate all sequences then compute a single BWT?
I'm interested to know for the sake of it but also because I need to include the mitochondrial and chloroplast genomes in an index, but BWA has one indexing method (IS) that can't handle a 'database' more than 2 Gbp while the other method (BWTSW) can't handle databases smaller than 10 Mbp (the organelle genomes are smaller than this...)
I just don't know if 'database' in the documentation means the sum of all sequences or whether each sequence is considered a separate database. If the sequences are all concatenated then BWTSW should work fine, but otherwise it seems neither single indexing method works for both the large chromosomes I have to deal with and the tiny organelle genomes.
Thanks for your time!
Since a single index is built for one multi-fasta reference that should all count towards the size of the
database. Someone with right programming chops will need to confirm if that interpretation is technically correct.Thanks for your answer. Ok brilliant, that is the crux of the matter. If a single Burrows-Wheeler transformation is conducted on the whole concatenated sequence then yes, the BWA BWTSW indexer should work.