How are BWA/bowtie/etc indices built for multiple fasta entries?
1
0
Entering edit mode
7 days ago

I'm confused by how an FM-index is built from a genome with multiple chromosomes (or more generically, any multi-sequence file). I understand the principles of the BWT, but do aligners such as BWA and Bowtie compute a separate BWT for each sequence or do they concatenate all sequences then compute a single BWT?

I'm interested to know for the sake of it but also because I need to include the mitochondrial and chloroplast genomes in an index, but BWA has one indexing method (IS) that can't handle a 'database' more than 2 Gbp while the other method (BWTSW) can't handle databases smaller than 10 Mbp (the organelle genomes are smaller than this...)

I just don't know if 'database' in the documentation means the sum of all sequences or whether each sequence is considered a separate database. If the sequences are all concatenated then BWTSW should work fine, but otherwise it seems neither single indexing method works for both the large chromosomes I have to deal with and the tiny organelle genomes.

index fasta mapping BWA bowtie • 242 views
1
Entering edit mode

Since a single index is built for one multi-fasta reference that should all count towards the size of the database. Someone with right programming chops will need to confirm if that interpretation is technically correct.

0
Entering edit mode

Thanks for your answer. Ok brilliant, that is the crux of the matter. If a single Burrows-Wheeler transformation is conducted on the whole concatenated sequence then yes, the BWA BWTSW indexer should work.

2
Entering edit mode
7 days ago

I believe that all the sequences are concatenated into a single one.

There is an interesting side effect to this, the aligner can detect certain chromosomal fusions this way.

Basically if a read partially aligns to the end of a chromosome and the beginning of another chromosome bwa can generate alignment for that (at least it used to).

It was one of the weird, unexpected cases specific to bwa where the alignment would get be labeled as unmapped (flag 4) yet the alignment would have a coordinate, a CIGAR string, an alignment score etc.

0
Entering edit mode

Oooh thank you for your answer - it answers my question (i.e. I can use BWA with very large and very small reference sequences, because they are concatenated into a single 'database') plus that is a super interesting side effect you describe!

Presumably it could only detect fusions across chromosomes that happened to be listed adjacently in the reference FASTA? e.g. fusion across end of chr4 / start of chr5 but not end of chr4 / start of chr7 (Assuming reference sequences are supplied in numerical order of course!)

Thanks again, Max

0
Entering edit mode

that's how I understood it worked, from a forum discussion on Sourceforge I believe, where the developer Heng Li talked mentioned some internal implementation details, it was not designed to find all possible fusions, but since there was a "free" solution for some cases, where it would have taken extra effort to remove certain information, instead it was kept and labeled as unmapped.

0
Entering edit mode

Interesting stuff! I'm new to informatics but I'm very much enjoying learning the quirks of these bits of software... not a fan of launching things into black boxes and hoping for the best!

Thanks again for your thoughts. Is 'accepting' your answer the best way to credit you on the site?