I'm trying to learn the theory behind various steps in variant calling using GATK. Before alignment using BWA-MEM we first index the reference genome and this generates a set of files with the extensions
where chr13and17.fa is the FASTA file containing the reference genome.
The next step in the pipeline is generating a .fai using samtools with the command:
samtools faidx chr13and17.fa
Followed by generating a .dict file using Picard:
java -jar picard.jar CreateSequenceDictionary
I want to know WHY we generate a .fai file and a .dict file despite also indexing the genome. In the samtools manual, the reason for creating a .fai file is specified as:
Using an fai index file in conjunction with a FASTA/FASTQ file containing reference sequences enables efficient access to arbitrary regions within those reference sequences.
Isn't 'efficient access to arbitrary regions of the genome' also the aim of indexing? I understand the files themselves store different information in different, well, formats. But why all the different files though?