Question: Why do we need a .fai file and a .dict file of the reference during alignment and variant calling using GATK?
gravatar for Cookie-san
4 weeks ago by
Bengaluru, India
Cookie-san0 wrote:

I'm trying to learn the theory behind various steps in variant calling using GATK. Before alignment using BWA-MEM we first index the reference genome and this generates a set of files with the extensions

chr13and17.fa.amb chr13and17.fa.ann chr13and17.fa.bwt chr13and17.fa.pac

where chr13and17.fa is the FASTA file containing the reference genome.

The next step in the pipeline is generating a .fai using samtools with the command:

samtools faidx chr13and17.fa

Followed by generating a .dict file using Picard:

java -jar picard.jar CreateSequenceDictionary R=chr13and17.fa O=chr13and17.dict

I want to know WHY we generate a .fai file and a .dict file despite also indexing the genome. In the samtools manual, the reason for creating a .fai file is specified as:

Using an fai index file in conjunction with a FASTA/FASTQ file containing reference sequences enables efficient access to arbitrary regions within those reference sequences.

Isn't 'efficient access to arbitrary regions of the genome' also the aim of indexing? I understand the files themselves store different information in different, well, formats. But why all the different files though?

ADD COMMENTlink modified 4 weeks ago by Pierre Lindenbaum120k • written 4 weeks ago by Cookie-san0

AFAIK only the GATK and Picard tools need the dict files. You're right, fai and dict are both index files in a manner of speaking, but they are optimized for different functions, and while fai is more prevalent, Picard tools are tooled to work better with dict files. Check out this thread on a similar topic: .dict file created by picard and by samtools

About all the different files that you encounter, here's my take: Welcome to Bioinformatics :-)

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by RamRS21k
gravatar for Pierre Lindenbaum
4 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

index for bwa-mem : burrow-wheeler transform index used to map the reads.

index fai : used by the tool to list the chromosome and quickly fetch a sequence from the fasta sequence

dict: list the chromosomes but also provides informations about the MD5 Sum of the fasta sequences (to be sure that you're using the same REF), the name of the organism(s), the names for aliases, the URL where we can retrieve the sequences, etc... this dict file will be inserted in/compared with the BAM and VCF headers

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Pierre Lindenbaum120k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1523 users visited in the last hour