what is in the fasta.fai
1
7
Entering edit mode
8.3 years ago

Hi, this may sounds a bit trivial question, i indexed my ref genome, along with the indexed genome, there is also a file called .fasta.fai. My question is what is in the fasta.fai? I opened this file, it does not have any header. Did I do anything wrong ?

I then used the ref genome to call SNPs. I assume some sort of position information is used to tell me the position of snps in the ref genome. Which column is position? Thanks.

Here is a subset of the fasta.fai file

gi|394055774|gb|AGTA02000001.1|    5516    93    70    71
gi|394055773|gb|AGTA02000002.1|    2292    5781    70    71
gi|394055772|gb|AGTA02000003.1|    4668    8199    70    71
gi|394055771|gb|AGTA02000004.1|    1190    13027    70    71

bowtie • 25k views
1
Entering edit mode
16
Entering edit mode
8.3 years ago
Dan D 7.3k

The fasta.fai is the fasta index, and the one you posted looks legit.

For each row:

Column 1: The contig name. In your FASTA file, this is preceeded by '>'

Column 2: The number of bases in the contig

Column 3: The byte index of the file where the contig sequence begins. (Notice how it constantly increases by roughly the amount in column 2?)

Column 4: bases per line in the FASTA file

Column 5: bytes per line in the FASTA file

0
Entering edit mode

Just out of curious, about information in the column3,

say

gi|394055774|gb|AGTA02000001.1|    5516    93    70    71


has a length of 5516, starts from 93,

would not I expect the next contig approximately starts from the position 5516 + 93 = 5609, in reality it starts from 5781, likewise, the next contig approximately starts with 2292 + 5781 = 8073, but it actually starts with 8199, why is this the case? This is related to the downstream analysis of SNPs, when I looked at the VCF file, each SNP is given the position information column, say gi|394055774|gb|AGTA02000001.1| 277 does that mean the variation site is on the 277th of 5516? Thank you for your explanation.

2
Entering edit mode

So the contig names themselves of course take up bytes in the file, but take a look again at columns 4 and 5. Notice how the number in column 5 is one more than column 4? There are 70 bases per line, but 71 bytes. The newline character is one byte long, so that in combination with the contig name explains the apparent discrepancy you're seeing.

On your VCF file, the SNP position is the genomic position (number of bases into the chromosome/contig), and has no direct association to the .fai data.

0
Entering edit mode

This helped a lot!

in other words, the SNP position in the genome is really depend on how the contigs are organized or aligned up with other contigs in the genome?

2
Entering edit mode

Think of it like this:

Let's say we have a really simple genome represented in a FASTA file, with two contigs:

>contig1
AAAAAATTTTTT
>contig2
CCCCCCGGGGG


And you have a sequence you want to align: AGTTTTT

So you can align it like this to contig1, with a single mismatch:

AAAAAATTTTTT
AGTTTTT
_____^


The SNP occurs at the seventh base of the 'contig1' sequence. So the VCF file should give you a position value of 7 for that SNP. Make sense?

0
Entering edit mode

Make a lot of sense now. Thank you so much.