faidx implementation with C++
3
0
Entering edit mode
4.2 years ago
bdelolmo ▴ 10

Hello,

I am learning C++, and after learning STL I would like to do my own implementation of faidx to perform sequence extraction from a genome. I know there are nice APIs out there (seqAns, etc) but I want to understand the principles behind it. I have tried to understand the faidx implementation on samtools C API, but it' too much for learning the basics so I am a bit lost at the beggining. Any suggestion?

C++ FASTA indexing • 1.4k views
ADD COMMENT
2
Entering edit mode
4.2 years ago

Read http://www.htslib.org/doc/faidx.html, which discusses the format of _.fai_ files and how they are used to index well-formatted FASTA files. Working from this will be clearer than working from the implementation in HTSlib, and when you are more familiar with it the HTSlib/Samtools implementation may make a bit more sense.

ADD COMMENT
1
Entering edit mode
4.2 years ago
sacha ★ 2.1k

with the following fasta file :

>chr1
ACGTATATATATGTAT
ACGTATATATATATAT
>chr2
ACGTATATATTATATT
ACTATATATATTATAT
ACTATATATATATATT
GTATATATAAATTATA

the faidx file contains :

chr1    32      6       16      17
chr2    64      46      16      17

One line per fasta header.

  • column 1 : header name
  • column 2 : sequence size ( how many nucleotid)
  • column 3 : file offset where the sequence start. ( How many caracter from the top of the file )
  • column 4 : line size ( how many nucleotid per line without cariage or line return )
  • column 5 : full line size ( how many caracter with line return )

Now, if you would like to get a sub sequence efficiently, read the index and create the following function:

string sequence(string chr, int start, int end)
{
           Index index = allIndex[chr];
           // where file reading start 
           int offset = index.offset + start + start / index.chompLineSize;
           // move into file 
           file.seek(offset);
           // int length = end - start
           return file.read(length + (length)/index.chompLineSize);

}

Here is my own implementation using Qt/C++ https://github.com/labsquare/CuteFasta/blob/master/src/FastaFile.cpp#L10

ADD COMMENT
1
Entering edit mode
4.2 years ago

You already have some pretty good answers, but I wanted to point out that I've done a Python implementation here: https://github.com/mdshw5/pyfaidx/blob/master/pyfaidx/__init__.py#L537

There's a lot of object-oriented stuff to get around, but the docstring explains the logic behind fetching a subsequence:

    Fetch the sequence ``[start:end]`` from ``rname`` using 1-based coordinates
    1. Count newlines before start
    2. Count newlines to end
    3. Difference of 1 and 2 is number of newlines in [start:end]
    4. Seek to start position, taking newlines into account
    5. Read to end position, return sequence
ADD COMMENT

Login before adding your answer.

Traffic: 1976 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6