I'm coding in c++ and reading in different reference genomes to examine regions across the chromosomes a few hundred base pairs at a time.
To do it in c++ I use the library htslib and the command
faidx_fetch_seq() which is similar to read specific regions using
The problem is when I'm reading in the reference genomes, there are of course multiple regions consisting only of
NNNNN (and so on). Since I only need the actual DNA sequence I wish to read in the entire file but removing the regions with
I have tried reading in the regions in c++ and then just creating an if-statement to check the actual bases consist of N. This works but is incredibly slow.
So my question is: Does anyone know how to only read in the actual DNA sequences, removing the
NNN regions? Either with samtools or with the c++ function