Question: Seqan read compressed stream
0
gravatar for pmarijon
11 months ago by
pmarijon120
pmarijon120 wrote:

Hi,

I want read a sequence file (fasta fastq bam, etc), so I read Seqan tutorial. But If I want know my position in file I need use std::ifstream (for generate a progress bar) , it's not a problem, I write this test code:

#include <iostream>
#include <fstream>

#include <seqan/seq_io.h>


int main (int argc, char ** argv) {
    std::streampos begin,end;
    std::ifstream myfile (argv[1], std::ios::in | std::ios::binary);

    begin = myfile.tellg();

    seqan::SeqFileIn seq_file(myfile);
    seqan::CharString id;
    seqan::Dna5String seq;
    seqan::CharString qual;

    while(!seqan::atEnd(seq_file))
    {
    seqan::readRecord(id, seq, qual, seq_file);
    std::cout<<"pos: "<<myfile.tellg()<<" id "<<id<<std::endl;
    }

    end = myfile.tellg();

    myfile.close();

    std::cout << "begin: "<< begin << " end: "<< end << std::endl;
    std::cout << "size is: " << (end-begin) << " bytes.\n"<<std::endl;
    return 0;
}

But when I try this code on compressed fastq read, Seqan throw an exception terminate called after throwing an instance of 'seqan::ParseError'

My question :

  • Use std::ifstream is the only solution to get the current position in file ?
  • How I can say to Seqan this stream are a compressed stream ?
  • Can I generate an uncompressed stream from my compressed stream (with SeqAn or zlib)

Thanks.

seqan • 404 views
ADD COMMENTlink modified 11 months ago • written 11 months ago by pmarijon120
1

why would you want to know the position of a fastq record in a compressed file ? unless you're using bgzf, there is no way to 'fseek ' a bgzip file...

ADD REPLYlink written 11 months ago by Pierre Lindenbaum115k

I want generate a progress bar, the post required an edit. For compress file we can have a good approximation with size of compressed file and the position in compressed file.

ADD REPLYlink written 11 months ago by pmarijon120

then I would create a custom std::streambuf to count the number of bytes... e.g: https://artofcode.wordpress.com/2010/12/12/deriving-from-stdstreambuf/

ADD REPLYlink written 11 months ago by Pierre Lindenbaum115k

I use a std::ifstream to get current position in file during seqan parsing, it's easy. But when I try my code on compressed file, seqan parsing failed. So seqan didn't detect my stream contain compressed data or seqan can't work on compressed stream, but isn't documented.

ADD REPLYlink written 11 months ago by pmarijon120

So seqan didn't detect my stream contain compressed data or seqan can't work on compressed stream, but isn't documented.

Usually it is the other way round: Things don't work on compressed data, unless documented.

ADD REPLYlink written 11 months ago by kloetzl990

Is documented

These classes provide an API for accessing sequence files in different file formats, either compressed or uncompressed.

Source : https://seqan.readthedocs.io/en/master/Tutorial/InputOutput/SequenceIO.html

ADD REPLYlink written 11 months ago by pmarijon120

Well, there is compressed .bam and compressed .gz.

ADD REPLYlink written 11 months ago by kloetzl990

And bz2 according to http://docs.seqan.de/seqan/master/group_FileCompressionTags.html#FileCompressionTags%23BgzFile

ADD REPLYlink written 11 months ago by pmarijon120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1650 users visited in the last hour