Question: Correct, Robust Fastq Parsing?
9
gravatar for Ketil
9.3 years ago by
Ketil4.0k
Germany
Ketil4.0k wrote:

Hi

I'm currently using samtools.pl to generate consensus and quality from a resequencing effort. Like so many other Fastq parsers, I made the assumption of a four-line format, but samtools produces a fastq format where sequence and quality runs over multiple lines.

I could of course extend the parser to accomodate this format, but this seems pretty hard to get correct, since quality information may contain both + and @ as valid characters.

As I see it, the correct way to do it is to read sequence until a line with either a single '+' OR a '+' followed by the same read name as was used in the '@' line. And then read the same number of quality values.

I think this might work, but as there are multiple other Fastq parser implementation, I'm curious how these deal with this issue?

fastq parsing • 3.3k views
ADD COMMENTlink modified 9.3 years ago by iw9oel_ad6.1k • written 9.3 years ago by Ketil4.0k
6
gravatar for Peter
9.3 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

In general the quality string can contain both the @ and + characters, and if you allow line wrapping this means any quality line may start with those characters. The only way to be safe is to track the length of the sequence, and thus know how many quality characters there should be. This also has implications for quickly counting the number of reads, and for indexing.

Please read this paper http://dx.doi.org/10.1093/nar/gkp1137 and test your code with the provided example files which are used in the Biopython, BioPerl, BioJava, BioRuby and EMBOSS test suites.

ADD COMMENTlink written 9.3 years ago by Peter5.8k

Pedantic note: The paper says that in the sequence data "there is no explicit limitation on the characters expected". I don't think it's possible (and certainly not practical) to parse FastQ files unambigously unless you at least prohibit '+'. OTOH, "a gap character" is explicitly allowed, and the only thing not allowed is whitespace other than newline.

ADD REPLYlink written 9.3 years ago by Ketil4.0k
3
gravatar for brentp
9.3 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

You can see how biopython handles this here

Check the docstring (just below the linked line) for some of the worst-cases you'd have to account for if you re-implemented yourself.

ADD COMMENTlink written 9.3 years ago by brentp23k

Looks fairly complete. It fails on empty reads (e.g. "@foon+n"), and nameless reads (@nACGTn+nBCDE") and doesn't treat "r" as whitespace (i.e. it is silently ignored), but these are arguably not illegal or actual limitations.

ADD REPLYlink written 9.3 years ago by Ketil4.0k
2
gravatar for iw9oel_ad
9.3 years ago by
iw9oel_ad6.1k
iw9oel_ad6.1k wrote:

That's exactly how I parse Fastq; count the sequence characters, count the quality characters. I also catch pathological cases where the identity is an empty string. This is technically legal because the Fastq paper explicitly states that there is no length limit on the title field. The authors probably meant no upper length limit only.

The Fastq paper is a high-visibility source that describes the format as

@title and optional description
sequence line(s)
+optional repeat of title line
quality line(s)

which means a lot of people will expect general-purpose Fastq parsers to handle this by default.

ADD COMMENTlink modified 9.3 years ago • written 9.3 years ago by iw9oel_ad6.1k

Good point. Leaving out the identifier is a common compression trick - probably about to be explicitly allowed for SAM/BAM non-paired data. It makes sense to allow this in FASTQ (and FASTA and QUAL) too - I'd have to check my parser copes.

ADD REPLYlink written 9.2 years ago by Peter5.8k
1
gravatar for Pierre Lindenbaum
9.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

Edit: there is a Java parser for FASTQ in the picard library:

http://picard-tools.sourcearchive.com/documentation/1.25-1/FastqReader_8java-source.html

but as far as I can see, it also expects four lines.

I wrote a fastq parser, one day. I don't even remember if I used it :-)

It works as an iterator :

import java.io.BufferedReader;
import java.io.IOException;

/**
 * FastQReader
 * @author Pierre Lindenbaum
 *
 */
public class FastQReader
    {
    private BufferedReader in;
    private String prev_line=null;
    private String name=null;
    private StringBuilder sequence=new StringBuilder();
    private StringBuilder quality=new StringBuilder();
    private long nLine=0L;

    public FastQReader(BufferedReader in)
        {
        this.in=in;
        }

    public String toFasQ()
        {
        return getName()+"\n"+getSequence()+"\n+\n"+getQuality();
        }

    private String _readLine() throws IOException
        {
        if(prev_line!=null)
            {
            String s=prev_line;
            prev_line=null;
            return s;
            }
        String s=null;
        while((s=in.readLine())!=null)
            {
            ++nLine;
            if(s.trim().isEmpty())
                {
                continue;
                }
            return s;
            }
        return null;
        }

    public boolean next() throws IOException
        {
        name=null;
        sequence.setLength(0);
        quality.setLength(0);
        String line=_readLine();
        if(line==null) return false;
        if(!line.startsWith("@") ) throw new IOException("expected a line starting with @ but found "+line);
        name=line.substring(1).trim();
        while(true)
            {
            line=_readLine();
            if(line==null) throw new IOException("bad sequence for @"+name);
            if(line.startsWith("+"))
                {
                break;
                }
            sequence.append(line);
            }
        while(true)
            {
            line=_readLine();
            if(line==null) //last line ?
                {
                if(quality.length()!=sequence.length())
                    {
                    throw new IOException("bad quality for @"+name +" next line is null and sequence:\n"+
                            sequence +"("+sequence.length()+") and quality:\n"+quality+"("+quality.length()+")");
                    }
                break;
                }
            if(line.startsWith("@") &&
                quality.length()==sequence.length()) //quality can start with '@'
                {
                this.prev_line=line;
                break;
                }
            quality.append(line);
            }
        if(quality.length()!=sequence.length())
            {
            throw new IOException(
                    "length(quality)!=length(dna) for @"+name+"\nseq:"+sequence+"("+sequence.length()+")\nqual:"+quality+"("+quality.length()+")");
            }
        return true;
        }

    public String getName() {
        return name;
        }

    public StringBuilder getSequence()
        {
        return sequence;
        }

    public StringBuilder getQuality() {
        return quality;
        }

    public long getLine() {
        return nLine;
        }

    @Override
    public boolean equals(Object obj) {
        return obj==this;
        }

    @Override
    public String toString()
        {
        return getClass().getName()+":"+getLine();
        }
    }
ADD COMMENTlink modified 9 months ago by RamRS27k • written 9.3 years ago by Pierre Lindenbaum129k
1
gravatar for Martin A Hansen
9.3 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

Wrapping of sequence, qualities, etc is a pain. I guess people insist on wrapping sequences for historical reason because in the old days they wanted a "human" readable sequence that printed nicely on paper so they could mark up their primers with a pencil.

In the days of NGS - are there ANY valid reasons for wrapping sequences and qualities?

ADD COMMENTlink written 9.3 years ago by Martin A Hansen3.0k

Human readability is always crucial. It is utmost important to "feel" the data rather than simply let programs to read it.

ADD REPLYlink written 9.3 years ago by lh332k

Human readability is always crucial. It is utmost important to "feel" the data rather than simply let programs do that for you.

ADD REPLYlink written 9.3 years ago by lh332k

So you think that a string of "ffedeee^eeeeedeeeeeee" is human readable? :o). I agree that it is important to get a "feel" for the data but the times of one page GenBank files is long gone ...

ADD REPLYlink written 9.2 years ago by Martin A Hansen3.0k

I mean the sequence should be readable, not quality.

ADD REPLYlink written 8.8 years ago by lh332k

If I use cat or less on a single-line sequence my terminal will display it perfectly. If you need a to view the sequenced wrapped - then wrap the sequence for that purpose only.

ADD REPLYlink written 8.8 years ago by Martin A Hansen3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1953 users visited in the last hour