How to read fastq file with Python
4
1
Entering edit mode
7.4 years ago
Qingyang Xiao ▴ 160

Hello! What's the meaning of reading fastQ file alone? And how to read fastQ file with Python?

sequencing next-gen • 37k views
ADD COMMENT
2
Entering edit mode

Is this an assignment question?

If it is not you need to clarify what you are trying to? If you have written some code then please post that in your original question if you want to get assistance with an error or question.

ADD REPLY
6
Entering edit mode
7.4 years ago

Assuming typical four-line FASTQ with no funny business:

#!/usr/bin/env python

import sys
import os

def process(lines=None):
    ks = ['name', 'sequence', 'optional', 'quality']
    return {k: v for k, v in zip(ks, lines)}

try:
    fn = sys.argv[1]
except IndexError as ie:
    raise SystemError("Error: Specify file name\n")

if not os.path.exists(fn):
    raise SystemError("Error: File does not exist\n")

n = 4
with open(fn, 'r') as fh:
    lines = []
    for line in fh:
        lines.append(line.rstrip())
        if len(lines) == n:
            record = process(lines)
            sys.stderr.write("Record: %s\n" % (str(record)))
            lines = []

Edited to use Wouter's faster dict comprehension, and edited to add some error checking and to remove an unnecessary last-pass at the process function.

Example input:

$ cat test.fq 
@M01757:9:000000000-AN67B:1:1101:13276:1772 1:N:0:1
TGACAGGACCAGTCACGCTTTTTCTCGGAGAAGATCAAAATCTGTCGTCTTTATTGACCATATACATAGTTCAGTCGCTGTACAACACTTATCTGAAA
+
11AAAFA1AAAFGDDF11EFGGH0F300000001D1111111D22A/BAFFG2DF1211111D222D212D222D1//B//1D21B//BF1B1F2221

Example run:

$ ./parseFastqRecords.py test.fq
Record: {'name': '@M01757:9:000000000-AN67B:1:1101:13276:1772 1:N:0:1', 'sequence': 'TGACAGGACCAGTCACGCTTTTTCTCGGAGAAGATCAAAATCTGTCGTCTTTATTGACCATATACATAGTTCAGTCGCTGTACAACACTTATCTGAAA', 'optional': '+', 'quality': '11AAAFA1AAAFGDDF11EFGGH0F300000001D1111111D22A/BAFFG2DF1211111D222D212D222D1//B//1D21B//BF1B1F2221'}
ADD COMMENT
1
Entering edit mode

I bet this beats the crap out of the Biopython parser :-)

Because the alternative was writing for my review, I looked for some variation to the process() function. It turns out that a dict comprehension is slightly more performant, see this SO post and replicated on my system:

{ks[idx]:val for idx, val in enumerate(lines)}: 475 ns ± 6.99 ns per loop
{k: v for k, v in zip(ks, lines)}: 408 ns ± 6.66 ns per loop

ADD REPLY
0
Entering edit mode

Nice test! I don't know why I get so much pushback in other questions when I mention that BioPython/SeqIO is a slow way to parse files. It's a useful library, no question, but it might be worth exploring why it is slow, and for devs to start working in some modern Pythonic ways to making this fast for typical use cases.

ADD REPLY
0
Entering edit mode

Well, I believe part of the reason is that it doesn't make assumptions. As far as I know a fastq record can be wrapped (I believe this is also what cschu181 means in the comment https://www.biostars.org/p/317524/#317602). Your code would break, Biopython wouldn't. Nevertheless, fastq files like that belong in hell.

ADD REPLY
1
Entering edit mode

I think mainly it’s because Biopython spends so much time testing for the compliance of the input and catching all the errors and as many of the edge cases as possible.

As ever:

enter image description here

ADD REPLY
1
Entering edit mode

Interestingly, if you dig down inside SeqIO, there are more basal functions that could run faster. From the source code header:

Some of these parsers are wrappers around low-level parsers which build up 
  38  SeqRecord objects for the consistent SeqIO interface. In cases where the 
  39  run-time is critical, such as large FASTA or FASTQ files, calling these 
  40  underlying parsers will be much faster - in this case these generator 
  41  functions which return tuples of strings: 
  42   
  43  >>> from Bio.SeqIO.FastaIO import SimpleFastaParser 
  44  >>> from Bio.SeqIO.QualityIO import FastqGeneralIterator
ADD REPLY
0
Entering edit mode

That dict function is sweet

ADD REPLY
0
Entering edit mode

That kinda makes me wonder, why FASTQ is not provided as either tabular or json (although here I understand that this causes overhead) instead...

ADD REPLY
0
Entering edit mode

That's probably a historical mistake.

ADD REPLY
0
Entering edit mode

I wouldn't be surprised if column modes on late-80s, early-90s terminal environments (vt100 etc.) determined how FASTA and other early genomic formats were written, so that files could be inspected and worked with by humans. Perhaps FASTQ is an offshoot of those formats, in that respect.

ADD REPLY
0
Entering edit mode

This implementation looks like it will run very slow when reading a very large FASTQ file (with 200 million reads). A dictionary and a user-defined Python function is called for every single read.

Here is a more efficient and fast implementation of a FASTQ reader in Python. https://github.com/lh3/readfq/blob/master/readfq.py

ADD REPLY
0
Entering edit mode

That entirely depends on what you need to do with the data after parsing the fastq file. For some goals it will be possible to just iterate over the file (like the example you linked) for other applications you might need a dictionary or a dataframe.

ADD REPLY
0
Entering edit mode

I love python, but if your files are exorbitantly large, it might be time to accept that BioPython is not the right hammer for this particular nail.

ADD REPLY
0
Entering edit mode

In this case pysam might be the right hammer ... :)

ADD REPLY
3
Entering edit mode
7.4 years ago
Joe 22k

A fastq file contains 4 lines per sequence:

@M01757:9:000000000-AN67B:1:1101:13276:1772 1:N:0:1
TGACAGGACCAGTCACGCTTTTTCTCGGAGAAGATCAAAATCTGTCGTCTTTATTGACCATATACATAGTTCAGTCGCTGTACAACACTTATCTGAAA
+
11AAAFA1AAAFGDDF11EFGGH0F300000001D1111111D22A/BAFFG2DF1211111D222D212D222D1//B//1D21B//BF1B1F2221

The header (@.... on line 1) contains assorted information such as the machine identifier, run ID and so on (it will vary by origin).

On the next line comes the sequence itself. Nothing special about this.

Following that is a + on line 3. Again, nothing special about that. It just demarcates the sequence from the following line.

On the 4th line is the quality per-base. It uses a special encoding called "Phred" scores, which maps a quality value (0 - 40 usually), to an ASCII character. You can find out more about the encoding here for example. The score is therefore encoded by a single byte, and the ASCII code is equal to the quality score + 33.

You can read FASTQ files with BioPython, or any number of alternatives. What the right answer here is very much depends on what you want to do. BioPython would be a biblically slow way to process large FASTQ files for example.

ADD COMMENT
3
Entering edit mode

Upvote for "biblically slow"

ADD REPLY
0
Entering edit mode

For completeness sake...:

line 3 may contain, following the + the identifier which was specified after the @ on line 1. If the + is followed by anything, it has to be exactly this identifier.

Ridiculous, but oh well what can we do about it :)

ADD REPLY
0
Entering edit mode

Also for completeness sake, there are apparently tools out there (concrete example escapes me right now), which produce "formatted" fastq files, i.e. header, N lines sequence with fixed number of characters, +, N lines quality scores ...

ADD REPLY
0
Entering edit mode

I did not know this! No idea what use that serves haha

ADD REPLY
1
Entering edit mode

When datasizes were small(er) that made sure that the sequence/scores were easily identifiable as a pair visually.

ADD REPLY
0
Entering edit mode

Also for completeness sake, the 4-line per record (sequence) in FASTQ is not mandatory, but it is a practical decision as it may be a bit complicated to parse correctly "multi-line" fastq files:

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants

It is vital to note that the ‘@’ marker character (ASCII 64) may occur anywhere in the quality string—including at the start of any of the quality lines. This means that any parser must not treat a line starting with ‘@’ as indicating the start of the next record, without additionally checking the length of the quality string thus far matches the length of the sequence.

Because of this complication, most tools output FASTQ files without line wrapping of the sequence and quality string. This means each read consists of exactly four lines (sometimes very long lines), ideal for a very simple parser to deal with.

ADD REPLY
1
Entering edit mode
7.4 years ago

FASTQ format is human readable. There is no trick to simply reading a fastq file.

ADD COMMENT
1
Entering edit mode
7.4 years ago
arudhir ▴ 10

If you really want to read FASTQ files using Python, BioPython's SeqIO module should be able to read the files.

But as the other poster said, it's going to be really slow and inappropriate for the task if you're trying to analyze a large number of FASTQ files.

ADD COMMENT

Login before adding your answer.

Traffic: 3575 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6