Question: How to read fastq file with Python
1
gravatar for Qingyang Xiao
2.7 years ago by
Qingyang Xiao140
Stockholm
Qingyang Xiao140 wrote:

Hello! What's the meaning of reading fastQ file alone? And how to read fastQ file with Python?

sequencing next-gen • 14k views
ADD COMMENTlink modified 2.6 years ago by Alex Reynolds31k • written 2.7 years ago by Qingyang Xiao140
2

Is this an assignment question?

If it is not you need to clarify what you are trying to? If you have written some code then please post that in your original question if you want to get assistance with an error or question.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by GenoMax94k
6
gravatar for Alex Reynolds
2.6 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Assuming typical four-line FASTQ with no funny business:

#!/usr/bin/env python

import sys
import os

def process(lines=None):
    ks = ['name', 'sequence', 'optional', 'quality']
    return {k: v for k, v in zip(ks, lines)}

try:
    fn = sys.argv[1]
except IndexError as ie:
    raise SystemError("Error: Specify file name\n")

if not os.path.exists(fn):
    raise SystemError("Error: File does not exist\n")

n = 4
with open(fn, 'r') as fh:
    lines = []
    for line in fh:
        lines.append(line.rstrip())
        if len(lines) == n:
            record = process(lines)
            sys.stderr.write("Record: %s\n" % (str(record)))
            lines = []

Edited to use Wouter's faster dict comprehension, and edited to add some error checking and to remove an unnecessary last-pass at the process function.

Example input:

$ cat test.fq 
@M01757:9:000000000-AN67B:1:1101:13276:1772 1:N:0:1
TGACAGGACCAGTCACGCTTTTTCTCGGAGAAGATCAAAATCTGTCGTCTTTATTGACCATATACATAGTTCAGTCGCTGTACAACACTTATCTGAAA
+
11AAAFA1AAAFGDDF11EFGGH0F300000001D1111111D22A/BAFFG2DF1211111D222D212D222D1//B//1D21B//BF1B1F2221

Example run:

$ ./parseFastqRecords.py test.fq
Record: {'name': '@M01757:9:000000000-AN67B:1:1101:13276:1772 1:N:0:1', 'sequence': 'TGACAGGACCAGTCACGCTTTTTCTCGGAGAAGATCAAAATCTGTCGTCTTTATTGACCATATACATAGTTCAGTCGCTGTACAACACTTATCTGAAA', 'optional': '+', 'quality': '11AAAFA1AAAFGDDF11EFGGH0F300000001D1111111D22A/BAFFG2DF1211111D222D212D222D1//B//1D21B//BF1B1F2221'}
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Alex Reynolds31k

That dict function is sweet

ADD REPLYlink written 2.6 years ago by Joe18k

That kinda makes me wonder, why FASTQ is not provided as either tabular or json (although here I understand that this causes overhead) instead...

ADD REPLYlink written 2.6 years ago by cschu1812.5k

That's probably a historical mistake.

ADD REPLYlink written 2.6 years ago by WouterDeCoster45k

I wouldn't be surprised if column modes on late-80s, early-90s terminal environments (vt100 etc.) determined how FASTA and other early genomic formats were written, so that files could be inspected and worked with by humans. Perhaps FASTQ is an offshoot of those formats, in that respect.

ADD REPLYlink written 2.6 years ago by Alex Reynolds31k

I bet this beats the crap out of the Biopython parser :-)

Because the alternative was writing for my review, I looked for some variation to the process() function. It turns out that a dict comprehension is slightly more performant, see this SO post and replicated on my system:

{ks[idx]:val for idx, val in enumerate(lines)}: 475 ns ± 6.99 ns per loop
{k: v for k, v in zip(ks, lines)}: 408 ns ± 6.66 ns per loop

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by WouterDeCoster45k

Nice test! I don't know why I get so much pushback in other questions when I mention that BioPython/SeqIO is a slow way to parse files. It's a useful library, no question, but it might be worth exploring why it is slow, and for devs to start working in some modern Pythonic ways to making this fast for typical use cases.

ADD REPLYlink written 2.6 years ago by Alex Reynolds31k

Well, I believe part of the reason is that it doesn't make assumptions. As far as I know a fastq record can be wrapped (I believe this is also what cschu181 means in the comment https://www.biostars.org/p/317524/#317602). Your code would break, Biopython wouldn't. Nevertheless, fastq files like that belong in hell.

ADD REPLYlink written 2.6 years ago by WouterDeCoster45k
1

I think mainly it’s because Biopython spends so much time testing for the compliance of the input and catching all the errors and as many of the edge cases as possible.

As ever:

enter image description here

ADD REPLYlink written 2.6 years ago by Joe18k
1

Interestingly, if you dig down inside SeqIO, there are more basal functions that could run faster. From the source code header:

Some of these parsers are wrappers around low-level parsers which build up 
  38  SeqRecord objects for the consistent SeqIO interface. In cases where the 
  39  run-time is critical, such as large FASTA or FASTQ files, calling these 
  40  underlying parsers will be much faster - in this case these generator 
  41  functions which return tuples of strings: 
  42   
  43  >>> from Bio.SeqIO.FastaIO import SimpleFastaParser 
  44  >>> from Bio.SeqIO.QualityIO import FastqGeneralIterator
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Joe18k

This implementation looks like it will run very slow when reading a very large FASTQ file (with 200 million reads). A dictionary and a user-defined Python function is called for every single read.

Here is a more efficient and fast implementation of a FASTQ reader in Python. https://github.com/lh3/readfq/blob/master/readfq.py

ADD REPLYlink written 14 months ago by enxxx23240

That entirely depends on what you need to do with the data after parsing the fastq file. For some goals it will be possible to just iterate over the file (like the example you linked) for other applications you might need a dictionary or a dataframe.

ADD REPLYlink written 14 months ago by WouterDeCoster45k

I love python, but if your files are exorbitantly large, it might be time to accept that BioPython is not the right hammer for this particular nail.

ADD REPLYlink written 14 months ago by Joe18k

In this case pysam might be the right hammer ... :)

ADD REPLYlink written 13 months ago by finswimmer14k
3
gravatar for Joe
2.6 years ago by
Joe18k
United Kingdom
Joe18k wrote:

A fastq file contains 4 lines per sequence:

@M01757:9:000000000-AN67B:1:1101:13276:1772 1:N:0:1
TGACAGGACCAGTCACGCTTTTTCTCGGAGAAGATCAAAATCTGTCGTCTTTATTGACCATATACATAGTTCAGTCGCTGTACAACACTTATCTGAAA
+
11AAAFA1AAAFGDDF11EFGGH0F300000001D1111111D22A/BAFFG2DF1211111D222D212D222D1//B//1D21B//BF1B1F2221

The header (@.... on line 1) contains assorted information such as the machine identifier, run ID and so on (it will vary by origin).

On the next line comes the sequence itself. Nothing special about this.

Following that is a + on line 3. Again, nothing special about that. It just demarcates the sequence from the following line.

On the 4th line is the quality per-base. It uses a special encoding called "Phred" scores, which maps a quality value (0 - 40 usually), to an ASCII character. You can find out more about the encoding here for example. The score is therefore encoded by a single byte, and the ASCII code is equal to the quality score + 33.

You can read FASTQ files with BioPython, or any number of alternatives. What the right answer here is very much depends on what you want to do. BioPython would be a biblically slow way to process large FASTQ files for example.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Joe18k
3

Upvote for "biblically slow"

ADD REPLYlink written 2.6 years ago by cschu1812.5k

For completeness sake...:

line 3 may contain, following the + the identifier which was specified after the @ on line 1. If the + is followed by anything, it has to be exactly this identifier.

Ridiculous, but oh well what can we do about it :)

ADD REPLYlink written 2.6 years ago by WouterDeCoster45k

Also for completeness sake, there are apparently tools out there (concrete example escapes me right now), which produce "formatted" fastq files, i.e. header, N lines sequence with fixed number of characters, +, N lines quality scores ...

ADD REPLYlink written 2.6 years ago by cschu1812.5k

I did not know this! No idea what use that serves haha

ADD REPLYlink written 2.6 years ago by Joe18k
1

When datasizes were small(er) that made sure that the sequence/scores were easily identifiable as a pair visually.

ADD REPLYlink written 2.6 years ago by GenoMax94k

Also for completeness sake, the 4-line per record (sequence) in FASTQ is not mandatory, but it is a practical decision as it may be a bit complicated to parse correctly "multi-line" fastq files:

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants

It is vital to note that the ‘@’ marker character (ASCII 64) may occur anywhere in the quality string—including at the start of any of the quality lines. This means that any parser must not treat a line starting with ‘@’ as indicating the start of the next record, without additionally checking the length of the quality string thus far matches the length of the sequence.

Because of this complication, most tools output FASTQ files without line wrapping of the sequence and quality string. This means each read consists of exactly four lines (sometimes very long lines), ideal for a very simple parser to deal with.

ADD REPLYlink written 2.6 years ago by h.mon32k
1
gravatar for swbarnes2
2.6 years ago by
swbarnes29.4k
United States
swbarnes29.4k wrote:

FASTQ format is human readable. There is no trick to simply reading a fastq file.

ADD COMMENTlink written 2.6 years ago by swbarnes29.4k
1
gravatar for arudhir
2.6 years ago by
arudhir10
arudhir10 wrote:

If you really want to read FASTQ files using Python, BioPython's SeqIO module should be able to read the files.

But as the other poster said, it's going to be really slow and inappropriate for the task if you're trying to analyze a large number of FASTQ files.

ADD COMMENTlink written 2.6 years ago by arudhir10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2556 users visited in the last hour
_