Question: Need help reading FASTQ file
0
gravatar for sarmademad
3.2 years ago by
sarmademad0
sarmademad0 wrote:

I need help to read fastq file on my server virtual machine im using the following python code

def readFastq(filename):
    sequences = []
    qualities = []
    with open(filename) as fh:
        while True:
            fh.readline()
            seq = fh.readline().rstrip()
            fh.readline()
            qual =   fh.readline().rstrip()
            if len(seq) == 0:
                break
            sequences.append(seq)
            qualities.append(qual)
    return sequences, qualities
seqs, quals = readFastq('sample-seqs.fastq')
print (seqs)

giving me the following

syntax error: invalid syntax      (pointing to the line before the last line)

thank you

rna-seq • 1.3k views
ADD COMMENTlink modified 3.2 years ago by Joe18k • written 3.2 years ago by sarmademad0

If it's pointing to the line before the last line, is your input filename/path correct?

You code runs for me, though the output only returns [+], [+], [+] - so some of your .readline() logic is off too.

ADD REPLYlink written 3.2 years ago by Joe18k
3
gravatar for skbrimer
3.2 years ago by
skbrimer640
United States
skbrimer640 wrote:

If it is not a homework question you could use biopython

from Bio import SeqIO

for seqrecord in SeqIO.parse("file.fastq", "fastq"):
    print(seqrecord.seq)
ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by skbrimer640
2
gravatar for Joe
3.2 years ago by
Joe18k
United Kingdom
Joe18k wrote:

I fully agree with the others, that unless there is a very good reason to write your own fastq parser, you should use Biopython etc.

However, for completeness, here's some working code. I've changed it quite a lot since there really isn't a good reason to use readline() since files are inherently iterable.

# Usage: python readfastq.py myseqs.fastq

import sys
from itertools import izip

def readFastq(filename):
        headers = []
        sequences = []
        qualities = []
        with open(filename, 'r') as ifh:
                for header, sequence, sep, quality in izip(ifh, ifh, ifh, ifh):
                        headers.append(header.rstrip())
                        sequences.append(sequence.rstrip())
                        qualities.append(quality.rstrip())

        return headers, sequences, qualities


h, s, q = readFastq(sys.argv[1])
# Do whatever with h, s and q....

There may be more efficient ways to go about it, but use of izip at least means this won't eat up all your memory if you ran it on large files.

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Joe18k
1

This is elegant for parsing clean fastq file. And in python3, zip can be used to replace izip.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by shoujun.gu370

Yeah I should probably add the caveat that I'm assuming a perfect input, but I would hope if someone was going to write their own parser, they'd have some idea of whether that was going to be the case or not.

ADD REPLYlink written 3.2 years ago by Joe18k

This parser assumes the user needs the entire file (h, s, q) in memory, while it's (I guess) more likely iterating over the fastq is more desirable.

ADD REPLYlink written 3.2 years ago by WouterDeCoster44k
1

Yeah I wrote it like that because that's how the OP had the code (returning a list of everything in the file). OP should consider using yeild instead of return if performance becomes and issue and just make it a generator function.

ADD REPLYlink written 3.2 years ago by Joe18k
1
gravatar for Renesh
3.2 years ago by
Renesh1.9k
United States
Renesh1.9k wrote:

Your code has a lot of issues associated with data structures. Please, see below code,

def readFastq(filename):
     fastq_open = open(filename, "rU")     
     for line in fastq_open:
          header1 = line.rstrip()
          sequences = next(fastq_open).rstrip()
          header2 = next(fastq_open).rstrip()
          qualities = next(fastq_open).rstrip()
          if len(seq) == 0:
            break
          return sequences, qualities
     fastq_open.close()

seqs, quals = readFastq('sample-seqs.fastq')
print (seqs)
ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Renesh1.9k

You're better off keeping the OP's with open() construct, as it handles file opening and closing. Your current code leaves a dangling file open.

ADD REPLYlink written 3.2 years ago by Joe18k

the close file line added

ADD REPLYlink written 3.2 years ago by Renesh1.9k

with is the preferred method for working with files, rather than "manually" opening and closing.

ADD REPLYlink written 3.2 years ago by WouterDeCoster44k

if you ever tried any test files, you should know this code is not working.

ADD REPLYlink written 3.2 years ago by shoujun.gu370
1
gravatar for shoujun.gu
3.2 years ago by
shoujun.gu370
Rockville/MD
shoujun.gu370 wrote:

Do not try to write a code to parse the fastq file by yourself (unless you just want to practice), its a tricky one. Just use biopython like skbrimer mentioned above.

ADD COMMENTlink written 3.2 years ago by shoujun.gu370
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1333 users visited in the last hour