Question: Processing Fastq Files In R/Bioconductor Without Reading Entire Files Into Memory?
6
gravatar for Ryan Thompson
3.5 years ago by
Ryan Thompson2.4k
TSRI, La Jolla, CA
Ryan Thompson2.4k wrote:

I'm trying to do some analysis on some fastq files using R, and what I'm doing only requires me read one sequence at a time and then discard it. I tried using the ShortRead package from Bioconductor, but readFastq reads the entire file at once. Or tries to, and runs out of RAM.

So is there an easy way to read a fastq file just a few sequences at a time, without writing my own half-broken parser?

ADD COMMENTlink modified 3.3 years ago by Michael Dondrup27k • written 3.5 years ago by Ryan Thompson2.4k

For what it's worth, I ended up using BioPython for this task.

ADD REPLYlink written 3.3 years ago by Ryan Thompson2.4k
3
gravatar for Ryan Thompson
3.1 years ago by
Ryan Thompson2.4k
TSRI, La Jolla, CA
Ryan Thompson2.4k wrote:

For what it's worth, the solution I ultimately settled on was to read the fastq in python using BioPython, and then call my R analysis code from python using rpy2.

ADD COMMENTlink written 3.1 years ago by Ryan Thompson2.4k
2
gravatar for Neilfws
3.5 years ago by
Neilfws41k
Sydney, Australia
Neilfws41k wrote:

EDIT: I just noticed in your question that you have one fastq file with many sequences. Obviously, this answer requires that you split that file into separate sequences - which may not be what you want.

One solution might be to read the file names into a vector, then use each name as the argument to readFastq(pattern=...), so as to read one file at a time. For example, assuming you opened an R console in the same directory as the fastq files:

files <- list.files()
for(file in files) {
  fq <- readFastq(".", pattern = file)
  # do stuff with fq...
  # then remove it
  rm(fq)
}

If you're not in the directory of files, you'll need to supply the path to list.files() and to readFastq(). I have not used readFastq(), so check that you can supply "." to mean "current directory".

ADD COMMENTlink written 3.5 years ago by Neilfws41k

Is there a way to create in-memory files in R? http://stackoverflow.com/questions/4125916/is-there-a-way-to-read-and-write-in-memory-files-in-r

ADD REPLYlink written 3.5 years ago by Ryan Thompson2.4k
2
gravatar for Oscar
3.4 years ago by
Oscar20
Oscar20 wrote:

Bit of an old question, but just thought that you could find FastqSampler in the ShortRead package interesting since it seems to be aimed at exactly what you asked.

ADD COMMENTlink written 3.4 years ago by Oscar20

This function shows it's possible in R, put it generates a random sample, you cannot control which records it reads.

ADD REPLYlink written 3.4 years ago by Michael Dondrup27k
1
gravatar for Keith James
3.5 years ago by
Keith James5.4k
UK
Keith James5.4k wrote:

ShortRead doesn't include a stream interface (to my knowledge - I haven't checked the devel version).

It's a bit of a hack, but you could convert your Fastq to BAM format e.g. using Picard Fastq2Sam and then use Rsamtools which has the ability to create views on large BAM files. It still doesn't provide a stream interface that I can see, but it may mitigate your problem. It may make things faster too because (at least where I work) Fastq operations are mostly I/O bound anyway.

Failing that, use a Fastq chunker to break up your file into blocks that will fit in RAM and then operate on those sequentially as neilfws suggested, or in parallel if you can avoid saturating your I/O.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Keith James5.4k
1
gravatar for Michael Dondrup
3.4 years ago by
Bergen
Michael Dondrup27k wrote:

There is definitely not a real good way to do this directly using R.

But there are two better options:

a good old unix tool split will help. You could also use readLines() and writeLines() in R to read a number of lines and write them to a temporary file but this is not very efficient.

Using the fact that each fastq record has exactly 4 lines, no empty lines allowed (if you have empty lines, remove them first):

split -l4000000 input.fastq split.fastq.

splits the input file into fastq files with max 1000000 entries and makes output files like split.fastq.aa, ..., .zz

Then read in the files sequentially using readFastq()

2.

using an R function, check out this:

read.DNAStringSet(filepath, format="fastq",
              nrec=-1L, skip=0L, use.names=TRUE)

using the parameters nrec and skip. However, this function ignores quality values.

So, if you need the qualities, use option 1. if you only need the sequence, option 2 might be ok.

Edit: There is a 'format' definition, for those doubting option 1 is valid:

<fastq> :=  <block>+
<block> :=  @<seqname>\n<seq>\n+[<seqname>]\n<qual>\n
<seqname>   :=  [A-Za-z0-9_.:-]+
<seq>   :=  [A-Za-z\n\.~]+
<qual>  :=  [!-~\n]+

http://maq.sourceforge.net/fastq.shtml

While wrapping sequence by n in fact should be punished hard, according to this definition it is unfortunately possible, though I have never seen such file. So it's situational and depends on the file not containing wrapped sequence. ANother argument for FASTQ not being a 'format' because it's not easy to parse then, cause the quality string could itself contain a '@' or '+'. So even if it's allowed, it must be bad practice to have wrapped sequences in FASTQ (unlike in FASTA)

Edit: I now assume again, that the split method is safe, because fastq files have 4 lines per entry and files containing wrapped sequence or quality don't exist, and nobody has ever seen one, so use method 1, it's safe. This is the correct answer. Period. lol

ADD COMMENTlink modified 2.7 years ago • written 3.4 years ago by Michael Dondrup27k
2

Fastq is quite straightforward, really. Just count the sequence characters for each record and then you know how many quality characters to expect. A problem does arise if your sequence alphabet contains the '+' character because then you can't distinguish the quality header. I agree that Fastq is not a viable format in the long run, but rather because it doesn't compress well. We use BAM files instead, even for unmapped reads.

ADD REPLYlink written 3.4 years ago by Keith James5.4k
1

Fastq records may have wrapped lines, much like Fasta, so the 4 lines per record assumption is not safe.

ADD REPLYlink written 3.4 years ago by Keith James5.4k

Really? I've never seen such... Will look it up

ADD REPLYlink written 3.4 years ago by Michael Dondrup27k

Keith, I disagree, then that file would be 'non-standard' (there is no standard unfortunately so FASTQ is not a viable 'format' in the long run anyway):

ADD REPLYlink written 3.4 years ago by Michael Dondrup27k

A FASTQ file normally uses four lines per sequence. Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). Line 2 is the raw sequence letters. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

ADD REPLYlink written 3.4 years ago by Michael Dondrup27k

Given, of course, that FASTQ is not even a format, anything is possible, but I have never seen such file, and if you got such I would just punish the provider hard.

ADD REPLYlink written 3.4 years ago by Michael Dondrup27k

[?] := [?]+ [?] := @[?]n[?]n+[[?]]n[?]n [?] := [A-Za-z0-9_.:-]+ [?] := [A-Za-zn.~]+ [?] := [!-~n]+

ADD REPLYlink written 3.4 years ago by Michael Dondrup27k

You are right, it just shouldn't be like this (and luckily most of the time it isn't)

ADD REPLYlink written 3.4 years ago by Michael Dondrup27k

I know, I guess we could make another parser for R, I can easily think of 10 lines in perl, but Ryan wanted something that works in R without building his own parser. So the first option was not that good anyway.

ADD REPLYlink written 3.4 years ago by Michael Dondrup27k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 766 users visited in the last hour