Question: Ngs - Huge (Fastq) File Parsing - Which Language For Good Efficiency ?
10
gravatar for toni
7.7 years ago by
toni2.1k
Lyon
toni2.1k wrote:

Dear Biostar members,

I am involved in analyzing data from NGS technologies. As you may know, raw data files are in fastq format. And the fact is that these files are huge (several gigabytes each).

As a first step, we naturally want to make some quality controls of the sequencing step. My team and I are aware of some tools like Picard or bamtools that give some information and statistics, but they do not allow to extract all the information we want. So, at some point, I am afraid we will have to parse fastq files ourselves.

Which programming language would you choose to do the job (ie : read and process a 10GB file) ? Obviously, the main goal is to make the processing as fast as possible.

  1. awk/sed
  2. C
  3. Perl
  4. Java
  5. Python

Do you think that we will be I/O bound anyway, so using a language or another makes no big difference ?

Would you read the file line by line or load chunks of hundreds of lines (for instance) ?

Many thanks. T.

ADD COMMENTlink modified 4.0 years ago by daniel.nicorici0 • written 7.7 years ago by toni2.1k
4
  1. Haskell!

Here's converting a 10G fastq file to fasta:

ghc -e 'Bio.Sequence.readIllumina "100903_s_8_1_seq_GLW-4.txt" >>= Bio.Sequence.writeFasta "/dev/null"'

On my system, a 10G file took 2m20s using 67% CPU, so parsing isn't terribly CPU intensive, and using a lower level language will make it even less of a problem.

ADD REPLYlink written 7.7 years ago by Ketil3.9k
1

Hi. You forgot to mention something important. Your CPU. Without this info you cannot tell if 2 minutes is a slow or fast! Also, since your program takes 67% I conclude the program is multi-threaded. But we don't know how many cores you CPU have.

ADD REPLYlink modified 4.0 years ago • written 4.3 years ago by BioApps720
1

If the program takes 67% of one cpu core then it's more reasonable to assume that it's IO bound, and if any assumption is to be made about number of threads from this we would say it's a single thread that is waiting 33% of the time.

ADD REPLYlink written 4.3 years ago by Matt Shirley8.6k
2

Have you checked out FastQC to make sure you aren't reinventing the wheel? http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

ADD REPLYlink written 7.7 years ago by Aaron Statham1.1k
1

Make sure that you do everything with compression (gzip, etc.) when possible. This reduces both network bandwidth and disk IO. If you are IO bound, doing so can actually make the task faster in some cases.

ADD REPLYlink written 7.7 years ago by Sean Davis25k
1

I think it's time for another programming challenge

ADD REPLYlink written 7.7 years ago by Jeremy Leipzig17k
  1. Haskell: Heres converting a 10G fastq file to Fasta:

% time ghc -e 'Bio.Sequence.readIllumina "100903_s_8_1_seq_GLW-4.txt" >>= Bio.Sequence.writeFasta "/dev/null"'

ghc -e 84.68s user 9.93s system 67% cpu 2:20.56 total

Note that disk IO is the bottleneck here, not CPU.

ADD REPLYlink written 7.7 years ago by Ketil3.9k

Sorry about the idiot formatting, btw - I don't know how to get paragraphs in comments.

ADD REPLYlink written 7.7 years ago by Ketil3.9k
9
gravatar for Lars Juhl Jensen
7.7 years ago by
Copenhagen, Denmark
Lars Juhl Jensen11k wrote:

The short answer is: it depends a lot on what exactly it is you need to do with the contents of the files.

Regarding being I/O bound, I think it is quite unlikely unless you have the data on very slow disks. Even a single 5400rpm disk can read in the order of 50MB/s, which means that reading a 10GB file will not take much over 3min. Since you are working with large NGS datasets, I would assume that you have a RAID system, in which case the combined speed of multiple disks will almost certainly exceed how fast you can process the data.

When it comes to speed, nothing beats a good C or C++ implementation. However, there is a huge price to pay in terms of development time. If what you do is pretty basic string operations, both Perl and Python should do fine, and the savings in development time are most likely well worth the slightly longer run time compared to a C or C++ implementation.

In my experience, Perl has a more efficient implementation of regular expressions. Conversely, if you need to load all the data into memory, the Python data structures appear to be more memory efficient.

Awk is the only of the languages that you list that I would advise you against. It is very convenient for writing short command-line hacks, but speed is not one of its virtues. I don't have any experience with using Java for bioinformatics tasks.

ADD COMMENTlink written 7.7 years ago by Lars Juhl Jensen11k
4

+1 for pointing out that I/O is frequently not the bottleneck on local disks. Also a comment on regex. For short regex without "|", Python2 is >10X slower than Perl. On other things, python is faster but never like 10X. If you rely on regex a lot like me, perl is preferred (I do not know how Python3 performs).

ADD REPLYlink written 7.7 years ago by lh331k
1

Thanks a lot Lars. And you were definitely true about being I/O bound...that was not the case at all. Finally I had a try with Perl and when I read the file (7Go) line by line without doing any ops on strings it took roughly 2minutes. Manipulating both sequence and quality strings in a loop while parsing brought the runtime up to 1h25 min. (I process each string character by character). I guess that the culprit is Perl. As an interpreted language it is too slow. I would save lot of execution time if I write this parser in a compiled language. Don't you think ?

ADD REPLYlink written 7.6 years ago by toni2.1k

If what you do is to loop over the strings character by character, I have no doubt that you can win a lot by using a compiled language. Would you mind elaborating a bit on what exactly it is you need to do to the sequences and quality strings?

ADD REPLYlink written 7.6 years ago by Lars Juhl Jensen11k

If you're using python and regular expressions are a bottleneck, you can try re2: http://pypi.python.org/pypi/re2/

ADD REPLYlink written 7.2 years ago by brentp22k
4
gravatar for Alex
7.7 years ago by
Alex1.4k
Theodosius Dobzhansky Center for Genome Bioinformatics
Alex1.4k wrote:

A cheap and fast solution is Amazon EC2.

For 2$ per hour per instance you can rent one ore more High-Memory Quadruple Extra Large Instance:

68.4 GB of memory
26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
API name: m2.4xlarge

I usualy use this NGS analyse workflow:

  1. Upload data in tar.gz to EC2 instance (usually this is a bottleneck) or upload data from NCBI.
  2. The main script (Python, but it can be any other language) loads data into memory and creates independent parallel tasks.
  3. The main script launches workers (usually in C if task is compute-intensive otherway in Python, or is can be some tool for NGS analysis e.g. bowtie). With one instance you get ~25 parallel workers. With two instances ~50 and so on.
  4. Workers return computed data to the main script or save it to files.
  5. Then the main script or other python script joins/reduces data computed by workers and packs result to tar.gz.
  6. Download data from Amazon (next bootleneck =)

As example, an enchanced suffix array computation for >10 Gbases fastq takes ~3 hours and 7$ where 1-2h for data upload/download. And it can be improved with more sophisticated workers.

I use Python for the main script because I like/know it. Althought languages like Google's Go or Erlang will be better here. Or you can use classical map/reduce engine with Hadoop (http://en.wikipedia.org/wiki/Hadoop).

Links to EC2 getting started guidies I've posted here http://biostar.stackexchange.com/questions/3651/using-amazon-web-services/3653#3653

ADD COMMENTlink modified 7.7 years ago • written 7.7 years ago by Alex1.4k

@Alex, after the up and download times, does this save you time overall when compared to running it on your local machine--no matter how modest it may be?

ADD REPLYlink written 7.7 years ago by brentp22k

@brentp, same task on Phenom II 965 x4 with 16Gb memory takes >18h (to fit into memory dataset was split to small parts). EC2 solves memory limitations without buying expensive hardware.

ADD REPLYlink written 7.7 years ago by Alex1.4k
2
gravatar for Science_Robot
7.7 years ago by
Science_Robot1.1k
Gainesville, FL
Science_Robot1.1k wrote:

You don't need to store the entire FASTQ file in memory. So you're limited mostly by disk I/O. In that case, it doesn't really matter which programming language you use.

All that matters is which language you're most comfortable with.

ADD COMMENTlink written 7.7 years ago by Science_Robot1.1k
2

While disks have very high latency, their throughput is perfectly reasonable.

ADD REPLYlink written 7.6 years ago by User 584520
2
gravatar for Aleksandr Levchuk
7.7 years ago by
United States
Aleksandr Levchuk3.1k wrote:

Long Answer

Assuming that you are parsing to gather some statistics/information which will end up being not very big (i.e. the results will be close to constant size - independent of what the input FASTQ file size)

I would use Python, Perl, or Ruby. They are powerful at string processing. The performance will be fine as long as you don't try to pull the entire file into memory. Make a simple a while loop that processes the file line by line. I don't think there is a need to grab chunks of lines because:

  1. Your parsing operations will probably not have much significant overhead.
  2. The interpreter and/or the OS will buffer the input automatically so the processing will be done while the OS is waiting for I/O.

I usually prefer to use the pipes (e.g. cat file.fastq | ./myprog) but with the more common way (to read the file name from the arguments and to open the file inside your program ./myprog f1.fastq) the underling mechanisms should be exactly the same so I would not expect any performance differences.

Sed and Awk are great but I found that once the parsing gets a little sophisticated then I always need to switch to a general purpose programming.

Short Answer

No matter what language you choose, it's all pretty optimized automatically for you.

ADD COMMENTlink written 7.7 years ago by Aleksandr Levchuk3.1k

Bash would not work - it often does things like fork other processes

ADD REPLYlink written 7.0 years ago by Aleksandr Levchuk3.1k
2
gravatar for Istvan Albert
7.7 years ago by
Istvan Albert ♦♦ 77k
University Park, USA
Istvan Albert ♦♦ 77k wrote:

I always prefer high level languages as my own time is more valuable than machine time -> it would need to save a week of compute time for me to invest a week of development.

Moreover depending on the operations that you wish to do you may not even be able to substantially improve on the execution speed of perl/python etc. Basic I/O, splitting, tokenizing the inputs may even be faster when performed in these languages. For more complex operations time them and extrapolate from them, for example:

$ python -m timeit "'a a a a a a'.split()[0]"
1000000 loops, best of 3: 0.528 usec per loop

this clocks in at 0.5 microseconds(!) - you'd be hard pressed to write a better line splitter that returns a list from which you extract the last element

ADD COMMENTlink written 7.7 years ago by Istvan Albert ♦♦ 77k
2
gravatar for Samuel Lampa
4.8 years ago by
Samuel Lampa1.1k
Stockholm
Samuel Lampa1.1k wrote:

I recently made a naive benchmark of a few popular languages, for doing really simple bioinformatics processing of fasta files (I chose to count the GC frequency, as a simple enough problem that I could figure out how to implement that in a number of different languages).

After publishing the initial benchmarks (between Python and the D language), I got a lot of feedback and suggestions for improvements, from experts in the different languages, and so the comparison grew quite a bit, and I finally got together a comparison of no less than the following languages, sorted approximately from slowest to fastest (when parsing line by line):

The final results are available in this blog post:

My personal take-home lesson from this comparison is that by using modern compiled and garbage collected language like Go and D, you can get rather close to C++/C like performance, while not having to fear introducing too many catastrophic memory leaks (due to the garbage collector) and foremost, retaining much easier to read code. As an example, have a look at these code examples:

Personally, I have to say I was highly impressed by the D version, due to it's high readability even in the face of the serious optimizations. On the other hand, although the Go version is a bit harder to read, the Go language is an increasingly popular language with a decently sized community and has a very easy to use features for writing threaded programs (although I could not get any speedups using threading for this simple example).

In terms of ready-to-use libraries, there is an up to date bioinformatics package in Go, called BioGo, that seems promising and high-quality. (For D, there is some very rudimentary support for working with some bioinformatics related data in the DScience library, although it does not seem to be very updated).

In the end, I would put my bet on Go for the time being due to it's much bigger community, and the better library support (BioGo), although I like D a lot too.

If you really don't want to get the benefits from learning a compiled language, trying to get your python code with PyPy might be an option too of course (although I heard about some incompatibilities with the NumPy/SciPy packages etc, which sounds worrying to me).

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Samuel Lampa1.1k
2

Wow! FreePascal is only on 4th position? I expected to have a higher position.

 

I looked a bit over your Pascal code and it uses ReadLn instead of the modern TFileStream. And then you are using:

(c = 'A') or (c = 'G') or (c = 'C') or (c = 'T')

which are 4 operations instead of one, like in:

c in ['A', 'C', 'G', 'T'] or case c of...

I recently wrote a program ( Efficient SFF/FastQ processing (viewer, editor, cleaner, converter, MID demultiplexing) v1.7 ) in a Pascal-related language that reads FastQ files (and SFF). After some optimizations like the ones explained above I decreased the parsing time from 1 minute (see version 1.5) to 10 seconds (see v1.7).

If you are interested I could optimize your FPC code. I have FPC installed. I only need your FastQ test file.

ADD REPLYlink written 4.3 years ago by BioApps720
2

Any news about this optimization? It should bring a TREMENDOUS increase in speed.

ADD REPLYlink written 4.0 years ago by BioApps720

Many thanks for the addition, and sorry about the silence, have had very busy times! I'm planning to do a new revision of the benchmarks, as have gotten a lot of feedback on a bunch of the other languages as well, and even new ones such as julia and elixir, so stay tuned!

The main problem is I have to rerun all the benchmarks, since I don't have access to the same computer I used for the first ones. Should need to create some better organization for the benchmarks as well, so it can more easily be cloned and run at any time.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Samuel Lampa1.1k
1

Perfect! I think we will see major improvements after changing that 'IF' to 'CASE'. At ASM level, CASE executes in only one micro instruction.

PS: Don't forget to turn on the 'compiler optimization' {$OPTIMIZATION ON} and to turn off the 'Assertions', 'Range checking' {$R-} and 'Buffer overflow checking'.

-------------

Can you run a Delphi test also? I could send the code and the exe also, but you will need Windows or Linux with WinE.

ADD REPLYlink written 3.9 years ago by BioApps720

Free Pascal and Lazarus have even more optimization settings, see http://www.freepascal.org/docs-html/3.0.0/prog/progse49.html and http://wiki.freepascal.org/Whole_Program_Optimization for details.

ADD REPLYlink written 2.8 years ago by Johannes Dietrich10
1

I should add that I'm not particularly fond of Java as a bioinformatics language, due to the 100+ ms startup latency for the JVM, the inflexible memory handling and the sometimes complicated installation procedures. Also, the big amount of boilerplate code needed makes it more or less depend on using a full-blown IDE, which can be really impractical in cluster environments that only support command line based text editors, or where big resource hogs like Eclipse are really impractical to run.

ADD REPLYlink written 4.8 years ago by Samuel Lampa1.1k
1

No Perl? The workhorse of most bioinformaticians around the world. Please see comments on Perl being >10X faster than Python in regex parsing.

ADD REPLYlink written 4.8 years ago by Anjan720

+1 Nice summary. I wanted to mention that there is a BioD project, which seems to be more active than the other D library you mention. I cannot comment on the quality or completeness of either because I have not used them.

ADD REPLYlink written 4.8 years ago by SES8.1k

That NimRod language is pretty interesting - I will benchmark their hash tables and arrays - if these can handle tens of millions of elements with half decent performance (like say Python) I think I' might start using it.

http://nimrod-lang.org/documentation.html

ADD REPLYlink written 3.9 years ago by Istvan Albert ♦♦ 77k

nimrod would be a lot more interesting if it had a large community.

ADD REPLYlink written 3.9 years ago by brentp22k

one weird thing I noted it does not seem to have output formatting for numbers in the language (printf equivalent) - one needs to install an unofficial package to do that - that made me drop my investigation. IMHO that should be part of the core of any language that one wants to do science with. Made me wonder what else is missing that I can only find out later.

ADD REPLYlink written 3.9 years ago by Istvan Albert ♦♦ 77k

So many interesting languages. D, Rust and Go are all good. I quite like C#, too, but Mono is not as good as MS runtime. Apple's new Swift also looks interesting. Life is too short to learn exciting programing languages...

ADD REPLYlink written 3.9 years ago by lh331k

I think Julia will be the one I look into next. You could say that life is too short not to learn interesting programming languages.

ADD REPLYlink written 3.9 years ago by brentp22k
1
gravatar for Ian
7.7 years ago by
Ian5.2k
University of Manchester, UK
Ian5.2k wrote:

This may not be quite the answer you are looking for. But there is a useful QC tool that works with SOLiD and Illumina (not sure about others) FASTQ files. It is called FASTQC http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/.

It is a JAVA tool (which can be run by command line) that provides a variety of QC reports. The most useful I think is a box plot of quality values per base. Other reports include detection of contaminant DNA etc.

ADD COMMENTlink written 7.7 years ago by Ian5.2k

@Ian. yep. Thank you Ian. We already tried fastqc, it's typically the kind of report we would like to produce but we planned to insert many more things. Thanks for the advise !

ADD REPLYlink written 7.7 years ago by toni2.1k
1
gravatar for Phis
7.7 years ago by
Phis1.0k
CH
Phis1.0k wrote:

For speed/efficiency, I'd definitely go with C or C++ (or Delphi if you're working on Windows). Also, usually the preferred way to deal with large files is to use memory mapped files that are supported by all major OSes (as far as I know). You'd usually only have part of the file (a 'view') in memory at any one time.

If it is Windows, the main API calls you want to look at are CreateFileMapping, MapViewOfFile and UnmapViewofFile.

ADD COMMENTlink written 7.7 years ago by Phis1.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 907 users visited in the last hour