Question

Tool:fastp v0.9 released: an all-in-one FASTQ preprocessor (QC, adapters, trimming, quality filtering / cutting, splitting output ... )

6

Entering edit mode

6.5 years ago

chen ★ 2.5k

fastp v0.9 is released (Nov 29, 2017), with a new feature added:

preprocess unique molecular identifer (UMI) enabled data, shift UMI to sequence name.. See Tutorial: Use fastp to preprocess FASTQ data with unique molecular identifer (UMI) integrated

This project is at: https://github.com/OpenGene/fastp

A list of features implemented:

filter out bad reads (too low quality, too short, or too many N...)
trim all reads in front and tail
cut low quality bases for per read in its 5' and 3' by evaluating the mean quality from a sliding window (like Trimmomatic but faster).
cut adapters (adapters are detected automatically for both PE/SE data).
correct mismatched base pairs in overlapped regions of paired end reads, if one base is with high quality while the other is with ultra low quality
preprocess unique molecular identifer (UMI) enabled data, shift UMI to sequence name.
report JSON format result for further interpreting.
visualize quality control and filtering results on a single HTML page (like FASTQC but faster and more informative).
split the output to multiple files (0001.R1.gz, 0002.R1.gz...) to support parallel processing.
support long reads (data from PacBio / Nanopore devices).
...

A list of features will be implemented soon:

Over representation analysis
Sequencing analysis by lanes/tiles
Pair merge
...

The initial evaluation has shown that fastp is about 10X faster than AfterQC, and also much faster than FASTQC, Trimmomatic, Cutadapt, while providing most features from all of them, plus some novel and useful functions. I had deployed fastp in my clusters and my cloud system.

I am still calling for new requirements to make it more powerful and useful. If you have good ideas, please reply to this post or file an issue on the github page.

Initial message for call of requirements:

Hi, I am a co-founder of a company owning 10 sets of Illumina NovaSeq sequencers, so my cluster system has to process very large data per day.

Recently I found that the FASTQ data preprocessing has become a bottleneck. Current tools (including tools I developed before) can only provide a part of wanted functions (i.e. QC/filtering/adapter-cutting), and are usually too slow (written in scripts, no good threading).

So I’m planning to write a new all-in-one FASTQ preprocessing tool, which will implement most required functions, and must be written in C/C++ with good multithreading to provide competitive execution performance.

Now I am posting this thread to call for requirement of this tool. If you have any good suggestions, please comment here, or file an issue on the github project ( https://github.com/OpenGene/fastp/issues/new ).

Your contributions are greatly appreciated.

fastq fastp open-source • 4.5k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 6.5 years ago by chen ★ 2.5k

3

Entering edit mode

To your point about multithreading, the operations you're proposing can be parallelized using unix paradigms of small programs that do one thing well and then:

Splitting files into chunks (we already do this routinely with FASTQ files)
Using pipes to operate on a data stream using multiple tools e.g. filter | cutadapt | ec | qc | sort | compress > out

I think you'll find that most of these operations can become I/O bound very quickly well before you run out of CPU cores. Have you done some benchmarking of current tools that shows you are CPU bottlenecked?

ADD REPLY • link 6.5 years ago by Matt Shirley 10k

0

Entering edit mode

This could potentially be integrated in the bcl2fastq so the sequences can be processed before they ever hit a file.

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

Good suggestion. For NovaSeq output (S4 chip, 6Tb / run, 1.5 day), bcl2fastq is also very slow and need to be accelerated.

ADD REPLY • link 6.5 years ago by chen ★ 2.5k

0

Entering edit mode

In my experience, bcl2fastq parallelizes over samples per lane. Before writing something new, it might make sense to first see if your multiplexing strategy can be tweaked to speed things up a bit.

ADD REPLY • link 6.5 years ago by Devon Ryan 104k

0

Entering edit mode

One NovaSeq flow chip has only 4 lanes, and you cannot feed different lanes with different libraries since there is only one library input. The reads of one sample are distributed in all these lanes, so your method doesn't work.

ADD REPLY • link 6.5 years ago by chen ★ 2.5k

0

Entering edit mode

Ah, right, I forgot that it was like a NextSeq in that regard. That'd make it optimally multiplexed anyway.

ADD REPLY • link 6.5 years ago by Devon Ryan 104k

0

Entering edit mode

You will soon be able to load individual lanes with different libraries. As long as you purchase "NovaSeq Xp" upgrade (which involves some sort of a manifold device, as I understand). I assume an update to control software would be included that will allow processing of individual lanes.

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

Much of this can be done with BBTools (e.g., trimming and reordering for higher compression). Those are already multithreaded and fairly performant (they're in Java). I imagine that your time would be better spent adding on to that.

ADD REPLY • link 6.5 years ago by Devon Ryan 104k

0

Entering edit mode

With pigz (and BBMap) and an isilon storage system capable of (~ 1.2GB/s theoretical throughput) it is not uncommon to see 300-500 MB/s streams.

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

I have experience with BBTools, cool and useful! I like them.
However, if you want to do different processing together (i.e. filtering + trimming + reordering + QC), you have to run different tools separately. That will definitely take more time.
So the basic idea of this fastp is to integrate all these tools in a single program with interfaces to flexibly configure the needed functions.
Another issue is QC. I prefer JSON format QC result and HTML report, since they can be easily integrated with web applications.

ADD REPLY • link 6.5 years ago by chen ★ 2.5k

0

Entering edit mode

There was a similar post a few months ago: Tool: Collaboration on an empirical QC tool, you could try to contact the people which showed interest in that thread.

with good multithreading to provide competitive execution performance.

Don't forget in general these tasks are embarrassingly parallel, as you can process each fastq file independently, and GNU Parallel can be of great help here.

ADD REPLY • link 6.5 years ago by h.mon 35k

score 0 · Answer 1 · 2017-10-31

0

Entering edit mode

6.5 years ago

chen ★ 2.5k

Does anyone like to support better compression format I/O like zstd (https://github.com/facebook/zstd)?

ADD COMMENT • link 6.5 years ago by chen ★ 2.5k