Tool: fastp v0.7 released: an all-in-one FASTQ preprocessor (QC, adapters, trimming, quality filtering / cutting, splitting output ... )
6
gravatar for chen
24 days ago by
chen1.2k
OpenGene
chen1.2k wrote:

fastp v0.7 is released (Nov 23, 2017), with a new feature added:

Correct mismatched base pairs in overlapped regions of paired end reads, if one base is with high quality while the other is with ultra low quality

This project is at: https://github.com/OpenGene/fastp

A list of features implemented:

  • filter out bad reads (too low quality, too short, or too many N...)
  • trim all reads in front and tail
  • cut low quality bases for per read in its 5' and 3' by evaluating the mean quality from a sliding window (like Trimmomatic but faster).
  • cut adapters (adapters are detected automatically for both PE/SE data).
  • correct mismatched base pairs in overlapped regions of paired end reads, if one base is with high quality while the other is with ultra low quality
  • report JSON format result for further interpreting.
  • visualize quality control and filtering results on a single HTML page (like FASTQC but faster and more informative).
  • split the output to multiple files (0001.R1.gz, 0002.R1.gz...) to support parallel processing.
  • support long reads (data from PacBio / Nanopore devices).
  • ...

A list of features will be implemented soon:

  • Over representation analysis
  • Sequencing analysis by lanes/tiles
  • Pair merge
  • ...

The initial evaluation has shown that fastp is about 10X faster than AfterQC, and also much faster than FASTQC, Trimmomatic, Cutadapt, while providing most features from all of them, plus some novel and useful functions. I had deployed fastp in my clusters and my cloud system.

I am still calling for new requirements to make it more powerful and useful. If you have good ideas, please reply to this post or file an issue on the github page.


Initial message for call of requirements:

Hi, I am a co-founder of a company owning 10 sets of Illumina NovaSeq sequencers, so my cluster system has to process very large data per day.

Recently I found that the FASTQ data preprocessing has become a bottleneck. Current tools (including tools I developed before) can only provide a part of wanted functions (i.e. QC/filtering/adapter-cutting), and are usually too slow (written in scripts, no good threading).

So I’m planning to write a new all-in-one FASTQ preprocessing tool, which will implement most required functions, and must be written in C/C++ with good multithreading to provide competitive execution performance.

Now I am posting this thread to call for requirement of this tool. If you have any good suggestions, please comment here, or file an issue on the github project ( https://github.com/OpenGene/fastp/issues/new ).

Your contributions are greatly appreciated.

ADD COMMENTlink modified 1 day ago • written 24 days ago by chen1.2k
3

To your point about multithreading, the operations you're proposing can be parallelized using unix paradigms of small programs that do one thing well and then:

  1. Splitting files into chunks (we already do this routinely with FASTQ files)
  2. Using pipes to operate on a data stream using multiple tools e.g. filter | cutadapt | ec | qc | sort | compress > out

I think you'll find that most of these operations can become I/O bound very quickly well before you run out of CPU cores. Have you done some benchmarking of current tools that shows you are CPU bottlenecked?

ADD REPLYlink written 24 days ago by Matt Shirley8.0k

This could potentially be integrated in the bcl2fastq so the sequences can be processed before they ever hit a file.

ADD REPLYlink written 24 days ago by genomax37k

Good suggestion. For NovaSeq output (S4 chip, 6Tb / run, 1.5 day), bcl2fastq is also very slow and need to be accelerated.

ADD REPLYlink written 24 days ago by chen1.2k

In my experience, bcl2fastq parallelizes over samples per lane. Before writing something new, it might make sense to first see if your multiplexing strategy can be tweaked to speed things up a bit.

ADD REPLYlink written 23 days ago by Devon Ryan73k

One NovaSeq flow chip has only 4 lanes, and you cannot feed different lanes with different libraries since there is only one library input. The reads of one sample are distributed in all these lanes, so your method doesn't work.

ADD REPLYlink written 23 days ago by chen1.2k

Ah, right, I forgot that it was like a NextSeq in that regard. That'd make it optimally multiplexed anyway.

ADD REPLYlink written 23 days ago by Devon Ryan73k

You will soon be able to load individual lanes with different libraries. As long as you purchase "NovaSeq Xp" upgrade (which involves some sort of a manifold device, as I understand). I assume an update to control software would be included that will allow processing of individual lanes.

ADD REPLYlink modified 23 days ago • written 23 days ago by genomax37k

Much of this can be done with BBTools (e.g., trimming and reordering for higher compression). Those are already multithreaded and fairly performant (they're in Java). I imagine that your time would be better spent adding on to that.

ADD REPLYlink written 24 days ago by Devon Ryan73k

With pigz (and BBMap) and an isilon storage system capable of (~ 1.2GB/s theoretical throughput) it is not uncommon to see 300-500 MB/s streams.

ADD REPLYlink modified 24 days ago • written 24 days ago by genomax37k

I have experience with BBTools, cool and useful! I like them.
However, if you want to do different processing together (i.e. filtering + trimming + reordering + QC), you have to run different tools separately. That will definitely take more time.
So the basic idea of this fastp is to integrate all these tools in a single program with interfaces to flexibly configure the needed functions.
Another issue is QC. I prefer JSON format QC result and HTML report, since they can be easily integrated with web applications.

ADD REPLYlink modified 24 days ago • written 24 days ago by chen1.2k

There was a similar post a few months ago: Tool: Collaboration on an empirical QC tool, you could try to contact the people which showed interest in that thread.

with good multithreading to provide competitive execution performance.

Don't forget in general these tasks are embarrassingly parallel, as you can process each fastq file independently, and GNU Parallel can be of great help here.

ADD REPLYlink written 24 days ago by h.mon9.2k
0
gravatar for chen
24 days ago by
chen1.2k
OpenGene
chen1.2k wrote:

Does anyone like to support better compression format I/O like zstd (https://github.com/facebook/zstd)?

ADD COMMENTlink modified 24 days ago • written 24 days ago by chen1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 945 users visited in the last hour