fastp v0.9 is released (Nov 29, 2017), with a new feature added:
preprocess unique molecular identifer (UMI) enabled data, shift UMI to sequence name.. See Tutorial: Use fastp to preprocess FASTQ data with unique molecular identifer (UMI) integrated
This project is at: https://github.com/OpenGene/fastp
A list of features implemented:
- filter out bad reads (too low quality, too short, or too many N...)
- trim all reads in front and tail
- cut low quality bases for per read in its 5' and 3' by evaluating the mean quality from a sliding window (like Trimmomatic but faster).
- cut adapters (adapters are detected automatically for both PE/SE data).
- correct mismatched base pairs in overlapped regions of paired end reads, if one base is with high quality while the other is with ultra low quality
- preprocess unique molecular identifer (UMI) enabled data, shift UMI to sequence name.
- report JSON format result for further interpreting.
- visualize quality control and filtering results on a single HTML page (like FASTQC but faster and more informative).
- split the output to multiple files (0001.R1.gz, 0002.R1.gz...) to support parallel processing.
- support long reads (data from PacBio / Nanopore devices).
A list of features will be implemented soon:
- Over representation analysis
- Sequencing analysis by lanes/tiles
- Pair merge
The initial evaluation has shown that
fastp is about
10X faster than
AfterQC, and also much faster than
Cutadapt, while providing most features from all of them, plus some novel and useful functions. I had deployed
fastp in my clusters and my cloud system.
I am still calling for new requirements to make it more powerful and useful. If you have good ideas, please reply to this post or file an issue on the github page.
Initial message for call of requirements:
Hi, I am a co-founder of a company owning 10 sets of Illumina NovaSeq sequencers, so my cluster system has to process very large data per day.
Recently I found that the FASTQ data preprocessing has become a bottleneck. Current tools (including tools I developed before) can only provide a part of wanted functions (i.e. QC/filtering/adapter-cutting), and are usually too slow (written in scripts, no good threading).
So I’m planning to write a new all-in-one FASTQ preprocessing tool, which will implement most required functions, and must be written in C/C++ with good multithreading to provide competitive execution performance.
Now I am posting this thread to call for requirement of this tool. If you have any good suggestions, please comment here, or file an issue on the github project ( https://github.com/OpenGene/fastp/issues/new ).
Your contributions are greatly appreciated.