Hi, this tool may save your time, it do filtering and QC with fastq data automatically
following introduction is out of date and the newer AfterQC is much more powerful, please check the github page for update
AfterQC
project on github: https://github.com/OpenGene/AfterQC
sample report: http://opengene.org/AfterQC/report.html
Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data
AfterQC can simply go through all fastq files in a folder and then output three folders: good, bad and QC folders, which contains good reads, bad reads and the QC results of each fastq file/pair.
Currently it supports processing data from HiSeq 2000/2500/3000/4000, X10, X5, Nextseq 500/550, MiniSeq...
Features:
AfterQC does following tasks automatically:
- Filters reads with too low quality, too short length or too many N
- Filters reads with abnormal PolyA/PolyT/PolyC/PolyG sequences
- Does per-base quality control and plots the figures
- Trims reads at front and tail, according to QC results
- For pair-end sequencing data,
AfterQCautomatically corrects low quality wrong bases in overlapped area of read1/read2 - Detects and eliminates bubble artifact caused by sequencer due to fluid dynamics issues
- Single molecule barcode sequencing support: if all reads have a single molecule barcode (see duplex sequencing),
AfterQCshifts the barcodes from the reads to the fastq query names - Support both single-end sequencing and pair-end sequencing data
Dependency:
AfterQC uses editdistance module, run following before using AfterQC:
pip install editdistance
WARNING: If you haven't installed editdistance module, AfterQC will use a python implementation of editdistance, but it will be extremely slow.
Simple usage:
1, Prepare your fastq files in a folder
2, For single-end sequencing, the filenames in the folder should be *R1*
For pair-end sequencing, the filenames in the folder should be *R1* and *R2*
cd /path/to/fastq/folder
python path/to/AfterQC/after.py
Two folders will be automatically generated, a folder 'good' stores the good reads and a folder 'bad' stores the bad reads
AfterQC will print some statistical information after it is done, such how many good reads, how many bad reads, and how many reads are corrected.
Quality Control only
If you only want to get quality control statistics, run:
python after.py --qc_only
Understand the report
AfterQCwill generate a QC folder, which contains lots of figures.- For pair-end sequencing data, both read1 and read2 figures will be in the same folder with the folder name of read1's filename.
R1meansread1,R2meansread2. - For single-end sequencing data, it will still have
R1. prefiltermeansbefore filtering,postfiltermeansafter filtering- For pair-end sequencing data,
Afterwill do anoverlap analysis. read1 and read2 will be overlapped whenread1_length + read2_length > DNA_template_length.
Hello,
I've got a few questions about the calcs in AfterQC. In the AfterQC paper, you note that "AfterQC can detect the mismatches in the overlapping regions. For those reads with very long overlap (i.e. overlap_len>50)".
In the estimated seq error field in the html report, are only overlaps greater than 50bp considered? And are the errors in these overlaps the only component that goes into the seq error rate calculation?
If only overlaps greater than 50bp go into the calculation, could you please let me know where should I change the source to modify that number (my guess is complete_compare_require in util.py)?
Thanks very much for the software!
Please don't post new questions in the answer section. New Questions need to be asked separately. This post will be moved to a comment.