PacBio raw data trimming and cleaning
2
0
Entering edit mode
6.8 years ago
misterie ▴ 110

Hi,

I have done analysis of QC using FastQC for my PacBio data. I am wondering whether should I clean my data by quality criterion or minimum length of read. Those data are very poor (PacBio). Do you know any recommendation how to clean those data? For Illumina I used to use Trimmomatic, CutAdapt and TrimGalore, but I have no idea how to pre-process PacBio data. I think I should remove reads shorter than 50 bp, but if you have any other recommendation for those purposes, let me know.

pacbio trimming cleaning qc • 7.3k views
ADD COMMENT
0
Entering edit mode

Things like this usually depend on the application you want to do after cleaning. Do you want (structural) variant calling, de novo assembly,...?

ADD REPLY
0
Entering edit mode

I want to do de novo assembly using different pipelines. But I think I should at least trimm my data using minimum length =50bp

ADD REPLY
0
Entering edit mode

Since you are working with PacBio data I personally think it's a bit silly to use a lower bound of only 50 nucleotides. Depending on your read length distribution I would go for at least 10fold of your 50n threshold.

On the other hand most PacBio processing pipelines will already apply an internal min length filtering (mostly around few Kbp).

ADD REPLY
0
Entering edit mode

Thank you. I mean mainly, that I have some samples after demultiplexing using lima and standard (default) threshold for minimum length was set to 50 Bp. I have also samples that do not require demultiplexing so there are reads that have minimum length = 1bp. I want to uniform those samples. If it could be better to change threshold to 500bp let me know which software will be appropriate.

ADD REPLY
0
Entering edit mode

it all depends on what the analyses are you want to do with the data.

eg. assembly: most assemblers will either do 'cleaning' themselves or do no quality cleaning at all

ADD REPLY
0
Entering edit mode

I'm facing the same problem like you! did you solve it eventually? actually I'm a complete newbie in this field and trying to do de novo assembly on plants. I only have pacbio data and I don't know where to find the pipelines and the right tools to use...

ADD REPLY
0
Entering edit mode

Do you have Pacbio CLR data or CCS (HiFi) data ?

For quality trimming you can try FastP Long.

ADD REPLY
0
Entering edit mode

I think it's CLR data, but not 100% sure. And I got this creepy result of FastQC report and really don't know how to do quality trimming..The data was sequenced years ago and until this month my mentor gave it to me and it's not even his field! I am the first and only one in the lab working on bioinformatic stuff..I am just TOTALLY lost enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

ADD REPLY
0
Entering edit mode

Don't use/rely on FastQC for long read data. As noted in other comments use a tool meant for long read data.

ADD REPLY
0
Entering edit mode

I turned to nanoplot but still all mysterious to me..

ADD REPLY
0
Entering edit mode

what exactly is mysterious about the nanoplot output?

or do you mean how to interpret the results?

ADD REPLY
0
Entering edit mode

on top of what GenoMax says: don't let yourself be fooled by the binning in FastQC graphs, it's a reoccurring thing to be mislead by the binning issue !

ADD REPLY
0
Entering edit mode

Oh no I didn't know that! but if all those tools are not to trust, how am I supposed to know how good or bad is my data and what's next to do to filter..I asked chatgpt but I was afraid it could also be misleading..

ADD REPLY
0
Entering edit mode

let's also not exaggerate :-) , those tools are to trust ! it's just that you need some insights to interpret their graphs (as any other graph of any tool) ...

The binning issue is only reflected in the output graphs (==what you see) not in what it does to compute them.

you can deactivate that behavior using the --nogroup option to fastqc (beware though that your reports will become substantially bigger when doing so) I suggest you run it once like that and compare the graphs, you will see what the difference is.

ADD REPLY
0
Entering edit mode

ok, what have you tried so far?

a quick google search (or alike) should likely already point you to some tools or procedures.

on read QC: for long reads, especially for the goal is assembly, it's often mainly length filtering (and some quality filtering, but that's not even crucial, well at least much less than it is for illumina for instance, same applies for the adapter trimming). In any case you first should get the QC overviews to make the correct decision (something like fastQC or nanoplot/chopper/ ... )

ADD REPLY
0
Entering edit mode

PacBio has several tools available for assembly of their data: https://www.pacb.com/products-and-services/analytical-software/whole-genome-sequencing/

ADD REPLY
2
Entering edit mode
9 days ago
Kevin Blighe ★ 90k

ok, first things first: it is crucial to know whether your pacbio data is CLR (continuous long reads, which are longer but with higher error rates) or HiFi (high-fidelity, from circular consensus sequencing, shorter but very accurate). that distinction will guide the preprocessing and assembly choices.

assuming you are a newbie, start by running some QC to understand your data. fastQC is ok for a quick look, but for long reads like pacbio, tools such as NanoPlot (from the NanoPack suite) or LongQC are better suited as they handle the specifics of long-read distributions well. you can install NanoPack via pip or conda, and run NanoPlot --fastq your_reads.fastq to get plots on length, quality, and more. this will help you decide on filtering thresholds.

regarding preprocessing: pacbio data often needs minimal cleaning compared to illumina, especially if it is HiFi, since the reads are already high quality. however, you might want to filter out very short reads or low-quality ones to uniform your samples. a minimum length of 500 bp could be a reasonable start, but check your read length histogram from QC— for assembly, keeping reads above 1 kb or even higher (e.g., 5 kb for CLR) is common to improve contiguity. avoid aggressive quality trimming as long-read assemblers are tolerant of errors.

for the actual trimming and filtering, fastp is a solid, updated choice that supports long reads (pacbio and nanopore). use it with options like --length_required 500 --qualified_quality_phred 10 to filter by length and quality. it is ultrafast and can handle adapter removal if needed (though pacbio adapters are usually handled during demultiplexing with lima). if your data still has barcodes or adapters, specify them with --adapter_sequence. you can get fastp from github (OpenGene/fastp) and it is actively maintained as of 2025.

once preprocessed, for de novo assembly on plants (which can be complex due to heterozygosity and repeats), here are some recommended pipelines:

  • if HiFi data: go with hifiasm, which is fast, haplotype-aware, and excellent for diploid genomes like plants. install via bioconda, and run with hifiasm -o assembly your_reads.fastq. it handles most of the heavy lifting internally without much extra preprocessing.

  • if CLR data: try flye, which is robust for error-prone long reads. run flye --pacbio-raw your_reads.fastq --out-dir assembly, and it includes built-in error correction. alternatively, canu is another option, though a bit slower.

pacbio provides their own tools via SMRT Link software, including the improved phased assembler (IPA) for HiFi, which is user-friendly with a GUI and optimized for their data. download it from the pacbio website, and it includes assembly workflows that can start from raw or demultiplexed reads.

a quick search on google or biostars for "pacbio hifi assembly pipeline plant" will give you tutorials and benchmarks— for example, recent papers from 2025 highlight hifiasm for high-quality plant genomes. if you share more details like your read type, coverage, or what you have tried, i can refine this further.

Kevin

ADD COMMENT
0
Entering edit mode

small addition: the developers advise to not use fastp for long reads but rather their fastplong version . (more or less the same under the hood but better tailored to long reads specifics)

ADD REPLY
0
Entering edit mode

Thanks for clarifying, lieven.sterck

ADD REPLY
0
Entering edit mode
6 days ago

When I used to use PacBio CLR reads, my starting points were first plot the lengths and qualities of the reads and after checking the plots, make a liberal length and quality cut-off for the data. If the data amount was sufficient I used ran a canu self correction and trimming, I used to assemble this self-corrected and trimmed data using both canu and flye separately and depending on the assembly statistics, I decide further.

ADD COMMENT

Login before adding your answer.

Traffic: 3990 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6