Entering edit mode
3 months ago
eebloom
▴
80
I have nanopore data which I have basecalled using dorado (--modified-bases) to produce unaligned bam files. note: this includes all raw reads/does not filter based on Q-score.
This is human cancer data. I want to perform consensus SV calling.
Sadly, some of my samples have relatively low average read length (mean length ~2-3kb and median length ~700bp)! As such, I am cautious about over-filtering the data and losing more reads than I have to.
What would be the recommendation for min read length and quality to filter such data (if at all)?
Presumably I would need to convert to fastq first...
If you are aligning to a good reference you can try using all the data. If you feel that the alignments look less than optimal then you could start filtering.
I guess you mean median length 700bp. That's not long, but still better the Illumina.
I would recommend looking at chopper, seqtk or fastp to check your data. I like fastp since it is fast and creates informative graphics with percent Q20 and Q30 reads etc. This is starting to become very useful as ONT data increases gradually in quality.
As GenoMax says - I would map everything and check the coverage of your samples before filtering. A Q value filter of Q7 or Q10 might remove the worst reads, but these - even though poor overall - might be the key reads supporting the breakpoints of your future discovery. It all depends.
Yeah I meant bp - have edited my original post, thanks.
Thanks both for the recommendations, this is very helpful