Question

Complexity curve for PE fastq files

0

Entering edit mode

6 months ago

noodle ▴ 580

Hi all,

Can someone recommend a one-liner or command-line tool to produce a complexity curve based on PE fastq files? Something like preseq c_curve but with the input being the fastq files instead of an alignment, with some flags for number of mismatches allowed, etc. TIA!

complexity fastq • 501 views

ADD COMMENT • link updated 6 months ago by Brian Bushnell 20k • written 6 months ago by noodle ▴ 580

score 1 · Answer 1 · 2023-10-24

1

Entering edit mode

6 months ago

Brian Bushnell 20k

BBTools has "bbcountunique.sh" which we use for the purpose of calculating library complexity for R1, R2, and the combined pair. It uses unaligned SE or PE fastq as input and produces a histogram of uniqueness as the read count increases, reporting a point every X reads (default is 25000).

ADD COMMENT • link 6 months ago by Brian Bushnell 20k

0

Entering edit mode

Thanks, I know this tool but it doesn't seem very flexible when it comes to kmer length and number of mismatches. Do you think this would be appropriate for PE with R1=61bp and R2=51bp?

ADD REPLY • link 6 months ago by noodle ▴ 580

0

Entering edit mode

Yes, it will work fine with those read lengths, though whether the tool is appropriate depends on the exact question you wish to answer. It will, for example, give you upward spikes in low-quality areas on the flow cell due to sequencing errors, as it requires an exact kmer match to consider sequences duplicate. Also, it will asymptote at a level slightly above zero, depending on the error rate. Determining whether read pairs are duplicates on the fly while allowing for an arbitrary number of mismatches is rather difficult. Though of course you could error-correct the data prior to measuring complexity; then you wouldn't need to worry about mismatch flags since the spurious complexity will be eliminated (to the extent possible).

ADD REPLY • link 6 months ago by Brian Bushnell 20k