Complexity curve for PE fastq files
1
0
Entering edit mode
6 months ago
noodle ▴ 580

Hi all,

Can someone recommend a one-liner or command-line tool to produce a complexity curve based on PE fastq files? Something like preseq c_curve but with the input being the fastq files instead of an alignment, with some flags for number of mismatches allowed, etc. TIA!

complexity fastq • 501 views
ADD COMMENT
1
Entering edit mode
6 months ago

BBTools has "bbcountunique.sh" which we use for the purpose of calculating library complexity for R1, R2, and the combined pair. It uses unaligned SE or PE fastq as input and produces a histogram of uniqueness as the read count increases, reporting a point every X reads (default is 25000).

ADD COMMENT
0
Entering edit mode

Thanks, I know this tool but it doesn't seem very flexible when it comes to kmer length and number of mismatches. Do you think this would be appropriate for PE with R1=61bp and R2=51bp?

ADD REPLY
0
Entering edit mode

Yes, it will work fine with those read lengths, though whether the tool is appropriate depends on the exact question you wish to answer. It will, for example, give you upward spikes in low-quality areas on the flow cell due to sequencing errors, as it requires an exact kmer match to consider sequences duplicate. Also, it will asymptote at a level slightly above zero, depending on the error rate. Determining whether read pairs are duplicates on the fly while allowing for an arbitrary number of mismatches is rather difficult. Though of course you could error-correct the data prior to measuring complexity; then you wouldn't need to worry about mismatch flags since the spurious complexity will be eliminated (to the extent possible).

ADD REPLY

Login before adding your answer.

Traffic: 2182 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6