Is there a tool to find the correct percentage of duplication levels in FastQ ?
4
0
Entering edit mode
11 months ago
▴ 180

Hello,

I would like to know if there are any tools available to find the correct percentage of duplication levels in FastQ files ?

Currently, I am using FastQC. However, FastQC gives an estimation. From FastQC manual:

To cut down on the memory requirements for this module only sequences which occur in the first 200,000 sequences in each file are analysed, but this should be enough to get a good impression for the duplication levels in the whole file.

If you have any tools in mind, I would highly appreciate it.

sequencing fastq • 519 views
1
Entering edit mode

you can try CD-HIT and use summary statistics. Try also seqkit rmdup function with -D option.

0
Entering edit mode

PicardTools? there is this MarkDuplicates tool, which marks duplicate reads and not sure but it could very well write out a summary of the amount of duplicates found .

EDIT: not a valid approach here as it works on aligned BAM files as pointed out below

0
Entering edit mode

Thanks for the answer but MarkDuplicates takes as input BAM or SAM files and not FastQ.

0
Entering edit mode

It should be possible to convert the FastQ file into an unaligned SAM or BAM if the alignment information itself is not used by Picard.

0
Entering edit mode

Picard uses the alignment info rather than the sequence info to calculate duplication.

0
Entering edit mode

yep, checked it as well ... scratch that thus from possible approaches in this post :)

3
Entering edit mode
11 months ago
GenoMax 102k

Answer here should be clumpify.sh, again from BBMap suite. It will give you counts for how many times a particular sequence is duplicated in fastq header. It can also de-duplicate your data and do various other things starting with fastq data. See: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

0
Entering edit mode

Could you please give an example of the commandline to produce the counts in the fastq header ?

0
Entering edit mode
$clumpify.sh -Xmx10g in=file.fq out=stdout.fq dedupe addcount=t  In output file you will get this (other sequences removed) : @M12345:751:000000000-F345F:1:1101:15835:1359 1:N:0:GATCTATC+ATGAGGCT **copies=4** CCTTGGGTGGTTCAGTCAAAGAGGTAAGACCTCCAGCTGGCTCACAAGAG + BBBBAFA3ADBAGGGGGGGGGGHGGFG4EGHHHGHCHHCHGHHHHHHHGH  ADD REPLY 2 Entering edit mode 11 months ago ATpoint 50k https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/ bbduk.sh can deduplicate fastq files based on kmer matching. If you simply count the number of reads before and after deduplication you should have your answer. Probably it even prints a detailed report. ADD COMMENT 0 Entering edit mode If I understood correctly, you mean to run bbduk.sh and generate the deduplicated FastQ_dedup_file. Then, count the number of reads before for FastQ_file and do something like (number of reads FastQ_dedup_file / number of reads FastQ_file) * 100. Is this correct ? ADD REPLY 1 Entering edit mode Yeah, that sounds reasonable. I never used that tool but I see it being recommended for deduplication here at biostars many times. ADD REPLY 2 Entering edit mode 11 months ago brute force method (word-size=8). Good luck with that if your fastq is big. gunzip -c input.fq.gz | \ awk '(NR%4==2) {L=length($0);W=8;for(i=1;i+W<=L;i++) {print substr(\$0,i,W);}}' |\
LC_ALL=C sort -T . | uniq -c |\
sort -nr |\

1
Entering edit mode
11 months ago

I'm not aware of any specific tool right now even though they might very well exist. One spontaneous idea that might work without using too much memory and allows for parallel processing is to use jellyfish, a program to efficiently count k-mers, and set the k-mer size to the read length. This works only if the read length is constant. Then keep and count the k-mers with occurrence > 1 from the jellyfish output.

0
Entering edit mode

What if the k-mer length is variable ? You know that demultiplexing does not generate equal reads length and that sometimes smaller reads may exist. Normally, they should be taken into account even if the percentage of those reads are fairly low, no ?

1
Entering edit mode

That's why I wrote, 'this works only if the read length is constant', however, I have very rarely seen unequal read lengths in illumina sequencing. Even if there were around 1% of them, they will contribute little to the duplication count (well, max 1% if they were all duplicated, right?).

0
Entering edit mode

Yes, you are right, thanks for the information !

0
Entering edit mode

The advantage of this 'method' is btw: you can determine which sequences are duplicated and their distribution, with the duplicate removal you get just the number and proportion of removed sequences. That could come from a single sequence that is duplicated millions of times or many sequences that are duplicated only a few times. If that is sufficient for you, the bbduk method might be just right.