Is there a tool to find the correct percentage of duplication levels in FastQ ?
4
0
Entering edit mode
3.8 years ago
▴ 210

Hello,

I would like to know if there are any tools available to find the correct percentage of duplication levels in FastQ files ?

Currently, I am using FastQC. However, FastQC gives an estimation. From FastQC manual:

To cut down on the memory requirements for this module only sequences which occur in the first 200,000 sequences in each file are analysed, but this should be enough to get a good impression for the duplication levels in the whole file.

If you have any tools in mind, I would highly appreciate it.

Thanks in advance !

sequencing fastq • 2.6k views
ADD COMMENT
1
Entering edit mode

you can try CD-HIT and use summary statistics. Try also seqkit rmdup function with -D option.

ADD REPLY
0
Entering edit mode

PicardTools? there is this MarkDuplicates tool, which marks duplicate reads and not sure but it could very well write out a summary of the amount of duplicates found .

EDIT: not a valid approach here as it works on aligned BAM files as pointed out below

ADD REPLY
0
Entering edit mode

Thanks for the answer but MarkDuplicates takes as input BAM or SAM files and not FastQ.

ADD REPLY
0
Entering edit mode

It should be possible to convert the FastQ file into an unaligned SAM or BAM if the alignment information itself is not used by Picard.

ADD REPLY
0
Entering edit mode

Picard uses the alignment info rather than the sequence info to calculate duplication.

ADD REPLY
0
Entering edit mode

yep, checked it as well ... scratch that thus from possible approaches in this post :)

ADD REPLY
3
Entering edit mode
3.8 years ago
GenoMax 141k

Answer here should be clumpify.sh, again from BBMap suite. It will give you counts for how many times a particular sequence is duplicated in fastq header. It can also de-duplicate your data and do various other things starting with fastq data. See: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

ADD COMMENT
0
Entering edit mode

Could you please give an example of the commandline to produce the counts in the fastq header ?

ADD REPLY
0
Entering edit mode
$ clumpify.sh -Xmx10g in=file.fq out=stdout.fq dedupe  addcount=t

In output file you will get this (other sequences removed) :

@M12345:751:000000000-F345F:1:1101:15835:1359 1:N:0:GATCTATC+ATGAGGCT **copies=4**
CCTTGGGTGGTTCAGTCAAAGAGGTAAGACCTCCAGCTGGCTCACAAGAG
+
BBBBAFA3ADBAGGGGGGGGGGHGGFG4EGHHHGHCHHCHGHHHHHHHGH
ADD REPLY
2
Entering edit mode
3.8 years ago
ATpoint 81k

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/

bbduk.sh can deduplicate fastq files based on kmer matching. If you simply count the number of reads before and after deduplication you should have your answer. Probably it even prints a detailed report.

ADD COMMENT
0
Entering edit mode

If I understood correctly, you mean to run bbduk.sh and generate the deduplicated FastQ_dedup_file. Then, count the number of reads before for FastQ_file and do something like (number of reads FastQ_dedup_file / number of reads FastQ_file) * 100. Is this correct ?

ADD REPLY
1
Entering edit mode

Yeah, that sounds reasonable. I never used that tool but I see it being recommended for deduplication here at biostars many times.

ADD REPLY
2
Entering edit mode
3.8 years ago

brute force method (word-size=8). Good luck with that if your fastq is big.

gunzip -c input.fq.gz  | \
awk '(NR%4==2) {L=length($0);W=8;for(i=1;i+W<=L;i++) {print substr($0,i,W);}}' |\
LC_ALL=C sort -T . | uniq -c |\
sort -nr |\
head
ADD COMMENT
1
Entering edit mode
3.8 years ago
Michael 54k

I'm not aware of any specific tool right now even though they might very well exist. One spontaneous idea that might work without using too much memory and allows for parallel processing is to use jellyfish, a program to efficiently count k-mers, and set the k-mer size to the read length. This works only if the read length is constant. Then keep and count the k-mers with occurrence > 1 from the jellyfish output.

ADD COMMENT
0
Entering edit mode

What if the k-mer length is variable ? You know that demultiplexing does not generate equal reads length and that sometimes smaller reads may exist. Normally, they should be taken into account even if the percentage of those reads are fairly low, no ?

ADD REPLY
1
Entering edit mode

That's why I wrote, 'this works only if the read length is constant', however, I have very rarely seen unequal read lengths in illumina sequencing. Even if there were around 1% of them, they will contribute little to the duplication count (well, max 1% if they were all duplicated, right?).

ADD REPLY
0
Entering edit mode

Yes, you are right, thanks for the information !

ADD REPLY
0
Entering edit mode

The advantage of this 'method' is btw: you can determine which sequences are duplicated and their distribution, with the duplicate removal you get just the number and proportion of removed sequences. That could come from a single sequence that is duplicated millions of times or many sequences that are duplicated only a few times. If that is sufficient for you, the bbduk method might be just right.

ADD REPLY

Login before adding your answer.

Traffic: 1522 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6