Question

Is there a tool to find the correct percentage of duplication levels in FastQ ?

0

Entering edit mode

3.8 years ago

乙 ▴ 210

Hello,

I would like to know if there are any tools available to find the correct percentage of duplication levels in FastQ files ?

Currently, I am using FastQC. However, FastQC gives an estimation. From FastQC manual:

To cut down on the memory requirements for this module only sequences which occur in the first 200,000 sequences in each file are analysed, but this should be enough to get a good impression for the duplication levels in the whole file.

If you have any tools in mind, I would highly appreciate it.

Thanks in advance !

sequencing fastq • 2.7k views

ADD COMMENT • link updated 3.8 years ago by GenoMax 141k • written 3.8 years ago by 乙 ▴ 210

1

Entering edit mode

you can try CD-HIT and use summary statistics. Try also seqkit rmdup function with -D option.

ADD REPLY • link 3.8 years ago by cpad0112 21k

0

Entering edit mode

~~PicardTools? there is this MarkDuplicates tool, which marks duplicate reads and not sure but it could very well write out a summary of the amount of duplicates found .~~

EDIT: not a valid approach here as it works on aligned BAM files as pointed out below

ADD REPLY • link 3.8 years ago by lieven.sterck 15k

0

Entering edit mode

Thanks for the answer but MarkDuplicates takes as input BAM or SAM files and not FastQ.

ADD REPLY • link 3.8 years ago by 乙 ▴ 210

0

Entering edit mode

It should be possible to convert the FastQ file into an unaligned SAM or BAM if the alignment information itself is not used by Picard.

ADD REPLY • link 3.8 years ago by Michael 54k

0

Entering edit mode

Picard uses the alignment info rather than the sequence info to calculate duplication.

ADD REPLY • link 3.8 years ago by i.sudbery 19k

0

Entering edit mode

yep, checked it as well ... scratch that thus from possible approaches in this post :)

ADD REPLY • link 3.8 years ago by lieven.sterck 15k

score 3 · Answer 1 · 2020-06-24

3

Entering edit mode

3.8 years ago

GenoMax 141k

Answer here should be clumpify.sh, again from BBMap suite. It will give you counts for how many times a particular sequence is duplicated in fastq header. It can also de-duplicate your data and do various other things starting with fastq data. See: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

ADD COMMENT • link 3.8 years ago by GenoMax 141k

0

Entering edit mode

Could you please give an example of the commandline to produce the counts in the fastq header ?

ADD REPLY • link 3.7 years ago by 乙 ▴ 210

0

Entering edit mode

$ clumpify.sh -Xmx10g in=file.fq out=stdout.fq dedupe  addcount=t

In output file you will get this (other sequences removed) :

@M12345:751:000000000-F345F:1:1101:15835:1359 1:N:0:GATCTATC+ATGAGGCT **copies=4**
CCTTGGGTGGTTCAGTCAAAGAGGTAAGACCTCCAGCTGGCTCACAAGAG
+
BBBBAFA3ADBAGGGGGGGGGGHGGFG4EGHHHGHCHHCHGHHHHHHHGH

ADD REPLY • link 3.7 years ago by GenoMax 141k

score 2 · Answer 2 · 2020-06-24

2

Entering edit mode

3.8 years ago

ATpoint 81k

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/

bbduk.sh can deduplicate fastq files based on kmer matching. If you simply count the number of reads before and after deduplication you should have your answer. Probably it even prints a detailed report.

ADD COMMENT • link 3.8 years ago by ATpoint 81k

0

Entering edit mode

If I understood correctly, you mean to run bbduk.sh and generate the deduplicated FastQ_dedup_file. Then, count the number of reads before for FastQ_file and do something like (number of reads FastQ_dedup_file / number of reads FastQ_file) * 100. Is this correct ?

ADD REPLY • link 3.8 years ago by 乙 ▴ 210

1

Entering edit mode

Yeah, that sounds reasonable. I never used that tool but I see it being recommended for deduplication here at biostars many times.

ADD REPLY • link 3.8 years ago by ATpoint 81k

score 2 · Answer 3 · 2020-06-24

2

Entering edit mode

3.8 years ago

Pierre Lindenbaum 161k

brute force method (word-size=8). Good luck with that if your fastq is big.

gunzip -c input.fq.gz  | \
awk '(NR%4==2) {L=length($0);W=8;for(i=1;i+W<=L;i++) {print substr($0,i,W);}}' |\
LC_ALL=C sort -T . | uniq -c |\
sort -nr |\
head

ADD COMMENT • link 3.8 years ago by Pierre Lindenbaum 161k

score 1 · Answer 4 · 2020-06-24

1

Entering edit mode

3.8 years ago

Michael 54k

I'm not aware of any specific tool right now even though they might very well exist. One spontaneous idea that might work without using too much memory and allows for parallel processing is to use jellyfish, a program to efficiently count k-mers, and set the k-mer size to the read length. This works only if the read length is constant. Then keep and count the k-mers with occurrence > 1 from the jellyfish output.

ADD COMMENT • link 3.8 years ago by Michael 54k

0

Entering edit mode

What if the k-mer length is variable ? You know that demultiplexing does not generate equal reads length and that sometimes smaller reads may exist. Normally, they should be taken into account even if the percentage of those reads are fairly low, no ?

ADD REPLY • link 3.8 years ago by 乙 ▴ 210

1

Entering edit mode

That's why I wrote, 'this works only if the read length is constant', however, I have very rarely seen unequal read lengths in illumina sequencing. Even if there were around 1% of them, they will contribute little to the duplication count (well, max 1% if they were all duplicated, right?).

ADD REPLY • link 3.8 years ago by Michael 54k

0

Entering edit mode

Yes, you are right, thanks for the information !

ADD REPLY • link 3.8 years ago by 乙 ▴ 210

0

Entering edit mode

The advantage of this 'method' is btw: you can determine which sequences are duplicated and their distribution, with the duplicate removal you get just the number and proportion of removed sequences. That could come from a single sequence that is duplicated millions of times or many sequences that are duplicated only a few times. If that is sufficient for you, the bbduk method might be just right.

ADD REPLY • link 3.8 years ago by Michael 54k