Question: Is there a tool to find the correct percentage of duplication levels in FastQ ?
0
gravatar for 乙
7 weeks ago by
160
160 wrote:

Hello,

I would like to know if there are any tools available to find the correct percentage of duplication levels in FastQ files ?

Currently, I am using FastQC. However, FastQC gives an estimation. From FastQC manual:

To cut down on the memory requirements for this module only sequences which occur in the first 200,000 sequences in each file are analysed, but this should be enough to get a good impression for the duplication levels in the whole file.

If you have any tools in mind, I would highly appreciate it.

Thanks in advance !

sequencing fastq • 226 views
ADD COMMENTlink modified 7 weeks ago by genomax87k • written 7 weeks ago by 160
1

you can try CD-HIT and use summary statistics. Try also seqkit rmdup function with -D option.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by cpad011213k

PicardTools? there is this MarkDuplicates tool, which marks duplicate reads and not sure but it could very well write out a summary of the amount of duplicates found .

EDIT: not a valid approach here as it works on aligned BAM files as pointed out below

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by lieven.sterck8.3k

Thanks for the answer but MarkDuplicates takes as input BAM or SAM files and not FastQ.

ADD REPLYlink written 7 weeks ago by 160

It should be possible to convert the FastQ file into an unaligned SAM or BAM if the alignment information itself is not used by Picard.

ADD REPLYlink written 7 weeks ago by Michael Dondrup47k

Picard uses the alignment info rather than the sequence info to calculate duplication.

ADD REPLYlink written 7 weeks ago by i.sudbery8.4k

yep, checked it as well ... scratch that thus from possible approaches in this post :)

ADD REPLYlink written 7 weeks ago by lieven.sterck8.3k
3
gravatar for genomax
7 weeks ago by
genomax87k
United States
genomax87k wrote:

Answer here should be clumpify.sh, again from BBMap suite. It will give you counts for how many times a particular sequence is duplicated in fastq header. It can also de-duplicate your data and do various other things starting with fastq data. See: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by genomax87k

Could you please give an example of the commandline to produce the counts in the fastq header ?

ADD REPLYlink written 20 days ago by 160
$ clumpify.sh -Xmx10g in=file.fq out=stdout.fq dedupe  addcount=t

In output file you will get this (other sequences removed) :

@M12345:751:000000000-F345F:1:1101:15835:1359 1:N:0:GATCTATC+ATGAGGCT **copies=4**
CCTTGGGTGGTTCAGTCAAAGAGGTAAGACCTCCAGCTGGCTCACAAGAG
+
BBBBAFA3ADBAGGGGGGGGGGHGGFG4EGHHHGHCHHCHGHHHHHHHGH
ADD REPLYlink written 20 days ago by genomax87k
2
gravatar for ATpoint
7 weeks ago by
ATpoint36k
Germany
ATpoint36k wrote:

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/

bbduk.sh can deduplicate fastq files based on kmer matching. If you simply count the number of reads before and after deduplication you should have your answer. Probably it even prints a detailed report.

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by ATpoint36k

If I understood correctly, you mean to run bbduk.sh and generate the deduplicated FastQ_dedup_file. Then, count the number of reads before for FastQ_file and do something like (number of reads FastQ_dedup_file / number of reads FastQ_file) * 100. Is this correct ?

ADD REPLYlink written 7 weeks ago by 160
1

Yeah, that sounds reasonable. I never used that tool but I see it being recommended for deduplication here at biostars many times.

ADD REPLYlink written 7 weeks ago by ATpoint36k
2
gravatar for Pierre Lindenbaum
7 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

brute force method (word-size=8). Good luck with that if your fastq is big.

gunzip -c input.fq.gz  | \
awk '(NR%4==2) {L=length($0);W=8;for(i=1;i+W<=L;i++) {print substr($0,i,W);}}' |\
LC_ALL=C sort -T . | uniq -c |\
sort -nr |\
head
ADD COMMENTlink written 7 weeks ago by Pierre Lindenbaum129k
1
gravatar for Michael Dondrup
7 weeks ago by
Bergen, Norway
Michael Dondrup47k wrote:

I'm not aware of any specific tool right now even though they might very well exist. One spontaneous idea that might work without using too much memory and allows for parallel processing is to use jellyfish, a program to efficiently count k-mers, and set the k-mer size to the read length. This works only if the read length is constant. Then keep and count the k-mers with occurrence > 1 from the jellyfish output.

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by Michael Dondrup47k

What if the k-mer length is variable ? You know that demultiplexing does not generate equal reads length and that sometimes smaller reads may exist. Normally, they should be taken into account even if the percentage of those reads are fairly low, no ?

ADD REPLYlink written 7 weeks ago by 160
1

That's why I wrote, 'this works only if the read length is constant', however, I have very rarely seen unequal read lengths in illumina sequencing. Even if there were around 1% of them, they will contribute little to the duplication count (well, max 1% if they were all duplicated, right?).

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Michael Dondrup47k

Yes, you are right, thanks for the information !

ADD REPLYlink written 7 weeks ago by 160

The advantage of this 'method' is btw: you can determine which sequences are duplicated and their distribution, with the duplicate removal you get just the number and proportion of removed sequences. That could come from a single sequence that is duplicated millions of times or many sequences that are duplicated only a few times. If that is sufficient for you, the bbduk method might be just right.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Michael Dondrup47k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1438 users visited in the last hour