FastUniq deduplicate only working for forward read
Entering edit mode
2.8 years ago
Egelbets ▴ 30

I am using FastUniq to deduplicate Illumina Miseq paired-end data, and using FastQC to compare quality control (QC) reports before and after deduplication. I figured out how to use FastUniq, but for some reason, it only seems to be effective on the first read pair, and not nearly as much on the second (which is incredibly odd knowing that the same number of reads are filtered out of forward en reverse files).

My FastUniq command:

fastuniq -i list.txt -o SAMPLE_R1_dedup_1.fastq -p SAMPLE_R2_dedup_2.fastq

Where list.txt is a file containing the forward and reverse files as FastUniq requires:


Then, when I compare the FastQC QC report, I see the following:

SAMPLE_R1 with duplicates:


SAMPLE_R1 deduplicated:


As you can see, this gives some really good results. However, the reverse QC reports look like this:

SAMPLE_R2 with duplicates:


SAMPLE_R2 deduplicated:


This is significantly worse than the forward (R1) read, and this exact same thing happens for all my samples.

My question is: what happens that I get these results? Is something going wrong with FastUniq, since it is quite an outdated tool? Is FastQC giving false reports for the reverse (R2) read? Or is this output to be expected?

FastUniq FastQC • 2.2k views
Entering edit mode

You may want to give from BBMap suite a try: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

If you wanted a simpler tool then from same suite should work as well.

Entering edit mode

Thank you for your tool suggestion. I have tried with the following command: in=SAME_SAMPLE.R1 in2=SAME_SAMPLE_R2 out=out_R1 out2=out_R2 dedupe subs=0

This resulted in the following:

SAME_SAMPLE_R1 deduplicated: SAME_SAMPLE_R1 deduplicated

SAME_SAMPLE_R2 deduplicated: SAME_SAMPLE_R2 deduplicated

This is definitely a better result compared to FastUniq (something I also see in all my other samples), but I guess the same issue remains? Maybe this is something to expect?

I have also tried PRINSEQ, but this tool performed the worst. It resulted in QC reports that are similar to FastUniq's R2 quality report, but then for both R1 and R2.

Something that might also be interesting to note is that all the tools filter out about the same number of duplicates from the files, with maybe a couple of thousand reads in difference. I should probably also mention that I do quality trimming after the deduplication, and then create QC reports with FastQC of the deduplicated+trimmed read data.

I am also planning on testing SAMtools and Picard to remove/mark duplicates, but these tools work on mapping data (BAM files). My goal is to variant call with this pipeline.

Entering edit mode

I should probably also mention that I do quality trimming after the deduplication,

That would certainly be one explanation for the difference you see. You are also using perfect matches for both reads for clumpify. Sequencing is not perfect. If you allowed for one error in reads (subs=1) then you may account for the differences. Read 2 is likely to have more errors.

Entering edit mode

I have tried comparing the QC duplication graphs of deduplication+no trimming, with deduplication+trimming, for sub=0 and sub=1. However, the same issue remains. The QC reports by FastQC of the untrimmed, but deduplicated reads are overall worse and have a higher duplication rate (as opposed to the trimmed reports), but with again the R1-R2 inconsistency. The sub parameter change also didn't really do anything, so I decided to bump it up to sub=5, but this also doesn't really change anything.

I have also checked what FastQC outputs when I just let it analyse the raw reads without any deduplication or trimming, and in that case, FastQC shows the same duplication level for both R1 and R2, so I think it's safe to say that this R1-R2 inconsistency arises during deduplication (since trimming just betters what is already there). I have still no idea what is causing this, and I also find it weird that this is happing with both FastUniq and clumpify.

Entering edit mode

AFAIK FastQC duplication module looks at first 8000 sequences and checks them in remaining file (you may want to email Dr. Simon Andrews author of FastQC and confirm).

Ideally this duplication marking is done after alignment in SNP calling workflows. Are you doing this just for QC upfront?

Entering edit mode

AFAIK FastQC duplication module looks at first 8000 sequences and checks them in remaining file

I definitely forgot about that. It's actually the first 100.000 (link) so that might explain something.

But yeah this is my first time making a workflow like this (I'm still a student), and I was just trying to get read data with the highest quality before aligning to the reference (not a particular reason for why). But I also think now it's just better to mark duplicates after aligning and let the variant caller take care of the duplicate markings.

Thank you for your help and time!


Login before adding your answer.

Traffic: 2732 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6