Question: Removing PCR duplicates from .fastq without .bam alignment
5
gravatar for Adrian Pelin
4.4 years ago by
Adrian Pelin2.2k
Canada
Adrian Pelin2.2k wrote:

Hello,

I have an old dataset from 2010 of PE illumina 54bp reads with a lot of PCR duplicates. These pairs of duplicates are very obvious, they are exactly the same read sequence forward and reverse present several times with different read names.

I know how to get rid of them using a bam alignment/mapping, but I am interested in methods to remove these without an alignment since I am interested on doing analysis on all reads, not just those that align to the genome.

What are some available approaches that take as input fastq and output fastq?

Thank you,

Adrian

pcr duplicates fastq illumina • 7.8k views
ADD COMMENTlink modified 2.2 years ago by Brian Bushnell16k • written 4.4 years ago by Adrian Pelin2.2k
1

Also, PRINSEQ

ADD REPLYlink written 4.4 years ago by komal.rathi3.4k

This worked:

perl prinseq-lite.pl -fastq ~/Encephalitozoon/Eromalae/100611_s_4_1_seq_GDR-7.fastq -fastq2 ~/Encephalitozoon/Eromalae/100611_s_4_2_seq_GDR-7.fastq -phred64 -derep 1

 

ADD REPLYlink written 4.4 years ago by Adrian Pelin2.2k

Check out FastUniq

ADD REPLYlink written 4.4 years ago by lkmklsmn870

perl prinseq-lite.pl -fastq ~/Encephalitozoon/Eromalae/100611_s_4_1_seq_GDR-7.fastq -fastq2 ~/Encephalitozoon/Eromalae/100611_s_4_2_seq_GDR-7.fastq -phred64 -derep 1

That's a bit odd that the max is 1000 pairs.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Adrian Pelin2.2k

Just for the record: FastUniq can not account for sequencing errors (which can be a strong limitation). Here is a quote from the authors' article (Xu _et al._, 2012).

There were some differences in levels of duplicates identified by FastUniq and Picard Markduplicates that were caused by the different criteria in read pair comparisons (Figure 3A, Table 1). Of them, FastUniq compares read pairs on the basis of sequences only, and it is sensitive to SNPs caused by heterozygous or sequencing errors.

ADD REPLYlink modified 18 months ago • written 18 months ago by Charles Plessy2.6k

Hi , do you know the same function tools written by python ?

ADD REPLYlink written 2.2 years ago by kaixian1100
5
gravatar for Brian Bushnell
2.2 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

Clumpify can mark or remove duplicate reads very efficiently without alignment:

clumpify.sh in=reads.fq out=deduped.fq dedupe

ADD COMMENTlink written 2.2 years ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1501 users visited in the last hour