Question

Counting number of duplicated reads from fastq file

0

Entering edit mode

10.4 years ago

mike ▴ 90

Hello,

Which tool or method is available for counting or extracting number of duplicated reads from fastq files with paired reads? I have checked various tools which can only removes duplicated reads

Thanks

NGS • 9.0k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by mike ▴ 90

Ram · Answer 1 · 2015-02-02

2

Entering edit mode

10.4 years ago

Brian Bushnell 20k

Hi Mike,

If you want to deduplicate raw paired fastq files, I recommend trying dedupe.sh from the BBMap package. You can run it like this:

dedupe.sh in1=read1.fq in2=read2.fq out1=x1.fq out2=x2.fq ac=f

That will also print the exact number of duplicates removed.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 10.4 years ago by Brian Bushnell 20k

Ram · Answer 2 · 2015-02-02

1

Entering edit mode

10.4 years ago

iraun 6.2k

You can remove duplicates (using picard, samtools or whatever) and then count how many reads are missing from the de-dupped file, no?

ADD COMMENT • link 10.4 years ago by iraun 6.2k

0

Entering edit mode

If you don't want to remove duplicates you can also, use samtools flags:

samtools view -c -f 1024 file.bam

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 10.4 years ago by iraun 6.2k

0

Entering edit mode

These tools are for aligned bam files.

ADD REPLY • link 10.4 years ago by mike ▴ 90

0

Entering edit mode

Oh, sorry, my mistake. So, I would suggest you to map first. In my opinion, it is much better to map first and remove duplicates then. But, if you want to remove duplicates first, you could try fastx_collapse, remove the duplicates and count how many of them have you lost.

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 10.4 years ago by iraun 6.2k

2

Entering edit mode

It should be noted that fastx_collapse only works on single-end reads (this is pretty common for tools like this).

ADD REPLY • link 10.4 years ago by Devon Ryan 105k

0

Entering edit mode

Okay thanks I will give it a shot.

ADD REPLY • link 10.4 years ago by mike ▴ 90

0

Entering edit mode

Fastx toolkit is not designed for paired end reads if I am not wrong

ADD REPLY • link 10.4 years ago by mike ▴ 90

0

Entering edit mode

Unless you want to use these for assembly, it's generally fast enough to just align and remove/mark duplicates from the resulting BAM file.

ADD REPLY • link 10.4 years ago by Devon Ryan 105k

score 0 · Answer 3 · 2017-01-13

0

Entering edit mode

8.5 years ago

kaixian110 • 0

HI , have you found any solutions to extract duplication reads from paired fastq files ?

ADD COMMENT • link 8.5 years ago by kaixian110 • 0

0

Entering edit mode

BBMap's dedupe program has an "outd" flag that will capture duplicate reads:

dedupe.sh in1=read1.fq in2=read2.fq out1=x1.fq out2=x2.fq ac=f outd=dupes.fq

Alternatively, you can use Clumpify:

clumpify.sh in=reads.fq out=clumped.fq markduplicates allduplicates

This command assumes paired reads are interleaved in a single file, although the upcoming release supports paired reads in twin files. The "allduplicates" flag will mark all copies as duplicates; if you remove that, all but one copy will be marked as duplicates (which is probably better for most purposes). The "optical" flag will mark only optical duplicates (rather than, say, PCR duplicates). Anyway, "clumped.fq" will contain all of the reads, but the duplicates will be marked with " duplicate". So you can then separate them like this:

filterbyname.sh in=clumped.fq out=dupes.fq include=t names=duplicate substring
filterbyname.sh in=clumped.fq out=unique.fq include=f names=duplicate substring

ADD REPLY • link 8.5 years ago by Brian Bushnell 20k

1

Entering edit mode

One can easily get interleaved data files for clumpify.sh by using another tool from BBMap: reformat.sh in1=R1.fg.gz in2=R2.fq.gz out=int.fq.gz.

ADD REPLY • link 8.5 years ago by GenoMax 152k