Question: Counting number of duplicated reads from fastq file
0
gravatar for mike
4.2 years ago by
mike60
Germany
mike60 wrote:

Hello,

Which tool or method is available for counting or extracting number of duplicated reads from fastq files with paired reads? I have checked vaious tools which can only removes duplicated reads

Thanks,

 

ngs • 3.9k views
ADD COMMENTlink modified 2.3 years ago by kaixian1100 • written 4.2 years ago by mike60
2
gravatar for Brian Bushnell
4.2 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

Hi Mike,

If you want to deduplicate raw paired fastq files, I recommend trying dedupe.sh from the BBMap package.  You can run it like this:

dedupe.sh in1=read1.fq in2=read2.fq out1=x1.fq out2=x2.fq ac=f

That will also print the exact number of duplicates removed.

ADD COMMENTlink modified 3.9 years ago by Sukhdeep Singh9.7k • written 4.2 years ago by Brian Bushnell16k
1
gravatar for iraun
4.2 years ago by
iraun3.5k
Norway
iraun3.5k wrote:

You can remove duplicates (using picard, samtools or whatever) and then count how many reads are missing from the de-dupped file, no?

ADD COMMENTlink written 4.2 years ago by iraun3.5k

If you don't want to remove duplicates you can also, use samtools flags:
samtools view -c -f 1024 file.bam

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by iraun3.5k

These tools are for aligned bam files.

ADD REPLYlink written 4.2 years ago by mike60

Oh, sorry, my mistake. So, I would suggest you to map first. In my opinion, it is much better to map first and remove duplicates then. But, if you want to remove duplicates first, you could try fastx_collapse (http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_collapser_usage), remove the duplicates and count how many of them have you lost.

ADD REPLYlink written 4.2 years ago by iraun3.5k
2

It should be noted that fastx_collapse only works on single-end reads (this is pretty common for tools like this).

ADD REPLYlink written 4.2 years ago by Devon Ryan89k

Okay thanks I will give it a shot.

ADD REPLYlink written 4.2 years ago by mike60

Fastx toolkit is not designed for paired end reads if I am not wrong

ADD REPLYlink written 4.2 years ago by mike60

Unless you want to use these for assembly, it's generally fast enough to just align and remove/mark duplicates from the resulting BAM file.

ADD REPLYlink written 4.2 years ago by Devon Ryan89k
0
gravatar for kaixian110
2.3 years ago by
kaixian1100
kaixian1100 wrote:

HI , have you found any solutions to extract duplication reads from paired fastq files ?

ADD COMMENTlink written 2.3 years ago by kaixian1100

BBMap's dedupe program has an "outd" flag that will capture duplicate reads:

dedupe.sh in1=read1.fq in2=read2.fq out1=x1.fq out2=x2.fq ac=f outd=dupes.fq

Alternatively, you can use Clumpify:

clumpify.sh in=reads.fq out=clumped.fq markduplicates allduplicates

This command assumes paired reads are interleaved in a single file, although the upcoming release supports paired reads in twin files. The "allduplicates" flag will mark all copies as duplicates; if you remove that, all but one copy will be marked as duplicates (which is probably better for most purposes). The "optical" flag will mark only optical duplicates (rather than, say, PCR duplicates). Anyway, "clumped.fq" will contain all of the reads, but the duplicates will be marked with " duplicate". So you can then separate them like this:

filterbyname.sh in=clumped.fq out=dupes.fq include=t names=duplicate substring
filterbyname.sh in=clumped.fq out=unique.fq include=f names=duplicate substring
ADD REPLYlink written 2.3 years ago by Brian Bushnell16k
1

One can easily get interleaved data files for clumpify.sh by using another tool from BBMap: reformat.sh in1=R1.fg.gz in2=R2.fq.gz out=int.fq.gz.

ADD REPLYlink written 2.3 years ago by genomax65k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1636 users visited in the last hour