Question: Duplicate Removal Of New Sequence Data?
0
gravatar for Gabriel R.
6.4 years ago by
Gabriel R.2.6k
Center for Geogenetik KĂžbenhavns Universitet
Gabriel R.2.6k wrote:

I have reads from a eukariotic genome and there duplicate due to sequencing. In a traditional enviroment, I would align them, mark and remove duplicates but here, I have no reference.

I am wondering, is there any software that does duplicate removal of raw sequence data ? What is your experience with them ?

Sorry in advance if the question is naive.

duplicates assembly • 1.9k views
ADD COMMENTlink modified 6.1 years ago by Wrf210 • written 6.4 years ago by Gabriel R.2.6k

fastX collapser works on raw read data http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_collapser_usage I am however not sure which kind of duplicates you want to remove.. completely same reads?

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by Biomonika (Noolean)3.0k

hmmm maybe tolerate 1 mm but a collapsed read would be have to be a consensus of the two (or more). How does this tool scale to say a full HiSeq or more ?

ADD REPLYlink written 6.4 years ago by Gabriel R.2.6k

How are you going to process the data? If you assemble the reads, most assemblers will take care of the duplicates.

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by lh331k

Hi Heng, we are using soapdenovo. The Panda Genome guys claim "The redundant reads were filtered at a threshold of euclid distance <= 3 and a mismatch rate of <= 0.1. We observed that the average rate of base-calling duplicates for each lane was about 0.83%, ranging from 0.00% to 8.52%." But they did that using an in-house pipeline, I was wondering if that is a procedure that one should use.

ADD REPLYlink written 6.4 years ago by Gabriel R.2.6k

I do not know what that threshold is used for, probably for SNP calling. For de novo assembly, it does not matter too much whether the duplicate rate is high.

ADD REPLYlink written 6.4 years ago by lh331k
2
gravatar for Raygozak
6.4 years ago by
Raygozak1.3k
State College, PA, Penn State
Raygozak1.3k wrote:

I also recommend prinseq lite, it is a very nice tool that generates statistics about your reads, and a mode to filter duplicates using four different criteria, trim bases and remove reads with less than a given mean quality. it has more options of course.

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Raygozak1.3k
0
gravatar for Wrf
6.1 years ago by
Wrf210
Wrf210 wrote:

not quite sure why you would want to remove duplicates from the raw data, but you could try "sequniq" in the genometools package. i never tried it for millions of raw reads, but for contigs its quite fast.

ADD COMMENTlink written 6.1 years ago by Wrf210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1180 users visited in the last hour