Question: Remove specific number of identical reads from fastq or bam files
gravatar for florian.noack
4 months ago by
florian.noack20 wrote:

Hi, I dealing right now with some ChIP-seq data generated from a very low number of cells. Data look so far good but I noticed that some loci got heavily amplified during library preparations which is I guess a consequence of working with low amounts of material. I looking now for a tool to restrict the number of identical reads per loci at for example 3 (e.g. if I have 10 identical reads 7 will be removed and 3 remain). As far as I read both picard tools as well as samtools remove duplicates in a all or nothing manner. Somebody has a handy solution for me (Iam biologist :p).

Thanks, Flo

chip-seq duplicates • 165 views
ADD COMMENTlink written 4 months ago by florian.noack20

I am not immediately aware of such a tool. What is special about requirement of leaving three instead of just one? You could use (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ) which has an option to add a count field to the fastq header after deduplicating the data which you could use to keep track of how many duplicates were there originally.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax64k

Because its ChIP-seq and I would expect to have some duplicates simply because we reduce extremely genomic complexity especially in the case using just a few cells (additional lost of complexity simply by losing some DNA fragments after shearing). Iam not sure which exact number i will allow later its just to play a bit around but removing all of them is maybe to harsh in my case.

ADD REPLYlink written 4 months ago by florian.noack20

prinseq can remove duplicated sequences. If you have a high levels of read-duplication you may consider to remove them, if not, I think that use arbitrary filters may cause absolutely biased analysis.

ADD REPLYlink written 4 months ago by Buffo1.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2225 users visited in the last hour