Question: Remove specific number of identical reads from fastq or bam files
gravatar for florian.noack
13 months ago by
florian.noack20 wrote:

Hi, I dealing right now with some ChIP-seq data generated from a very low number of cells. Data look so far good but I noticed that some loci got heavily amplified during library preparations which is I guess a consequence of working with low amounts of material. I looking now for a tool to restrict the number of identical reads per loci at for example 3 (e.g. if I have 10 identical reads 7 will be removed and 3 remain). As far as I read both picard tools as well as samtools remove duplicates in a all or nothing manner. Somebody has a handy solution for me (Iam biologist :p).

Thanks, Flo

chip-seq duplicates • 269 views
ADD COMMENTlink written 13 months ago by florian.noack20

I am not immediately aware of such a tool. What is special about requirement of leaving three instead of just one? You could use (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ) which has an option to add a count field to the fastq header after deduplicating the data which you could use to keep track of how many duplicates were there originally.

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax75k

Because its ChIP-seq and I would expect to have some duplicates simply because we reduce extremely genomic complexity especially in the case using just a few cells (additional lost of complexity simply by losing some DNA fragments after shearing). Iam not sure which exact number i will allow later its just to play a bit around but removing all of them is maybe to harsh in my case.

ADD REPLYlink written 13 months ago by florian.noack20

prinseq can remove duplicated sequences. If you have a high levels of read-duplication you may consider to remove them, if not, I think that use arbitrary filters may cause absolutely biased analysis.

ADD REPLYlink written 13 months ago by Buffo1.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2067 users visited in the last hour