Question: Did you remove ChIP-seq duplicates
1
gravatar for mikysyc2016
15 months ago by
mikysyc201660
mikysyc201660 wrote:

Hi, when you analysis ChIP-seq data(fastq file). Did you remove duplicates from the data? which command and software you used? Thanks!

chip-seq • 2.3k views
ADD COMMENTlink modified 15 months ago by i.sudbery5.3k • written 15 months ago by mikysyc201660
3
gravatar for i.sudbery
15 months ago by
i.sudbery5.3k
Sheffield, UK
i.sudbery5.3k wrote:

We always remove duplicates from ChIP-seq data. If you sequencing is paired end, you'll want to do this in a paired-end aware manner. Normally this is done after mapping. We use MarkDuplicates from picard for ChIP-seq. samtools also has rmdup. We use picard because back in the day MarkDuplicates was more intelligent than rmdup about how it detected duplicates, but I don't know if that is still true. If you are using MACS for your peak-calling, you'll want to mark duplicates rather than remove them.

ADD COMMENTlink written 15 months ago by i.sudbery5.3k

As per Ian, for ChIP-seq, I have also always marked PCR / optical duplicates with Picard MarkDuplicates. You can then literally eliminate them from the BAM files with SAMtools:

#Identify and mark duplicates, and index new BAM
java -jar MarkDuplicates.jar INPUT=Aligned_Sorted.bam OUTPUT=Aligned_Sorted_PCRDupes.bam ASSUME_SORTED=true METRICS_FILE=Aligned_Sorted_PCRDupes.txt VALIDATION_STRINGENCY=SILENT ;
samtools index Aligned_Sorted_PCRDupes.bam ;

#Expunge marked duplicate reads, and then index new BAM
samtools view -b -F 0x400 Aligned_Sorted_PCRDupes.bam > Aligned_Sorted_PCRDuped.bam ;
samtools index Aligned_Sorted_PCRDuped.bam ;

As always, however, each experiment is unique and has its own intricacies. It may not, therefore, always be appropriate to eliminate reads that are identified as duplicates.

ADD REPLYlink modified 11 months ago • written 15 months ago by Kevin Blighe48k

How do you distinguish PCR duplicates from "biological" duplicates ? You could loose 96% of your reads, that's a really hard filter. I mean in a whole genome analysis, then, OK you can filter out duplicates because you have a very low probability to sequence twice the same read, but in amplicon or chipseq this probability is very high.

ADD REPLYlink written 15 months ago by Bastien Hervé4.4k
1

Amplicon sequencing is very different to ChIP-seq. In ChIP-seq one would expect a protein to bind to thousands of locations. Also ChIP-seq doesn't return the precise location, so the binding site could be anywhere within a fragment. For a 300bp fragment, that gives 300 different fragments for a single site. Then account for the fact that fragments arn't a fixed size. Lets say your fragments are 250-300bp long. That gives you 15,000 possible read pairs for a single binding site. Now realise that a ChIP-seq peak probably contains more than one binding site, so you could be talking 30,000 possible read pairs per peak across thousands of peaks, lets say 10,000 peaks, that gives you 300 million possible read pairs for your 10,000 peaks. Now note that on average only around 10% of reads for ChIP-seq experiments fall into peaks. So there would be 3 billion possible unique reads pairs in a chip-seq experiment for a factor with 10,000 binding clusters using 2x75bp reads with a 250-300bp fragment size.

If your ChIP-seq experiment has a 96% duplication rate then there is something wrong with your data. ENCODE guidelines for ChIP-seq recommend only using samples where more than 80% of the read pairs are unique (i.e. less than 20% duplication rate).

There are experiments where biological duplicates are more likely and distinguishing between those and PCR duplicates is more important. For example, contrast the above with an amplicon sequencing whereby if you sequence 1000x500bp amplicons there are probably only 1 million possible read pairs even if you fragment (and many of those fragments will be pretty unlikely due to fragmentation bias). Fro such experiments one most either not deduplicate or include UMIs in your experimental design.

BTW RNA-seq is a very common technique where deduplication is not appropriate.

ADD REPLYlink written 15 months ago by i.sudbery5.3k

BTW RNA-seq is a very common technique where deduplication is not appropriate.

Yes, and certain DNA-seq library preps.

ADD REPLYlink written 15 months ago by Kevin Blighe48k

Thank a lot for this very helpful comment. It took me around a hour to fully get the content with drawing and all.

Biologicaly I did not know that proteins could have so many binding sites. In my mind, proteins could have linked to a dozen binding sites not 10,000.

Do you have complementary informations about :

Now note that on average only around 10% of reads for ChIP-seq experiments fall into peaks

I did not understand this info.

I conclude that Chip-seq is more a genome scan rather than a genome panel (DNAseq).

Thanks again for the time

ADD REPLYlink written 15 months ago by Bastien Hervé4.4k

In only 1 situation did I observe a duplication rate that high, and it was due to the fact that the wet-lab immunologist had PCR amplified the same sample multiple times.

ADD REPLYlink modified 11 months ago • written 15 months ago by Kevin Blighe48k

Maybe it is too easy. Can I just use bam file from the first command below, then use bam to do peak calling, if i do not use samtools index? $ java -jar MarkDuplicates.jar INPUT=Aligned_Sorted.bam OUTPUT=Aligned_Sorted_PCRDupes.bam ASSUME_SORTED=true METRICS_FILE=Aligned_Sorted_PCRDupes.txt VALIDATION_STRINGENCY=SILENT ;

ADD REPLYlink written 15 months ago by mikysyc201660

Won't removing duplicate in short single-end ChIP-seq experiments put an effective ceiling on your coverage in enriched regions? There's only room for so many unique 75-bp reads over a 200bp region.

ADD REPLYlink written 11 months ago by eric.fournier0
1

Yes. Don't do short read single-end ChIP-seq.

ADD REPLYlink written 11 months ago by i.sudbery5.3k
1
gravatar for Bastien Hervé
15 months ago by
Bastien Hervé4.4k
Limoges, CBRS, France
Bastien Hervé4.4k wrote:

As suggested in this post, you expect to have duplicates in Chip-seq data because you sequenced a very small part of the genome. It will all depends of your coverage.

Try to find the proportion of duplicates you have. If you got 98% of duplicates, try the following :

A good way to catch PCR duplicates, @harold.smith.tarheel answer from the post above : "You can discriminate via genome browser of your non-deduplicated data. Bona fide peaks will have multiple overlapping reads with offsets, while samples with only PCR duplicates will stack up perfectly without offsets."

If you got "samples with only PCR duplicates will stack up perfectly without offsets." that will be a problem (or at least you will have to choose if you keep duplicates or not). In the other way if you got "multiple overlapping reads with offsets" you can keep duplicates.

ADD COMMENTlink modified 15 months ago • written 15 months ago by Bastien Hervé4.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1332 users visited in the last hour