Question

ChIP-Seq reads mapping and data normalization of samples with different PCR cycles, library size and duplicates level

0

Entering edit mode

4.0 years ago

S.Fajon • 0

Hello everybody,

I’m new to bioinformatics and I am currently proceeding to my first ChIP-Seq analysis. I have questions concerning the mapping of my reads and data normalization of samples containing elevated number of duplicates.

During the library formation some samples had more PCR cycles than others. This results in a bigger library size and a number of duplicates much more important at the end of the day. When proceeding to the mapping of my reads with Bowtie, I’d like to restrict my library size to 10 000 000 reads for every sample to be able to compare them afterwards. Due to the high amount of duplicates in some of my samples (up to 65%) once I remove them from my bam file, shouldn’t I end up with a much smaller mapped library? Which in that case, correct me if I’m wrong, defeats the purpose of setting a fixed number of reads at the beginning. By doing so am I not introducing a big bias as removing 20% or 65% of reads out of 10 M will not give the same result?

Shouldn’t it be better to redo the mapping with a set of “clean” reads with no duplicates?

If not, how can I erase the bias introduced by the different number of PCR cycles among my samples and between my sample and my input?

I looked at the Bowtie user manual and I saw that you can set arguments for multimapping but nothing to map unique reads at that stage. I’m sure there is a good reason for that but I missed the point… I hope my explanations were clear enough to be understood by everybody.

Thank you in advance for your help!

ChIP-Seq alignment duplicates PCR cycles • 1.1k views

ADD COMMENT • link 4.0 years ago by S.Fajon • 0

0

Entering edit mode

Thank you for your detailed answer! I indeed use Bowtie for now because my reads are short (around 50 bp for my R1 files and around 35 for my R2 files) but I was planning on trying with Bowtie2 to see if it gives better results. I saw people were recommanding the MAnorm paper on other threads so I guess I'll have a look at it and keep in mind your advices. Cheers!

ADD REPLY • link 4.0 years ago by S.Fajon • 0

0

Entering edit mode

MAnorm is a normalization method that is relatively similar to the ones that DESeq2 and edgeR use, at least from the principle of assuming that many regions do not have differential binding. It is old and not maintained anymore. Would not bother with it. What did they recommend it for? What is the question you want to answer? if it is only the normalization then put your count matrix into DESeq2 or edgeR and get normalized counts from that. The critical question is if you expect global changes in binding profile. If so then a bin-based normalization might be desirable. Check the csaw manual for a discussion on normalization strategies.

ADD REPLY • link 4.0 years ago by ATpoint 82k

score 1 · Answer 1 · 2020-04-21

Map the full sequencing results against your reference, then mark and remove duplicates with any of the standard tools such as MarkDuplicates from Picard or MarkDup from Samtools. Do not do any custom read sampling, this is arbitrary and not standard. PCR cycles should not be a factor you consider during analysis since this is not something you can objectively model or respect. A sample with more cycles can still have better quality and one with lower cycle numbers can still be crappy. Take the reads that are present after mapping and deduplicating, this is what you have. Multimappers are also typically excluded. People usually do not use bowtie anymore unless you have very short reads. Use bowtie2, it is a more recent replacement.

The actual QC starts after mapping imho. Downstream of mapping perform peak calling on each sample and then calculate FRiPs (fraction of reads per peak). That is nothing different than the percentage of reads overlapping with callable peaks per sample. It is strongly dependent on the peak caller and the way you calculate it but when using macs2 as a peak caller then FRiPs should be somewhat > 5% for TFs and most histone marks. Also check samples on a genome browser and see if you have clear separation between peaks and noise. Also consider performing principal component analysis to check if you have odd samples that cluster away from the other replicates. This only makes sense if you indeed have replicates per condition which I strongly recommend.

Normalization comes into place during differential analysis, not before that. Do not sample replicates to equal read numbers as this is not informative. You would need to respect differences in data quality and (between conditions) different library compositions. Also do not compare raw peak numbers as this is also a function of sequencing depth and data quality. If you want to compare samples (between experimental conditions) then perform differential analysis, e.g. with edgeR, reqiuring replicates.

ChIP-seq is a tricky assay as it depends on so many factors if your library prep is successful, most importantly antibody quality and specificinty. if you have further questions feel free to comment.