ChIP-Seq identification of peaks
Entering edit mode
6.7 years ago
TulsaPrim • 0

I have the results of ChIP-Seq experiments for histone modification - aligned reads, and want to develop a script that will identify "true" peaks (I know about MACS, SICER etc. but I need to get familiar with the pre-processing of the data using a custom script). No control data are available to assess the noise.

Which characteristics would form a set of features that will allow for a simple procedure yet yielding good enough results?

I think the ratio: no. reads in the reverse strand / no. reads in the forward strand, and the no. reads within a window should be included.

Thank you.

ChIP-Seq peaks histone modification • 2.3k views
Entering edit mode

To be quite frank, if you want to develop your own peak caller then you shouldn't be asking these questions. Read the literature. Look at what other tools are already doing, then try to come up with something sensible yourself.

Entering edit mode

I do read the literature but was looking for the extra input of experience. Thanks.

Entering edit mode
6.6 years ago

As with most things in bioinformatics, people develop programs to address a specific problem or part of a problem, or to improve on a previous solution based on newfound information/knowledge/research.

You need to ask yourself why you want to develop a new program and who would be the audience, i.e., why would anyone use your program instead of others?

In ChIP-seq, the main parameters that are important are:

  • Quality control of raw reads, including adaptor trimming, eliminating reads below a certain length (e.g. 50bp), and ensuring that each base in each read is > 20 Phred score

  • performing unique mapping to a reference genome and setting the mapping quality (MAPQ) high (e.g. >50 Phred score) to prevent false alignments

  • removing duplicate reads that result from PCR amplification step

  • strand-specific alignments (depends on the NGS lab prep protocol and also the bioinformatic aligner and its parameters used)

  • read depth at each position / number of reads mapping at each position

  • maximum number of reads mapping to each position

  • the depth of coverage 'profile' and where to define a cut-off for signal / no signal, and for peak region identification (e.g. instruct to cease looking further for extending a region if the depth of coverage falls below 10x).

  • window size for moving along the genome and detecting peak regions

  • minimum/maximum peak starting size

  • maximum peak region allowed

  • normalization of depth of coverage and type of normalization used (e.g. Reads Per Kilobase per Million mapped reads [RPKM], reads per genome coverage [RPGC], etc.)

  • P- and Q-values, and fold-changes, for identifying where a region ends

Off the top of my head, current programs include HOMER, SICER, MACS, and bedCoverage.

Good luck! Kevin

Entering edit mode

Hi Kevin, thanks for the answer. Just wanna ask if MACS2 can perform reads per genome coverage [RPGC] normalization?

Entering edit mode
6.6 years ago
mforde84 ★ 1.4k

Here's a pipeline developed by ENCODE for narrow TF peaks. However, you may need to modify it a little to work with broad peaks depending on what histone markers you're looking at.


Login before adding your answer.

Traffic: 1621 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6