I have downloaded several ChIP-seq data (histone marks) and I'm trying to plot their coverage over the TSS

Since every file comes from a different sequencing experiment the sequencing depth varies from one sample to another. Therefore, what I'm trying to do is scale all the samples by a specific value.

The deeptools package offers that via the bamCompare --scaleFactor function. Is it sensible to calculate the mean coverage for each sample and then apply a scaling factor to normalise all the bam files to each other?

Moreover, I'm having trouble parsing the matrix file from the computeMatrix function into R. If anyone is familiar with deeptools any explanations with regard to what the matrix files represent would be much appreciated

bamCompare is used when you have two bam files. For ChIP-seq there is normally the ChIP sample and the so called input, bamCompare is used to compute the log2ratio or the difference. If you have for all your histone marks the ChIP and input, you can use bamCompare which will produce normalized bigwig files

If you don't have an input, you can use bamCoverage which can normalize the values using two methods: normalizeTo1x and normalizeUsingRPKM. Either method will produce output that is comparable across samples.

You can write to the deepTools mailing list to get support on the matrix issue (deeptools@googlegroups.com)

Can someone please explain how the normalize To 1x and normalize using RPKM normalization is accomplished in bamCoverage ??
Please add a detail description, Thanks!

It's best to post things like this as a new question.

Anyway, these normalization methods basically come up with a number against which the signal is multiplied. For the "normalize to 1x" method, the total number of alignments that will be used in the file and their lengths are estimated. These values are then multiplied together and divided by the effective genome size that you input. That yields the average coverage that's expected in a particular sample. The signal in each bin is then divided by it (technically it's multiplied by 1/that number, but the results are the same).

For RPKM, the procedure is similar, with the number of alignments in the file estimated, the results divided by 1 million and multiplied by the bin size in kb). That scale factor is then used as above in each bin.

You can see the full code for this in getScaleFactor.py in the source code (in particular, see the get_scale_factor function).

It's best to post things like this as a new question.

Anyway, these normalization methods basically come up with a number against which the signal is multiplied. For the "normalize to 1x" method, the total number of alignments that will be used in the file and their lengths are estimated. These values are then multiplied together and divided by the effective genome size that you input. That yields the average coverage that's expected in a particular sample. The signal in each bin is then divided by it (technically it's multiplied by

`1/that number`

, but the results are the same).For RPKM, the procedure is similar, with the number of alignments in the file estimated, the results divided by 1 million and multiplied by the bin size in kb). That scale factor is then used as above in each bin.

You can see the full code for this in

`getScaleFactor.py`

in the source code (in particular, see the`get_scale_factor`

function).