Question

deeptools bamcompare/bamcoverage merging bins?

1

Entering edit mode

5.8 years ago

gewa ▴ 20

Hi all, I am currently using deepTools to process some chip-seq data and need a sanity check as to how to interpret/deal with the output, as I am not sure after reading the docs. I have some alignment files from multiple replicates of a ChIP-seq experiment targeting a histone mark. What I need to do is assign the reads to 100bp bins covering the genome, and then get a normalized value for each bin representing the abundance of reads that map to it.

To do this, I am using deeptools bamCompare to take in the alignments of both replicates from a single experiement, finding the RPKM for each bin, for each replicate, and then getting the mean of these values to get a single RPKM value for each bin.

The issue is that I don't get an output of contiguous 100bp bins, rather the bins seemed to be merged. While I don't see it on the bamcompare docs page, I assume this is the same as in bamCoverage, where "If consecutive bins have the same number of reads overlapping they are merged." What I'm wondering is:could I simply modify the output begraph (using something like awk) to split the "merged" bins into the size I need (and giving them all the RPKM value from the original, merged/large bin)? Or am I misunderstanding the effect of the merging on the calculation?

Alternatively, if there is a better way to accomplish what I'm trying to, I'd welcome any advice - I'm very new to bioninformatics stuff. I need the data in these bins to input into another model; I'm not doing any DB analysis or anything like that.

Thanks!

deeptools ChIP-Seq • 3.0k views

ADD COMMENT • link updated 5.8 years ago by ATpoint 81k • written 5.8 years ago by gewa ▴ 20

score 5 · Answer 1 · 2018-07-18

5

Entering edit mode

5.8 years ago

ATpoint 81k

Why don't you split the genome into bins of 100bp (bedtools makewindows), and then use featureCounts (or similar tools) to get the counts per bin directly from the bam files? Once you have this, you can use awk to normalize the bins to RPKM and get the mean with bedtools map

## Calculate the scaling factor for each bin (requires indexed bam file):
SCALE_FACTOR=$(bc <<< "scale=8;1000000/($(samtools idxstats $1 | awk '{SUM+=$3} END {print SUM}')*0.1)") 

## Normalize the counts (assuming bedGraph-like format, chr-start-end-counts)
mawk -v SF=${SCALE_FACTOR} 'OFS="\t" {print $1, $2, $3, $4*SF}' bedGraphLike.txt > normalized_rpkm.txt

## Get the average using bedtools map:
bedtools map -a normalized_rpkm_sample1.txt -b normalized_rpkm_sample2.txt -c 4 -o mean > averaged.txt

ADD COMMENT • link 5.8 years ago by ATpoint 81k

0

Entering edit mode

I will try this and update with results, thanks! Also, is there any way you could please explain the calculation for the scale factor? I'm having trouble understanding where the 8 comes from, and why to use IDXstats. Thanks so much again!

ADD REPLY • link 5.8 years ago by gewa ▴ 20

0

Entering edit mode

The 8 is the number of digits that the calculated value is rounded to. You can use any number you want, but do not set it too low, to avoid larger rounding errors. idxstats is simply used to quickly get the total number of reads in the BAM file. That is much faster than flagstat.

ADD REPLY • link 5.8 years ago by ATpoint 81k