Question

How to make a read count matrix from multiple bed files generated by ROSE

1

Entering edit mode

4.5 years ago

Researcher ▴ 130

Hi All, I have 10 bed files generated by ROSE, each with distinct start-end cordinates for SuperEnhancers (SEs) from 10 different samples, 5 of them are from one condition and rest 5 are from other. In order to check the differential binding for these SEs between the two conditions, I have parsed these sample-wise bedfiles as chip-seq peaks to the DiffBind along with their bam file using a sample-sheet and given command.

2cond_K27Ac_SE.csv has the following info:

SampleID Condition bamReads ControlID bamControl Peaks PeakCaller

sampleA C1 sampleA.bam A_Input A_Input.bam sampleA_K27Ac_SE.bed bed

sampleB C1 sampleB.bam B_Input B_Input.bam sampleB_K27Ac_SE.bed bed

.....

sampleF C2 sampleF.bam F_Input F_Input.bam sampleF_K27Ac_SE.bed bed

sampleG C2 sampleG.bam G_Input G_Input.bam sampleG_K27Ac_SE.bed bed

samples <- read.csv("2cond_K27Ac_SE.csv")
DBdata <- dba(sampleSheet=samples)

DBdata_count <- dba.count(DBdata,score=DBA_SCORE_TMM_MINUS_FULL_CPM)
counts <- dba.peakset(DBdata_count,bRetrieve=TRUE,DataType=DBA_DATA_FRAME, consensus=TRUE)
write.csv(counts,"SE_cpm.csv")

I am not sure will it be a recommended approach to get the normalized read count to perform differential binding. I am looking for your suggestions, please share your thoughts.

Thanks

super-enhancers rose chipseq diffbind H3K27Ac • 2.1k views

ADD COMMENT • link updated 23 months ago by ksong ▴ 40 • written 4.5 years ago by Researcher ▴ 130

0

Entering edit mode

Hello my friend, I know this post is old. However I plan to do the exactly the same thing recently: generating normalized count matrix using diffbind for ROSE super enhancers. I want to use the resulting count matrix for Heatmap. Have you found an appropriate way to do it? Many thanks.

ADD REPLY • link 23 months ago by ksong ▴ 40

score 1 · Answer 1 · 2022-05-17

1

Entering edit mode

23 months ago

Rory Stark ★ 2.0k

I'm not 100% sure what the specific issue is here?

You can do a differential binding analysis (default DESeq2by default) very easily:

DBanalysis <- dba.analyze(DBdata_count)

You can also use dba.normalize() to customize the normalization method, and dba.,contrast() to control the model design and set up contrasts other than the default ones.

If you want to do the differential binding analysis outside of DiffBind, most methods start with non-normalized counts as normalization and modelling can be somewhat intertwined. To get the raw read counts:

DBdata_count <- dba.count(DBdata_count, peaks=NULL, score=DBA_SCORE_READS)
counts.raw   <- dba.peakset(DBdata_count, bRetrieve=TRUE)

ADD COMMENT • link 23 months ago by Rory Stark ★ 2.0k

0

Entering edit mode

Hi Rory, Thank you for your reply! I am learning DiffBind. I think I can start to understand the example code you are showing now. However I found another problem after looking at some comments from the experts in Super Enhancer field:
Looking for suggestions if ROSE can handle multiple samples together?
DiffBind to call differential binding of Super-enhancers from ROSE

It sounds like super enhancers usually have large windows so if we count reads using super enhancer intervals we will get lots of noisy/background reads. I don't know if this can affect count matrix as my final goal is to plot heatmap using normalized count matrix.

To overcome this problem, I am thinking just sum over the counts of individual peaks in each large super enhancer window, and use that sum as the score for each super enhancer.

Here is the code I tried

 #raw counts   
myDBA <- dba.count(h3k27ac, peaks=NULL, score = DBA_SCORE_READS)  
 counts <- dba.peakset(myDBA, bRetrieve=TRUE)

 rose_output_dir = '/rose/output'  
 filenames<-list.files(rose_output_dir, pattern = "H3K27ac.*", full.names = F)
se_collect=list()   
for (sample in filenames){   se = read.table(paste0(rose_output_dir, "/",sample, "/",sample,"_SuperEnhancers.table.txt"), skip = 5, sep = "\t", header = T)  
 se_collect[[sample]]= se }   
se_peakset = NULL    
for (se in names(se_collect)){   se_peakset = dba.peakset(se_peakset,peaks=se_collect[[se]][,c("CHROM", "START","STOP", "enhancerRank")]) }

#make consensus super enhancer intervals
 se.consensus <- dba.peakset(se_peakset, minOverlap=2, bRetrieve=TRUE)  
 mcols(se.consensus)<-NULL #remove metadata column  
 names(se.consensus)<-paste0(seqnames(se.consensus),"_", ranges(se.consensus)) #add rownames. 

#find member original peaks of each super enhancer
overlaps = findOverlaps(query = se.consensus, subject = counts)   
 mcols(counts)$se_name <- NA  
 mcols(counts)$se_name[subjectHits(overlaps)] <-  names(se.consensus)[queryHits(overlaps)] 

se_peak_merged = counts[which(mcols(counts)$se_name != "NA")] %>%
 mcols() %>% 
as.data.frame
se_agg_counts = aggregate(. ~ se_name, se_peak_merged, sum)  

 head(se_agg_counts)  
                  se_name            X10       X11            X12             X25  
1   chr1_10761204-10794150        115        109      77       105  
2 chr1_118097743-118118284      220        201       144      190  
3 chr1_120030935-120062842        34         27       16        42

X10 X11.. are different timepoints. Now how should I normalize these counts? what would be the best norm.method and lib.method? Can I use the same norm.factors I got from the initial dba object like this?

dba.normalize(h3k27ac, bRetrieve = T,normalize = DBA_NORM_RLE)

Do you have any comment on this workflow? I just want to make sure this approach does not generate too much bias.

I really appreciate your help

ADD REPLY • link 23 months ago by ksong ▴ 40