since three weeks i try to make sense out of my (h)medip-seq data (iam an absolute beginner in deep sequencing data analysis and bioinformatics in general) and i hope you guys can give me some tips.
Sample: in vivo material of three linage related cell types
Method: comperative (h) MeDip-seq ( Li Tan et al. 2013) with 300 ng of each of the three celltypes (in total 900ng DNA) using the Diagnode MeDip and hMeDip kits.
First of all I don’t know if my data look alright since I have really narrow peaks (reads are extended to 250 bp) which align quite often only to on strand (hemimethylated DNA or non-CpG methylation), ([IMG]http://i.imgur.com/1TiUPTQ.png[/IMG]). I tried already to download published raw medip-seq data to compare them with my data but without success. Most times the data were already processed and therefor i couldnt compare them with my data. Is this narrow peak pattern normal or should I see more broader peaks (MeDip QC and Seqeuncing QC seem all okay) ?
The second problem is to analyse the data and find differential methylated peaks/regions). I used already the MEDIPS package (see code below) but these gives me around 10 peaks without duplicates whereas I get around 500 peaks with duplicates (adj=T).
bcv <- 0.01
counts <- matrix( rnbinom(40,size=1/bcv^2,mu=10), 20,2)
y <- DGEList(counts=counts, group=1:2)
et <- exactTest(y, dispersion=bcv^2)
N_Output= MEDIPS.createSet(file = medipoN, BSgenome = BSgenome, extend = extend, shift = shift, uniq = uniq, window_size = ws, chr.select = chr.select)
R_Output= MEDIPS.createSet(file = medipoR, BSgenome = BSgenome, extend = extend, shift = shift, uniq = uniq, window_size = ws, chr.select = chr.select)
mr.edgeR.MeDip = MEDIPS.meth(MSet1 = N_Output, MSet2 = R_Output, p.adj = "hochberg",diff.method = "edgeR", prob.method = "poisson", MeDIP = F, CNV = F, type = "rpkm", minRowSum = 1)
mr.edgeR.Medip.s = MEDIPS.selectSig(results = mr.edgeR.MeDip, p.value = 0.2, adj = T, ratio = NULL, bg.counts = 30, CNV = F)
However if I visualize the called regions in seqmonk they don’t look really interesting whereas I find easily other regions which look more differential methylated. So I don’t really trust the package but may used some wrong settings ? Can i use adj=F (but i think then i will get to much false positiv peaks) ?
I proceed then with MACS2 to call broad peaks in all three samples, merged them and count the reads of the peaks using bedtools.
macs2 callpeak -t /home/crtd_c ... MeDip_Output.bam -c /home/crtd_c ... MeDip_Input.bam --broad -f BAM -g mm --bw 250 -n PP_MeDip_MACS2 -> for each of the three celltypes
merge all three with "cat" function
bedtools merge -i /home/crtd_csortedmacs2peaks.bed -n -scores sum -d 100 > mergedMACS2peaks.bed
bedtools multicov -bams /home/crtd_c...PP_MeDip_Output.bam /home/crtd_c...DP_MeDip_Output.bam /home/crtd_c...N_MeDip_Output.bam -bed /home/crtd_c...mergedMACS2peaks.bed > MeDippeaks.bed -D -q 30
I sorted all peaks out which have no more than 20 reads in one of the three samples (
sorted <-subset(mydata, ((V6>20)|(V7>20)|(V8>20))) )and calculated the log2 ratio between the three celltypes (so at the end I had a excelsheet with the genomic coordinates and the three log2 ratios). However now is the question can I do it like this and if yes, how can I define the significant changes of the log2 ratios (simply +1/-1 fold change ?).
However Iam happy about any other suggestions regarding the data analysis. May there is a tool out there which I didn’t try so far.
Thanks a lot !