Hi all, I have downloaded the files wgEncodeBroadHistone from the ucsc downloads and I want to create a binary score for each region - signal enrichment/non enrichment. There are two kinds of files that I could get my data from 1) the bigWig files with a density score, but Im not sure what threshold I could use in order to call the signal of the region enriched (or not). And I guess its different for each marker (?). 2)the broadPeak file with the columns (correct me if Im wrong): "chr", "start", "end", "name", "score", "strand", "signalValue", "pValue", "qValue", but Im not sure what those values are and how I could use them for creating my variable. Any help would be much appreaciated! Thanks, Emma
Can you add some key references related to your answer ? Thanks !
Thanks Gjain, but I don't think that would do in my case. Firstly because Im not going to be looking genome wide, but in specific regions which are more likely to be regulatory so I would like a more consistent cut-off. And secondly because the distribution of the signal of marker h3k4me3, for example, (for chr19) is like:
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.040 1.000 2.000 4.254 3.000 3828.000 which means that the 25% cut-off is going to include lots of regions without a signal of enrichment.
Sorry, the distribution came out illegible, so I retype: Min: 0.040, 1st Qu: 1.000, Median: 2.000,Mean: 4.254,3rd Qu.: 3.000,Max: 3828.000
what you can do then is to do SnowsPenultimateNormalityTest and check if the scores distribution is normal or not. If its normal then you can calculate the zscores and get for a particular p-value say 0.005 as a cutoff get the zscore threshold.