I have a file with with start and end peak position according to TSS. My file contains 4 coumns (gene, start_peak, end_peak, strand). I would create a distribution plot. On the x-axis is the distance from the TSS and On the y-axis is number of peak found at that distance. I want to bin the distances from the TSS, say every 500 bp.nIt is not read density but Chipseq peak density
How to plot ChIP-seq Density vs Distance from TSS using my file and R? How to use these start and end peak position, because when I plot this
ggplot(data, aes(x=start_peak)) + geom_density()
it is only according to start peak ! And why my data are not centered on the TSS ? Could you help me ? Thanks
If I understood correctly, you have peak position and I guess you need the distance from TSS.
Let's assume you have peak information in bed file format:
chromosome peakStart peakEnd, then you can annotate them using R package such as ChIPpeakAnno OR you can manually calculate the distance from TSS using gtf file (in case of a non-model organism).
Finally, you can use distance from TSS information from peak annotation and calculate the frequency.
Thanks for you answer. I have already calculate each peak position relative from the TSS, so the postitions I had were the start position relative to TSS and end position relative to TSS.
How do I calculate the frequency, The frequencey is the number of gene having a given peak? For example :
so the frequency (column 3) is
I am not sure how did you calculate the relative distance from TSS for the peaks. But I assume the second and third column is the distance of start and end from TSS, and then take each column and calculate the frequency separately.
dplyr's group_by + count are your friends: