Tutorial: Plot average methylation levels across TSS region
gravatar for TriS
3.5 years ago by
United States, Buffalo
TriS3.4k wrote:

I spent a good amount of time online looking for answers to this but I wasn't able to find anything really conclusive, therefore I decided to do it my own way. please feel free to comment/critique, this is my first tutorial! I probably missed a simpler way of doing it or an R package that works much better, if that's the case please reference it in a comment.

what you need:

I used the RRBS methylation data from ENCODE K562 that you can find from the Data Matrix.

you can unzip the file with gunzip *.gz

a good initial introduction on how to use the RRBS data from ENCODE is here, therefore I won't take any credit for what he already explains really well. 

I'll just remind that there are 11 columns of which the last two are (col 10) reads and (col 11) percentage methylation.

the TSS file is available for download as .bed from from the UCSC Genome Browser in the Table link  under the following conditions:

  • assembly (Feb2009/hg19), feel free to use the one you assembled your data on
  • group: Genes and Predictions
  • track: RefSeq Genes
  • table: RefSeq
  • output format: selected fields from primary and related tables
  • output file: "TSShg19" ...or whatever you wanna call it
  • --> get output
  • select "chrom" and "txStart"
  • --> get output

now we start with some fun...

1) prepare TSS file

I since the TSS is one nucleotide I copied the start position as end position using awk:

cat myTSSfile.bed | awk '{print $2, '\t', $2}' > TSS_new.bed

2) intersect methylated regions with TSS and calculate distance. this step is fundamental since it will tell you how far away your CpGs are from the TSS, we'll use bedtools closest function, also known as closestBed. we want to keep the distance of the CpG (file a) to the TSS (file b)

the command line looks like this:

closestBed -a wgEncodeHaibMethylRrbsK562HaibSitesRep1.bed -b TSS_new.bed -D "b" > dist_TSS.bed

this will add three columns at the end, which will be your TSS coordinate and the distance of that CpG from it.

3) extract only those data with > 10 reads. you want to filter the RRBS data so that you keep only those with a minimum number of reads. you can set this before intersecting if you want, you can also set it to your favorite limit, I used 10 reads. at the end I will only print out the 1) number of reads (col 10), 2) the levels of methylation (col 11) and 3) the distance to the TSS (col 15)

cat dist_TSS.bed | awk '{if($10 > 10) print $10, "\t", $11, "\t", $15}' > results_dist_meth.bed

3a) in case you want you can also filter reads within a maximum distance, I did this step in R, if you want to do it in UNIX I'll leave it as an exercise for the reader

4) now we switch to R and read in the files we just created + filter reads that are more than 2kbs away from the TSS

math <- read.table("results_dist_meth.bed")
m_dens <- math[which(math$V3 < 2000 & math$V3 > -2000),2:3]

5) now, since you will have lots of reads at the same distance from the TSS and you want to plot the average values across a region, we'll calculate the average values across the same positions:

m_dens <- m_dens[order(m_dens[,2]),]
results_dist <- apply(m_dens,2,function(x) ave(x, m_dens[,2], FUN=mean))
results_dist <- results_dist[!(duplicated(results_dist[,2])),]

6) almost there, now we can plot all the points, and fit a nice curve to it. briefly: we prepare an empty plot to accommodate the curve, fit a model to our data, plot the model. the model is polynomial and the 6 indicates the order, you can change it based on how it fits your data (apologies to statisticians for my poor explanation).

plot(x=results_dist[,2], y=results_dist[,1], type="n", ylim=c(0,100), xlab="Dist to TSS", ylab="Average Methylation")
fit <- lm(results_dist2[,1]~poly(results_dist[,2],6,raw=T))
lines(results_dist[,2],predict(fit,data.frame(x=results_dist[,1])), lwd=5, col="black")

the resulting graph should be the following:


(sorry I'm not sure how to display an image on here from google drive... :-/ )

7) enjoy :)

hopefully this was helpful, comments, critiques, corrections etc are always welcome!! thanks for reading!

ADD COMMENTlink modified 3.5 years ago by Gjain5.2k • written 3.5 years ago by TriS3.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1553 users visited in the last hour