Question

Tutorial:Plot average methylation levels across TSS region

12

Entering edit mode

9.2 years ago

TriS ★ 4.7k

I spent a good amount of time online looking for answers to this but I wasn't able to find anything really conclusive, therefore I decided to do it my own way. please feel free to comment/critique, this is my first tutorial! I probably missed a simpler way of doing it or an R package that works much better, if that's the case please reference it in a comment.

What you need:

methylation data
TSS file
bedtools, R and patience...coffee helps too.

I used the RRBS methylation data from ENCODE K562 that you can find from the Data Matrix.

You can unzip the file with gunzip *.gz

A good initial introduction on how to use the RRBS data from ENCODE is here, therefore I won't take any credit for what he already explains really well.

I'll just remind that there are 11 columns of which the last two are (col 10) reads and (col 11) percentage methylation.

The TSS file is available for download as .bed from from the UCSC Genome Browser in the Table link under the following conditions:

assembly (Feb2009/hg19), feel free to use the one you assembled your data on
group: Genes and Predictions
track: RefSeq Genes
table: RefSeq
output format: selected fields from primary and related tables
output file: "TSShg19" ...or whatever you wanna call it
--> get output
select "chrom" and "txStart"
--> get output

Now we start with some fun...

1) Prepare TSS file

I since the TSS is one nucleotide I copied the start position as end position using awk:

cat myTSSfile.bed | awk '{print $2, '\t', $2}' > TSS_new.bed

2) Intersect methylated regions with TSS and calculate distance. this step is fundamental since it will tell you how far away your CpGs are from the TSS, we'll use bedtools closest function, also known as closestBed. we want to keep the distance of the CpG (file a) to the TSS (file b)

The command line looks like this:

closestBed -a wgEncodeHaibMethylRrbsK562HaibSitesRep1.bed -b TSS_new.bed -D "b" > dist_TSS.bed

This will add three columns at the end, which will be your TSS coordinate and the distance of that CpG from it.

3) Extract only those data with > 10 reads. you want to filter the RRBS data so that you keep only those with a minimum number of reads. you can set this before intersecting if you want, you can also set it to your favorite limit, I used 10 reads. at the end I will only print out the 1) number of reads (col 10), 2) the levels of methylation (col 11) and 3) the distance to the TSS (col 15)

cat dist_TSS.bed | awk '{if($10 > 10) print $10, "\t", $11, "\t", $15}' > results_dist_meth.bed

3a) In case you want you can also filter reads within a maximum distance, I did this step in R, if you want to do it in UNIX I'll leave it as an exercise for the reader

4) Now we switch to R and read in the files we just created + filter reads that are more than 2kbs away from the TSS

math <- read.table("results_dist_meth.bed")
m_dens <- math[which(math$V3 < 2000 & math$V3 > -2000),2:3]

5) Now, since you will have lots of reads at the same distance from the TSS and you want to plot the average values across a region, we'll calculate the average values across the same positions:

m_dens <- m_dens[order(m_dens[,2]),]
results_dist <- apply(m_dens,2,function(x) ave(x, m_dens[,2], FUN=mean))
results_dist <- results_dist[!(duplicated(results_dist[,2])),]

6) Almost there, now we can plot all the points, and fit a nice curve to it. briefly: we prepare an empty plot to accommodate the curve, fit a model to our data, plot the model. the model is polynomial and the 6 indicates the order, you can change it based on how it fits your data (apologies to statisticians for my poor explanation).

plot(x=results_dist[,2], y=results_dist[,1], type="n", ylim=c(0,100), xlab="Dist to TSS", ylab="Average Methylation")
fit <- lm(results_dist2[,1]~poly(results_dist[,2],6,raw=T))
lines(results_dist[,2],predict(fit,data.frame(x=results_dist[,1])), lwd=5, col="black")

The resulting graph should be the following:

https://drive.google.com/file/d/0BxzhXZ5eBptDZkpXUXM3NUhqdlU/view?usp=sharing

< image not found >

(Sorry I'm not sure how to display an image on here from google drive... :-/ )

7) Enjoy :)

Hopefully this was helpful, comments, critiques, corrections etc are always welcome!! thanks for reading!

bedtools RRBS R methylation • 7.9k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by TriS ★ 4.7k