Question: Plot Coverage Vs. Gc Content
10
gravatar for Bfranks
8.2 years ago by
Bfranks100
Bfranks100 wrote:

I'm trying to visualize coverage vs. gc content across a handful of regions. My starting materials are my.bam and gc5Base.txt.gz. What's the easiest way to get a scatterplot of coverage vs. GC-content using free tools? y-axis is coverage, x-axis is %GC.

I got as far as this in R (EDIT: see complete solution below)

library(Rsamtools)
library(GenomicRanges)
x <- readGappedAlignments("my.bam")
cvg <- coverage(x)

but I'm struggling to figure out how to join the GC-content to the coverage.

BONUS - I'd also like to be able to easily adjust the smoothing window.

EDIT: solution modified from Michael Dondrup's answers ([IMAGE] - http://min.us/lmrKfU):

library(GenomicFeatures)
library(ShortRead)
library(rtracklayer)
library(BSgenome)
library("BSgenome.Hsapiens.UCSC.hg18")
x <- readGappedAlignments("sample1.bam")
cvg <- coverage(x)
chrZ <- DNAString(Hsapiens$chrZ)
window.size <- 1001 #BONUS
gcZ <- rowSums(letterFrequencyInSlidingView(chrZ, window.size, c("G","C")))/window.size
pad.right <- function (rle, value=0, len) {
    if (length(rle) > len) stop("length mismatch: len must be >= length(rle)!")
        if (len > length(rle))
            c(rle, Rle(value, len-length(rle)))
    else rle
}
cvgZ <- cvg[["chrZ"]]
cvgZp <- pad.right(cvgZ, len=length(gcZ))
cvgZw <- as.numeric(runmean(x=cvgZp, k=window.size, endrule="constant"))
smoothScatter(x=gcZ, y=cvgZw)
ADD COMMENTlink modified 8.2 years ago by Yadav Lal Bhattarai0 • written 8.2 years ago by Bfranks100

Hi, I hope this works for you. check my answer

ADD REPLYlink written 8.2 years ago by Michael Dondrup46k

cool, I edited my question to respond.

ADD REPLYlink written 8.2 years ago by Bfranks100

Yes, we forgot to pad the coverage vector because the coverage vector will only be as long as the largest alignment and thereby possibly shorter, we have to add 0's at the 'right-hand' side. Se my updated code.

ADD REPLYlink written 8.2 years ago by Michael Dondrup46k

Sweet, after a cup of coffee or 10 I'll have a plot. Some level of alpha transparency will help the madness, or I can switch to a 2D density plot. Thank you. This exercise was most informative for me.

ADD REPLYlink written 8.2 years ago by Bfranks100

fine!, for completeness, the smoothScatter function in the graphics package will do that.

ADD REPLYlink written 8.2 years ago by Michael Dondrup46k

also very helpful!

ADD REPLYlink written 8.2 years ago by Bfranks100

also very helpful! I must have too much time on my hands because now I am interested in viewing this density strung out along a genomic coordinate 3rd dimension, but I guess the density would then be in a 4th dimension. Shoot.

ADD REPLYlink written 8.2 years ago by Bfranks100

Can you post the resulting graphics? And maybe make a new question for the new task, because this question would become too crowded with code blocks. The density is however not a function of the genomic position, but only on gc and coverage.

ADD REPLYlink written 8.2 years ago by Michael Dondrup46k

It's the least I could do. Here is the result for two samples - http://min.us/lmmXNM. Pretty cool! Sorry for the delayed response. As for the new question, I'll try cleaning this one up first.

ADD REPLYlink written 8.2 years ago by Bfranks100

This was very helpful, thanks for the discussion and sharing.

 

You do mention 'handful of regions' but I cant figure in the code where those select regions are used for the plot. In other words, assuming exome or targeted sequencing data, one would like to use a substring of the chrZ in order to make the final GC plot.

 

One thing am going to try is subset the original BAM file (capture data aligned to whole genome, so contains off-target stuff) to just the capture regions, and then load the subsetted BAM in this code snippet!

ADD REPLYlink written 4.0 years ago by Bioinfosm610
12
gravatar for Michael Dondrup
8.2 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

Computing GC content and smoothed coverage in R

Edit: pre-processing the data to compute GC content and compute running mean with flexible window size. The genome sequence is required to compute the GC content. I assume it is present as DNAString object, use read.DNAStringSet to read in the genome from a fasta file or use the BSGenome package. You will have to experiment a bit how to best load the data, e.g. maybe process only one chromosme at a time will help save memory.

library(Biostrings)

window.size <- 1000 # configure the window size

# create some test yeast data, the coordinates must fit with the coverage ofc, 
# so this might not work directly:
data(yeastSEQCHR1)
chrom <- DNAString(yeastSEQCHR1)

# compute gc content in a sliding window (as a fraction, if you want % : *100):
gc <- rowSums(letterFrequencyInSlidingView(chrom, window.size, c("G","C")))/window.size


# the coverage vector can be shorter than the sequence. 
# The length is equal to the largest end position of an alignment in the data:
pad.right <- function (rle, value=0, len) {
if (length(rle) > len) stop("length mismatch: len must be >= length(rle)!")
    if (len > length(rle)) 
      c(rle, Rle(value, len-length(rle))) 
    else rle
}
cvg <- pad.right(cvg, len=length(gc))


# we have to process the coverage vector to have the same window size:
cvgw <- runmean(x=cvg, k=window.size, endrule = c("constant")) 
# endrule is important to get a vector of same size, see ?runmean

Finally, do a scatter-plot and maybe add a lowess scatterplot smoothing curve to visualize the correlation:

plot(x=gc, y=cvgw)
lines(lowess(x=gc, y=cvgw), col=2)

should do the trick.


Plotting the GC content as a genome track

Once you are using R, the easiest way to plot coverage is possibly the GenomeGraphs package. Use the function makeBaseTrack to generate a coverage track. Given your coverage is in the cvg Rle object and gc is an Rle with the gc content, that looks like this:

library("GenomeGraphs")
baseCoverage <- as.numeric(cvg) # makeBaseTrack doesn't take Rle
gcContent <- as.numeric(gc)
 # contruct your gene models here: 
gene <- ...
pl <- list( 
      genes=gene, makeGenomeAxis(T,T,T),
      coverage=makeBaseTrack(base=1:length(baseCoverage), value=baseCoverage),
      gc=makeBaseTrack(base=1:length(gcContent), value=gcContent)
)


gdPlot(pl, min=1, max=1000) # restrict viewport

There are also multiple configurable functions in that package to plot the gene models, e.g.: makeAnnotationTrack or using biomaRt (from the package documetation):

mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
# Next we can retrieve the gene structure of the gene of interest.
gene <- makeGene(id = "ENSG00000095203",
type = "ensembl_gene_id", biomart = mart)
ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by Michael Dondrup46k

I'm not sure how to get the gc-content Rle object. Specifically, I would like the gc-content averaged across views identical to the coverage Rle.

Also, you prefaced your comment with, 'Once you are using R,". Would you recommend a different tool for the job?

ADD REPLYlink written 8.2 years ago by Bfranks100

Also, why do you say a scatterplot would not be helpful? A plot of both across the genome axis is nice for the eye, but I'm interested in determining the extent of GC-bias in the data, which I imagine is more accurately visualized in euclidean space across the whole dataset rather than one genomic range at a time. At least with the scatterplot I could get a numeric correlation.

ADD REPLYlink written 8.2 years ago by Bfranks100

I think R is a good tool to do this, I was just happy about that your use of R would provide for an easy and brief solution. You will see if the plot helps. I was just thinking that, should you try to plot the data for the whole human genome, the plotting might show a lot of noise, but you will see that better. I edit my question to reflect how to compute gc content.

ADD REPLYlink written 8.2 years ago by Michael Dondrup46k
1
gravatar for Ian
8.2 years ago by
Ian5.4k
University of Manchester, UK
Ian5.4k wrote:

Sorry if this is a silly suggestion, but have you looked at the data using the UCSC browser? It will easily present the two tracks for any given region. The browser can present BAM files (here) and there is the GC content track.

ADD COMMENTlink written 8.2 years ago by Ian5.4k

Thanks, I can do the track visualizations, but I'm interested in the scatterplot where coverage is on the y-axis and GC-content is on the x-axis.

ADD REPLYlink written 8.2 years ago by Bfranks100

Apologies - misread the question!

ADD REPLYlink written 8.2 years ago by Ian5.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 819 users visited in the last hour