Question: how to get GC content of multiple ranges in R
gravatar for pt.taklifi
6 weeks ago by
pt.taklifi60 wrote:

I have a list of genomic ranges mapped to hg19 . my data is in matrix format lets call it ranges which has 600,000 rows and 4 clumns here is few row of my data

  chr    start    end      strand
[1,] "chr1" "10025"  "10525"  "."   
[2,] "chr1" "13252"  "13752"  "."   
[3,] "chr1" "16019"  "16519"  "."   
[4,] "chr1" "96376"  "96876"  "."   
[5,] "chr1" "115440" "115940" "."   
[6,] "chr1" "235393" "235893" "."

Is there a function that gets sequences and calculates GC content for each row( each range) I would prefer that output be in a vector format I would really appreciate your help

sequence R • 272 views
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by pt.taklifi60
gravatar for bernatgel
6 weeks ago by
Barcelona, Spain
bernatgel2.8k wrote:


You can use the getSeq and alphabetFrequency functions from Bioconductor's Biostrings package to obtain the sequence and compute the frequency of each nucleotide, and then, just divide the sum of G+C by the total number of nucleotides.

To use getSeq the easiest way is to create a GRanges object from your matrix. I use toGRanges from regioneR package, but there are other ways to build it.


ranges <- matrix(c("chr1", "10025",  "10525",  ".", "chr1", "13252",  "13752",  "."),ncol = 4, byrow = TRUE)
colnames(ranges) <- c("chr", "start", "end", "strand")   

#Build a GRanges from your matrix
ranges <- toGRanges(data.frame(ranges))

#Get the sequences and compute the GC content
freqs <- alphabetFrequency(getSeq(BSgenome.Hsapiens.UCSC.hg19, ranges))
gc <- (freqs[,'C'] + freqs[,'G'])/rowSums(freqs)

Here gc is a vector with the GC frequency of each range in the original matrix.

Hope this helps


ADD COMMENTlink written 6 weeks ago by bernatgel2.8k
gravatar for ATpoint
6 weeks ago by
ATpoint44k wrote:

A solution using GenomicRanges and BSgenome from Bioconductor. You will need to transform your data to a GRanges object for this.


gr <- GenomicRanges::GRanges(seqnames = c("chr1", "chr2"),
                             ranges = IRanges(start = c(123456,123490), 
                                              end = c(500020, 600020)))

GetGC <- function(bsgenome, gr){

  seqs <- BSgenome::getSeq(bsgenome, gr)
  return(as.numeric(Biostrings::letterFrequency(x = seqs, letters = "GC", as.prob = TRUE)))


> gr
GRanges object with 2 ranges and 0 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]     chr1 123456-500020      *
  [2]     chr2 123490-600020      *
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

> GetGC(bsgenome = BSgenome.Hsapiens.UCSC.hg19, gr = gr)
[1] 0.2823072 0.4355687
ADD COMMENTlink written 6 weeks ago by ATpoint44k
gravatar for pt.taklifi
6 weeks ago by
pt.taklifi60 wrote:

thank you all, these solutions are great. I also found a function that calculates GC content using BSgenome.Hsapiens.UCSC.hg19.

Granges<- makeGRangesFromDataFrame(data.frame(ranges))
gc<- gcContentCalc(Granges , organism=Hsapiens, verbose=TRUE)
ADD COMMENTlink written 6 weeks ago by pt.taklifi60
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1620 users visited in the last hour