Question: how to get GC content of multiple ranges in R
2
gravatar for pt.taklifi
6 weeks ago by
pt.taklifi60
pt.taklifi60 wrote:

I have a list of genomic ranges mapped to hg19 . my data is in matrix format lets call it ranges which has 600,000 rows and 4 clumns here is few row of my data

 head(ranges)
  chr    start    end      strand
[1,] "chr1" "10025"  "10525"  "."   
[2,] "chr1" "13252"  "13752"  "."   
[3,] "chr1" "16019"  "16519"  "."   
[4,] "chr1" "96376"  "96876"  "."   
[5,] "chr1" "115440" "115940" "."   
[6,] "chr1" "235393" "235893" "."

Is there a function that gets sequences and calculates GC content for each row( each range) I would prefer that output be in a vector format I would really appreciate your help

sequence R • 272 views
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by pt.taklifi60
5
gravatar for bernatgel
6 weeks ago by
bernatgel2.8k
Barcelona, Spain
bernatgel2.8k wrote:

Hi,

You can use the getSeq and alphabetFrequency functions from Bioconductor's Biostrings package to obtain the sequence and compute the frequency of each nucleotide, and then, just divide the sum of G+C by the total number of nucleotides.

To use getSeq the easiest way is to create a GRanges object from your matrix. I use toGRanges from regioneR package, but there are other ways to build it.

library(regioneR)
library(BSgenome.Hsapiens.UCSC.hg19)


ranges <- matrix(c("chr1", "10025",  "10525",  ".", "chr1", "13252",  "13752",  "."),ncol = 4, byrow = TRUE)
colnames(ranges) <- c("chr", "start", "end", "strand")   

#Build a GRanges from your matrix
ranges <- toGRanges(data.frame(ranges))

#Get the sequences and compute the GC content
freqs <- alphabetFrequency(getSeq(BSgenome.Hsapiens.UCSC.hg19, ranges))
gc <- (freqs[,'C'] + freqs[,'G'])/rowSums(freqs)

Here gc is a vector with the GC frequency of each range in the original matrix.

Hope this helps

Bernat

ADD COMMENTlink written 6 weeks ago by bernatgel2.8k
4
gravatar for ATpoint
6 weeks ago by
ATpoint44k
ATpoint44k wrote:

A solution using GenomicRanges and BSgenome from Bioconductor. You will need to transform your data to a GRanges object for this.

library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg19)
library(BSgenome)

gr <- GenomicRanges::GRanges(seqnames = c("chr1", "chr2"),
                             ranges = IRanges(start = c(123456,123490), 
                                              end = c(500020, 600020)))

GetGC <- function(bsgenome, gr){

  seqs <- BSgenome::getSeq(bsgenome, gr)
  return(as.numeric(Biostrings::letterFrequency(x = seqs, letters = "GC", as.prob = TRUE)))


}

> gr
GRanges object with 2 ranges and 0 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]     chr1 123456-500020      *
  [2]     chr2 123490-600020      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

> GetGC(bsgenome = BSgenome.Hsapiens.UCSC.hg19, gr = gr)
[1] 0.2823072 0.4355687
ADD COMMENTlink written 6 weeks ago by ATpoint44k
4
gravatar for pt.taklifi
6 weeks ago by
pt.taklifi60
pt.taklifi60 wrote:

thank you all, these solutions are great. I also found a function that calculates GC content using BSgenome.Hsapiens.UCSC.hg19.

require(BSgenome.Hsapiens.UCSC.hg19)
require(BSgenome)
library(Repitools)
Granges<- makeGRangesFromDataFrame(data.frame(ranges))
gc<- gcContentCalc(Granges , organism=Hsapiens, verbose=TRUE)
ADD COMMENTlink written 6 weeks ago by pt.taklifi60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1620 users visited in the last hour
_