how to get GC content of multiple ranges in R
3
2
Entering edit mode
3.4 years ago
pt.taklifi ▴ 60

I have a list of genomic ranges mapped to hg19 . my data is in matrix format lets call it ranges which has 600,000 rows and 4 clumns here is few row of my data

 head(ranges)
  chr    start    end      strand
[1,] "chr1" "10025"  "10525"  "."   
[2,] "chr1" "13252"  "13752"  "."   
[3,] "chr1" "16019"  "16519"  "."   
[4,] "chr1" "96376"  "96876"  "."   
[5,] "chr1" "115440" "115940" "."   
[6,] "chr1" "235393" "235893" "."

Is there a function that gets sequences and calculates GC content for each row( each range) I would prefer that output be in a vector format I would really appreciate your help

R sequence • 3.7k views
ADD COMMENT
8
Entering edit mode
3.4 years ago
bernatgel ★ 3.4k

Hi,

You can use the getSeq and alphabetFrequency functions from Bioconductor's Biostrings package to obtain the sequence and compute the frequency of each nucleotide, and then, just divide the sum of G+C by the total number of nucleotides.

To use getSeq the easiest way is to create a GRanges object from your matrix. I use toGRanges from regioneR package, but there are other ways to build it.

library(regioneR)
library(BSgenome.Hsapiens.UCSC.hg19)


ranges <- matrix(c("chr1", "10025",  "10525",  ".", "chr1", "13252",  "13752",  "."),ncol = 4, byrow = TRUE)
colnames(ranges) <- c("chr", "start", "end", "strand")   

#Build a GRanges from your matrix
ranges <- toGRanges(data.frame(ranges))

#Get the sequences and compute the GC content
freqs <- alphabetFrequency(getSeq(BSgenome.Hsapiens.UCSC.hg19, ranges))
gc <- (freqs[,'C'] + freqs[,'G'])/rowSums(freqs)

Here gc is a vector with the GC frequency of each range in the original matrix.

Hope this helps

Bernat

ADD COMMENT
4
Entering edit mode
3.4 years ago
ATpoint 82k

A solution using GenomicRanges and BSgenome from Bioconductor. You will need to transform your data to a GRanges object for this.

library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg19)
library(BSgenome)

gr <- GenomicRanges::GRanges(seqnames = c("chr1", "chr2"),
                             ranges = IRanges(start = c(123456,123490), 
                                              end = c(500020, 600020)))

GetGC <- function(bsgenome, gr){

  seqs <- BSgenome::getSeq(bsgenome, gr)
  return(as.numeric(Biostrings::letterFrequency(x = seqs, letters = "GC", as.prob = TRUE)))


}

> gr
GRanges object with 2 ranges and 0 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]     chr1 123456-500020      *
  [2]     chr2 123490-600020      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

> GetGC(bsgenome = BSgenome.Hsapiens.UCSC.hg19, gr = gr)
[1] 0.2823072 0.4355687
ADD COMMENT
4
Entering edit mode
3.4 years ago
pt.taklifi ▴ 60

thank you all, these solutions are great. I also found a function that calculates GC content using BSgenome.Hsapiens.UCSC.hg19.

require(BSgenome.Hsapiens.UCSC.hg19)
require(BSgenome)
library(Repitools)
Granges<- makeGRangesFromDataFrame(data.frame(ranges))
gc<- gcContentCalc(Granges , organism=Hsapiens, verbose=TRUE)
ADD COMMENT

Login before adding your answer.

Traffic: 1950 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6