how to get GC content of multiple ranges in R
3
2
Entering edit mode
2.8 years ago
pt.taklifi ▴ 60

I have a list of genomic ranges mapped to hg19 . my data is in matrix format lets call it ranges which has 600,000 rows and 4 clumns here is few row of my data

 head(ranges)
chr    start    end      strand
[1,] "chr1" "10025"  "10525"  "."
[2,] "chr1" "13252"  "13752"  "."
[3,] "chr1" "16019"  "16519"  "."
[4,] "chr1" "96376"  "96876"  "."
[5,] "chr1" "115440" "115940" "."
[6,] "chr1" "235393" "235893" "."


Is there a function that gets sequences and calculates GC content for each row( each range) I would prefer that output be in a vector format I would really appreciate your help

R sequence • 3.1k views
7
Entering edit mode
2.8 years ago
bernatgel ★ 3.4k

Hi,

You can use the getSeq and alphabetFrequency functions from Bioconductor's Biostrings package to obtain the sequence and compute the frequency of each nucleotide, and then, just divide the sum of G+C by the total number of nucleotides.

To use getSeq the easiest way is to create a GRanges object from your matrix. I use toGRanges from regioneR package, but there are other ways to build it.

library(regioneR)
library(BSgenome.Hsapiens.UCSC.hg19)

ranges <- matrix(c("chr1", "10025",  "10525",  ".", "chr1", "13252",  "13752",  "."),ncol = 4, byrow = TRUE)
colnames(ranges) <- c("chr", "start", "end", "strand")

#Build a GRanges from your matrix
ranges <- toGRanges(data.frame(ranges))

#Get the sequences and compute the GC content
freqs <- alphabetFrequency(getSeq(BSgenome.Hsapiens.UCSC.hg19, ranges))
gc <- (freqs[,'C'] + freqs[,'G'])/rowSums(freqs)


Here gc is a vector with the GC frequency of each range in the original matrix.

Hope this helps

Bernat

4
Entering edit mode
2.8 years ago
ATpoint 77k

A solution using GenomicRanges and BSgenome from Bioconductor. You will need to transform your data to a GRanges object for this.

library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg19)
library(BSgenome)

gr <- GenomicRanges::GRanges(seqnames = c("chr1", "chr2"),
ranges = IRanges(start = c(123456,123490),
end = c(500020, 600020)))

GetGC <- function(bsgenome, gr){

seqs <- BSgenome::getSeq(bsgenome, gr)
return(as.numeric(Biostrings::letterFrequency(x = seqs, letters = "GC", as.prob = TRUE)))

}

> gr
GRanges object with 2 ranges and 0 metadata columns:
seqnames        ranges strand
<Rle>     <IRanges>  <Rle>
[1]     chr1 123456-500020      *
[2]     chr2 123490-600020      *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths

> GetGC(bsgenome = BSgenome.Hsapiens.UCSC.hg19, gr = gr)
[1] 0.2823072 0.4355687

4
Entering edit mode
2.8 years ago
pt.taklifi ▴ 60

thank you all, these solutions are great. I also found a function that calculates GC content using BSgenome.Hsapiens.UCSC.hg19.

require(BSgenome.Hsapiens.UCSC.hg19)
require(BSgenome)
library(Repitools)
Granges<- makeGRangesFromDataFrame(data.frame(ranges))
gc<- gcContentCalc(Granges , organism=Hsapiens, verbose=TRUE)