Question: Identifying long stretches of Ns
gravatar for anshupa.vssut
14 months ago by
anshupa.vssut40 wrote:

Is there a R package to identify long stretches of gaps (Ns) in human genome (hg19)?

R genome • 586 views
ADD COMMENTlink modified 14 months ago by ATpoint12k • written 14 months ago by anshupa.vssut40

Are you looking for hard-masked repeat regions or just N's at ends of chromosomes?

ADD REPLYlink written 14 months ago by genomax60k

I am looking for regions were it is obvious to find gaps (N's) in hg19, I am not looking into hard masked repeat regions.

ADD REPLYlink modified 14 months ago • written 14 months ago by anshupa.vssut40
gravatar for Nicolas Rosewick
14 months ago by
Belgium, Brussels
Nicolas Rosewick7.2k wrote:

You could use the stringy package to detect all N position in the genome and then apply some function to extract the number of consectuive Ns and then order them by occurence

# find all position in one chromosome
n.pos <- stri_locate_all_fixed(chromosome,"N")[[1]][,1]
# group consecutive Ns in elements of a list
n.pos.list <- split(n.pos, cumsum(c(1, diff(n.pos) != 1)))
# extract for each group of Ns the length (thus the length of the gap)
n.length <-sapply(n.pos.list,length)
# construct results
n.pos.res <- cbind(,lapply(n.pos.list,function(x){return(c(min(x),max(x)))})),length=n.length)
# order by length
n.pos.res <- n.pos.res[order(n.pos.res[,3],decreasing = T),]
# name column

Results :

  start end length
2    35  52     18
1    19  23      5
3    69  70      2
4    75  75      1

You could also perform some thresold (here minimum gaps of 1000bp are keep.

n.pos.res.filtered <- n.pos.res[n.pos.res[,3]>1000,]
ADD COMMENTlink modified 14 months ago • written 14 months ago by Nicolas Rosewick7.2k
gravatar for Pierre Lindenbaum
14 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

not R

my answer to

ADD COMMENTlink written 14 months ago by Pierre Lindenbaum116k
gravatar for ATpoint
14 months ago by
ATpoint12k wrote:

A crude approach using Biostrings and BSgenomes that I wrote when I got started with R (originally intended to check coordinates of primers) that allows you to screen individual chromosomes of any BSgenome for a given nucleotide query, resulting in a GRanges object that merges individual matches in case they are adjacent.


seq.check <- function(given.seq, chr, tmp.genome){

  # subset BSgenome to chromosome of interest:
  where <- which(tmp.genome@seqinfo@seqnames == chr)
  current.chr <- tmp.genome[[where]]

  # get all coordinates that match the pattern:
  tmp.match <- matchPattern(given.seq, getSeq(tmp.genome, chr, 1, current.chr@length))

  # transform to GRanges and merge adjacent ranges:
  tmp.granges <- GenomicRanges::reduce( GRanges(seqnames = chr, ranges = ranges(tmp.match)))

# example for poly-N (can also be any nucleotide sequence)
seq.check("NNNNNNNNNNNNNNNNNNNN", "chr13", BSgenome.Hsapiens.UCSC.hg38)

# in case a reverse complement is to be checked:
seq.check(reverseComplement(DNAString("TGCAGCCTCATTACTCAGACA")), "chr13", BSgenome.Hsapiens.UCSC.hg38)

EDIT (02/18): Search all chromosomes using lapply:

hg38 <- BSgenome.Hsapiens.UCSC.hg38

# Chromosom 1-22, X, Y:

polyN <-"c", lapply(paste("chr", c(seq(1,22), "X", "Y"), sep=""), function(x) 
            seq.check("NNNNNNNNNNNNNNNNN", x, hg38))

This will give you a GRanges with all "NNNNNNNNNNNNNNNNN" on every chromosome, without redundancy (=overlapping ranges merged into one interval).

ADD COMMENTlink modified 11 months ago • written 14 months ago by ATpoint12k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1274 users visited in the last hour