Question: Identifying long stretches of Ns
gravatar for anshupa.vssut
9 months ago by
anshupa.vssut20 wrote:

Is there a R package to identify long stretches of gaps (Ns) in human genome (hg19)?

R genome • 406 views
ADD COMMENTlink modified 9 months ago by ATpoint6.5k • written 9 months ago by anshupa.vssut20

Are you looking for hard-masked repeat regions or just N's at ends of chromosomes?

ADD REPLYlink written 9 months ago by genomax54k

I am looking for regions were it is obvious to find gaps (N's) in hg19, I am not looking into hard masked repeat regions.

ADD REPLYlink modified 9 months ago • written 9 months ago by anshupa.vssut20
gravatar for Nicolas Rosewick
9 months ago by
Belgium, Brussels
Nicolas Rosewick6.5k wrote:

You could use the stringy package to detect all N position in the genome and then apply some function to extract the number of consectuive Ns and then order them by occurence

# find all position in one chromosome
n.pos <- stri_locate_all_fixed(chromosome,"N")[[1]][,1]
# group consecutive Ns in elements of a list
n.pos.list <- split(n.pos, cumsum(c(1, diff(n.pos) != 1)))
# extract for each group of Ns the length (thus the length of the gap)
n.length <-sapply(n.pos.list,length)
# construct results
n.pos.res <- cbind(,lapply(n.pos.list,function(x){return(c(min(x),max(x)))})),length=n.length)
# order by length
n.pos.res <- n.pos.res[order(n.pos.res[,3],decreasing = T),]
# name column

Results :

  start end length
2    35  52     18
1    19  23      5
3    69  70      2
4    75  75      1

You could also perform some thresold (here minimum gaps of 1000bp are keep.

n.pos.res.filtered <- n.pos.res[n.pos.res[,3]>1000,]
ADD COMMENTlink modified 9 months ago • written 9 months ago by Nicolas Rosewick6.5k
gravatar for Pierre Lindenbaum
9 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum111k wrote:

not R

my answer to

ADD COMMENTlink written 9 months ago by Pierre Lindenbaum111k
gravatar for ATpoint
9 months ago by
ATpoint6.5k wrote:

A crude approach using Biostrings and BSgenomes that I wrote when I got started with R (originally intended to check coordinates of primers) that allows you to screen individual chromosomes of any BSgenome for a given nucleotide query, resulting in a GRanges object that merges individual matches in case they are adjacent.


seq.check <- function(given.seq, chr, tmp.genome){

  # subset BSgenome to chromosome of interest:
  where <- which(tmp.genome@seqinfo@seqnames == chr)
  current.chr <- tmp.genome[[where]]

  # get all coordinates that match the pattern:
  tmp.match <- matchPattern(given.seq, getSeq(tmp.genome, chr, 1, current.chr@length))

  # transform to GRanges and merge adjacent ranges:
  tmp.granges <- GenomicRanges::reduce( GRanges(seqnames = chr, ranges = ranges(tmp.match)))

# example for poly-N (can also be any nucleotide sequence)
seq.check("NNNNNNNNNNNNNNNNNNNN", "chr13", BSgenome.Hsapiens.UCSC.hg38)

# in case a reverse complement is to be checked:
seq.check(reverseComplement(DNAString("TGCAGCCTCATTACTCAGACA")), "chr13", BSgenome.Hsapiens.UCSC.hg38)

EDIT (02/18): Search all chromosomes using lapply:

hg38 <- BSgenome.Hsapiens.UCSC.hg38

# Chromosom 1-22, X, Y:

polyN <-"c", lapply(paste("chr", c(seq(1,22), "X", "Y"), sep=""), function(x) 
            seq.check("NNNNNNNNNNNNNNNNN", x, hg38))

This will give you a GRanges with all "NNNNNNNNNNNNNNNNN" on every chromosome, without redundancy (=overlapping ranges merged into one interval).

ADD COMMENTlink modified 6 months ago • written 9 months ago by ATpoint6.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 660 users visited in the last hour