Question: Identifying long stretches of Ns
gravatar for anshupa.vssut
11 months ago by
anshupa.vssut20 wrote:

Is there a R package to identify long stretches of gaps (Ns) in human genome (hg19)?

R genome • 485 views
ADD COMMENTlink modified 11 months ago by ATpoint8.2k • written 11 months ago by anshupa.vssut20

Are you looking for hard-masked repeat regions or just N's at ends of chromosomes?

ADD REPLYlink written 11 months ago by genomax57k

I am looking for regions were it is obvious to find gaps (N's) in hg19, I am not looking into hard masked repeat regions.

ADD REPLYlink modified 11 months ago • written 11 months ago by anshupa.vssut20
gravatar for Nicolas Rosewick
11 months ago by
Belgium, Brussels
Nicolas Rosewick6.8k wrote:

You could use the stringy package to detect all N position in the genome and then apply some function to extract the number of consectuive Ns and then order them by occurence

# find all position in one chromosome
n.pos <- stri_locate_all_fixed(chromosome,"N")[[1]][,1]
# group consecutive Ns in elements of a list
n.pos.list <- split(n.pos, cumsum(c(1, diff(n.pos) != 1)))
# extract for each group of Ns the length (thus the length of the gap)
n.length <-sapply(n.pos.list,length)
# construct results
n.pos.res <- cbind(,lapply(n.pos.list,function(x){return(c(min(x),max(x)))})),length=n.length)
# order by length
n.pos.res <- n.pos.res[order(n.pos.res[,3],decreasing = T),]
# name column

Results :

  start end length
2    35  52     18
1    19  23      5
3    69  70      2
4    75  75      1

You could also perform some thresold (here minimum gaps of 1000bp are keep.

n.pos.res.filtered <- n.pos.res[n.pos.res[,3]>1000,]
ADD COMMENTlink modified 11 months ago • written 11 months ago by Nicolas Rosewick6.8k
gravatar for Pierre Lindenbaum
11 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum113k wrote:

not R

my answer to

ADD COMMENTlink written 11 months ago by Pierre Lindenbaum113k
gravatar for ATpoint
11 months ago by
ATpoint8.2k wrote:

A crude approach using Biostrings and BSgenomes that I wrote when I got started with R (originally intended to check coordinates of primers) that allows you to screen individual chromosomes of any BSgenome for a given nucleotide query, resulting in a GRanges object that merges individual matches in case they are adjacent.


seq.check <- function(given.seq, chr, tmp.genome){

  # subset BSgenome to chromosome of interest:
  where <- which(tmp.genome@seqinfo@seqnames == chr)
  current.chr <- tmp.genome[[where]]

  # get all coordinates that match the pattern:
  tmp.match <- matchPattern(given.seq, getSeq(tmp.genome, chr, 1, current.chr@length))

  # transform to GRanges and merge adjacent ranges:
  tmp.granges <- GenomicRanges::reduce( GRanges(seqnames = chr, ranges = ranges(tmp.match)))

# example for poly-N (can also be any nucleotide sequence)
seq.check("NNNNNNNNNNNNNNNNNNNN", "chr13", BSgenome.Hsapiens.UCSC.hg38)

# in case a reverse complement is to be checked:
seq.check(reverseComplement(DNAString("TGCAGCCTCATTACTCAGACA")), "chr13", BSgenome.Hsapiens.UCSC.hg38)

EDIT (02/18): Search all chromosomes using lapply:

hg38 <- BSgenome.Hsapiens.UCSC.hg38

# Chromosom 1-22, X, Y:

polyN <-"c", lapply(paste("chr", c(seq(1,22), "X", "Y"), sep=""), function(x) 
            seq.check("NNNNNNNNNNNNNNNNN", x, hg38))

This will give you a GRanges with all "NNNNNNNNNNNNNNNNN" on every chromosome, without redundancy (=overlapping ranges merged into one interval).

ADD COMMENTlink modified 8 months ago • written 11 months ago by ATpoint8.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1051 users visited in the last hour