Is there a R package to identify long stretches of gaps (Ns) in human genome (hg19)?
Is there a R package to identify long stretches of gaps (Ns) in human genome (hg19)?
Update 2023 -- Use Biostrings::vmatchPattern().
library(BSgenome.Hsapiens.UCSC.hg38)
library(Biostrings)
# Will take a few moments
vmp <- Biostrings::vmatchPattern(pattern="NNNNNNNNNNNNNNNNNNNN", subject=BSgenome.Hsapiens.UCSC.hg38)
vmp
GRanges object with 323184822 ranges and 0 metadata columns:
                          seqnames        ranges strand
                             <Rle>     <IRanges>  <Rle>
          [1]                 chr1          1-20      +
          [2]                 chr1          2-21      +
          [3]                 chr1          3-22      +
          [4]                 chr1          4-23      +
          [5]                 chr1          5-24      +
          ...                  ...           ...    ...
  [323184818] chr19_KV575249v1_alt 210330-210349      -
  [323184819] chr19_KV575249v1_alt 210331-210350      -
  [323184820] chr19_KV575249v1_alt 210332-210351      -
  [323184821] chr19_KV575249v1_alt 210333-210352      -
  [323184822] chr19_KV575249v1_alt 210334-210353      -
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome
This returns every distinct interval as a separate entry. Use GenomicRanges::reduce() on the output to collapse adjacent and overlapping intervals into one.
You could use the stringy package to detect all N position in the genome and then apply some function to extract the number of consectuive Ns and then order them by occurence
chromosome <- "ATGTAGATATGAATGCCCNNNNNACGACGACGAGNNNNNNNNNNNNNNNNNNACGACGACGAGAGGGANNAAAAN"
# find all position in one chromosome
n.pos <- stri_locate_all_fixed(chromosome,"N")[[1]][,1]
# group consecutive Ns in elements of a list
n.pos.list <- split(n.pos, cumsum(c(1, diff(n.pos) != 1)))
# extract for each group of Ns the length (thus the length of the gap)
n.length <-sapply(n.pos.list,length)
# construct results
n.pos.res <- cbind(do.call(rbind,lapply(n.pos.list,function(x){return(c(min(x),max(x)))})),length=n.length)
# order by length
n.pos.res <- n.pos.res[order(n.pos.res[,3],decreasing = T),]
# name column
colnames(n.pos.res)<-c("start","end","length")
Results :
  start end length
2    35  52     18
1    19  23      5
3    69  70      2
4    75  75      1
You could also perform some thresold (here minimum gaps of 1000bp are keep.
n.pos.res.filtered <- n.pos.res[n.pos.res[,3]>1000,]
                    
                
                not R
my answer to
Who's got a command for me to output coordinates of FASTA scaffolds gaps as a BED file? Looking for a one-liner. Saw https://t.co/7Znt8tPLZQ
— Shaun Jackman (@sjackman) June 29, 2016
5.4 years later: use ScatterIntervalsByNs https://gatk.broadinstitute.org/hc/en-us/articles/360041416072-ScatterIntervalsByNs-Picard-
The GC selector (https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/gc5Base.txt.gz) has data for basically the entire genome and doesn't include the N positions. You can just subtract that from the entire hg19. Something like:
cat /references/hg19.fa.fai | awk -F "\t" '{print $1 "\t" 1 "\t" $2'} > hg19.bed
cut -f 2-4 gc5Base.txt | bedtools subtract -a hg19.bed -b - | awk -F "\t" '{print $1 "\t" $2+1 "\t" $3}' > hg19_Ns.bed
                    
                
                Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Are you looking for hard-masked repeat regions or just N's at ends of chromosomes?
I am looking for regions were it is obvious to find gaps (N's) in hg19, I am not looking into hard masked repeat regions.