Question

Identifying long stretches of Ns

0

Entering edit mode

8.0 years ago

anshupa.vssut ▴ 50

Is there a R package to identify long stretches of gaps (Ns) in human genome (hg19)?

genome R • 5.0k views

ADD COMMENT • link updated 22 months ago by André • 0 • written 8.0 years ago by anshupa.vssut ▴ 50

0

Entering edit mode

Are you looking for hard-masked repeat regions or just N's at ends of chromosomes?

ADD REPLY • link 8.0 years ago by GenoMax 154k

0

Entering edit mode

I am looking for regions were it is obvious to find gaps (N's) in hg19, I am not looking into hard masked repeat regions.

ADD REPLY • link 8.0 years ago by anshupa.vssut ▴ 50

score 4 · Answer 1 · 2017-11-06

Update 2023 -- Use Biostrings::vmatchPattern().

library(BSgenome.Hsapiens.UCSC.hg38)
library(Biostrings)

# Will take a few moments
vmp <- Biostrings::vmatchPattern(pattern="NNNNNNNNNNNNNNNNNNNN", subject=BSgenome.Hsapiens.UCSC.hg38)
vmp

GRanges object with 323184822 ranges and 0 metadata columns:
                          seqnames        ranges strand
                             <Rle>     <IRanges>  <Rle>
          [1]                 chr1          1-20      +
          [2]                 chr1          2-21      +
          [3]                 chr1          3-22      +
          [4]                 chr1          4-23      +
          [5]                 chr1          5-24      +
          ...                  ...           ...    ...
  [323184818] chr19_KV575249v1_alt 210330-210349      -
  [323184819] chr19_KV575249v1_alt 210331-210350      -
  [323184820] chr19_KV575249v1_alt 210332-210351      -
  [323184821] chr19_KV575249v1_alt 210333-210352      -
  [323184822] chr19_KV575249v1_alt 210334-210353      -
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

This returns every distinct interval as a separate entry. Use GenomicRanges::reduce() on the output to collapse adjacent and overlapping intervals into one.

score 3 · Answer 2 · 2017-11-06

You could use the stringy package to detect all N position in the genome and then apply some function to extract the number of consectuive Ns and then order them by occurence

chromosome <- "ATGTAGATATGAATGCCCNNNNNACGACGACGAGNNNNNNNNNNNNNNNNNNACGACGACGAGAGGGANNAAAAN"
# find all position in one chromosome
n.pos <- stri_locate_all_fixed(chromosome,"N")[[1]][,1]
# group consecutive Ns in elements of a list
n.pos.list <- split(n.pos, cumsum(c(1, diff(n.pos) != 1)))
# extract for each group of Ns the length (thus the length of the gap)
n.length <-sapply(n.pos.list,length)
# construct results
n.pos.res <- cbind(do.call(rbind,lapply(n.pos.list,function(x){return(c(min(x),max(x)))})),length=n.length)
# order by length
n.pos.res <- n.pos.res[order(n.pos.res[,3],decreasing = T),]
# name column
colnames(n.pos.res)<-c("start","end","length")

Results :

  start end length
2    35  52     18
1    19  23      5
3    69  70      2
4    75  75      1

You could also perform some thresold (here minimum gaps of 1000bp are keep.

n.pos.res.filtered <- n.pos.res[n.pos.res[,3]>1000,]

score 3 · Answer 3 · 2017-11-06

3

Entering edit mode

8.0 years ago

Pierre Lindenbaum 166k

not R

my answer to

Who's got a command for me to output coordinates of FASTA scaffolds gaps as a BED file? Looking for a one-liner. Saw https://t.co/7Znt8tPLZQ
— Shaun Jackman (@sjackman) June 29, 2016

ADD COMMENT • link 8.0 years ago by Pierre Lindenbaum 166k

3

Entering edit mode

5.4 years later: use ScatterIntervalsByNs https://gatk.broadinstitute.org/hc/en-us/articles/360041416072-ScatterIntervalsByNs-Picard-

ADD REPLY • link 2.5 years ago by Pierre Lindenbaum 166k

score 0 · Answer 4 · 2023-12-22

The GC selector (https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/gc5Base.txt.gz) has data for basically the entire genome and doesn't include the N positions. You can just subtract that from the entire hg19. Something like:

cat /references/hg19.fa.fai | awk -F "\t" '{print $1 "\t" 1 "\t" $2'} > hg19.bed
cut -f 2-4 gc5Base.txt | bedtools subtract -a hg19.bed -b - | awk -F "\t" '{print $1 "\t" $2+1 "\t" $3}' > hg19_Ns.bed