Identify homopolymer stretches in a given genome
1
0
Entering edit mode
4.0 years ago
ATpoint 54k

Is anyone aware of an implementation that identifies or blacklists homopolymer stretches in a given genome/fasta, lets say with a length of 6 or greater?


Edit ((04/19): Solution in R using BSgenomes, GenomicRanges and Biostrings

## Find homopolymers in a given BSgenome

####################################################################################################################

require(Biostrings)
require(GenomicRanges)

####################################################################################################################

## Avoid floating point numbers for larger genomic coordinates:
options(scipen=999)

####################################################################################################################

## Helper function for pattern matching:
## Function searches a given chromosome (chr) of the BSgenome (tmp.genome) for pattern matches (given.seq)
## Example: seq.check("AAAAAA", "chr1", BSgenome.Hsapiens.UCSC.hg38)

seq.check <- function(given.seq, chr, tmp.genome){

  ## Subset BSgenome to chromosome of interest:
  current.chr <- tmp.genome[[which(tmp.genome@seqinfo@seqnames == chr)]]

  ## Get all (also redundant) coordinates of the exact pattern match:
  tmp.match <- matchPattern(given.seq, getSeq(tmp.genome, chr))

  ## If no matches are found, output empty GRanges, else output GRanges with unique coordinates:
  if (length(tmp.match@ranges) == 0){
    return(
      GRanges()
    )
  } else{ 
    return( 
      suppressWarnings(GenomicRanges::reduce(GRanges(seqnames = chr, ranges = ranges(tmp.match))))
    )
  }
}

## Function that uses seq.check() to output a granges with coordinates of all defined homopolymers in the genome:
find.Homopolymers <- function(Nucleotide, PolyXLength, Query.Genome, Cores = 1){

  return(
    suppressWarnings(
      do.call("c", 
              mclapply(as.character(Query.Genome@seqinfo@seqnames), function(x) seq.check(
                paste(replicate(PolyXLength, Nucleotide), collapse = ""), x, Query.Genome), 
                mc.cores = Cores)
      )
    )
  )
}

####################################################################################################################

## Example:
require(BSgenome.Ecoli.NCBI.20080805)

## 1: Find polyA's of at least 6bp length:
PolyA5_EColi <- find.Homopolymers(Nucleotide = "A", PolyXLength = 6, Query.Genome = BSgenome.Ecoli.NCBI.20080805)

## 2: Find all (polyATCG) of at least 6bp length:
PolyAll5_EColi  <- do.call("c", list(find.Homopolymers("A", 6, BSgenome.Ecoli.NCBI.20080805),
                                     find.Homopolymers("T", 6, BSgenome.Ecoli.NCBI.20080805),
                                     find.Homopolymers("C", 6, BSgenome.Ecoli.NCBI.20080805),
                                     find.Homopolymers("G", 6, BSgenome.Ecoli.NCBI.20080805)
)
)
Variant Calling Homopolymers SNP Indel • 1.6k views
ADD COMMENT

Login before adding your answer.

Traffic: 2205 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6