Question: Identify homopolymer stretches in a given genome
0
gravatar for ATpoint
23 months ago by
ATpoint23k
Germany
ATpoint23k wrote:

Is anyone aware of an implementation that identifies or blacklists homopolymer stretches in a given genome/fasta, lets say with a length of 6 or greater?


Edit ((04/19): Solution in R using BSgenomes, GenomicRanges and Biostrings

## Find homopolymers in a given BSgenome

####################################################################################################################

require(Biostrings)
require(GenomicRanges)

####################################################################################################################

## Avoid floating point numbers for larger genomic coordinates:
options(scipen=999)

####################################################################################################################

## Helper function for pattern matching:
## Function searches a given chromosome (chr) of the BSgenome (tmp.genome) for pattern matches (given.seq)
## Example: seq.check("AAAAAA", "chr1", BSgenome.Hsapiens.UCSC.hg38)

seq.check <- function(given.seq, chr, tmp.genome){

  ## Subset BSgenome to chromosome of interest:
  current.chr <- tmp.genome[[which(tmp.genome@seqinfo@seqnames == chr)]]

  ## Get all (also redundant) coordinates of the exact pattern match:
  tmp.match <- matchPattern(given.seq, getSeq(tmp.genome, chr))

  ## If no matches are found, output empty GRanges, else output GRanges with unique coordinates:
  if (length(tmp.match@ranges) == 0){
    return(
      GRanges()
    )
  } else{ 
    return( 
      suppressWarnings(GenomicRanges::reduce(GRanges(seqnames = chr, ranges = ranges(tmp.match))))
    )
  }
}

## Function that uses seq.check() to output a granges with coordinates of all defined homopolymers in the genome:
find.Homopolymers <- function(Nucleotide, PolyXLength, Query.Genome, Cores = 1){

  return(
    suppressWarnings(
      do.call("c", 
              mclapply(as.character(Query.Genome@seqinfo@seqnames), function(x) seq.check(
                paste(replicate(PolyXLength, Nucleotide), collapse = ""), x, Query.Genome), 
                mc.cores = Cores)
      )
    )
  )
}

####################################################################################################################

## Example:
require(BSgenome.Ecoli.NCBI.20080805)

## 1: Find polyA's of at least 6bp length:
PolyA5_EColi <- find.Homopolymers(Nucleotide = "A", PolyXLength = 6, Query.Genome = BSgenome.Ecoli.NCBI.20080805)

## 2: Find all (polyATCG) of at least 6bp length:
PolyAll5_EColi  <- do.call("c", list(find.Homopolymers("A", 6, BSgenome.Ecoli.NCBI.20080805),
                                     find.Homopolymers("T", 6, BSgenome.Ecoli.NCBI.20080805),
                                     find.Homopolymers("C", 6, BSgenome.Ecoli.NCBI.20080805),
                                     find.Homopolymers("G", 6, BSgenome.Ecoli.NCBI.20080805)
)
)
ADD COMMENTlink modified 5 months ago • written 23 months ago by ATpoint23k
4
gravatar for Pierre Lindenbaum
23 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

There is a module HomopolymerRun in GATK Variant Annotation: see https://software.broadinstitute.org/gatk/gatkdocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php and https://software.broadinstitute.org/gatk/gatkdocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_HomopolymerRun.php

I've also written a VCFPolyX: http://lindenb.github.io/jvarkit/VCFPolyX.html

ADD COMMENTlink written 23 months ago by Pierre Lindenbaum123k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 827 users visited in the last hour