Question: Identify homopolymer stretches in a given genome
0
gravatar for ATpoint
16 months ago by
ATpoint14k
Germany
ATpoint14k wrote:

Is anyone aware of an implementation that identifies or blacklists homopolymer stretches in a given genome/fasta, lets say with a length of 6 or greater?

EDIT (02/18): Simple R solution, using BSgenomes, GenomicRanges and Biostrings. Will produce a granges/BED suitable for BEDtools intersect.

options(scipen=999) # avoid floating points for large coordinates
require(BSgenome.Hsapiens.UCSC.hg38)
require(Biostrings)
require(GenomicRanges)
require(parallel)

# Select your genome
hg38 <- BSgenome.Hsapiens.UCSC.hg38

seq.check <- function(given.seq, chr, tmp.genome){

  # subset BSgenome to chromosome of interest:
  where <- which(tmp.genome@seqinfo@seqnames == chr)
  current.chr <- tmp.genome[[where]]

  # get all coordinates that match the pattern:
  tmp.match <- matchPattern(given.seq, getSeq(tmp.genome, chr, 1, current.chr@length))

  # transform to GRanges and merge adjacent ranges:
  tmp.granges <- GenomicRanges::reduce( GRanges(seqnames = chr, ranges = ranges(tmp.match)))
  return(tmp.granges)
}

# homoATCG of length 6 and larger:
homoA6 <- do.call("c", mclapply(paste("chr", c(seq(1,22), "X", "Y"), sep=""), function(x) 
            seq.check("AAAAAA", x, hg38), mc.cores=8))
homoC6 <- do.call("c", mclapply(paste("chr", c(seq(1,22), "X", "Y"), sep=""), function(x) 
  seq.check("CCCCCC", x, hg38), mc.cores=8))
homoT6 <- do.call("c", mclapply(paste("chr", c(seq(1,22), "X", "Y"), sep=""), function(x) 
  seq.check("TTTTTT", x, hg38), mc.cores=8))
homoG6 <- do.call("c", mclapply(paste("chr", c(seq(1,22), "X", "Y"), sep=""), function(x) 
  seq.check("GGGGGG", x, hg38), mc.cores=8))

homoPolymerHexa.gr <- do.call("c", list(homoA6, homoG6, homoC6, homoT6))

# Write 1-based granges to 0-based BED file on disk:
write.table(data.frame(
  seqnameshomoPolymerHexa.gr),
  starthomoPolymerHexa.gr)-1,
  endhomoPolymerHexa.gr)),
  quote = F, col.names = F, row.names = F, sep="\t", file="Homopolymers_6AndMore.bed")
ADD COMMENTlink modified 13 months ago • written 16 months ago by ATpoint14k
4
gravatar for Pierre Lindenbaum
16 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

There is a module HomopolymerRun in GATK Variant Annotation: see https://software.broadinstitute.org/gatk/gatkdocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php and https://software.broadinstitute.org/gatk/gatkdocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_HomopolymerRun.php

I've also written a VCFPolyX: http://lindenb.github.io/jvarkit/VCFPolyX.html

ADD COMMENTlink written 16 months ago by Pierre Lindenbaum118k

Thank you. Your tool gets it done smoothly!

ADD REPLYlink modified 16 months ago • written 16 months ago by ATpoint14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1776 users visited in the last hour