Question: Random shuffling of features leaving gene models intact
1
gravatar for Michael Dondrup
6.5 years ago by
Bergen, Norway
Michael Dondrup48k wrote:

I am looking for a tool that can randomly shuffle gff features into intergenic regions, but leaving the gene-models 'intact', so that at least all features of a gene (transcript, exon, CDS, etc.) are placed on the same contig and related features are placed fully inside the interval of their parent region. Bedtools shuffle doesn't seem to do that, I am trying:

shuffleBed -i genes.gff3 -excl genes.gff3 -g chromsizes.txt -f 0

This command distributes sub-features to different contigs and leads to invalid gene-models, if I add -chrom, features are placed on the same contig, but not all features can be placed at all and the resulting gene-models are still not valid. Does anyone maybe have some R-code for this use-case? 

shuffle gff genome bedtools • 2.4k views
ADD COMMENTlink modified 4.6 years ago • written 6.5 years ago by Michael Dondrup48k

Thanks for pointing this out - it is a feature that we should certainly provide.

ADD REPLYlink written 6.5 years ago by Aaronquinlan11k
2
gravatar for Michael Dondrup
6.5 years ago by
Bergen, Norway
Michael Dondrup48k wrote:

Here's my second attempt, this time in Perl. It does more or less what I want with good speed, maybe I will add a few more options and documentation in the future. This solution is really fast enough, I tried to shuffle 100,000 features thereof 30,000 top-level genes over 30,000 contigs in a few seconds, more or less the time it takes to do the parsing and IO. Feel free to comment on possible improvements bugs, horrific code...

http://mdondrup.github.io/shuffleGff/

Hint, to place genes only in intergenic regions use the same gff file as exclude and input.

ADD COMMENTlink modified 4.6 years ago • written 6.5 years ago by Michael Dondrup48k
0
gravatar for Michael Dondrup
6.5 years ago by
Bergen, Norway
Michael Dondrup48k wrote:

Here, is my first naive attempt in R using a loop over a GRanges object. It works but is painfully slow as expected. Takes about 110 seconds for 1000 features on my computer. Afaik, one cannot use apply with GRanges. The way it is implemented now works almost without re-drawing, except for too small contigs. It might be faster to draw and assign everything in one go though using apply and then just re-draw features overlapping exclude. Any comment is welcome.

require(GenomicRanges)

shuffle <- function(genome, excludes=NULL, landmark.prob=TRUE,
                   keep.gene.models=TRUE, max.retries=10000) {
### make a lookup vector from the Parent character list
  if (! is.null(genome$Parent))
    lookup.parent <- unlist(sapply(genome$Parent, function(x)if (length(x) == 0) 'NA' else x ))
  rdata <- as(excludes, "RangedData")
### Foreach feature that doesn't have a parent
### calculate gaps for placing objects:
  gaplist <- gaps(ranges(rdata), start=1, end = seqlengths(genome))

### END preapations
### iterate only over the top-level
  myset <- if ( keep.gene.models && ! is.null(genome$Parent))
    which(lookup.parent == "NA")
  else
    1:length(genome)
  for (i in myset) { 
    f <- genome[i]
### select a contig or chromosome:
    landmark <- NULL
    for (c in 1:max.retries) {
      prob <- if (landmark.prob)
        (seqlengths(genome))/max((seqlengths(genome)))
      else NULL
      landmark <- sample(seqlevels(genome), 1,  prob=prob )
### redraw if landmark doesn't have enough free space (gene length > max gap length)
      if (max(width(gaplist[[landmark]])) >= width(f)) break;
    }
    if (is.null(landmark)) {
      warning ("couldn't place feature");
      next;
    }
### end choose chromosome       
### choose a random start position from landmark (at least gene-length away from any exclude region)
### flank position intervals on start position with - feature width will ensure that.
### we have already chosen landmark such that there is enough space.
    new.start <- as.numeric(sample(as.integer(flank(gaplist[[landmark]], -1*width(f))), 1))

  #new.start <- (sample(as.integer(gaplist[[landmark]]), 1))

    offset <- new.start-start(f)
### change feature coordinates
    seqnames(genome[i]) <- factor(landmark, levels=seqlevels(f)) 
    genome[i] <- shift(genome[i], offset)
### collect all dependent features and adjust their location to the new origin of the parent
    if (keep.gene.models & ! is.null(genome$Parent)) {
     children <- get.children.i(i, genome, lookup.parent)   
      if (!is.null(children)) {
        seqnames(genome[children]) <- factor(landmark, levels=seqlevels(genome))

        genome[children] <- shift(genome[children], offset)
      }
    ## NEXT feature
    }
   }
  return (genome)
}

## look for all children recursively
## this is slightly faster than using the objects
get.children.i <- function(i, allfeatures, lookup.parent) {
  m <- match(allfeatures[i]$ID, lookup.parent)
  if is.na(m))
    return(c())
  else {
    ch <- get.children.i(m, allfeatures, lookup.parent)
    return (c(m, ch))
  }
}
ADD COMMENTlink modified 10 months ago by RamRS30k • written 6.5 years ago by Michael Dondrup48k

If you are able to represent the gene models in BED12 format, a relatively simple change could be made to the bedtools `shuffle` command. Let me know if that is an option.

ADD REPLYlink written 6.5 years ago by Aaronquinlan11k

Hi Aaron, I think it is possible to convert from GFF3 to BED12. It might be nice to have this option (we will be 2 here in Bergen who would use it), but please only if it is not too much effort.

ADD REPLYlink written 6.5 years ago by Michael Dondrup48k

It is 90% of the way there already, so let me try to have a look in the next day or two. Will get back to you.

ADD REPLYlink written 6.5 years ago by Aaronquinlan11k

Hi Aaron, How could I use the bedtools shuffle to solve this problem?

ADD REPLYlink written 5.4 years ago by Richard Yang0

Hi, OP here, I haven't checked if it is working now in bedtools, but in the meantime, you can use my perl script from the other answer.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Michael Dondrup48k

Regarding running apply on GRanges objects, you just have to split() them first and then run lapply on the resulting list. Yes, this is kind of annoying and I'm pretty sure I'd seen multiple requests for direct support of GRanges objects by mapply or apply.

ADD REPLYlink written 6.5 years ago by Devon Ryan97k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1402 users visited in the last hour