Create a bed file bins from fasta?
2
0
Entering edit mode
22 months ago
simplitia ▴ 130

Hi, given a fasta file, I would like to create 1Mb bins until the end of the chromosome and store that in a bed file as such.

1   chr1         0   1000000
3   chr1   1000000   2000000
5   chr1   2000000   3000000
7   chr1   3000000   4000000
9   chr1   4000000   5000000
11  chr1   5000000   6000000

The input fasta is from human hg38 coordinates. Bonus would be great to do this in R but necessarily. thanks.

bed fasta • 916 views
ADD COMMENT
2
Entering edit mode
22 months ago
ATpoint 81k

Solution of rpolicastro is the easiest. Here a custom solution.

library(Biostrings)

#/ Some dummy data, here yeast genome
url <- "http://ftp.ensembl.org/pub/release-106/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz"
fasta <- readDNAStringSet(url)
names(fasta) <- gsub(" .*", "", names(fasta))

make_bins <- function(fasta, binsize){

  intervals <- 
    lapply(names(fasta), function(x){

      current <- fasta[[x]]
      div  <- length(current)/binsize
      chunks <- floor(div)
      rest <- length(current) - chunks*binsize

      if(chunks>0){

        from <- seq(0, chunks*binsize, binsize)
        to <- c(from[2:length(from)])

        if(rest==0){

          from <- from[1:(length(from)-1)]

        } else {
          to <- c(to, to[length(to)]+rest)
        }

      } else {

        from <- 0
        to <- length(current)

      }

      data.frame(chr=x, start=from, end=to)

    })

  intervals <- do.call(rbind, intervals)

  return(intervals) 

}



b <- make_bins(fasta, 50000)

head(b)

  chr  start    end
1   I      0  50000
2   I  50000 100000
3   I 100000 150000
4   I 150000 200000
5   I 200000 230218
6  II      0  50000
ADD COMMENT
1
Entering edit mode
22 months ago

Since you said you want to do this in R, you would first use the GenomicRanges::tileGenome function to generate a GRanges object with your genomic tiles. This can then be exported using rtracklayer::export.

As a generic example.

library("GenomicRanges")
library("rtracklayer")

chrm_sizes <- c(I=1e9, II=2e8)
genome_tiles <- tileGenome(chrm_sizes, tilewidth=1e6)

export(genome_tiles, "genome_tiles.bed", format="bed")
ADD COMMENT
1
Entering edit mode

+1

With data.frame(unlist(genome_tiles))[,1:3] to get a data.frame from that directly in R.

ADD REPLY

Login before adding your answer.

Traffic: 2370 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6