Question

How to generate start & end coordinates for each chromosome of human ref genome in bed. format (0-based index)

1

Entering edit mode

4.1 years ago

Alewa ▴ 150

Dear Biostars,

My aim is to generate start and end cordinates for each chromosome in human reference genome to use for other downstream analysis. so using R packages BSgenome and GenomicFeatures I generated the cordinates of hg38 genome and saved as .bed file. However, i think my cordinates (.bed) i generated are inaccurate because .bed files are 0-based index

Question; Any susggestions on how to generate a 0-based index .bed file for chromosomes of human reference genome?

codes used

################begingining of codes#################
library(BSgenome)
library(GenomicFeatures)
library(GenomeInfoDbData)


# download a local copy of BSgenome.Hsapiens.UCSC.hg38 using bioconductor
if (interactive()) {
  if (!require("BiocManager"))
    install.packages("BiocManager")
  BiocManager::install("BSgenome.Hsapiens.UCSC.hg38")
}

##load BSgenome.Hsapiens.UCSC.hg38 on current workspace
library(BSgenome.Hsapiens.UCSC.hg38)

##assign hg38 genone to `hg38` variable to save on typing the long name
hg38 <- getChromInfoFromUCSC("hg38")

##get cordinates of hg38 in GRaanges formart
hg38.cordinates <- GRanges(seqnames = hg38[,1],ranges=IRanges(end=hg38[,2],width = hg38[,2]))
ranges(hg38.cordinates)

###save as df and write out as a bed file
hg38.cordinates.dt <- cbind(hg38[,1],as.data.frame(ranges(hg38.cordinates)[,c(1,2)]))
dim(hg38.cordinates.dt)

write.table(file="/path/to/write_to/hg38.cordinates.bed",hg38.cordinates.dt,sep="\t",quote=F,row.names=F, col.names = c("chr","start","end","width"))

##########################end of codes################################

bed bedtools bioconductor genome • 4.7k views

ADD COMMENT • link updated 2.8 years ago by Jorge Amigo 14k • written 4.1 years ago by Alewa ▴ 150

1

Entering edit mode

0-based simply means that you have to subtract 1 from the start coordinates, so the start coordinate of a chromosome is 0 and the end is the length of the chromosome.

ADD REPLY • link 2.8 years ago by ATpoint 81k

score 4 · Accepted Answer · 2020-03-19

Instead of using R, I suggest using the UCSC Kent Utilities kit. It includes a binary called fetchChromSizes that is perfect for this task. The link will take you to folders you can pick from for your platform (Linux or OS X) containing pre-compiled binaries, so it is a quick download.

You give fetchChromSizes the name of your assembly, and it gives you a list of chromosomes and their sizes. This can be piped to other tools to make a BED file easily and quickly.

For instance, to make a sorted, 0-indexed BED file:

$ fetchChromSizes hg38 | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2; }' | sort-bed - > hg38.all.bed

If you want to remove non-nuclear chromosomes (partial chromosomes and pseudochromosomes), you can add a grep statement in between to filter them out:

$ fetchChromSizes hg38 | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2; }' | grep -v ".*_.*" | sort-bed - > hg38.nuclearOnly.bed

This also makes use of BEDOPS sort-bed, for creating a sorted BED file, which is more efficient for use with set operations.

score 3 · Accepted Answer · 2021-07-11

A simple and general way would be indexing your fasta file. It will generate a .fai file where the first column is the chromosome name, and the second column is the number of bases in it, therefore adding a middle '0' column is all you need to obtain your desired bed file:

samtools faidx hg38.fa
awk 'OFS="\t"{print $1, "0", $2}' hs38.fa.fai > hg38.bed