How to generate start & end coordinates for each chromosome of human ref genome in bed. format (0-based index)
2
1
Entering edit mode
19 months ago
ekwame ▴ 90

Dear Biostars,

My aim is to generate start and end cordinates for each chromosome in human reference genome to use for other downstream analysis. so using R packages BSgenome and GenomicFeatures I generated the cordinates of hg38 genome and saved as .bed file. However, i think my cordinates (.bed) i generated are inaccurate because .bed files are 0-based index

Question; Any susggestions on how to generate a 0-based index .bed file for chromosomes of human reference genome?

codes used

################begingining of codes#################
library(BSgenome)
library(GenomicFeatures)
library(GenomeInfoDbData)


# download a local copy of BSgenome.Hsapiens.UCSC.hg38 using bioconductor
if (interactive()) {
  if (!require("BiocManager"))
    install.packages("BiocManager")
  BiocManager::install("BSgenome.Hsapiens.UCSC.hg38")
}

##load BSgenome.Hsapiens.UCSC.hg38 on current workspace
library(BSgenome.Hsapiens.UCSC.hg38)

##assign hg38 genone to `hg38` variable to save on typing the long name
hg38 <- getChromInfoFromUCSC("hg38")

##get cordinates of hg38 in GRaanges formart
hg38.cordinates <- GRanges(seqnames = hg38[,1],ranges=IRanges(end=hg38[,2],width = hg38[,2]))
ranges(hg38.cordinates)

###save as df and write out as a bed file
hg38.cordinates.dt <- cbind(hg38[,1],as.data.frame(ranges(hg38.cordinates)[,c(1,2)]))
dim(hg38.cordinates.dt)

write.table(file="/path/to/write_to/hg38.cordinates.bed",hg38.cordinates.dt,sep="\t",quote=F,row.names=F, col.names = c("chr","start","end","width"))

##########################end of codes################################
bed bedtools bioconductor genome • 1.0k views
ADD COMMENT
0
Entering edit mode

0-based simply means that you have to subtract 1 from the start coordinates, so the start coordinate of a chromosome is 0 and the end is the length of the chromosome.

ADD REPLY
3
Entering edit mode
19 months ago

Instead of using R, I suggest using the UCSC Kent Utilities kit. It includes a binary called fetchChromSizes that is perfect for this task. The link will take you to folders you can pick from for your platform (Linux or OS X) containing pre-compiled binaries, so it is a quick download.

You give fetchChromSizes the name of your assembly, and it gives you a list of chromosomes and their sizes. This can be piped to other tools to make a BED file easily and quickly.

For instance, to make a sorted, 0-indexed BED file:

$ fetchChromSizes hg38 | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2; }' | sort-bed - > hg38.all.bed

If you want to remove non-nuclear chromosomes (partial chromosomes and pseudochromosomes), you can add a grep statement in between to filter them out:

$ fetchChromSizes hg38 | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2; }' | grep -v ".*_.*" | sort-bed - > hg38.nuclearOnly.bed

This also makes use of BEDOPS sort-bed, for creating a sorted BED file, which is more efficient for use with set operations.

ADD COMMENT
1
Entering edit mode

This is great! Thanks!

ADD REPLY
0
Entering edit mode

awesome thanks

ADD REPLY
0
Entering edit mode
3 months ago

A simple and general way would be indexing your fasta file. It will generate a .fai file where the first column is the chromosome name, and the second column is the number of bases in it, therefore adding a middle '0' column is all you need to obtain your desired bed file:

samtools faidx hg38.fa
awk 'OFS="\t"{print $1, "0", $2}' hs38.fa.fai > hg38.bed
ADD COMMENT

Login before adding your answer.

Traffic: 2899 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6