Question

Transform a data.frame to GRranges in R

0

Entering edit mode

20 months ago

LDT ▴ 330

Dear all,

I have a data.frame that I want to convert to GRranges format. When I use the following command I am taking the following errors

suppressPackageStartupMessages(library(GenomicRanges))

bed <- data.frame(chrom=c(rep("Chr1",5)),
                  chromStart=c(18915152,24199229,73730,81430,89350),
                  chromEnd=c(18915034,24199347,74684,81550,89768), 
                  strand=c("-","+","+","+","+"))
bed
#>   chrom chromStart chromEnd strand
#> 1  Chr1   18915152 18915034      -
#> 2  Chr1   24199229 24199347      +
#> 3  Chr1      73730    74684      +
#> 4  Chr1      81430    81550      +
#> 5  Chr1      89350    89768      +
makeGRangesFromDataFrame(bed)
#> Error in .width_as_unnamed_integer(width, msg = "an end that is greater or equal to its start minus one"): each range must have an end that is greater or equal to its start minus
#>   one

^{Created on 2022-08-01 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)}

How do people deal with this? Do they flip the order of the " - " strand?

Any help or suggestion is welcome

GenomicRanges R data.frame bed • 3.7k views

ADD COMMENT • link updated 16 months ago by Mannu • 0 • written 20 months ago by LDT ▴ 330

0

Entering edit mode

I am using Genomation package within MethylKit and I am getting a similar error while using T2TCHM13v2.0/hs1 assembly. It worked well when I worked with NCBI RefSeq.bed file but showing the error "Error in .width_as_unnamed_integer(width, msg = "an end that is greater or equal to its start minus one") : each range must have an end that is greater or equal to its start minus one" when used with CAT/Liftoff Genes.bed file.

I also checked manually the start and end coordinates to confirm the start< end, which is perfectly fine. Is it something to do with the CAT/LiftOff Genes.bed file? I find this bed file useful as it has gene symbols rather than NCBI RefSeq IDs.

Below is the error:

gene.obj=readTranscriptFeatures(system.file("extdata","catLiftOffGenes_T2T_wholegene.bed", package = "methylKit")) Reading the table... Calculating intron coordinates...
Error in .width_as_unnamed_integer(width, msg = "an end that is greater or equal to its start minus one") : each range must have an end that is greater or equal to its start minus one

ADD REPLY • link 16 months ago by Mannu • 0

0

Entering edit mode

Please do not asking questions in existing threads, open a new question with the necessary details and a reproducible example.

ADD REPLY • link 16 months ago by ATpoint 81k

2

Entering edit mode

20 months ago

ATpoint 81k

This is not a valid BED file. Start must always be smaller than end., regardless of strand. Downstream tools are/must be smart enough to internally convert start to end in case of minus strand. You will need to sort that out by going through the file line by line and swapping start and end so start < end.

Something like this:

library(GenomicRanges)

bed <- data.frame(chrom=c(rep("Chr1",5)),
                  chromStart=c(18915152,24199229,73730,81430,89350),
                  chromEnd=c(18915034,24199347,74684,81550,89768), 
                  strand=c("-","+","+","+","+"))

bed
#>   chrom chromStart chromEnd strand
#> 1  Chr1   18915152 18915034      -
#> 2  Chr1   24199229 24199347      +
#> 3  Chr1      73730    74684      +
#> 4  Chr1      81430    81550      +
#> 5  Chr1      89350    89768      +

wrong <- which(bed$chromStart > bed$chromEnd)

new_end   <- bed[wrong,]$chromStart
new_start <- bed[wrong,]$chromEnd

bed[wrong,]$chromStart <- new_start
bed[wrong,]$chromEnd   <- new_end

makeGRangesFromDataFrame(bed)
#> GRanges object with 5 ranges and 0 metadata columns:
#>       seqnames            ranges strand
#>          <Rle>         <IRanges>  <Rle>
#>   [1]     Chr1 18915034-18915152      -
#>   [2]     Chr1 24199229-24199347      +
#>   [3]     Chr1       73730-74684      +
#>   [4]     Chr1       81430-81550      +
#>   [5]     Chr1       89350-89768      +
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
Created on 2022-08-01 by the reprex package (v2.0.1)

ADD COMMENT • link 20 months ago by ATpoint 81k

0

Entering edit mode

Its a data.frame called bed

ADD REPLY • link 20 months ago by LDT ▴ 330

0

Entering edit mode

You need to figure out why the ranges are incorrect. I suggested in your other post that the start and end positions might have been flipped for minus strand ranges, but that was speculation and needs to be confirmed. If this is the case you just need to switch them back before creating the GRanges object. Additionally, you need to specify what column corresponds to start, end, etc. in the makeGRangesFromDataFrame function. See the function documentation for more information ?makeGRangesFromDataFrame.

ADD REPLY • link 20 months ago by rpolicastro 13k

0

Entering edit mode

Fine, GenomicRanges still obeys to BED format and as such it is not a valid input for it.

ADD REPLY • link 20 months ago by ATpoint 81k

0

Entering edit mode

i added a code suggestion

ADD REPLY • link 20 months ago by ATpoint 81k

0

Entering edit mode

thank you ATpoint for your time and guidance. I think after a few days I figured it out. I appreciate the time you took

ADD REPLY • link 20 months ago by LDT ▴ 330

score 2 · Accepted Answer · 2022-08-02

Here the answer includes all the steps needed to go from a data.frame to a functional bed file in R and also how to transform a data.frame to to Genomic Ranges object

suppressPackageStartupMessages(library(GenomicRanges))
suppressPackageStartupMessages(library(rtracklayer))
suppressPackageStartupMessages(library(tidyverse))

# data 
bed <- data.frame(chrom=c(rep("Chr1",5)),
                  chromStart=c(18915152,24199229,73730,81430,89350),
                  chromEnd=c(18915034,24199347,74684,81550,89768), 
                  strand=c("-","+","+","+","+"))

# transform such as always chromStart < chromEnd
bed2 <- bed |> 
transform(chromStart=ifelse(chromStart>chromEnd,chromEnd,chromStart),
          chromEnd= ifelse(chromEnd<chromStart,chromStart,chromEnd))

# Genomic Ranges 
bed3 <- GenomicRanges::makeGRangesFromDataFrame(bed2)
head(bed3)
#> GRanges object with 5 ranges and 0 metadata columns:
#>       seqnames            ranges strand
#>          <Rle>         <IRanges>  <Rle>
#>   [1]     Chr1 18915034-18915152      -
#>   [2]     Chr1 24199229-24199347      +
#>   [3]     Chr1       73730-74684      +
#>   [4]     Chr1       81430-81550      +
#>   [5]     Chr1       89350-89768      +
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

# rtracklayer 
bed4 <- rtracklayer::export(bed3, format="bed", ignore.strand = FALSE)
bed4
#> [1] "Chr1\t18915033\t18915152\t.\t0\t-" "Chr1\t24199228\t24199347\t.\t0\t+"
#> [3] "Chr1\t73729\t74684\t.\t0\t+"       "Chr1\t81429\t81550\t.\t0\t+"      
#> [5] "Chr1\t89349\t89768\t.\t0\t+"

# write it as a bed file
# this is essential to make sure that this works properly 
write.table(bed4, "test.bed", sep="\t", col.names=FALSE, row.names = FALSE, append = TRUE, quote = FALSE)

^{Created on 2022-08-02 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)}