Transform genomic intervals to genomic positions in an R dataframe
0
0
Entering edit mode
4.1 years ago
jeni ▴ 90

Hi everyone!

I have a dataframe with some genomic intervals and its corresponding coverage in several samples:

            sample1  sample2   sample3
     1:1-3    30        NA      NA
     1:1-4    NA        40      35
     1:4-5    35        NA      NA
     1:5-7    NA        50      50
     1:6-7    60        NA      NA

I would like to obtain the same dataframe but for genomic positions:

            sample1    sample2     sample3
     1:1      30         40          35
     1:2      30         40          35
     1:3      30         40          35
     1:4      35         40          35
     1:5      35         50          50 
     1:6      60         50          50
     1:7      60         50          50

How could I get this?

R • 926 views
ADD COMMENT
0
Entering edit mode

The intervals can be obtained first by rownames. Then use strsplit to get the chromosome (first element) and the ranges (2nd and 3rd element). You can either put this into a data frame and use then makeGRangesFromDataFrame or use GRanges directly to construct a GRanges object. The coverages could be stored as elementMetadata in the resulting GRanges object. I suggest you try that out. It is a good practice to improve yourself.

ADD REPLY
0
Entering edit mode

Okay, thanks! I have already done that.

But now how can I get genomic positions from each interval, indicating the coverage value of each sample for each position?

ADD REPLY
0
Entering edit mode

Can you show what you have done?

ADD REPLY
0
Entering edit mode

Sure! I've transformed my dataframe in a GRanges object (I've splitted first genomic coordinates to this format -> chr start end):

gr<-makeGRangesFromDataFrame(df, seqnames.field = 'chrm', start.field = 'start', end.field = 'end', keep.extra.columns = TRUE)

GRanges object with 5 ranges and 3 metadata columns:
      seqnames    ranges strand |     sample1     sample2     sample3
         <Rle> <IRanges>  <Rle> | <character> <character> <character>
  [1]        1       1-3      * |          30          NA          NA
  [2]        1       1-4      * |          NA          40          35
  [3]        1       4-5      * |          35          NA          NA
  [4]        1       5-7      * |          NA          50          50
  [5]        1       6-7      * |          60          NA          NA
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Now, I have tried:

grd<-disjoin(gr)

and I get this:

GRanges object with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1       1-3      *
  [2]        1         4      *
  [3]        1         5      *
  [4]        1       6-7      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

In this example I cannot obtain all the positions, but in my real df I can, because I have a lot of overlapped intervals. Now the problem I have is that I dont know how to maintain and adapt metadata columns, what I would like is to obtain this:

GRanges object with 4 ranges and 3 metadata columns:
      seqnames    ranges strand   |   sample1 sample2 sample3
         <Rle> <IRanges>  <Rle>   |  character  character character
  [1]        1       1-3      *                |      30              40            35
  [2]        1         4      *                 |      35              40            35
  [3]        1         5      *                 |      35              50            50
  [4]        1       6-7      *                |      60              50            50
ADD REPLY
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 2276 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6