Question

[Solved] Trying to apply mapToTranscripts to data.table by row, getting incorrect amount of values

0

Entering edit mode

7.3 years ago

anmartinez • 0

I'm kind of new to data.tables and I have a table containing DNA genomic coordinates like this:

   chrom   pause strand coverage
1:     1 3025794      +        1
2:     1 3102057      +        2
3:     1 3102058      +        2
4:     1 3102078      +        1
5:     1 3108840      -        1
6:     1 3133041      +        1

I wrote a custom function that I want to apply to each row of my circa 2 million rows table, it uses GenomicFeatures' mapToTranscripts to retrieve the transcript ID for the position and the CDS coordinate. I want to add them to my table in two new columns, like this:

   chrom   pause strand coverage       transcriptID CDS
1:     1 3025794      +        1 ENSMUST00000116652 196
2:     1 3102057      +        2 ENSMUST00000116652  35
3:     1 3102058      +        2 ENSMUST00000156816 888
4:     1 3102078      +        1 ENSMUST00000156816 883
5:     1 3108840      -        1 ENSMUST00000156816 882
6:     1 3133041      +        1 ENSMUST00000156816 880

The function is the following:

get_feature <- function(dt){

  coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) 
  hit <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) 
  tx_id <- tx_names[as.character(seqnames(hit))] 
  cds_coordinate <- sapply(ranges(hit), '[[', 1)

  if(length(tx_id) == 0 || length(cds_coordinate) == 0) {  
    out <- list('NaN', 0)
  } else {
    out <- list(tx_id, cds_coordinate)
  }

  return(out)
}

Then, I do:

counts[, c("transcriptID", "CDS"):=get_feature(.SD), by = .I]

And I get this error, indicating that the function is returning two lists of shorter length than the original table, instead of one new element per row:

    Warning messages:
1: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"),  ... :
  Supplied 1112452 items to be assigned to 1886614 items of column 'transcriptID' (recycled leaving remainder of 774162 items).
2: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"),  ... :
  Supplied 1112452 items to be assigned to 1886614 items of column 'CDS' (recycled leaving remainder of 774162 items).

I assumed that using the .I operator would apply the function on a row basis and return one value per row. I also made sure the function was not returning empty values using the if statement.

Then I tried this mock version of the function:

get_feature <- function(dt) {

  return('I should be returned once for each row')

}

And called it like this:

new.table <- counts[, get_feature(.SD), by = .I]

It makes a 1 row data table, instead of one the original length. So I concluded that my function, or maybe the way I'm calling it, is collapsing the elements of the resulting vector somehow. What am I doing wrong?

Update (with solution): I was able to figure out after deeper research around StackOverflow. As is explained in this answer and in ?data.table, .I is only intended for use in j (as in DT[i,j,by=]). Therefore, by=.I is equivalent to by=NULL and the proper syntax is by=1:nrow(dt) in order to group by row number and apply the function row-wise.

Unfortunately, for my particular case, this is utterly inefficient and I calculated an execution time of 20 seconds for 100 rows. For my 36 million row dataset that takes 3 months to complete.

In my case, I had to give up and use the mapToTranscripts function on the entire table like this, which takes a couple of seconds and was obviously the intended use.

    get_features <- function(dt){
      coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) # define coordinate
      hits <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) # map it to a transcript
      tx_hit <- as.character(seqnames(hits)) # get transcript number
      tx_id <- tx_names[tx_hit] # get transcript name from translation table

      return(data.table('transcriptID'= tx_id, 
                       'CDS_coordinate' =  start(hits))
    }

     density <- counts[, get_features(.SD)]

Then map back to the genome using mapFromTranscripts from GenomicFeatures package so I could use a data.tables join to retrieve information from the original table, which was the intended purpose of what I was trying to do.

data.tables R GenomicRanges Ribo-seq • 1.9k views

ADD COMMENT • link 7.3 years ago by anmartinez • 0