Question: [Solved] Trying to apply mapToTranscripts to data.table by row, getting incorrect amount of values
gravatar for anmartinez
3.4 years ago by
anmartinez0 wrote:

I'm kind of new to data.tables and I have a table containing DNA genomic coordinates like this:

   chrom   pause strand coverage
1:     1 3025794      +        1
2:     1 3102057      +        2
3:     1 3102058      +        2
4:     1 3102078      +        1
5:     1 3108840      -        1
6:     1 3133041      +        1

I wrote a custom function that I want to apply to each row of my circa 2 million rows table, it uses GenomicFeatures' mapToTranscripts to retrieve the transcript ID for the position and the CDS coordinate. I want to add them to my table in two new columns, like this:

   chrom   pause strand coverage       transcriptID CDS
1:     1 3025794      +        1 ENSMUST00000116652 196
2:     1 3102057      +        2 ENSMUST00000116652  35
3:     1 3102058      +        2 ENSMUST00000156816 888
4:     1 3102078      +        1 ENSMUST00000156816 883
5:     1 3108840      -        1 ENSMUST00000156816 882
6:     1 3133041      +        1 ENSMUST00000156816 880

The function is the following:

get_feature <- function(dt){

  coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) 
  hit <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) 
  tx_id <- tx_names[as.character(seqnames(hit))] 
  cds_coordinate <- sapply(ranges(hit), '[[', 1)

  if(length(tx_id) == 0 || length(cds_coordinate) == 0) {  
    out <- list('NaN', 0)
  } else {
    out <- list(tx_id, cds_coordinate)


Then, I do:

counts[, c("transcriptID", "CDS"):=get_feature(.SD), by = .I]

And I get this error, indicating that the function is returning two lists of shorter length than the original table, instead of one new element per row:

    Warning messages:
1: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"),  ... :
  Supplied 1112452 items to be assigned to 1886614 items of column 'transcriptID' (recycled leaving remainder of 774162 items).
2: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"),  ... :
  Supplied 1112452 items to be assigned to 1886614 items of column 'CDS' (recycled leaving remainder of 774162 items).

I assumed that using the .I operator would apply the function on a row basis and return one value per row. I also made sure the function was not returning empty values using the if statement.

Then I tried this mock version of the function:

get_feature <- function(dt) {

  return('I should be returned once for each row')


And called it like this:

new.table <- counts[, get_feature(.SD), by = .I]

It makes a 1 row data table, instead of one the original length. So I concluded that my function, or maybe the way I'm calling it, is collapsing the elements of the resulting vector somehow. What am I doing wrong?

Update (with solution): I was able to figure out after deeper research around StackOverflow. As is explained in this answer and in ?data.table, .I is only intended for use in j (as in DT[i,j,by=]). Therefore, by=.I is equivalent to by=NULL and the proper syntax is by=1:nrow(dt) in order to group by row number and apply the function row-wise.

Unfortunately, for my particular case, this is utterly inefficient and I calculated an execution time of 20 seconds for 100 rows. For my 36 million row dataset that takes 3 months to complete.

In my case, I had to give up and use the mapToTranscripts function on the entire table like this, which takes a couple of seconds and was obviously the intended use.

    get_features <- function(dt){
      coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) # define coordinate
      hits <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) # map it to a transcript
      tx_hit <- as.character(seqnames(hits)) # get transcript number
      tx_id <- tx_names[tx_hit] # get transcript name from translation table

      return(data.table('transcriptID'= tx_id, 
                       'CDS_coordinate' =  start(hits))

     density <- counts[, get_features(.SD)]

Then map back to the genome using mapFromTranscripts from GenomicFeatures package so I could use a data.tables join to retrieve information from the original table, which was the intended purpose of what I was trying to do.

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by anmartinez0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1252 users visited in the last hour