Question: Why can't I use a GRanges txDb with clusterMap function? (as input or output)
0
gravatar for jhanks1981
18 months ago by
jhanks19810
jhanks19810 wrote:

I need to create several transcript database objects from different GTF files, so I would like to save time by running each txdb creation in parallel.

However, the txdb that is generated when I have run it in parallel (although it completes without errors) doesn't behave like the one generated by a single instance of the function. At first I thought it might have to do with the way I wrapped it in a function, but that does not seem to be the problem.

I do not understand why in the minimal example below, "txdb" and "txdb2" are valid and "txdb3" is not. Anyone have any ideas?

> require(GenomicFeatures)
> 
> txdb <- makeTxDbFromGFF("gtf_files/exons_final_sorted.gtf",format="gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: closing unused connection 4 (<-mordor-PC:11950) 
2: closing unused connection 3 (<-mordor-PC:11950) 
3: Named parameters not used in query: internal_chrom_id, chrom, length, is_circular 
4: Named parameters not used in query: internal_id, name, type, chrom, strand, start, end 
5: Named parameters not used in query: internal_id, name, chrom, strand, start, end 
6: Named parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_id 
7: Named parameters not used in query: gene_id, internal_tx_id 
> 
> test <- function(gtffile="gtf_files/exons_final_sorted.gtf", ftype="gtf"){
+     nuTxDb <- makeTxDbFromGFF(file=gtffile, format = ftype)
+     return(nuTxDb)
+ }
> 
> txdb2 <- test(gtffile="gtf_files/exons_final_sorted.gtf", ftype="gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: Named parameters not used in query: internal_chrom_id, chrom, length, is_circular 
2: Named parameters not used in query: internal_id, name, type, chrom, strand, start, end 
3: Named parameters not used in query: internal_id, name, chrom, strand, start, end 
4: Named parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_id 
5: Named parameters not used in query: gene_id, internal_tx_id 
> 
> 
> cluster <- makeCluster(2)
> dbeez <- clusterMap(cluster, makeTxDbFromGFF, file = c("gtf_files/exons_final_sorted.gtf", "gtf_files/exons_final_sorted.gtf"), format=c("gtf","gtf"))
> txdb3 <- dbeez[[1]]
> 
> typeof(txdb)
[1] "S4"
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: gtf_files/exons_final_sorted.gtf
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 162738
# exon_nrow: 547144
# cds_nrow: 0
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2017-12-30 15:20:30 -0500 (Sat, 30 Dec 2017)
# GenomicFeatures version at creation time: 1.26.4
# RSQLite version at creation time: 1.1-2
# DBSCHEMAVERSION: 1.1
> 
> typeof(txdb2)
[1] "S4"
> txdb2
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: gtf_files/exons_final_sorted.gtf
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 162738
# exon_nrow: 547144
# cds_nrow: 0
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2017-12-30 15:21:10 -0500 (Sat, 30 Dec 2017)
# GenomicFeatures version at creation time: 1.26.4
# RSQLite version at creation time: 1.1-2
# DBSCHEMAVERSION: 1.1
> 
> typeof(txdb3)
[1] "S4"
> txdb3
TxDb object:
Error in rsqlite_send_query(conn@ptr, statement) : 
  external pointer is not valid
>

Also, it seems that I cannot use a previously created and seemly good txdb object in clusterMap function, even though the exact same object and parameters works outside of it:

> require(GenomicFeatures)
> 
> 
> txdb <- makeTxDbFromGFF("gtf_files/exons_final_sorted.gtf",format="gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: closing unused connection 4 (<-mordor-PC:11299) 
2: closing unused connection 3 (<-mordor-PC:11299) 
3: Named parameters not used in query: internal_chrom_id, chrom, length, is_circular 
4: Named parameters not used in query: internal_id, name, type, chrom, strand, start, end 
5: Named parameters not used in query: internal_id, name, chrom, strand, start, end 
6: Named parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_id 
7: Named parameters not used in query: gene_id, internal_tx_id 
> 
> list.per.geneA <- transcriptsBy(x=txdb, by="exon", use.names = TRUE)
Warning message:
In .set_group_names(grl, use.names, txdb, by) :
  some group names are NAs or duplicated
> list.per.geneB <- transcriptsBy(x=txdb, by="gene", use.names = FALSE)
> require(parallel)
> cluster <- makeCluster(2)
> list.per.gene <- clusterMap(cluster, transcriptsBy, x=c(txdb, txdb), by=c("exon", "gene"), use.names=c(TRUE,FALSE))
Error in checkForRemoteErrors(val) : 
  2 nodes produced errors; first error: invalid DB file
ADD COMMENTlink modified 18 months ago • written 18 months ago by jhanks19810
1

Most parallelization in R works via a fork(). I suspect that what you're seeing is that forked threads are getting invalid database handles.

ADD REPLYlink written 17 months ago by Devon Ryan90k

Apologies if this is a naive question, but is that something I can do something about?

ADD REPLYlink written 17 months ago by jhanks19810
1

No, that's not something you can do anything about. I suspect you simply can't use clusterMap() for this.

ADD REPLYlink written 17 months ago by Devon Ryan90k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1673 users visited in the last hour