Why can't I use a GRanges txDb with clusterMap function? (as input or output)
0
0
Entering edit mode
6.3 years ago
jhanks1981 ▴ 10

I need to create several transcript database objects from different GTF files, so I would like to save time by running each txdb creation in parallel.

However, the txdb that is generated when I have run it in parallel (although it completes without errors) doesn't behave like the one generated by a single instance of the function. At first I thought it might have to do with the way I wrapped it in a function, but that does not seem to be the problem.

I do not understand why in the minimal example below, "txdb" and "txdb2" are valid and "txdb3" is not. Anyone have any ideas?

> require(GenomicFeatures)
> 
> txdb <- makeTxDbFromGFF("gtf_files/exons_final_sorted.gtf",format="gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: closing unused connection 4 (<-mordor-PC:11950) 
2: closing unused connection 3 (<-mordor-PC:11950) 
3: Named parameters not used in query: internal_chrom_id, chrom, length, is_circular 
4: Named parameters not used in query: internal_id, name, type, chrom, strand, start, end 
5: Named parameters not used in query: internal_id, name, chrom, strand, start, end 
6: Named parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_id 
7: Named parameters not used in query: gene_id, internal_tx_id 
> 
> test <- function(gtffile="gtf_files/exons_final_sorted.gtf", ftype="gtf"){
+     nuTxDb <- makeTxDbFromGFF(file=gtffile, format = ftype)
+     return(nuTxDb)
+ }
> 
> txdb2 <- test(gtffile="gtf_files/exons_final_sorted.gtf", ftype="gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: Named parameters not used in query: internal_chrom_id, chrom, length, is_circular 
2: Named parameters not used in query: internal_id, name, type, chrom, strand, start, end 
3: Named parameters not used in query: internal_id, name, chrom, strand, start, end 
4: Named parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_id 
5: Named parameters not used in query: gene_id, internal_tx_id 
> 
> 
> cluster <- makeCluster(2)
> dbeez <- clusterMap(cluster, makeTxDbFromGFF, file = c("gtf_files/exons_final_sorted.gtf", "gtf_files/exons_final_sorted.gtf"), format=c("gtf","gtf"))
> txdb3 <- dbeez[[1]]
> 
> typeof(txdb)
[1] "S4"
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: gtf_files/exons_final_sorted.gtf
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 162738
# exon_nrow: 547144
# cds_nrow: 0
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2017-12-30 15:20:30 -0500 (Sat, 30 Dec 2017)
# GenomicFeatures version at creation time: 1.26.4
# RSQLite version at creation time: 1.1-2
# DBSCHEMAVERSION: 1.1
> 
> typeof(txdb2)
[1] "S4"
> txdb2
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: gtf_files/exons_final_sorted.gtf
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 162738
# exon_nrow: 547144
# cds_nrow: 0
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2017-12-30 15:21:10 -0500 (Sat, 30 Dec 2017)
# GenomicFeatures version at creation time: 1.26.4
# RSQLite version at creation time: 1.1-2
# DBSCHEMAVERSION: 1.1
> 
> typeof(txdb3)
[1] "S4"
> txdb3
TxDb object:
Error in rsqlite_send_query(conn@ptr, statement) : 
  external pointer is not valid
>

Also, it seems that I cannot use a previously created and seemly good txdb object in clusterMap function, even though the exact same object and parameters works outside of it:

> require(GenomicFeatures)
> 
> 
> txdb <- makeTxDbFromGFF("gtf_files/exons_final_sorted.gtf",format="gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: closing unused connection 4 (<-mordor-PC:11299) 
2: closing unused connection 3 (<-mordor-PC:11299) 
3: Named parameters not used in query: internal_chrom_id, chrom, length, is_circular 
4: Named parameters not used in query: internal_id, name, type, chrom, strand, start, end 
5: Named parameters not used in query: internal_id, name, chrom, strand, start, end 
6: Named parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_id 
7: Named parameters not used in query: gene_id, internal_tx_id 
> 
> list.per.geneA <- transcriptsBy(x=txdb, by="exon", use.names = TRUE)
Warning message:
In .set_group_names(grl, use.names, txdb, by) :
  some group names are NAs or duplicated
> list.per.geneB <- transcriptsBy(x=txdb, by="gene", use.names = FALSE)
> require(parallel)
> cluster <- makeCluster(2)
> list.per.gene <- clusterMap(cluster, transcriptsBy, x=c(txdb, txdb), by=c("exon", "gene"), use.names=c(TRUE,FALSE))
Error in checkForRemoteErrors(val) : 
  2 nodes produced errors; first error: invalid DB file
R genomicFeatures parallel clusterMap • 2.8k views
ADD COMMENT
1
Entering edit mode

Most parallelization in R works via a fork(). I suspect that what you're seeing is that forked threads are getting invalid database handles.

ADD REPLY
0
Entering edit mode

Apologies if this is a naive question, but is that something I can do something about?

ADD REPLY
1
Entering edit mode

No, that's not something you can do anything about. I suspect you simply can't use clusterMap() for this.

ADD REPLY

Login before adding your answer.

Traffic: 2052 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6