Question: getGEO and getGEOfile error
0
gravatar for pengchy
2.1 years ago by
pengchy410
China/Beijing
pengchy410 wrote:

I have tried the following two methods to download and read the GSE data, but failed at the end.

#first download the SOFT file to local, then read in.

> gse1 <- getGEOfile("GSE11724",destdir="F:/GSEfiles/",amount="brief")
File stored at: 
F://GSEfiles//GSE11724.soft
> gse1
[1] "F:/GSEfiles//GSE11724.soft"
> ?getGEO
> getGEO(filename="F:/GSEfiles/GSE11724.soft")
Parsing....

Error in readLines(con, lineCounts[1, 1] - 1) : invalid 'n' argument

 

#the second method is download and parsing:

> gse1 <- getGEO("GSE11724",GSEMatrix=TRUE,getGPL=FALSE,destdir="F:/GSEfiles/")
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE11nnn/GSE11724/matrix/
OK
Found 1 file(s)
GSE11724_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE11nnn/GSE11724/matrix/GSE11724_series_matrix.txt.gz'
Content type 'application/x-gzip' length 125545 bytes (122 KB)
downloaded 122 KB

Error in value[[3L]](cond) : duplicate row.names: 
  AnnotatedDataFrame 'initialize' could not update varMetadata:
  perhaps pData and varMetadata are inconsistent?
In addition: Warning message:
closing unused connection 3 (F:/GSEfiles/GSE11724.soft)

I have checked the source code and found that getGEO will invoke parseGEO, which will invoke parseGSE, the source of this function listed below. However, I cann't test the function, because the filegrep function can not be found.

> parseGSE
function (fname, GSElimits = NULL) 
{
    gsmlist <- list()
    gpllist <- list()
    GSMcount <- 0
    writeLines("Parsing....")
    con <- fileOpen(fname)
    lineCounts <- filegrep(con, "\\^(SAMPLE|PLATFORM)", chunksize = 10000)
    message(sprintf("Found %d entities...", nrow(lineCounts)))
    close(con)
    con <- fileOpen(fname)
    a <- readLines(con, lineCounts[1, 1] - 1)
    header = parseGeoMeta(a)
    for (j in 1:nrow(lineCounts)) {
        tmp <- strsplit(as.character(lineCounts[j, 2]), " = ")[[1]]
        accession <- tmp[2]
        message(sprintf("%s (%d of %d entities)", accession, 
            j, nrow(lineCounts)))
        entityType <- tolower(sub("\\^", "", tmp[1]))
        nLinesToRead <- lineCounts[j + 1, 1] - lineCounts[j, 
            1] - 1
        if (j == nrow(lineCounts)) {
            nLinesToRead <- NULL
        }
        if (entityType == "sample") {
            GSMcount = GSMcount + 1
            if (is.null(GSElimits)) {
                gsmlist[[accession]] <- .parseGSMWithLimits(con, 
                  n = nLinesToRead)
            }
            else {
                if ((GSMcount >= GSElimits[1]) & (GSMcount <= 
                  GSElimits[2])) {
                  gsmlist[[accession]] <- .parseGSMWithLimits(con, 
                    n = nLinesToRead)
                }
                else {
                  if (!is.null(nLinesToRead)) {
                    readLines(con, n = nLinesToRead)
                  }
                }
            }
        }
        if (entityType == "platform") {
            gpllist[[accession]] <- .parseGPLWithLimits(con, 
                n = nLinesToRead)
        }
    }
    close(con)
    return(new("GSE", header = header, gsms = gsmlist, gpls = gpllist))
}
<environment: namespace:GEOquery>
getgeo bioconductor R geoquery • 1.3k views
ADD COMMENTlink modified 2.1 years ago by theobroma221.1k • written 2.1 years ago by pengchy410

Because I only want to extract the taxonomy information, I have resolved this problem by scan the downloaded files *.soft, and extract the data one by one. However, it seems that the parsing function from GEOquery need update.

ADD REPLYlink written 2.1 years ago by pengchy410
1

You should carefully read/look at/do the vignette before attempting to use it for your own good. If all you require is the taxid, please see my reply below.

ADD REPLYlink written 2.1 years ago by theobroma221.1k
2
gravatar for theobroma22
2.1 years ago by
theobroma221.1k
theobroma221.1k wrote:

Ok, I see. At first I thought you were trying to call a dataset from the GEOquery library. But this is incorrect. So, you can't use the code you provided because that is to call the data that comes with the package library, as it is stated in the vignette.

You should use:

gse1 = getGEO("GSE11724", GSEMatrix = FALSE)

This will take awhile to download all 26 files.

Then,

> show(gse1)

An object of class "GSE" Slot "header": $contact_address [1] "9 Cambridge Center"

$contact_city [1] "Cambridge"

$contact_country [1] "USA"

$contact_department [1] "Genome Technology Core"

$contact_email [1] "sgupta@wi.mit.edu"

$contact_institute [1] "Whitehead Institute"

$contact_name [1] "Whitehead,,Institute"

$contact_phone [1] "617-324-0339"

$contact_state [1] "MA"

$contact_zip/postal_code [1] "02141"

$contributor [1] "Richard,A,Young"

$email [1] "geo@ncbi.nlm.nih.gov"

$geo_accession [1] "GSE11724"

$institute [1] "NCBI NLM NIH"

$last_update_date [1] "Dec 07 2016"

$name [1] "Gene Expression Omnibus (GEO)"

$overall_design [1] "ChIP-seq in murine embryonic stem cells for Oct4, Sox2 (2 runs), Nanog (2 runs), Tcf3 (2 r...

$platform_id [1] "GPL9185"

$platform_taxid [1] "10090"

$pubmed_id [1] "18692474"

$relation [1] "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRP000712"
[2] "BioProject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA106023"

$sample_id [1] "GSM307137" "GSM307138" "GSM307139" "GSM307140" "GSM307141" "GSM307142" "GSM307143" "GSM307144" "GSM307145" [10] "GSM307146" "GSM307147" "GSM307148" "GSM307149" "GSM307150" "GSM307151" "GSM307152" "GSM307153" "GSM307154" [19] "GSM307155" "GSM307156" "GSM307157" "GSM307158" "GSM307159" "GSM307160" "GSM307161"

$sample_taxid [1] "10090"

$status [1] "Public on Jul 30 2008"

$submission_date [1] "Jun 09 2008"

$summary [1] "MicroRNAs (miRNAs) are crucial for normal embryonic stem (ES) cell self-renewal and cellular differentiation, but how miRNA gene expression is controlled by the key transcriptional regulators of ES cells has not been established... [2] ""
[3] "Keywords: ChIP-seq analysis ...

$supplementary_file [1] "ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP007/SRP000712" [2] "ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP000/SRP000712" [3] "ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/series/GSE11724/GSE11724_RAW.tar" ......... ......... .........

It's unfortunate but the submitter may have submitted the pData and varMetaData with errors. I've found that this is not uncommon. So, you might have to do what you want to do outside of the GEOquery package and handle the data yourself.

Hope this helps.

ADD COMMENTlink written 2.1 years ago by theobroma221.1k
1
gravatar for theobroma22
2.1 years ago by
theobroma221.1k
theobroma221.1k wrote:

In the first section you seem to be missing a forward slash in the filename when you called the getGEO function. Did you try using filename=gse1? Is the file in that directory and did you set the working directory to that location?

ADD COMMENTlink written 2.1 years ago by theobroma221.1k

Thank you, I have tried as you suggested, the error remain.

ADD REPLYlink written 2.1 years ago by pengchy410
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1350 users visited in the last hour