Question: Extracting Information From Geo Soft Files
0
gravatar for Layla
7.9 years ago by
Layla50
Layla50 wrote:

Hi: I am working with the soft files obtained from GEO from specific diseases. The question that I have is that if there is an specific way or tag inside the GEO soft files that points to me when is a set of control samples or disease samples (GSM in this case). Can I obtain that data using some external package like GEOquery? Thanks

geo • 8.4k views
ADD COMMENTlink written 7.9 years ago by Layla50
2
gravatar for Sean Davis
7.9 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

It sounds like you should be searching for the GSE (or GDS) rather than the GSMs directly. The experiment-level information is in the GSE (or GDS) while the sample-level information is in the GSM file. GEOquery can deal with all GEO data types, so I would recommend giving it a try before embarking on a wheel reinvention.

ADD COMMENTlink written 7.9 years ago by Sean Davis26k
2
gravatar for Neilfws
7.9 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

The short answer to this question is no. You cannot determine whether a sample is control/normal or diseased from the GEO database, because it does not use a controlled vocabulary: in other words, submitters can describe their samples any way they like, using arbitrary free text. This is, in my opinion, one of the great failings of GEO.

Take for example series GSE4183: "Inflammation, adenoma and cancer: objective classification of colon biopsy specimens with gene expression signature." In this study, a sample from normal colon is described like this:

!Sample_title = colon_normal_1024
!Sample_description = total RNA extracted from cells obtained using biopsy of the colon in a healthy control

A sample of cancerous colon:

!Sample_title = colon_adenoma_1115
!Sample_description = total RNA extracted from cells obtained using biopsy in a patient having colon adenoma

This is actually quite a good example. In other series, titles and descriptions frequently do not follow any pattern and are not informative.

So in summary: there is no specific tag for "control" or "disease". The best you can do is parse sample title and/or description (using GEOquery or some other software) and hope that the text is informative.

ADD COMMENTlink modified 5 months ago by RamRS27k • written 7.9 years ago by Neilfws48k
1
gravatar for Julien Textoris
7.9 years ago by
Marseille, France
Julien Textoris430 wrote:

Don't know if it can help you, but we wrote these scripts a few years ago to 1) extract data and notes from series_matrix files, and then to kind of transform the generated .notes file into a phenoData file.

Usage is (you have to install the PerlIO::gzip module) :

extractData.pl GSExxxx_series_matrix.txt.gz output_dir

this will extract a GSExxxx.data and a GSExxxx.notes files into 'output_dir'. The GSExxx.data file is easily imported in R with read.table(). Colnames are GSM values.

To get a phenoData file from the .notes file, do :

notes2pData.pl GSExxxx.notes

hope it still works !

Julien

extractData.pl:

#!/usr/bin/perl -w

use strict;
use warnings;
use PerlIO::gzip;

open( FILE, "<:gzip", "$ARGV[0]" );

my $outputDir = $ARGV[1];

$ARGV[0] =~ /(GSE\d+)-?(GPL\d+)?_series_matrix.txt.gz$/;
my $GSE = $1;
my $GPL = $2;

my($prefix);
if(defined($GPL)) {
    $prefix = "$GSE-$GPL"
}
else {
    $prefix = "$GSE";
}

print $prefix;

open(MATRIX, ">$outputDir/$prefix.data");
open(NOTES, ">$outputDir/$prefix.notes");

my ( $gseTitle, $gseDescription, $gsePMID, @sampleIDs, @sampleTitles,
    @sampleDescriptions, @sampleSrcCh1, @samplePlatforms, @sampleOrganism,
    @platformIDs );    

my $table = 0;

while(my $line = <FILE>) {

    if($line =~ /^\![sS]eries_matrix_table_end/) {
        $table = 0;
    }
    if($table == 1) {
        $line =~ s/[\"\#]//g;
        print MATRIX $line;
    }


    $line =~ s/[\r\n]//g;

    if($line =~ /^\!Series_title[\s\t]+\"(.+)\"/) {
        $gseTitle = $1;
    }
    elsif($line =~ /^\!Series_summary[\s\t]+\"(.+)\"/) {
        $gseDescription .= $1;
    }
    elsif($line =~ /^\!Sample_geo_accession[\s\t]+\"(.+)\"/) {
        my $tt = $1;
        $tt =~ s/\"//g;
        @sampleIDs = split(/\t/, $tt);
    }
    elsif($line =~ /^\!Series_pubmed_id[\s\t]+\"(.+)\"/)  {
        $gsePMID = $1;
    }
    elsif($line =~ /^\!Series_platform_id[\s\t]+\"(.+)\"/) {
        push @platformIDs, $1;
    }
    elsif($line =~ /^\!Sample_title/) {
        $line =~ s/\"//g;
        my @t = split(/\t/, $line);
        shift @t;
        @sampleTitles = @t;
    }
    elsif($line =~ /^\!Sample_source_name_ch1/) {
        $line =~ s/\"//g;
        my @t = split(/\t/, $line);
        shift @t;
        @sampleSrcCh1 = @t;
    }
    elsif($line =~ /^\!Sample_organism_ch1/) {
        $line =~ s/\"//g;
        my @t = split(/\t/, $line);
        shift @t;
        @sampleOrganism = @t;
    }
    elsif($line =~ /^\!Sample_description/) {
        $line =~ s/\"//g;
        my @t = split(/\t/, $line);
        shift @t;
        for(my $i = 0; $i < scalar(@t); $i++) {
            $sampleDescriptions[$i] .= $t[$i]." ";
        }
    }
    elsif($line =~ /^\!Sample_platform_id/) {
        $line =~ s/\"//g;
        my @t = split(/\t/, $line);
        shift @t;
        @samplePlatforms = @t;
    } 
    elsif($line =~ /^\![sS]eries_matrix_table_begin/) {
        $table = 1;
    }
}

#( $gseTitle, $gseDescription, $gsePMID, @sampleIDs, @sampleTitles,
#   @sampleDescriptions, @sampleSrcCh1, @samplePlatforms, @sampleOrganism,
#   @platformIDs )

print NOTES "GSE_ID = $GSE\n", "GSE_TITLE = $gseTitle\n", "GSE_DESC = $gseDescription\n", "GSE_PMID = $gsePMID\n";
if(defined($GPL)) {
    print NOTES "PLATFORM = $GPL\n";
}
else {
    print NOTES "PLATFORM = $samplePlatforms[0]\n";
}
print NOTES     "NB_SAMPLES = ".scalar(@sampleIDs)."\n";
print NOTES "\n";
print NOTES "SAMPLE_IDS = ".join("\t",@sampleIDs)."\n",
            "SAMPLE_TITLES = ".join("\t",@sampleTitles)."\n",
            "SAMPLE_ORGANISMS = ".join("\t", @sampleOrganism)."\n",
            "SAMPLE_SRC_CH1 = ".join("\t", @sampleSrcCh1)."\n",
            "SAMPLE_DESC = ".join("\t", @sampleDescriptions)."\n";

close(MATRIX);
close(NOTES);
ADD COMMENTlink modified 5 months ago by RamRS27k • written 7.9 years ago by Julien Textoris430
1
gravatar for Julien Textoris
7.9 years ago by
Marseille, France
Julien Textoris430 wrote:

and notes2pData.pl:

#!/usr/bin/perl -w

use strict;
use warnings;

open(FILE,"$ARGV[0]");

$ARGV[0] =~ /^(GSE\d+)-?(GPL\d+)?\.notes$/;
my $GSE = $1;
my $GPL = $2;

my($prefix);
if(defined($GPL)) {
        $prefix = "$GSE-$GPL"
}
else {
        $prefix = "$GSE";
}

print $prefix;

my(@ids);
my(%hash);
my $go = 0;

while(my $line = <FILE>) {
    $line =~ s/[\r\n]//g;
    print "+";
    if($line =~ /^SAMPLE\_IDS\s=\s(.+)$/) {
        @ids = split(/\t/,$1);
        print join("\t",@ids)."\n";     
        $go = 1;
    }

    if($go == 1) {
        $line =~ /^(.+)\s=\s(.+)$/;
        my @t = split(/\t/,$2);
        print join("\t",@t)."\n";
        $hash{$1} = \@t;
    }
}

open(PDATA,">$prefix.pData");

my @keys = keys(%hash);
print PDATA "ID\t".join("\t",@keys)."\n";
for(my $i=0;$i<scalar(@ids);$i++) {
    print PDATA $ids[$i];
    for(my $j=0;$j<scalar(@keys);$j++) {
        print PDATA "\t",$hash{$keys[$j]}[$i];
    }
    print PDATA "\n";
}

close(PDATA);
close(FILE);
ADD COMMENTlink modified 5 months ago by RamRS27k • written 7.9 years ago by Julien Textoris430
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1237 users visited in the last hour