Extracting Information From Geo Soft Files
4
0
Entering edit mode
11.8 years ago
Layla ▴ 50

Hi: I am working with the soft files obtained from GEO from specific diseases. The question that I have is that if there is an specific way or tag inside the GEO soft files that points to me when is a set of control samples or disease samples (GSM in this case). Can I obtain that data using some external package like GEOquery? Thanks

geo • 11k views
ADD COMMENT
2
Entering edit mode
11.8 years ago

It sounds like you should be searching for the GSE (or GDS) rather than the GSMs directly. The experiment-level information is in the GSE (or GDS) while the sample-level information is in the GSM file. GEOquery can deal with all GEO data types, so I would recommend giving it a try before embarking on a wheel reinvention.

ADD COMMENT
2
Entering edit mode
11.8 years ago
Neilfws 49k

The short answer to this question is no. You cannot determine whether a sample is control/normal or diseased from the GEO database, because it does not use a controlled vocabulary: in other words, submitters can describe their samples any way they like, using arbitrary free text. This is, in my opinion, one of the great failings of GEO.

Take for example series GSE4183: "Inflammation, adenoma and cancer: objective classification of colon biopsy specimens with gene expression signature." In this study, a sample from normal colon is described like this:

!Sample_title = colon_normal_1024
!Sample_description = total RNA extracted from cells obtained using biopsy of the colon in a healthy control

A sample of cancerous colon:

!Sample_title = colon_adenoma_1115
!Sample_description = total RNA extracted from cells obtained using biopsy in a patient having colon adenoma

This is actually quite a good example. In other series, titles and descriptions frequently do not follow any pattern and are not informative.

So in summary: there is no specific tag for "control" or "disease". The best you can do is parse sample title and/or description (using GEOquery or some other software) and hope that the text is informative.

ADD COMMENT
1
Entering edit mode
11.8 years ago

Don't know if it can help you, but we wrote these scripts a few years ago to 1) extract data and notes from series_matrix files, and then to kind of transform the generated .notes file into a phenoData file.

Usage is (you have to install the PerlIO::gzip module) :

extractData.pl GSExxxx_series_matrix.txt.gz output_dir

this will extract a GSExxxx.data and a GSExxxx.notes files into 'output_dir'. The GSExxx.data file is easily imported in R with read.table(). Colnames are GSM values.

To get a phenoData file from the .notes file, do :

notes2pData.pl GSExxxx.notes

hope it still works !

Julien

extractData.pl:

#!/usr/bin/perl -w

use strict;
use warnings;
use PerlIO::gzip;

open( FILE, "<:gzip", "$ARGV[0]" );

my $outputDir = $ARGV[1];

$ARGV[0] =~ /(GSE\d+)-?(GPL\d+)?_series_matrix.txt.gz$/;
my $GSE = $1;
my $GPL = $2;

my($prefix);
if(defined($GPL)) {
    $prefix = "$GSE-$GPL"
}
else {
    $prefix = "$GSE";
}

print $prefix;

open(MATRIX, ">$outputDir/$prefix.data");
open(NOTES, ">$outputDir/$prefix.notes");

my ( $gseTitle, $gseDescription, $gsePMID, @sampleIDs, @sampleTitles,
    @sampleDescriptions, @sampleSrcCh1, @samplePlatforms, @sampleOrganism,
    @platformIDs );    

my $table = 0;

while(my $line = <FILE>) {

    if($line =~ /^\![sS]eries_matrix_table_end/) {
        $table = 0;
    }
    if($table == 1) {
        $line =~ s/[\"\#]//g;
        print MATRIX $line;
    }


    $line =~ s/[\r\n]//g;

    if($line =~ /^\!Series_title[\s\t]+\"(.+)\"/) {
        $gseTitle = $1;
    }
    elsif($line =~ /^\!Series_summary[\s\t]+\"(.+)\"/) {
        $gseDescription .= $1;
    }
    elsif($line =~ /^\!Sample_geo_accession[\s\t]+\"(.+)\"/) {
        my $tt = $1;
        $tt =~ s/\"//g;
        @sampleIDs = split(/\t/, $tt);
    }
    elsif($line =~ /^\!Series_pubmed_id[\s\t]+\"(.+)\"/)  {
        $gsePMID = $1;
    }
    elsif($line =~ /^\!Series_platform_id[\s\t]+\"(.+)\"/) {
        push @platformIDs, $1;
    }
    elsif($line =~ /^\!Sample_title/) {
        $line =~ s/\"//g;
        my @t = split(/\t/, $line);
        shift @t;
        @sampleTitles = @t;
    }
    elsif($line =~ /^\!Sample_source_name_ch1/) {
        $line =~ s/\"//g;
        my @t = split(/\t/, $line);
        shift @t;
        @sampleSrcCh1 = @t;
    }
    elsif($line =~ /^\!Sample_organism_ch1/) {
        $line =~ s/\"//g;
        my @t = split(/\t/, $line);
        shift @t;
        @sampleOrganism = @t;
    }
    elsif($line =~ /^\!Sample_description/) {
        $line =~ s/\"//g;
        my @t = split(/\t/, $line);
        shift @t;
        for(my $i = 0; $i < scalar(@t); $i++) {
            $sampleDescriptions[$i] .= $t[$i]." ";
        }
    }
    elsif($line =~ /^\!Sample_platform_id/) {
        $line =~ s/\"//g;
        my @t = split(/\t/, $line);
        shift @t;
        @samplePlatforms = @t;
    } 
    elsif($line =~ /^\![sS]eries_matrix_table_begin/) {
        $table = 1;
    }
}

#( $gseTitle, $gseDescription, $gsePMID, @sampleIDs, @sampleTitles,
#   @sampleDescriptions, @sampleSrcCh1, @samplePlatforms, @sampleOrganism,
#   @platformIDs )

print NOTES "GSE_ID = $GSE\n", "GSE_TITLE = $gseTitle\n", "GSE_DESC = $gseDescription\n", "GSE_PMID = $gsePMID\n";
if(defined($GPL)) {
    print NOTES "PLATFORM = $GPL\n";
}
else {
    print NOTES "PLATFORM = $samplePlatforms[0]\n";
}
print NOTES     "NB_SAMPLES = ".scalar(@sampleIDs)."\n";
print NOTES "\n";
print NOTES "SAMPLE_IDS = ".join("\t",@sampleIDs)."\n",
            "SAMPLE_TITLES = ".join("\t",@sampleTitles)."\n",
            "SAMPLE_ORGANISMS = ".join("\t", @sampleOrganism)."\n",
            "SAMPLE_SRC_CH1 = ".join("\t", @sampleSrcCh1)."\n",
            "SAMPLE_DESC = ".join("\t", @sampleDescriptions)."\n";

close(MATRIX);
close(NOTES);
ADD COMMENT
1
Entering edit mode
11.8 years ago

and notes2pData.pl:

#!/usr/bin/perl -w

use strict;
use warnings;

open(FILE,"$ARGV[0]");

$ARGV[0] =~ /^(GSE\d+)-?(GPL\d+)?\.notes$/;
my $GSE = $1;
my $GPL = $2;

my($prefix);
if(defined($GPL)) {
        $prefix = "$GSE-$GPL"
}
else {
        $prefix = "$GSE";
}

print $prefix;

my(@ids);
my(%hash);
my $go = 0;

while(my $line = <FILE>) {
    $line =~ s/[\r\n]//g;
    print "+";
    if($line =~ /^SAMPLE\_IDS\s=\s(.+)$/) {
        @ids = split(/\t/,$1);
        print join("\t",@ids)."\n";     
        $go = 1;
    }

    if($go == 1) {
        $line =~ /^(.+)\s=\s(.+)$/;
        my @t = split(/\t/,$2);
        print join("\t",@t)."\n";
        $hash{$1} = \@t;
    }
}

open(PDATA,">$prefix.pData");

my @keys = keys(%hash);
print PDATA "ID\t".join("\t",@keys)."\n";
for(my $i=0;$i<scalar(@ids);$i++) {
    print PDATA $ids[$i];
    for(my $j=0;$j<scalar(@keys);$j++) {
        print PDATA "\t",$hash{$keys[$j]}[$i];
    }
    print PDATA "\n";
}

close(PDATA);
close(FILE);
ADD COMMENT

Login before adding your answer.

Traffic: 2456 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6