NCBI: Parsing data nicely?
2
0
Entering edit mode
7.2 years ago
Tom ▴ 40

I downloaded a set of summaries about gene expression data sets from NCBI (for example of file format, if you go to https://www.ncbi.nlm.nih.gov/gds and searched for GDS5879, and downloaded the result through Send to file -> Summary (text)).

When you open this in excel, you can see the result is like this:

  1. Pulmonary CDC11c+ cells from young and middle-age animals
    Analysis of pulmonary CDC11c+ cells from 6-8 week and 10-13 month old C57BL/6 animals. CDC11c+ cells are key modulators of the immune response in the lung. Results provide insight into molecular mechanisms underlying the decline in immune function associated with aging.
    Organism: Mus musculus
    Type: Expression profiling by array, count, 2 age sets
    Platform: GPL6885 Series: GSE71868 8 Samples
    FTP download: GEO ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS5nnn/GDS5879/
    DataSet Accession: GDS5879 ID: 5879

  2. Normal liver developmental stages: embryonic and postnatal
    Analysis of hepatoblasts, immature hepatocytes and hepatocytes from livers at different developmental timepoints (embryonic day 14, embryonic day 18, post-natal day 5, post-natal day 56). Results provide insight into molecular mechanisms underlying normal liver development.
    Organism: Mus musculus
    Type: Expression profiling by array, log2 ratio, 4 age, 3 cell type, 2 development stage sets Platform: GPL7202 Series: GSE65063 11 Samples
    FTP download: GEO (TXT) ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS5nnn/GDS5818/
    DataSet Accession: GDS5818 ID: 5818

I wanted to parse this data so I have a table of Title + "\t" + Description + "\t" + Organism + "\t" + Type + "\t" + Platform + "\t" + Series + "\t" + NumberOfSamples + "\t" + DatasetAccession.

The file doesn't look badly laid out here, but when I open it in excel, there are problems. For example, when I say in my script that "Organism" is the line underneath the experiment description, sometimes the line underneath that again (Type:) is printed instead, even though the Organism line is there.

This is the code I tried:

import sys
import re
count = 1
file_open = open(sys.argv[1])

Dict1 = {}
for line in file_open:
    if str(count) + "." in line:
        Title = line.strip()
        Dict1[Title] = []
        Description = file_open.next().strip()
        Dict1[Title] = [Description]
        Organism = file_open.next().strip().split(":")
        Dict1[Title].append(Organism[1].strip())
        Type = file_open.next().strip()
        split_type = Type.strip().split(",")
        Dict1[Title].extend(split_type)
        Platform = file_open.next().strip()
        split_platform = Platform.strip().split()
        Dict1[Title].extend(split_platform)
        Download = file_open.next().strip()
        Dict1[Title].extend(split_platform)
        Dataset = file_open.next().strip().split()
        Dict1[Title].extend(Dataset)
        count +=1

for i in Dict1:
    print i + "\t" + "\t".join(Dict1[i])

I also tried to using csv and pandas to read in the file properly.

The problem is that the output is not uniform.

For example, if I just do this:

import sys
import re
count = 1
file_open = open(sys.argv[1])

Dict1 = {}
for line in file_open:
    if str(count) + "." in line:
        Title = line.strip()
        Dict1[Title] = []
        Description = file_open.next().strip()
        Dict1[Title] = [Description]
        Organism = file_open.next().strip().split(":")
        Dict1[Title].append(Organism[1].strip())
        Type = file_open.next().strip()
        split_type = Type.strip().split(",")
        print split_type

The output looks like this:

Type:       Expression profiling by array
Platform: GPL1261 Series: GSE32334 6 Samples
Type:       Expression profiling by array
Platform: GPL8321 Series: GSE42389 12 Samples
Platform: GPL570 Series: GSE19533 10 Samples
Type:       Expression profiling by array
Type:       Expression profiling by array
Type:       Expression profiling by array
Platform: GPL7202 Series: GSE22828 84 Samples
Type:       Expression profiling by array
Platform: GPL1261 Series: GSE6290 37 Samples
Platform: GPL1261 Series: GSE22616 24 Samples
Type:       Expression profiling by array
Platform: GPL1261 Series: GSE10113 12 Samples
Type:       Expression profiling by array
Platform: GPL339 Series: GSE9914 12 Samples
Platform: GPL1261 Series: GSE8091 16 Samples
Type:       Expression profiling by array
Type:       Expression profiling by array
Platform: GPL96 Series: GSE9714 8 Samples
Platform: GPL96 Series: GSE9713 18 Samples
Platform: GPL96 Series: GSE9712 12 Samples
Platform: GPL96 Series: GSE6011 37 Samples
Type:       Expression profiling by array
Type:       Expression profiling by array
Type:       Expression profiling by array

...even though the "Type:" line is there for the ones that are skipped. I've also tried to view as "Summary ->text" on NCBI, but that's even worse when i put it into excel.

Can anyone help me cleanly parse this data? Have other people had this problem?

parsing data table ncbi geo • 2.7k views
ADD COMMENT
3
Entering edit mode
7.2 years ago

I wrote the following script very quickly, but it should give you an idea of how to improve it.

import sys
import re

file_open = open(sys.argv[1])

d = {}
isdescnext = False
for line in file_open:
    line = line.strip()
    if re.search('^\d+\.', line):
        Title = ".".join(line.split('.')[1:]).strip()
        d[Title] = {
          'Organism': 'NA',
          'Type': 'NA',
          'Platform': 'NA',
          'FTP': 'NA',
          'Series': 'NA',
          'Description': [],
          'Samples': 'NA'

        }
        isdescnext = True
    elif line.startswith('Organism:'):
        d[Title]['Organism'] = line[10:].strip()
        isdescnext = False
    elif line.startswith('Type:'):
        d[Title]['Type'] = line[10:].strip()
    elif line.startswith('Platform:'):
        d[Title]['Platform'] = re.findall('Platform: ([^\s]+)', line)[0]
    elif line.startswith('FTP:'):
        d[Title]['FTP'] = line[9:].strip()
    elif isdescnext:
        d[Title]['Description'].append(line)
    if re.search('\d+ Samples', line):
        d[Title]['Samples'] = re.findall('(\d+) Samples', line)[0]
    if re.search('Series: [^\s]+', line):
        d[Title]['Series'] = re.findall('Series: ([^\s]+)', line)[0]


for title, v in d.items():
    desc =  " ".join(v["Description"]) if v["Description"] else 'NA'
    print '{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}'.format(title, desc, v["Organism"], v['Type'], v['Platform'], v['Series'], v['Samples'])

Output:

Middle-aged 2   NA  Mus musculus    NA  GPL6885 GSE71868    NA
Middle-aged 3   NA  Mus musculus    NA  GPL6885 GSE71868    NA
Pulmonary CDC11c+ cells from young and middle-age animals   Analysis of pulmonary CDC11c+ cells from 6-8 week and 10-13 month old C57BL/6 animals. CDC11c+ cells are key modulators of the immune response in the lung. Results provide insight into molecular mechanisms underlying the decline in immune function associated with aging.  Mus musculus    ression profiling by array, count, 2 age sets   GPL6885 GSE71868    8
Middle-aged 1   NA  Mus musculus    NA  GPL6885 GSE71868    NA
Middle-aged 4   NA  Mus musculus    NA  GPL6885 GSE71868    NA
Gene expression pattern of pulmonary CD11c+ cells from middle-aged and young mice.  (Submitter supplied) Analysis of function of CD11c+ cells from middle-aged and young mice at gene level. This experiment provided insight into the different genes that plays roles in inflammation, immune response and mainly arachidonic acid cascade that are differentiall expressed in CD11c+ cells from middle aged and young mice.  Mus musculus    ression profiling by array  NA  NA  8
Illumina MouseRef-8 v2.0 expression beadchip    (Submitter supplied) The MouseRef-8 v2.0 Expression BeadChip offer the most up-to-date content for whole-genome expression profiling in the mouse. Featuring content derived from the National Center for Biotechnology Information Reference Sequence (NCBI RefSeq) database (Build 36, Release 22).  Please use the GEO Data Submission Report Plug-in v1.0 for Gene Expression which may be downloaded from https://icom.illumina.com/icom/software.ilmn?id=234 to format the normalized and raw data. more...   Mus musculus    NA  NA  NA  13673
Young 4 NA  Mus musculus    NA  GPL6885 GSE71868    NA
Young 3 NA  Mus musculus    NA  GPL6885 GSE71868    NA
Young 2 NA  Mus musculus    NA  GPL6885 GSE71868    NA
Young 1 NA  Mus musculus    NA  GPL6885 GSE71868    NA
ADD COMMENT
1
Entering edit mode
7.2 years ago
Michael 54k

You should try NCBI e-utils for this, e.g.:

esearch -db GDS -query 'GDS5879' | efetch -format docsum

and use xtract to extract the data you want.

ADD COMMENT
0
Entering edit mode

Thanks. Yes I have been trying two other ways: One is a script:

import Bio
from Bio import Geo
from Bio import Entrez
Entrez.email = "xxx@gmail.com"
handle = Entrez.esearch(db="gds",term="GDS5879")
record = Entrez.read(handle)
print record

The output:

 {u'Count': '11', u'RetMax': '11', u'IdList': ['5879', '200071868', '100006885', '301847070', '301847069', '301847068', '301847067', '301847066', '301847065', '301847064', '301847063'], u'TranslationStack': [{u'Count': '11', u'Field': 'All Fields', u'Term': 'GDS5879[All Fields]', u'Explode': 'N'}, 'GROUP'], u'TranslationSet': [], u'RetStart': '0', u'QueryTranslation': 'GDS5879[All Fields]'}

Just a small question, I think the "IdList" in the above output has all the info I need, I'm just struggling to find a dictionary on NCBI that links each of these numbers to e.g. GDS5879 or GSMXXXXX or ExpressionProfilingByArray or GPLXXXX, would you know if I'm right in saying that that file should be somewhere/I should somehow to able to link those numbers to english? (e.g. I know that 301847070 is linked somewhere to sample GSM1847070).

and the other way, using the command:

esearch -db GDS -query 'GDS5879' | efetch -format docsum | xtract -pattern DocumentSummary -element Id Title summary Accession taxon gdsType

The output:

5879    Young 1 Young 2 Young 3 Young 4 Middle-aged 1   Middle-aged 2   Middle-aged 3   Middle-aged 4   Analysis of pulmonary CDC11c+ cells from 6-8 week and 10-13 month old C57BL/6 animals. CDC11c+ cells are key modulators of the immune response in the lung. Results provide insight into molecular mechanisms underlying the decline in immune function associated with aging.  GDS5879 GSM1847067  GSM1847068  GSM1847069  GSM1847070  GSM1847063  GSM1847064  GSM1847065  GSM1847066  Mus musculus    Expression profiling by array
200071868   Young 4 Young 1 Middle-aged 3   Young 2 Middle-aged 4   Young 3 Middle-aged 1   Middle-aged 2   Analysis of function of CD11c+ cells from middle-aged and young mice at gene level. This experiment provided insight into the different genes that plays roles in inflammation, immune response and mainly arachidonic acid cascade that are differentiall expressed in CD11c+ cells from middle aged and young mice.   GSE71868    GSM1847070  GSM1847067  GSM1847065  GSM1847068  GSM1847066  GSM1847069  GSM1847063  GSM1847064  Mus musculus    Expression profiling by array

Neither give me what I want (which would be the following info for GDS5879 in a tab delimited line):

  1. Title (In the XML file from efetch, this is in DocumentSummary -> title) = Analysis of pulmonary CDC11c+ cells from 6-8 week and 10-13 month old C57BL/6 animals. CDC11c+ cells are key modulators of the immune response in the lung. Results provide insight into molecular mechanisms underlying the decline in immune function associated with aging.

  2. Organism (DocumentSummary ->taxon) = Mus musculus

  3. Type (e.g. RNASeq or microarray) (DocumentSummary ->gdsType) = Expression profiling by array

  4. Platform (DocumentSummary ->GPL) = GPL6885

  5. Series (DocumentSummary ->GSE) = GSE78168

  6. NumberOfSamples = (DocumentSummary -> n_samples) = 8

  7. DatasetAccession (DocumentSummary ->GDS): = GDS5879

I'm just wondering if someone could put me on the right track as to what I'm doing wrong with either of these ways, because I can pretty much see the data I want in both the methods manually, but I'm just struggling to make the link to getting code that works?

ADD REPLY
0
Entering edit mode

I decided to try to be more adventurous and extract the sample characteristics for each dataset. However, eFetch doesn't seem to return all samples for a data set? For example, if you look at GDS5204, there are 41 samples in the data set.

When I run: esearch -db GDS -query GDS5204'[ACCN] AND GSM[ETYP]' | efetch -format docsum | xtract -pattern DocumentSummary -element Accession

The output is less than 41 samples:

GSM1303184 GSM1303183 GSM1303182 GSM1303181 GSM1303180 GSM1303179 GSM1303178 GSM1303177 GSM1303176 GSM1303175 GSM1303174 GSM1303173 GSM1303172 GSM1303171

If I put the number "5204" into a file called "test", and run: for i in cat test ;do esearch -db GDS -query GDS$i'[ACCN] AND GSM[ETYP]' | efetch -format docsum | xtract -pattern DocumentSummary -element GDS Accession summary title ;done

the output is:

5204 GSM1303184 106 years old Female 106 years old Female 5204 GSM1303183 105 years old Female 105 years old Female 5204 GSM1303182 104 years old Male 104 years old Male 5204 GSM1303181 103 years old Female 103 years old Female 5204 GSM1303180 94 years old Female 94 years old Female 5204 GSM1303179 93 years oldFemale 93 years oldFemale 5204 GSM1303178 92 years old Male 92 years old Male 5204 GSM1303177 91 years old Male 91 years old Male 5204 GSM1303176 91 years old Female 91 years old Female

When I run: esearch -db GDS -query GDS5204'[ACCN] AND GSM[ETYP]' | efetch -format docsum | xtract -pattern Samples -element Accession, there is no output at all.

Does this method/file not contain a complete list of samples for each data set or am I mis-understanding something?

ADD REPLY

Login before adding your answer.

Traffic: 1967 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6