Question: Importing Pubmed Medline Details Into A Local Rdbms To Execute Data Mining Methods
0
gravatar for bukowski.mark
6.7 years ago by
bukowski.mark0 wrote:

Hi everyone,

I want to execute Data Mining methods on a PubMed dataset (Medline in XML). Regarding this aim I found a paper from 2004 "Software to parse and load MEDLINE into a RDBMS " and want to execute the java code (http://biotext.berkeley.edu/software.html). I can't get the MedinlineParser work - probably its an problem of JAXP or other older libraries. Furthermore I don't find any recent solutions to mine a PubMed dataset (XML files) directly or firstly get it into a local RDBMS.

Are there any working solutions? Maybe a XSLT Stylesheet?

I would be very grateful if you could help me to find a solution.

Best regards, Mark

xml database pubmed • 2.6k views
ADD COMMENTlink modified 6.4 years ago by Biostar ♦♦ 20 • written 6.7 years ago by bukowski.mark0

note: Mark asked me his question by mail, and I suggested him to use biostars.org to get the answers from the community.

ADD REPLYlink written 6.7 years ago by Pierre Lindenbaum129k
1

I just found the archived post on nodalpoint http://archive.nodalpoint.org/2006/06/07/medline_xml_to_database_parser

ADD REPLYlink written 6.7 years ago by Pierre Lindenbaum129k
1
gravatar for Pierre Lindenbaum
6.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

When http://nodalpoint.org/ was still alive (... ;-) ) I suggested to use a XSLT stylesheet to import a pubmed xml into a database. I quickly wrote a XSLT to insert the some pubmed articles into a sqlite3 database. See https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/ncbi/pubmed2sqlite.xsl . here , I only use 3 tables but the schema could be far more complicated.

$ xsltproc --novalid  stylesheets/bio/ncbi/pubmed2sqlite.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=9891771,21378989&retmode=xml"

create table if not exists Journal
    (
    nlmUniqueID  TEXT UNIQUE NOT NULL,
    medlineTA TEXT
    );


create table if not exists PubmedArticle
    (
    pmid INT UNIQUE NOT NULL,
    title TEXT,
    abstract TEXT,
    nlmUniqueID TEXT,
    FOREIGN KEY(nlmUniqueID ) REFERENCES Journal(nlmUniqueID)
    );

create table if not exists Author
    (
    lastName TEXT,
    foreName TEXT,
    pmid INT NOT NULL,
    position INT,
    FOREIGN KEY(pmid ) REFERENCES PubmedArticle(pmid)
    );

create unique index if not exists Author2Article on Author(lastName,foreName,pmid);
begin transaction;
insert or ignore into Journal(nlmUniqueID,medlineTA) values ('7609767','Ann Chir Gynaecol');
insert or ignore into PubmedArticle(pmid,title,abstract,nlmUniqueID) values ('9891771','Prognosis and surveillance of gastrointestinal stromal/smooth muscle tumors.','','7609767');
insert or ignore into Author(lastName,foreName,pmid,position) values ('Emory','T S','9891771',1);
insert or ignore into Author(lastName,foreName,pmid,position) values ('O''Leary','T J','9891771',2);
insert or ignore into Journal(nlmUniqueID,medlineTA) values ('9216904','Nat Genet');
insert or ignore into PubmedArticle(pmid,title,abstract,nlmUniqueID) values ('21378989','Truncating mutations in the last exon of NOTCH2 cause a rare skeletal disorder with osteoporosis.','Hajdu-Cheney syndrome is a rare autosomal dominant skeletal disorder with facial anomalies, osteoporosis and acro-osteolysis. We sequenced the exomes of six unrelated individuals with this syndrome and identified heterozygous nonsense and frameshift mutations in NOTCH2 in five of them. All mutations cluster to the last coding exon of the gene, suggesting that the mutant mRNA products escape nonsense-mediated decay and that the resulting truncated NOTCH2 proteins act in a gain-of-function manner.','9216904');
insert or ignore into Author(lastName,foreName,pmid,position) values ('Isidor','Bertrand','21378989',1);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Lindenbaum','Pierre','21378989',2);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Pichon','Olivier','21378989',3);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Bézieau','Stéphane','21378989',4);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Dina','Christian','21378989',5);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Jacquemont','Sébastien','21378989',6);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Martin-Coignard','Dominique','21378989',7);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Thauvin-Robinet','Christel','21378989',8);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Le Merrer','Martine','21378989',9);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Mandel','Jean-Louis','21378989',10);
insert or ignore into Author(lastName,foreName,pmid,position) values ('David','Albert','21378989',11);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Faivre','Laurence','21378989',12);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Cormier-Daire','Valérie','21378989',13);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Redon','Richard','21378989',14);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Le Caignec','Cédric','21378989',15);

commit transaction;

then

$ xsltproc --novalid  stylesheets/bio/ncbi/pubmed2sqlite.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=9891771,21379889&retmode=xml"| sqlite3 test.db

$ sqlite3 test.db 'select * from Journal'
7609767|Ann Chir Gynaecol
101484507|Prov Med Surg J (1840)
ADD COMMENTlink written 6.7 years ago by Pierre Lindenbaum129k

Thank you for your answer! After getting familiar with XSLT I will try to use it for a more complex schema.

ADD REPLYlink written 6.7 years ago by bukowski.mark0

just FYI nodalpoint is archived :) EDIT: I see Pierre found the archive too.

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by Neilfws48k
0
gravatar for reachtoskumar
6.6 years ago by
reachtoskumar10 wrote:

You can also give a try to BioGyan (http://www.biogyan.com/). It is a comprehensive search tool specially designed for biologists, enabling search, annotation and ranking of scientific literature from public databases.Further you can export your result in excel and that can be imported into the RDMS which you intend to.

ADD COMMENTlink written 6.6 years ago by reachtoskumar10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1877 users visited in the last hour