Question

Help with analyzing NCBI tissue expression data - solr xml file

0

Entering edit mode

3.2 years ago

j.matt.franklin • 0

I'm trying to access the complete ncbi tissue expression dataset. When you look at any individual gene, NCBI provides a
expression chart to look at rna-seq counts across tissues. For example, https://www.ncbi.nlm.nih.gov/gene/6304 You can see higher expression in brain and lymph nodes.

I contacted ncbi, and they showed me the data for all genes is accessible here: https://ftp.ncbi.nih.gov/gene/DATA/expression/

The data is a giant xml file that's formatted for solr Apache databse. They provide a schema file to help read the data.

However, my first attempt at loading the data into solr totally failed. Has anyone set up scripts for loading and querying this data?

solr ncbi tissue expression xml RNA-Seq • 735 views

ADD COMMENT • link 3.2 years ago by j.matt.franklin • 0

0

Entering edit mode

Wow that worked perfectly! Thanks.

ADD REPLY • link 3.2 years ago by j.matt.franklin • 0

0

Entering edit mode

close the question validate my answer by clicking the green tick on the left please.

ADD REPLY • link 3.2 years ago by Pierre Lindenbaum 161k

score 2 · Accepted Answer · 2021-02-23

The XML file is buggy, there is no XML root element.

Download it: wget "https://ftp.ncbi.nih.gov/gene/DATA/expression/Mammalia/Homo_sapiens/PRJEB2445_GRCh38.p2_107_expression.xml.gz"

fix the xml by adding a root element.

(echo "<root>" && gunzip -c PRJEB2445_GRCh38.p2_107_expression.xml.gz  && echo "</root>" )  > tmp.xml

process with an XSLT stylesheet below to generate a table. (slow and memory consumming)

xsltproc  biostar492866.xsl tmp.xml

ouput:

entropy exp_Mcount  exp_rpkm    exp_total   full_rpkm   gene    id  is_metadata is_sample   project_desc    sample_id   source_name sra_id  taxid   var 
    16177.9                 metadata_9606_SAMEA962332   true    true    PRJEB2445   SAMEA962332 thyroid ERS025090   9606    
    16970.1                 metadata_9606_SAMEA962333   true    true    PRJEB2445   SAMEA962333 testes  ERS025094   9606    
    17645.9                 metadata_9606_SAMEA962334   true    true    PRJEB2445   SAMEA962334 prostate    ERS025095   9606    
    15620.7                 metadata_9606_SAMEA962335   true    true    PRJEB2445   SAMEA962335 liver   ERS025096   9606    
    17816.5                 metadata_9606_SAMEA962336   true    true    PRJEB2445   SAMEA962336 white blood cells   ERS025091   9606    
    24649.8                 metadata_9606_SAMEA962337   true    true    PRJEB2445   SAMEA962337 16 tissues mixture  ERS025093   9606    
    17701.4                 metadata_9606_SAMEA962338   true    true    PRJEB2445   SAMEA962338 lung    ERS025099   9606    
    19777.8                 metadata_9606_SAMEA962339   true    true    PRJEB2445   SAMEA962339 adipose ERS025098   9606    
    18398.2                 metadata_9606_SAMEA962340   true    true    PRJEB2445   SAMEA962340 breast  ERS025088   9606

but the best way to process such big xml file is to use a STAX or a SAX parser. ( A: Is There Any Tool To Extract Demanded Information From An Asn/Xml File? Convert XML file to FASTA ... )