9 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
The XML file is buggy, there is no XML root element.
Download it: wget "https://ftp.ncbi.nih.gov/gene/DATA/expression/Mammalia/Homo_sapiens/PRJEB2445_GRCh38.p2_107_expression.xml.gz"
fix the xml by adding a root element.
(echo "<root>" && gunzip -c PRJEB2445_GRCh38.p2_107_expression.xml.gz && echo "</root>" ) > tmp.xml
process with an XSLT stylesheet below to generate a table. (slow and memory consumming)
xsltproc biostar492866.xsl tmp.xml
ouput:
entropy exp_Mcount exp_rpkm exp_total full_rpkm gene id is_metadata is_sample project_desc sample_id source_name sra_id taxid var
16177.9 metadata_9606_SAMEA962332 true true PRJEB2445 SAMEA962332 thyroid ERS025090 9606
16970.1 metadata_9606_SAMEA962333 true true PRJEB2445 SAMEA962333 testes ERS025094 9606
17645.9 metadata_9606_SAMEA962334 true true PRJEB2445 SAMEA962334 prostate ERS025095 9606
15620.7 metadata_9606_SAMEA962335 true true PRJEB2445 SAMEA962335 liver ERS025096 9606
17816.5 metadata_9606_SAMEA962336 true true PRJEB2445 SAMEA962336 white blood cells ERS025091 9606
24649.8 metadata_9606_SAMEA962337 true true PRJEB2445 SAMEA962337 16 tissues mixture ERS025093 9606
17701.4 metadata_9606_SAMEA962338 true true PRJEB2445 SAMEA962338 lung ERS025099 9606
19777.8 metadata_9606_SAMEA962339 true true PRJEB2445 SAMEA962339 adipose ERS025098 9606
18398.2 metadata_9606_SAMEA962340 true true PRJEB2445 SAMEA962340 breast ERS025088 9606
but the best way to process such big xml file is to use a STAX or a SAX parser. ( A: Is There Any Tool To Extract Demanded Information From An Asn/Xml File? Convert XML file to FASTA ... )
Wow that worked perfectly! Thanks.
close the question validate my answer by clicking the green tick on the left please.