Extracting specific metadata from SRA runs

0

Entering edit mode

8.2 years ago

Gon ▴ 10

Dear everyone,

I want to retrieve reads from NCBI that correspond to Staphylococcus aureus isolated from specific hosts (eg. Bovine, swine). I also want to extract metadata (such as collection place and/or date) from those isolates. Searching “Staphylococcus aureus” on SRA (https://www.ncbi.nlm.nih.gov/sra/advanced) gives me approximately 45,000 entries.

However, the SRA Run Selector outputs (“RunInfo”, which rather focus on the run and not in the sample) usually don’t include a lot of metadata, but after googling around, I learnt that there’s extra metadata in NCBI (ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Metadata). I’ve downloaded the latest full file with all metadata NCBI_SRA_Metadata_Full_20170501.tar.gz (1.4GB) and confirmed that many runs contain a file with a pattern .sample.xml that includes info like host or collection date.

I’ve tried extracting the .sample.xml files corresponding to the 45k S. aureus IDs but it takes forever to look inside the NCBI_SRA_Metadata_Full_20170501.tar.gz file and extract those particular files, which I intended to use to parse the information I want.

Alternatively, I’ve also read about the SRAdb package in Bioconductor but I’m not really sure how to query the database to filter by host. I’m also not sure how standardised this data is stored.

Does anybody have any experience doing something similar or working with SRAdb that could point me towards the right direction?

Thanks a lot and sorry if I didn't explain myself very well.

Gon

SRA Metadata NCBI Genomics SRAdb • 5.2k views

ADD COMMENT • link updated 6.6 years ago by Rob • 0 • written 8.2 years ago by Gon ▴ 10

0

Entering edit mode

Quick comment: SRAdb vignette is here http://bioconductor.org/packages/release/bioc/vignettes/SRAdb/inst/doc/SRAdb.pdf

Could you give one example of what are you looking for with a given accession name?

ADD REPLY • link 8.2 years ago by Santosh Anand 5.8k

0

Entering edit mode

For example, for the accession name ERS446847 I want to parse the information corresponding to the sample attributes with tags "geographic location (country and/or sea)" and "collection date", which from visual inspection of the file ERA305021.sample.xml (ERA305021 is the submission code for that sample) that I extracted from NCBI_SRA_Metadata_Full_20170501.tar.gz are respectively Denmark and 2007-01-01. However I know not all samples have this information under the same tags, but variations of them, which makes everything more complicated!

I've seen the SRAdb vignette, but it didn't help me to understand how to filter data from the SRA metadata.

Thanks

ADD REPLY • link 8.2 years ago by Gon ▴ 10

1

Entering edit mode

8.2 years ago

Pierre Lindenbaum 166k

using a loop : Apply a XSLT stylesheet to the XML files to get a tabular view of the metadata.

find . -type f -name "*.sample.xml" | while read F; do xsltproc transform.xsl ${F} ; done

you'll get a tabular view of the data

EXPERIMENT	SAMPLE	TAG	VALUE
ERX458901	ERS446847	Is the sequenced pathogen host associated?	Yes
ERX458901	ERS446847	collected_by	National Food Institute (DTU Food)
ERX458901	ERS446847	geographic location (latitude)	missing
ERX458901	ERS446847	host health state	diseased
ERX458901	ERS446847	collection_date	2007-01-01
ERX458901	ERS446847	environmental_sample	No
ERX458901	ERS446847	geographic location (country and/or sea)	Denmark
ERX458901	ERS446847	strain	9B
ERX458901	ERS446847	geographic location (longitude)	missing
ERX458901	ERS446847	isolate	DTU2013_1139
ERX458901	ERS446847	host scientific name	Sus scrofa domesticus
ERX458901	ERS446847	serovar	missing

view raw output.tsv hosted with ❤ by GitHub

	<?xml version="1.0"?>
	<xsl:stylesheet
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
	version="1.0"
	>
	<xsl:output method="text"/>

	<xsl:template match="/">

	<xsl:text>EXPERIMENT</xsl:text>
	<xsl:text> SAMPLE</xsl:text>
	<xsl:text> TAG</xsl:text>
	<xsl:text> VALUE
	</xsl:text>


	<xsl:for-each select="//SAMPLE[SAMPLE_NAME/TAXON_ID/text() = 1280 ]">
	<xsl:for-each select="SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE">
	<xsl:value-of select="../../../EXPERIMENT/@accession"/>
	<xsl:text> </xsl:text>
	<xsl:value-of select="../../@accession"/>
	<xsl:text> </xsl:text>
	<xsl:value-of select="TAG"/>
	<xsl:text> </xsl:text>
	<xsl:value-of select="VALUE"/>
	<xsl:text>
	</xsl:text>

	</xsl:for-each>
	</xsl:for-each>
	</xsl:template>



	</xsl:stylesheet>

view raw transform.xsl hosted with ❤ by GitHub

ADD COMMENT • link 8.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks Pierre, this is actually helpful: I still have to wait for all sample.xml files to unzip but at least in this format the data is much easier to work with and I will be able to parse data from it.

ADD REPLY • link 8.2 years ago by Gon ▴ 10

0

Entering edit mode

6.6 years ago

Rob • 0

You should take a look at these posts that show you how to extract metadata using SQLite databases provided by the Meltzer lab: https://edwards.sdsu.edu/research/sra-metadata/ and there are other posts we've written on extracting metadata about SRA runs here: https://edwards.sdsu.edu/research/sra/

ADD COMMENT • link 6.6 years ago by Rob • 0

Login before adding your answer.