I want to retrieve reads from NCBI that correspond to Staphylococcus aureus isolated from specific hosts (eg. Bovine, swine). I also want to extract metadata (such as collection place and/or date) from those isolates. Searching “Staphylococcus aureus” on SRA (https://www.ncbi.nlm.nih.gov/sra/advanced) gives me approximately 45,000 entries.
However, the SRA Run Selector outputs (“RunInfo”, which rather focus on the run and not in the sample) usually don’t include a lot of metadata, but after googling around, I learnt that there’s extra metadata in NCBI (ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Metadata). I’ve downloaded the latest full file with all metadata
NCBI_SRA_Metadata_Full_20170501.tar.gz (1.4GB) and confirmed that many runs contain a file with a pattern
.sample.xml that includes info like host or collection date.
I’ve tried extracting the
.sample.xml files corresponding to the 45k S. aureus IDs but it takes forever to look inside the
NCBI_SRA_Metadata_Full_20170501.tar.gz file and extract those particular files, which I intended to use to parse the information I want.
Alternatively, I’ve also read about the SRAdb package in Bioconductor but I’m not really sure how to query the database to filter by host. I’m also not sure how standardised this data is stored.
Does anybody have any experience doing something similar or working with SRAdb that could point me towards the right direction?
Thanks a lot and sorry if I didn't explain myself very well.