ffq offers metadata retrieval from the SRA.
See Gálvez-Merchán, Á., et al. (2023) 'Metadata retrieval from sequence databases with ffq'
ffq
installation gets you both a command line and Python module. ffq
installation gets you both a command line and Python module. (See note below how on some systems accessing the command line way isn't that easy and so an equivalent is provided in the example.)
From the paper, it seems it uses NCBI Entrez programming utilities under the hood.
Specific example with ffq paralleling GenoMax's example
Let's assume you are doing this in Jupyter, which you can actually do without installing anything on your machine, or even logging in, by running the following in your temporary Jupyter session started from here by pressing 'launch binder
':
%pip install ffq pyjq
We'll use pyjq later, but install it now.
Restart the kernel after that using Kernel
> Restart Kernel...
.
Run the following using ffq:
!ffq SRP144355 -o SRP144355.txt
(Leave off the exclamation to do that in a terminal.)
You'll get something like this below with ...
being used to truncate here for display because as GenoMax points out in his post, "(there are 143 samples showing two examples)".
For stderr:
[2025-01-31 20:12:29,901] INFO Parsing Study SRP144355
[2025-01-31 20:12:30,107] INFO Getting Sample for SRP144355
/srv/conda/envs/notebook/lib/python3.10/site-packages/ffq/utils.py:1082: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
return BeautifulSoup(
[2025-01-31 20:14:51,501] WARNING There are 143 samples for SRP144355
[2025-01-31 20:14:51,501] INFO Parsing sample SRS3242949
[2025-01-31 20:14:51,691] WARNING Failed to parse sample information from ENA XML. Falling back to ENA search...
[2025-01-31 20:14:51,821] INFO Getting Experiment for SRS3242949
[2025-01-31 20:14:51,821] INFO Parsing Experiment SRX4022459
...
And as the output, you'll get:
{
"SRP144355": {
"accession": "SRP144355",
"title": "Predicting age from the transcriptome of human dermal fibroblasts",
"abstract": "There is a marked heterogeneity in human lifespan and health outcomes for people of the same chronological age.
...
Next we'd need to read that in and parse it.
You have a couple of options in this case to use ffq
to get the json into the variable we'll call data
here.
In fact, instead of having to read that in, if using a Python kernel in Jupyter you can run this in cell to skip reading it back in:
%%capture out
import sys
from ffq.main import main
sys.argv = ['ffq','SRP144355']
main()
(Note that option will work in cloud environments like Anaconda Cloud, whereas the terminal or option to use ffq
with an exclamation point in a cell may not work as pip
doesn't easily install the command line interface everywhere.)
In subsequent cells you can access the output with out.stdout
, such as:
data_str = out.stdout
(The cell magic %%capture
that Jupyter possesses is a nice covenience, but may confuse those unfamilar as it is using elements of shell and Python & so I am specifically pointing it out.)
Then get the stdout string into a json object data
with:
data = json.loads(out.stdout)
But what if you went with the command !ffq SRP144355 -o SRP144355.txt
?
For reading the saved file made by !ffq SRP144355 -o SRP144355.txt
, you can use:
import json
with open("SRP144355.txt") as f:
data = json.load(f)
Either way you went, you should have data
at the end of those steps. (Running type(data)
will give you dict
because it is a dictionary to Python.)
And you can use pyjq to parse or just plain Python. We'll take advantage of the json structure in this example and use keys.
We can extract all the SRS accessions and SRR accessions with the following:
import json
import pyjq
# Assuming your data is loaded as:
# with open("SRP144355.txt") as f:
# data = json.load(f)
# Get all SRS accessions
srs_query = '.[] | .samples | keys[]'
# Alternative query that gets the accessions from the full objects:
# srs_query = '.[] | .samples | .[] | .accession'
# Get all SRR accessions
srr_query = '.[] | .samples | .[] | .experiments | .[] | .runs | keys[]'
# Alternative query that gets the accessions from the full objects:
# srr_query = '.[] | .samples | .[] | .experiments | .[] | .runs | .[] | .accession'
def extract_accessions(data):
# Extract both types of accessions
srs_accessions = pyjq.all(srs_query, data)
srr_accessions = pyjq.all(srr_query, data)
print("SRS Accessions:", srs_accessions)
print("\nSRR Accessions:", srr_accessions)
# Print counts
print(f"\nFound {len(srs_accessions)} SRS accessions")
print(f"Found {len(srr_accessions)} SRR accessions")
return srs_accessions, srr_accessions
srs_accessions, srr_accessions = extract_accessions(data)
That will give you 143 of each.
Now like GenoMax did for two examples, we can iterate on two of those SRR7093892
& SRR7093893
:
for acc in srs_accessions:
if acc == 'SRS3243030' or acc == 'SRS3243030':
sys.argv = ['ffq',acc]
main()
That gives the result that starts out like so:
{
"SRS3243030": {
"accession": "SRS3243030",
"title": "98_17yr_Male_Caucasian",
"organism": "Homo sapiens",
"attributes": {
"INSDC secondary accession": "SRS3243030",
"NCBI submission package": "Generic.1.0",
"disease": "Normal",
"ethnicity": "Caucasian",
"organism": "Homo sapiens",
"Sex": "male",
"cell id": "GM07753",
"age": "17",
"source_name": "Skin; Unspecified",
"BioSampleModel": "Generic",
"ENA-FIRST-PUBLIC": "2022-03-29",
"ENA-LAST-UPDATE": "2022-03-29"
},
"experiments": {
"SRX4022539": {
"accession": "SRX4022539",
"title": "GSM3124643: 98_17yr_Male_Caucasian; Homo sapiens; RNA-Seq",
"platform": "ILLUMINA",
"instrument": "NextSeq 500",
"runs": {
"SRR7093892": {
"accession": "SRR7093892",
"experiment": "SRX4022539",
...
If you wanted to do that for all the SRS accesnsions, just delete the conditional to make it like so:
for acc in srs_accessions:
sys.argv = ['ffq',acc]
main()
Adapt the code as you see fit using some Python.
The equivalent of that penultimate Python code block could also be run with:
for acc in srs_accessions:
if acc == 'SRS3243030' or acc == 'SRS3243030':
!ffq {acc}
Hi When I run the command you mentioned above (command below), there is no content in the file.
And when I run the command on the web page, it shows "HTTP ERROR 400", do you know how to solve it?