Question

Missing columns in meta table from SRA Selector

0

Entering edit mode

12 months ago

tnocs • 0

I'm trying to fetch meta data from the SRA Run Selector:

https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA253315&o=acc_s%3Aa

using the linux command line and a project id. I can do it with this line:

esearch -db sra -query PRJNA253315 | efetch -format runinfo > file.csv

But it doesn't give me all columns, the "Antibody" and "TREATMENT" columns aren't there for example. I know there is a way I can specify exactly which columns I want, but I also don't want do that. I just want the exact table that I would get if I clicked the "Metadata" button on the website, how can I do this in the command line? Does esearch and efetch offer ways of doing this?

SRA esearch efetch • 743 views

ADD COMMENT • link updated 12 months ago by Istvan Albert 100k • written 12 months ago by tnocs • 0

0

Entering edit mode

Does esearch and efetch offer ways of doing this?

No there does not seem to be. Information provided in SRA Run selector is not is not identical to one provided by EntrezDirect.

ADD REPLY • link 12 months ago by GenoMax 141k

score 0 · Answer 1 · 2023-04-24

Unfortunately there is not enforced standard of what metadata must make into the SRA, it is very frustrating actually and makes reproducing any analysis needlessly complicated.

You can look at what EBI fields are there, and sometimes they produce more fields than SRA:

pip install bio

then look at the metadata that way:

bio search PRJNA253315 --all | more

prints things like:

[
    {
        "accession": "SAMN02870079",
        "altitude": "",
        "assembly_quality": "",
        "assembly_software": "",
        "base_count": "3049954530",
        "binning_software": "",
        "bio_material": "",
        "broker_name": "",
        "cell_line": "IMR90",
        "cell_type": "",
        "center_name": "GEO",
        "checklist": "",
        "collected_by": "",
        "collection_date": "",
        "collection_date_submitted": "",
        "completeness_score": "",
        "contamination_score": "",
        "country": "",
        "cram_index_aspera": "",
        "cram_index_ftp": "",
        "cram_index_galaxy": "",
        "cultivar": "",
        "culture_collection": "",
        "depth": "",
        "description": "Illumina HiSeq 2000 sequencing; GSM1418957: H3 ChIP (DMSO); Homo sapiens; ChIP-Seq",
        "dev_stage": "",
        "ecotype": "",
        "elevation": "",
        "environment_biome": "",
        "environment_feature": "",
        "environment_material": "",
        "environmental_package": "",
        "environmental_sample": "false",
        "experiment_accession": "SRX620734",
        "experiment_alias": "GSM1418957",
        "experiment_title": "Illumina HiSeq 2000 sequencing; GSM1418957: H3 ChIP (DMSO); Homo sapiens; ChIP-Seq",
        "experimental_factor": "",
        "fastq_aspera": "fasp.sra.ebi.ac.uk:/vol1/fastq/SRR144/004/SRR1448774/SRR1448774.fastq.gz",
        "fastq_bytes": "2838144326",
        "fastq_galaxy": "ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/004/SRR1448774/SRR1448774.fastq.gz",
        "fastq_md5": "2ac617b0b8670c9d4a9bc15213f68c4f",
        "first_created": "2015-06-05",
        "first_public": "2015-06-05",
        "germline": "false",
        "host": "",
        "host_body_site": "",
        "host_genotype": "",
        "host_gravidity": "",
        "host_growth_conditions": "",
        "host_phenotype": "",
        "host_sex": "",
        "host_status": "",
        "host_tax_id": "",
        "identified_by": "",
        "instrument_model": "Illumina HiSeq 2000",
        "instrument_platform": "ILLUMINA",
        "investigation_type": "",
        "isolate": "",
        "isolation_source": "",
        "last_updated": "2019-11-16",
        "lat": "",
        "library_construction_protocol": "For ChIP-seq, cells were crosslinked with formaldehyde (1% final) for 10min at room temperature, and harvested for sonication.  Nuclei were extracted and chromatin was sheared to an average size of 200bp using a Diagenode Bioruptor. For RNA-seq, cells were harvested and PolyA+ RNA was isolated using the NEBNext Ultra RNA-seq Isolation Module. For ATAC-seq, cells were harvested, nuclei were prepped,and transposase was added for 30 minutes at 30C. Sequencing libraries for ChIP-seq were constructd using the NEBNext Ultra kit as per manufacturer's recommended instructions Sequencing libraries for ATAC-seq were constructed using custom Nextera-compatible primers, from Nextera-adapted DNA fragments",
        "library_layout": "SINGLE",
        "library_name": "",
        "library_selection": "ChIP",
        "library_source": "GENOMIC",
        "library_strategy": "ChIP-Seq",
        "location": "",
        "lon": "",
        "mating_type": "",
        "nominal_length": "",
        "nominal_sdev": "",
        "parent_study": "PRJNA9558",
        "ph": "",
        "project_name": "",
        "protocol_label": "",
        "read_count": "59803030",
        "run_accession": "SRR1448774",
        "run_alias": "GSM1418957_r1",
        "salinity": "",
        "sample_accession": "SAMN02870079",
        "sample_alias": "GSM1418957",
        "sample_capture_status": "",
        "sample_collection": "",
        "sample_description": "H3 ChIP (DMSO)",
        "sample_material": "",
        "sample_title": "H3 ChIP (DMSO)",
        "sampling_campaign": "",
        "sampling_platform": "",
        "sampling_site": "",
        "scientific_name": "Homo sapiens",
        "secondary_sample_accession": "SRS645140",
        "secondary_study_accession": "SRP043510",
        "sequencing_method": "",
        "serotype": "",
        "serovar": "",
        "sex": "",
        "specimen_voucher": "",
        "sra_aspera": "fasp.sra.ebi.ac.uk:/vol1/srr/SRR144/004/SRR1448774",
        "sra_bytes": "1994732885",
        "sra_ftp": "ftp.sra.ebi.ac.uk/vol1/srr/SRR144/004/SRR1448774",
        "sra_galaxy": "ftp.sra.ebi.ac.uk/vol1/srr/SRR144/004/SRR1448774",
        "sra_md5": "e3920e0a35006ada4a8738af2c7bfcf7",
        "strain": "",
        "study_accession": "PRJNA253315",
        "study_alias": "GSE58740",
        "study_title": "Chromatin dynamics of p53 binding sites in IMR90",
        "sub_species": "",
        "sub_strain": "",
        "submission_accession": "SRA172049",
        "submission_tool": "",
        "submitted_aspera": "",
        "submitted_bytes": "",
        "submitted_format": "",
        "submitted_ftp": "",
        "submitted_galaxy": "",
        "submitted_host_sex": "",
        "submitted_md5": "",
        "submitted_sex": "",
        "target_gene": "",
        "tax_id": "9606",
        "taxonomic_classification": "",
        "taxonomic_identity_marker": "",
        "temperature": "",
        "tissue_lib": "",
        "tissue_type": "",
        "variety": "",
        "fastq_url": [
            "https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/004/SRR1448774/SRR1448774.fastq.gz"
        ],
        "info": "3 GB file; 60 million reads; 3.0 billion sequenced bases"
    },
  [...]

it pretty nuts actually, look at all the fields not filled in, sometimes you can parse out various information from other fields.

score 0 · Answer 2 · 2023-04-24

0

Entering edit mode

12 months ago

zhousun21 ▴ 40

For a lot of SRA submissions, there is no antibody or treatment data associated with the organism or experiment. So, nothing for the submitter to enter in those fields.

ADD COMMENT • link 12 months ago by zhousun21 ▴ 40