Question

How can NIH SRA data be more useful?

1

Entering edit mode

2.3 years ago

datanerd ▴ 520

Hi all,

SRA datasets has a lot of useful and reusable data for researchers. But some of them might be the ones that never get used at all. I would appreciate anyone advice on how can these data be more useful for the researchers? Would it be helpful to have it QCed and provide a quality metrics to the data, or maybe in some way ready to use data ie.e preprocessed? Is one dataset more used than the other (RNAseq vs DNA vs Metagenomics or a disease specific?

SRA resource genomics opendata • 2.1k views

ADD COMMENT • link updated 2.3 years ago by Jeremy Leipzig 22k • written 2.3 years ago by datanerd ▴ 520

0

Entering edit mode

Qiagen has done exactly this for ~100K public datasets from SRA in their Land Explorer/Analysis Match package. These are add-ons for Ingenuity Pathway Analysis (IPA) which in itself can pricey. You can match results of your own analyses against these public datasets easily.

ADD REPLY • link 2.3 years ago by GenoMax 141k

score 4 · Answer 1 · 2021-12-17

4

Entering edit mode

2.3 years ago

Jeremy Leipzig 22k

What comes to mind is that in all but the simplest experiments it can be difficult to align what is written in manuscripts with what is in SRA metadata in terms of treatment, and biological and technical replicates. So any tools that help you go from an SRA project to a correctly assembled workflow manifest (nf-core or otherwise) would be helpful. I should mention PEP here could be a useful standard. But, you would really have to read the paper to do this properly.

I am excited about harmonization/normalization efforts (recount2/refine.bio) and also this interesting site https://pluto.bio where you can reanalyze a number of old GEO/SRA experiments.

ADD COMMENT • link 2.3 years ago by Jeremy Leipzig 22k

3

Entering edit mode

Tbh, just having a mandatory and proper table of metadata incl. lab processing dates and clinical features would already be a huge step forward. Too often I am guessing based on file names which sample belongs to which group. Correcting for confounders is often a challenge because we know nothing about the data.

ADD REPLY • link 2.3 years ago by ATpoint 81k

0

Entering edit mode

Thanks! I have had that problems too. There is no clear naming convention from files to meta data, however I think its getting better with the more recent submission. Agree on the confounders point. Many a times, the meta data will only have information used in the paper and other associated metadata are not submitted at all.

ADD REPLY • link 2.3 years ago by datanerd ▴ 520

3

Entering edit mode

recount3 was actually just released, and it's pretty nifty/easy to get a variety of pre-processed counts from. Agree about the metadata though, it's a crapshoot as to whether it's even possible to determine the sample groupings in any associated publications.

ADD REPLY • link 2.3 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

thanks Jared, agree metadata is a big challenge.With the last point do you mean having a way to determine if there is another dataset with similar group based on the meta data?

ADD REPLY • link 2.3 years ago by datanerd ▴ 520

2

Entering edit mode

I mean just determining which samples are which with regard to results shown in a corresponding publication. Often, sample identifiers are not the same.

ADD REPLY • link 2.3 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

Great suggestion regarding the 'SRA project to workflow manifest' - really think that would be very useful. I will look over what PEP offers. Biggest complain regarding SRA has always been the meta data. Do you think it will be valuable to do a QC (assay, platform, series, gender check etc) and provide that with the dataset?

ADD REPLY • link 2.3 years ago by datanerd ▴ 520

1

Entering edit mode

SRA metadata varies in quality by project. Some submitters just treat the metadata like a headache to be ignored or avoided and others are more conscientious. I don't think SRA performs sex checks or duplicate file content checks (not sure about this) but dbGaP does and it can reveal serious lab and sample management errors.

ADD REPLY • link 2.3 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Thanks for all the input. I often think of SRA as the 'raw data' and GEO as the processed data. What are your thoughts on which one researchers use more? In my experience I have used GEO because it saves processing time but there are disadvantages as the analysis might be very hypothesis specific or sometimes even trusting the entire analysis pipeline.

ADD REPLY • link 2.3 years ago by datanerd ▴ 520