Question

Three questions about datasets

0

Entering edit mode

5 months ago

SHXVRR ▴ 20

Hello,

I have three questions about Rna-seq and datasets:

Is it fine to combine datasets? Suppose I am doing a project comparing control tongue epithelial tissue vs. tumor tongue epithelial tissue through DESEQ2 analysis. I have 5 control sra files from one experiment and 5 control sra files from another. Then I have 5 tumor sra files from another experiment and 5 tumor sra files from another. Is that fine since they are 10 control vs. 10 tumors or will it produce swayed results based on how the files were made?
My second question is what is the recommended amount of files to work with for rna-seq? I have heard that 10 control vs 10 tumor is ideal or 30 files in total, but what is the most recommendable as finding a dataset can be hard? I have also seen people doing work on geo datasets with over like 200 files or more. Is it more the merrier for better results or what?
This question kinda doesn't relate to the top 2, but there are MANY geo datasets without SRA. I find it hard to find datasets if it doesn't contain a SRA link. An example could be, GSE58911, which is perfect for what I'm looking for but does not have fq files which are pretty much necessary for a typical rna-seq pipeline. Am I doing something wrong or is there a way to use .txt files for a, suppose, Linux pipeline?

Sorry for the number of questions, but I've searched long and hard for answers and nothing has helped me yet

Thanks

geo datasets sra • 638 views

ADD COMMENT • link 5 months ago by SHXVRR ▴ 20

score 0 · Answer 1 · 2023-11-19

0

Entering edit mode

5 months ago

dsull ★ 6.0k

You'll have technical artifacts across the 4 experiments. If you have a mix of control+tumor samples in one experiment and a mix of control+tumor samples in another experiment, you could regress out the technical artifact. However, since your controls are in different experiments than your tumors, if you compare tumor vs. normal, your results will be "swayed" because you can't confidently say whether a differentially expressed gene is due to something technical (experiment A vs. experiment B) or due to biology (tumor vs. normal).
You can honestly just use very few samples: e.g. 3 samples in each condition. Algorithms such as the DESeq2 algorithms were designed specifically for such cases. Just use whatever is available -- it's not really under your control anyway.
That example is microarray data -- microarray data isn't deposited on the SRA; what you see are the signal intensity values. In this case, use the signal intensities in an algorithm designed specifically for such data (e.g. limma). In other cases (like the TCGA dataset), there are no SRA or FASTQ files available -- this is because of patient privacy. In this case, you have no choice but to use whatever count values you're given in your analysis (whether your analysis be DESeq2, limma-voom, linear regression, wgcna, nmf, etc.). Edit: Or as genomax has pointed out, you could apply for access via dbGAP. ;)

ADD COMMENT • link 5 months ago by dsull ★ 6.0k

0

Entering edit mode

THIS MAKES SOOO MUCH SENSE but I have some questions about #3.

In response to #3, Is there any way to even get fastq files from geo(not SRA)? If there is a way, what parameters would I use to narrow it down? I'd rather not apply for access on dbGAP, and get public datasets. Lastly, Is it possible to even use the .txt files for rna-seq or are they not in the proper format?

ADD REPLY • link 5 months ago by SHXVRR ▴ 20

0

Entering edit mode

You'd have the get FASTQ files from SRA. One useful too to do so is https://sra-explorer.info/ -- just type in the GEO or SRA ID, then check the samples you want to download and add them to the cart, then go to your cart and the download links will be right there :) Probably the easiest way to get FASTQ files from the SRA.

As for the .txt files, the reason I wouldn't use them is because different people use different tools to process RNA-seq (different aligners, different counting algorithms, different genome reference versions, different normalization methods, etc.). If you're analyzing a hundred RNA-seq datasets all of them should be processed the same way at the very least, which is why I prefer downloading the FASTQ files directly and running them through my pipeline. You don't want to introduce a "technical artifact" from different people processing their datasets differently.

Edit: That's why, even though the TCGA fastq files aren't available, they made sure they used a uniform processing pipeline to analyze the thousands of RNAseq samples that they have. Hence, I felt pretty comfortable using the ".txt" counts from TCGA data in my research. My advice is: If you can start from the FASTQ files, then please do so. If it's impossible (because of data access restrictions), you have no choice but to use the ".txt" counts (and hope that they were processed correctly).

ADD REPLY • link 5 months ago by dsull ★ 6.0k

0

Entering edit mode

OMG. https://sra-explorer.info/ solves all my problems. Previously, I was just manually finding them in the SRA section of ncbi. Thank you so much!!!!

ADD REPLY • link 5 months ago by SHXVRR ▴ 20

score 0 · Answer 2 · 2023-11-19

Is it fine to combine datasets?

While anything can be done question is would it be logical to do and would such an analysis produce logical/usable results.

will it produce swayed results based on how the files were made?

More than likely. You should do your own due diligence but if the datasets use different kits/methods/were done a few years apart then there will be biases.

Is it more the merrier for better results or what?

See the discussion in this thread : Am I crazy, or are most published RNA-seq studies vastly underpowered?

I find it hard to find datasets if it doesn't contain a SRA link and GSE58911

This is not an RNAseq dataset. Looks like these data came from Affymetrix gene chips i.e microarrays.

In general you are not going to find patient fastq data in publicly accessible part of SRA. You will need to apply for access to dbGAP, where access controlled data resides. This is done because of patient privacy reasons.