Hello,
I have three questions about Rna-seq and datasets:
Is it fine to combine datasets? Suppose I am doing a project comparing control tongue epithelial tissue vs. tumor tongue epithelial tissue through DESEQ2 analysis. I have 5 control sra files from one experiment and 5 control sra files from another. Then I have 5 tumor sra files from another experiment and 5 tumor sra files from another. Is that fine since they are 10 control vs. 10 tumors or will it produce swayed results based on how the files were made?
My second question is what is the recommended amount of files to work with for rna-seq? I have heard that 10 control vs 10 tumor is ideal or 30 files in total, but what is the most recommendable as finding a dataset can be hard? I have also seen people doing work on geo datasets with over like 200 files or more. Is it more the merrier for better results or what?
This question kinda doesn't relate to the top 2, but there are MANY geo datasets without SRA. I find it hard to find datasets if it doesn't contain a SRA link. An example could be, GSE58911, which is perfect for what I'm looking for but does not have fq files which are pretty much necessary for a typical rna-seq pipeline. Am I doing something wrong or is there a way to use .txt files for a, suppose, Linux pipeline?
Sorry for the number of questions, but I've searched long and hard for answers and nothing has helped me yet
Thanks
THIS MAKES SOOO MUCH SENSE but I have some questions about #3.
In response to #3, Is there any way to even get fastq files from geo(not SRA)? If there is a way, what parameters would I use to narrow it down? I'd rather not apply for access on dbGAP, and get public datasets. Lastly, Is it possible to even use the .txt files for rna-seq or are they not in the proper format?
You'd have the get FASTQ files from SRA. One useful too to do so is https://sra-explorer.info/ -- just type in the GEO or SRA ID, then check the samples you want to download and add them to the cart, then go to your cart and the download links will be right there :) Probably the easiest way to get FASTQ files from the SRA.
As for the .txt files, the reason I wouldn't use them is because different people use different tools to process RNA-seq (different aligners, different counting algorithms, different genome reference versions, different normalization methods, etc.). If you're analyzing a hundred RNA-seq datasets all of them should be processed the same way at the very least, which is why I prefer downloading the FASTQ files directly and running them through my pipeline. You don't want to introduce a "technical artifact" from different people processing their datasets differently.
Edit: That's why, even though the TCGA fastq files aren't available, they made sure they used a uniform processing pipeline to analyze the thousands of RNAseq samples that they have. Hence, I felt pretty comfortable using the ".txt" counts from TCGA data in my research. My advice is: If you can start from the FASTQ files, then please do so. If it's impossible (because of data access restrictions), you have no choice but to use the ".txt" counts (and hope that they were processed correctly).
OMG. https://sra-explorer.info/ solves all my problems. Previously, I was just manually finding them in the SRA section of ncbi. Thank you so much!!!!