My work involves downloading RNA-seq data from NCBI-SRA and its analysis to find DE genes. In such a case is it advisable to select data from different sequencers? For example data sequenced from Illumina HiSeq 1500, 2000 and 2500. Also if same sequencer but different library preparation methods. I was wondering if we could pre-process, align and count each data separately and then go for DE analysis.
It depends on your question of the study. If it is a data-driven study that tries to account for sequencing batches then Approach 1 is better suited. If its more in line with a biological hypothesis Approach 2 is ideal when Approach 1 upon correction does not yield a meaningful answer to your biological question you are trying to address. I would put a few suggestions here
- If you want to interrogate a specific study that has different layers of data coming from different machines and sequenced by different operators with different library preparation, you will risk for batch effects. Now if you have apriori information of the batches in this data you can model around them using combat and if not then you will need something like SVA or RUVSeq.
- To perform the one above you will need to download raw fastq files from the study in SRA that you are interested. Quantify all the samples together with the aligner or mapper of your interest providing the proper information of libType (as Salmon/Kallisto prefers such).
- Prepare your meta-information files with information about tissue types, operators, batch info and libtype. Once you have the total count table of all your data you can normalize the counts to logCPM and perform a PCA bi-plot of MDS to see if your biological hypothesis is holding strongly or the batches. If batches do then you will have to correct for it or you them as information of covariates and perform your DE analysis. This can be possible but keep in mind if your batch effects and libType are too strong of confounders then corrections will not be great and a chance of overfitting comes into play.
Alternatively one can perform separately the DE analysis for each of the labs or studies(provided each study has enough samples for DE analysis) so and then compare the DEGs that are in common and try to reason the biological question you want to address. Keep in mind you might have also low overlaps.
It is a very broad question. As of now, I can suggest these 2 approaches but unless you interrogate the data and perform a preliminary exploratory analysis, it is difficult to say. If the data are very homogenous and batch effects do not mask the real biological differences approach 1 should work as well for meaniningful hypothesis and even for that matter approach 2.
If you mean you can compare group A with library prep 1 on HiSeq 1500 versus group B with library prep 2 on HiSeq 2000: no, the technical variability between sequencers (and definitely between kits) is too big. Better to keep everything the same and only compare within-run/within-experiment.