My work involves downloading RNA-seq data from NCBI-SRA and its analysis to find DE genes. In such a case is it advisable to select data from different sequencers? For example data sequenced from Illumina HiSeq 1500, 2000 and 2500.
Also if same sequencer but different library preparation methods.
I was wondering if we could pre-process, align and count each data separately and then go for DE analysis.
It depends on your question of the study. If it is a data-driven study that tries to account for sequencing batches then Approach 1 is better suited. If its more in line with a biological hypothesis Approach 2 is ideal when Approach 1 upon correction does not yield a meaningful answer to your biological question you are trying to address. I would put a few suggestions here
If you want to interrogate a specific study that has different
layers of data coming from different machines and sequenced by different operators with different library preparation, you will risk for batch effects. Now if you have apriori information of the batches in this data you can model around them using combat and if not then you will need something like SVA or RUVSeq.
To perform the one above you will need to download raw fastq files
from the study in SRA that you are interested. Quantify all the
samples together with the aligner or mapper of your interest
providing the proper information of libType (as Salmon/Kallisto
Prepare your meta-information files with information about tissue
types, operators, batch info and libtype. Once you have the total
count table of all your data you can normalize the counts to logCPM
and perform a PCA bi-plot of MDS to see if your biological
hypothesis is holding strongly or the batches. If batches do then
you will have to correct for it or you them as information of
covariates and perform your DE analysis. This can be possible but
keep in mind if your batch effects and libType are too strong of
confounders then corrections will not be great and a chance of
overfitting comes into play.
Alternatively one can perform separately the DE analysis for each of the labs or studies(provided each study has enough samples for DE analysis) so and then compare the DEGs that are in common and try to reason the biological question you want to address. Keep in mind you might have also low overlaps.
It is a very broad question. As of now, I can suggest these 2 approaches but unless you interrogate the data and perform a preliminary exploratory analysis, it is difficult to say. If the data are very homogenous and batch effects do not mask the real biological differences approach 1 should work as well for meaniningful hypothesis and even for that matter approach 2.
If you mean you can compare group A with library prep 1 on HiSeq 1500 versus group B with library prep 2 on HiSeq 2000: no, the technical variability between sequencers (and definitely between kits) is too big. Better to keep everything the same and only compare within-run/within-experiment.