I have two datasets:
- A commercial targeted RNASeq (HTG EdgeSeq) with ~ 1.5k coding genes. There is one single probe sequence (75 bp long) per RNA transcript. It contains samples considered as "treated".
- A public (whole) RNASeq with > 60k genes. It contains samples considered as "control".
I'd like to perform differential expression tests between them, but there are obviously several issues I'd have to deal with. I have already raw counts for each one, so this would be the starting point.
I thought I could do the following:
- Subset the "big" dataset, selecting only genes present in the targeted RNASeq.
- Use the EDASeq package to correct for length effects with the function
withinLaneNormalization, for each dataset independently. I assume this would normalize counts by length, after having set the gene lengths in every dataset differently (the whole RNASeq would consider the real gene lengths, whereas the targeted dataset would consider 75 bp as the gene length).
- Create a
DGEListobject for each dataset from the data generated with EDASeq.
DGEListobjects to combine them into a single one.
- Perform differential expression analysis with EdgeR as usual.
I'm a bit suspicious about the rightness of this procedure at several steps. For example, at step 1), wouldn't it change dramatically the distribution of data affecting the library sizes? At step 3) and 4), is it possible to combine DGEList objects with different corrections in gene length through EDASeq? And at step 5), would the results be trustworthy?
I know of course that mixing different sequencing technologies won't yield the best results, but this is the data I possess at the moment.
(NOT A DUPLICATE OF Combining two rnaseq platforms in one)