Question

Mapping expression profiles from microarray control set, to a collection of publicly available RNAseq expression profiles of diseases sets.

0

Entering edit mode

7.7 years ago

becton ▴ 10

Hello, I’m really lost in the sense there is no direct guide, but I knew that when I sign up for grad school ☺. My Idea is very clear, I’m asking here, trying get sense of the scope and challenges involves to evaluate the work need to be done,I'm hoping after you help me that I will be able to sketch a diagram of each major steps, so I can start working out the details.

so, my Initial data set is of 2-color microarray control set (250 subjects) samples where taken from specific tissue location, and includes thousands of genes; I have good age ranges (5-75).

My condition of interest is not very well understood but there are both microarray & RNAseq dataset available online, however my initial data set advantageous in a sense is that is from rare tissue source and known.

My genes of interest list include 15-25 possible candidate genes that I selected from published meta-analysis and reviews; I want to investigate this condition starting from those genes. Specifically, mapping the expression values across life span [from healthy into same life stages in disease state and if possible specific progression state of the disease]

1- to what extents I could utilize this initial dataset; I’m not exploring differential expression overall and general analysis, I’m interested in correlation analysis of theses genes and enrich them. And see what pathways involved, this is introductory analysis, any other ideas about appropriate analysis?

2- I guess in my second phase of this research I have to do a meta- analysis of RNAseq data? Which involve combing control and patients from different experimental designs, and separate patient samples into my main age groups [1-5, 5-15, 20-49, 50+], hoping to get good number of each group but i didn't give thought to how many sample i should have in each one ? so Can you refer me to good guide in combining RNAseq from different datasets, any advice about major issues that I should be aware of? is there away to deal with missing values across different datasets , or should I consider fixing them indiviuially , I'm really lost here :)

2- The expression values of 2 color microarray is an-average of the two dye signals subtracted form the background noise (it took me a while to figure out how to appropriately clean, normalize, and then convert these values into absolute numbers then get the expression matrix ready for analysis in R) while RNAseq data represent expression values as read counts which is discrete value, how doI deal with this? [ mapping light intensity values to read counts? is it even possible? can be meaningful ?]

3- some of these RNAseq studies don’t contain disease progression as phenotype/variable; some have record of only some important symptoms (present or absent), Any advice regarding this?

I’m fairly beginner in analyzing high throughput data, so please don’t assume I know every term,I never worked on multiple dataset before, so before I start learning the how to handle RNAseq data I want to know if my plan is feasible and will constitute a nice solid graduate work.

RNA-Seq R next-gen genome • 2.1k views

ADD COMMENT • link updated 7.6 years ago by Kevin Blighe 89k • written 7.7 years ago by becton ▴ 10

score 1 · Answer 1 · 2017-12-26

Your question was bumped back to the top of the 'Open' questions list by the Biostars bot, so, I thought it best to give a response.

As I understand, you have your own microarray data and you want to compare your results to those of published studies that have used RNA-seq data. I can only assume that your tutor/mentor is not knowledgeable in bioinformatics, nor does your department have any bioinformatics service (?).

There are some 'holes' in your entire methodological approach if you are only interested in a few dozen genes (you indicate "15-25"). If you are only interested in this number of genes, why did you not aim to do high-throughput real-time qPCR (like Fluidigm), NanoString, or even CyTOF? - these are targeted approaches for when you already have a bunch of candidate markers/genes. Was a microarray the cheaper option? You will nevertheless have to at least normalise your microarray data in order to produce normalised expression values. There are plenty of tutorials online about how to process 2-colour microarrays.

I would not merge the microaray with RNA-seq data unless you are really knowledgeable in how both datatypes are produced and normalised, and are also knowledgeable in particular on statistical methodologies such as data distributions. The RNA-seq data should be seen as a meta-analysis.

3- some of these RNAseq studies don’t contain disease progression as phenotype/variable; some have record of only some important symptoms (present or absent), Any advice regarding this?

Difficult to elaborate on this because you have not stated what you are researching.

Kevin