I have never worked with micro RNA data previously and want to know if there is anything wrong with the methodology I've come up with for analyzing it:
- Download Level_3 TCGA miRNA data for all samples in the cancer type of interest (this data consists of two-column file for each sample; one column lists miRNA names, the second lists normalized expression values)
- Use clinical data to define sub-populations based on Patient Barcode.
- Within each sub-population of interest, compute the mean and median for each miRNA across samples then compute the standard deviation and variance for both mean and median.
- Use the descriptive statistics from step 3 to determine which miRNAs change the least within the sub-population of interest then compare their mean/median against different sub-populations of interest. The rationale for choosing least-changing miRNAs is that if they differ from other sub-populations when compared, the difference is more likely to be significant. The selection of mean vs median for comparison is decided based on the prevalence and intensity of outliers for a given miRNA within a sub-population.
- If you find any miRNAs that are differentially expressed in sub-populations, calculate some error bars
If there is anything glaringly wrong or problematic with this approach - please let me know. Anything more subtle, I'll still be happy to hear about it.