I have never worked with micro RNA data previously and want to know if there is anything wrong with the methodology I've come up with for analyzing it:
1. Download Level_3 TCGA miRNA data for all samples in the cancer type of interest (this data consists of two-column file for each sample; one column lists miRNA names, the second lists normalized expression values)
2. Use clinical data to define sub-populations based on Patient Barcode.
3. Within each sub-population of interest, compute the mean and median for each miRNA across samples then compute the standard deviation and variance for both mean and median.
4. Use the descriptive statistics from step 3 to determine which miRNAs change the least within the sub-population of interest then compare their mean/median against different sub-populations of interest. The rationale for choosing least-changing miRNAs is that if they differ from other sub-populations when compared, the difference is more likely to be significant. The selection of mean vs median for comparison is decided based on the prevalence and intensity of outliers for a given miRNA within a sub-population.
5. If you find any miRNAs that are differentially expressed in sub-populations, calculate some error bars
If there is anything glaringly wrong or problematic with this approach - please let me know. Anything more subtle, I'll still be happy to hear about it.