Dear Biostar experts,
I want to analyze gene expression data from 6 adult brains belonging to the Allen adult human brain dataset.
Each brain has up to 500 tissue samples with gene expression data per hemisphere across multiple brain structures. For each brain I have a csv file that looks like the following (each row has the MNI space coordinate and the brain structure per tissue sample).


Each sample has been taken in a slightly different position within each brain structure, and not all brain structures were sampled the same number of times. I have ordered the samples in each structure across each brain by the mni coordinates. I have counted the number of samples per structure, per brain like so: 









With this data I want to calculate a mean coordinate for each samples accross brains. However due to the different number of samples points per brain structure, some rows have fewer data points than others.
One approach would be to take the intersect (i.e. in the example below every brain region was sampled at least 2 times, so generate 2 mean coordinates per region). The problem is that might result in losing a lot of data points. Another approach could be to allow 1 (or a certain proportion) of missing data point across brains, so in this example generate 3 mean coordinates for the putamen and 2 mean coordinates for the middle temporal gyrus ?
My question is if you think is this a reasonable strategy, and if you have any alternative suggestions ? I want my final list of coordinates to be as representative as possible of the underlying data. Thanks very much!
brain_1 
brain_2 
brain_3 
brain_4 
brain_5 
brain_6 
mean coord 








putamen, left 
x,y,z 
x,y,z 
x,y,z 
x,y,z 
x,y,z 
x,y,z 
… 








x,y,z, 
x,y,z 
x,y,z 
x,y,z 
x,y,z 
x,y,z 
… 








x,y,z 
x,y,z 
x,y,z 
x,y,z 
x,y,z 
























middle temporal gyrus 
x,y,z 
x,y,z 
x,y,z 
x,y,z 
x,y,z 
x,y,z 
… 








x,y,z, 
x,y,z 
x,y,z 
x,y,z 
x,y,z 
x,y,z 
… 








x,y,z 
x,y,z 
x,y,z 
x,y,z 


... 






