Dear Biostar experts,
I want to analyze gene expression data from 6 adult brains belonging to the Allen adult human brain dataset.
Each brain has up to 500 tissue samples with gene expression data per hemisphere across multiple brain structures. For each brain I have a csv file that looks like the following (each row has the MNI space coordinate and the brain structure per tissue sample).
structure_id structure_name polygon_id mni_x mni_y mni_z 9137 abducens nucleus, left 978024 -5.1 -44.6 -42.7 9137 abducens nucleus, left 977815 -3.9 -40 -37.9 4329 amygdalohippocampal transition zone, left 73464 -22.1 -7.6 -9.6 4329 amygdalohippocampal transition zone, left 73236 -19.1 -13.6 -11.8 4114 angular gyrus, left, inferior bank of gyrus 27442 -43.9 -76.3 27.4
Each sample has been taken in a slightly different position within each brain structure, and not all brain structures were sampled the same number of times. I have ordered the samples in each structure across each brain by the mni coordinates. I have counted the number of samples per structure, per brain like so:
brain_1 brain_2 brain_3 brain_4 brain_5 brain_6 putamen, left 3 3 3 3 3 2 middle temporal gyrus 3 3 3 3 2 2
With this data I want to calculate a mean coordinate for each samples accross brains.However due to the different number of samples points per brain structure, some rows have fewer data points than others.
One approach would be to take the intersect (i.e. in the example below every brain region was sampled at least 2 times, so generate 2 mean coordinates per region). The problem is that might result in losing a lot of data points. Another approach could be to allow 1 (or a certain proportion) of missing data point across brains, so in this example generate 3 mean coordinates for the putamen and 2 mean coordinates for the middle temporal gyrus?
My question is if you think is this a reasonable strategy, and if you have any alternative suggestions? I want my final list of coordinates to be as representative as possible of the underlying data. Thanks very much!
brain_1 brain_2 brain_3 brain_4 brain_5 brain_6 mean coord putamen, left x,y,z x,y,z x,y,z x,y,z x,y,z x,y,z … x,y,z, x,y,z x,y,z x,y,z x,y,z x,y,z … x,y,z x,y,z x,y,z x,y,z x,y,z middle temporal gyrus x,y,z x,y,z x,y,z x,y,z x,y,z x,y,z … x,y,z, x,y,z x,y,z x,y,z x,y,z x,y,z … x,y,z x,y,z x,y,z x,y,z …