post-mortem brain gene expression, comparing tissue sample points across brains
0
2
Entering edit mode
7.2 years ago
avari ▴ 110

Dear Biostar experts,

I want to analyze gene expression data from 6 adult brains belonging to the Allen adult human brain dataset.

Each brain has up to 500 tissue samples with gene expression data per hemisphere across multiple brain structures. For each brain I have a csv file that looks like the following (each row has the MNI space coordinate and the brain structure per tissue sample).

structure_id  structure_name                               polygon_id  mni_x  mni_y  mni_z
9137          abducens nucleus, left                       978024      -5.1   -44.6  -42.7
9137          abducens nucleus, left                       977815      -3.9   -40    -37.9
4329          amygdalohippocampal transition zone, left    73464       -22.1  -7.6   -9.6
4329          amygdalohippocampal transition zone, left    73236       -19.1  -13.6  -11.8
4114          angular gyrus, left, inferior bank of gyrus  27442       -43.9  -76.3  27.4


Each sample has been taken in a slightly different position within each brain structure, and not all brain structures were sampled the same number of times. I have ordered the samples in each structure across each brain by the mni coordinates. I have counted the number of samples per structure, per brain like so:

                       brain_1  brain_2  brain_3  brain_4  brain_5  brain_6
putamen, left          3        3        3        3        3        2
middle temporal gyrus  3        3        3        3        2        2


With this data I want to calculate a mean coordinate for each samples accross brains.However due to the different number of samples points per brain structure, some rows have fewer data points than others.

One approach would be to take the intersect (i.e. in the example below every brain region was sampled at least 2 times, so generate 2 mean coordinates per region). The problem is that might result in losing a lot of data points. Another approach could be to allow 1 (or a certain proportion) of missing data point across brains, so in this example generate 3 mean coordinates for the putamen and 2 mean coordinates for the middle temporal gyrus?

My question is if you think is this a reasonable strategy, and if you have any alternative suggestions? I want my final list of coordinates to be as representative as possible of the underlying data. Thanks very much!

                       brain_1  brain_2  brain_3  brain_4  brain_5  brain_6  mean coord
putamen, left          x,y,z    x,y,z    x,y,z    x,y,z    x,y,z    x,y,z    …
x,y,z,   x,y,z    x,y,z    x,y,z    x,y,z    x,y,z    …
x,y,z    x,y,z    x,y,z    x,y,z    x,y,z

middle temporal gyrus  x,y,z    x,y,z    x,y,z    x,y,z    x,y,z    x,y,z    …
x,y,z,   x,y,z    x,y,z    x,y,z    x,y,z    x,y,z    …
x,y,z    x,y,z    x,y,z    x,y,z                      …

coordinates gene-expression post-mortem-brains • 2.1k views