I have a dataset from which I am constructing a stacked barplot in R, and I want to know how I can arrange the stacked barplot where "similar" individuals cluster together. My dataset is an admixture proportions dataset Q. Here is the dataset which is a d-by-n matrix. In this toy dataset, there are d=10 ancestral populations and n = 5 individuals:
Here is my dataset construction:
> a
V1 V2 V3 V4 V5
1 0.534410243 0.009358740 0.011295181 0.2141751740 0.0030129254
2 0.026653603 0.372426720 0.447847534 0.0179177507 0.4072904477
3 0.193317915 0.003605024 0.003186611 0.4832114736 0.0007095471
4 0.111881585 0.000000000 0.000000000 0.2296213741 0.0119233461
5 0.089696570 0.591163629 0.509774416 0.0032542030 0.5535847030
6 0.007543558 0.000000000 0.000000000 0.0364907757 0.0013148362
7 0.004862942 0.000000000 0.002123909 0.0146682272 0.0004053690
8 0.009276195 0.011710457 0.014367894 0.0000000000 0.0000000000
9 0.006903171 0.004314528 0.011404455 0.0000000000 0.0126889937
10 0.015454219 0.007420903 0.000000000 0.0006610215 0.0090698319
I create a stacked barplot like so:
pop <- rownames(a)
a <- a %>% mutate(pop = rownames(a))
a_long <- gather(a, key, value, -pop)
# trying to create a similarity index
a_long <- a_long %>% group_by(key) %>%
mutate(mean = mean(value)) %>%
arrange(desc(mean))
# looking at some of the expanded dataset
> a_long[1:20,]
# A tibble: 20 x 4
# Groups: key [2]
pop key value mean
<chr> <chr> <dbl> <dbl>
1 1 V2 0.00936 0.1
2 2 V2 0.372 0.1
3 3 V2 0.00361 0.1
4 4 V2 0 0.1
5 5 V2 0.591 0.1
6 6 V2 0 0.1
7 7 V2 0 0.1
8 8 V2 0.0117 0.1
9 9 V2 0.00431 0.1
10 10 V2 0.00742 0.1
11 1 V4 0.214 0.1
12 2 V4 0.0179 0.1
13 3 V4 0.483 0.1
14 4 V4 0.230 0.1
15 5 V4 0.00325 0.1
16 6 V4 0.0365 0.1
17 7 V4 0.0147 0.1
18 8 V4 0 0.1
19 9 V4 0 0.1
20 10 V4 0.000661 0.1
# colors
v_colors <- c("#440154FF", "#443B84FF", "#34618DFF", "#404588FF", "#1FA088FF", "#40BC72FF",
"#67CC5CFF", "#A9DB33FF", "#DDE318FF", "#FDE725FF")
plot <- ggplot(a_long, aes(x = key, y = value, fill = pop))
plot + geom_bar(position="stack", stat="identity") + scale_fill_manual(values = v_colors)
The output looks like this:
How can I make the output look more neat, e.g. with the individuals with higher proportion of population 5 ancestry be next to each other on the x-axis? So far, I have tried to compute the "mean" of value of each individual, but it didn't work since it's not a good measure. How can I create a similarity index that tells me how similar individual 1 is to individual 2, and then how do I order it them on the x-axis so that they look well-clustered (e.g. like the barplots in this figure)?
Thanks!
In case you want to recreate the data frame a
in the example above:
v1 = c(0.534410243, 0.026653603, 0.193317915, 0.111881585, 0.089696570, 0.007543558, 0.004862942, 0.009276195, 0.006903171, 0.015454219)
v2 = c(0.009358740, 0.372426720, 0.003605024, 0.000000000, 0.591163629, 0.000000000, 0.000000000, 0.011710457, 0.004314528, 0.007420903)
v3 = c(0.011295181, 0.447847534, 0.003186611, 0.000000000, 0.509774416, 0.000000000, 0.002123909, 0.014367894, 0.011404455, 0.000000000)
v4 = c(0.2141751740, 0.0179177507, 0.4832114736, 0.2296213741, 0.0032542030, 0.0364907757, 0.0146682272, 0.0000000000, 0.0000000000, 0.0006610215)
v5 = c(0.0030129254, 0.4072904477, 0.0007095471, 0.0119233461, 0.5535847030, 0.0013148362, 0.0004053690, 0.0000000000, 0.0126889937, 0.0090698319)
a = data.frame(V1 = v1, V2 = v2, V3 = v3, V4 = v4, V5 = v5)