How to compare the similarity of a dataset against a larger dataset and plot a heat map?
1
0
Entering edit mode
4.2 years ago
nattzy94 ▴ 30

I have datasets containing mutations for 26 samples (so 26 different sets of data) and I want to compare how similar they are to themselves. So for example, I would like to compare how Sample 1 is similar to Sample 1 to 26 for each of the 26 samples. At the end, I hope to get something like a heat map where the first row and first column are each of the 26 samples.

Someone has suggested using the intersect and union function in R to calculate the similarity but that would be very laborious as I would have to run the functions 676 times (26*26).

Is there any program to do this quickly or is there a way that I could make this more efficient in R?

heat map R • 1.7k views
0
Entering edit mode

What is your data, i.e. how is each sample represented ? What kind of similarity are you looking for, i.e. how do you define similarity between two samples ? Note that running a function ~700 times is not a big deal (in R or any other language) unless each run takes days.

0
Entering edit mode

I guess, you are looking for correlation map, not heatmap. Correlation maps compares all against all (26 x 26) samples.

0
Entering edit mode

If the variables are continuous and the expect trend is linear then you can do correlation. There are alternative distance measures which you can also use to give you a better idea of how closely two variables are related to one another. Though you should be careful which clustering methods you use because they have statistical assumptions which need to be meet in order to be used properly.

0
Entering edit mode

Thanks for the replies! My data is in a single column and contains information about types of SNPs in the format Chr1_Pos_A_T for example. This is all stored in a single column in a text file.

0
Entering edit mode

So for each sample you have a list of SNP positions. If the similarity you're after is about the fraction of SNPs two samples share, then you could use the Jaccard index or any other measure of similarity between sets. As you've been told, you can use the R intersect() and union() functions or convert your data to binary vectors and use similarity functions from the proxy package. However, if you have very large numbers of SNPs, it is possible that the similarities become meaningless due to the distance concentration phenomenon.

0
Entering edit mode
4.2 years ago
Bioaln ▴ 350
for dataset in datasets
for anotherDataset in datasets
do:
compare somehow (this is unclear from question)
save to a dataframe (e.g. Pandas in python)

finally:
do:
plot heatmap, where you pivot based on similarity, and x and y axis are your datasets.


This way, diagonal will be 100% identical.

Hope this helps.

0
Entering edit mode

Thanks for the reply! My data is in a single column and contains information about types of SNPs in the format Chr1_Pos_A_T for example. This is all stored in a single column in a text file.