Question: Which is the best smoothing technique for replacing zeros from data?
3
Deepak Tanwar • 4.1k wrote:
I read some presentations and papers regarding smoothing techniques.
Smoothing N-gram Language models
An Empirical Study of Smoothing Techniques for Language Modeling
Improved Smoothing for N-gram Language Models Based on Ordinary Counts
I want to apply Smoothing on a data, containing zero values. Which one should be the best?
This is just an example:
Pathway1 | Pathway2 | Pathway3 | Pathway4 | |
Calcium ions | 0 | 3 | 1 | 0 |
ATP | 2 | 1 | 0 | 7 |
Sorry Deepak, I don't really understand - smoothing in my mind is something you do to continuous data, like a time series or genomic data. Your example of pathways is categorical data, in that Pathway2 doesn't really come before Pathway3 or after Pathway1, they are just categories.
So how do you want this data to look like?
Ultimately, the best smoothing algorithm is the one that is well described/understood to anyone who has to look at the result :)
Although it would never stand up in any other aspect of science, too often when it comes to smoothing of data or intersection of genomic coordinates, you see "then we did [stuff we're not even going to detail in the appendix] - and the result was [bold claims of novel biology]".
John, it was just an example, and I am not going to do with pathways anything. I can't disclose what I want to do. I have already used Good Touring estimate, Witten Bell smoothing. To explain further, I can say that, suppose you have a list of 30 people and a list of 500 softwares. You create a table, columns as name of people and rows as name of softwares. you fill the value in each cell for the number of times, that person used that software in last 10 years:
I want to replace the 0 counts. One way is Laplace smoothing by adding 1 to each value.
I hope, I made it clear.