Almost all kind of clustering algorithms exploit one kind of a distance measure to calculate how close the instances are. So for example if John is 170 cm long, Peter is 210 cm long and Martin is 173 cm long, according to their height John is closer to Martin than Peter and potentially they should be placed in a same cluster.
But what happens if there exist missing values in data? For example consider the height of Mary is not defined or missed. So how is it possible to measure the distance of Mary (according to her height) to others?! Many researchers have addressed this problem in previous years and S. Conrad has made a nice survey on evaluating different methods proposed for clustering with missing values.
In fact, all missing values are not in the same nature. There exist 3 different kinds of them:
- Missing values completely at random (MCAR): The missingness does not depend on the data values whether they are observed or missed, i.e. the real random missingness!
- Missing values at random (MAR): The missingness depends on the observed data values and not missing ones. E.g. the value of “income” for some one who is young is missed!
- Not missing at random (NMAR): The missingness depends on the missing values themselves. E.g. the value of “income” for some one who has a high salary is missed!
Finally, what should we do with these missing values? Here is the answer in the literature:
- Imputation: Let’s estimate the missing values using existing ones. For instance for the height of Mary, we insert the average heights of all people that we have their heights. Of course, imputation is not a reliable method, but a popular one!
- Marginalization: Let’s ignore missing values. This method is much more reliable, because we don’t invent new data which may be far from reality!
There exist two different versions of marginalization: hard and soft. In hard version, in a preprocessing step, we remove all attributes that include at least one missing value. So in our example we remove the “height” attribute from our calculations at all. The problem here is that we miss many existing values like heights of John, Martin and Peter only for one missing value!