Almost all kind of clustering algorithms exploit one kind of a distance measure to calculate how close the instances are. So for example if John is 170 cm long, Peter is 210 cm long and Martin is 173 cm long, according to their height John is closer to Martin than Peter and potentially they should be placed in a same cluster.

But what happens if there exist missing values in data? For example consider the height of Mary is not defined or missed. So how is it possible to measure the distance of Mary (according to her height) to others?! Many researchers have addressed this problem in previous years and S. Conrad has made a nice survey on evaluating different methods proposed for clustering with missing values.

In fact, all missing values are not in the same nature. There exist 3 different kinds of them:

  1. Missing values completely at random (MCAR): The missingness does not depend on the data values whether they are observed or missed, i.e. the real random missingness!
  2. Missing values at random (MAR): The missingness depends on the observed data values and not missing ones. E.g. the value of “income” for some one who is young is missed!
  3. Not missing at random (NMAR): The missingness depends on the missing values themselves. E.g. the value of “income” for some one who has a high salary is missed!

Finally, what should we do with these missing values? Here is the answer in the literature:

  • Imputation: Let’s estimate the missing values using existing ones. For instance for the height of Mary, we insert the average heights of all people that we have their heights. Of course, imputation is not a reliable method, but a popular one!
  • Marginalization: Let’s ignore missing values. This method is much more reliable, because we don’t invent new data which may be far from reality!

There exist two different versions of marginalization: hard and soft. In hard version, in a preprocessing step, we remove all attributes that include at least one missing value. So in our example we remove the “height” attribute from our calculations at all. The problem here is that we miss many existing values like heights of John, Martin and Peter only for one missing value!

The soft version of marginalization does not remove attributes with missing values, but tries to use the values from attributes with missing values whenever it is possible. So that means it uses the “height” values to compare John and Peter, but not Martin and Mary.
One implementation of soft marginalization is proposed by Wagstaff in 2004. In this method, we do classical clustering like k-means using attributes with no missing values, but whenever values from attributes with missing values were available, we also consider their effect in clustering. The figure below expresses the same algorithm as k-means but extended to the mechanism explained by Wagstaff.
Note “w” is a simple weight to consider the amount of effect of attributes with missing values. This method is indefeasible when missing values happens everywhere i.e. in all attributes.
Advertisements