Clustering is a powerful way to use the features of individuals to group similar individuals together. This enables businesses to understand and take action based on the individuals in each cluster. However, traditional clustering approaches suffer when features are not clean numeric values.
Generalized low rank models (GLRMs) — a subject of my own PhD work (see Udell ’16, of which I’m a co-author) — propose a new clustering framework to handle all types of data even when entries are missing. Using GLRMs, we generate clusters on datasets that would be considered too messy for other techniques.
What Denotes Messy Data?
There are many ways the data itself can be difficult to work with if you intend to use it for clustering (or any machine learning model, for that matter). For example, the data may have:
- Missing values,
- Categorical or ratings-like features,
- Features with varying importance, or
- Outliers and anomalies.
Even worse, one or more of the data features may be a sequence or time series. We say a dataset with two or more of these symptoms is “messy.”
Imagine data organized in a table format. In this table, rows correspond to items and columns correspond to features. For example, below is a fictional table of demographic data available to an eCommerce business that tracks how visitors navigate its website using cookies to track across multiple visits. The data is organized as tabular, where rows correspond to customers and tracking data correspond to features.
Age | Income | Zip Code | Device Type | Page Visits | Purchase | |
---|---|---|---|---|---|---|
Visitor A | 19 | N/A | 90403 | [Mobile] | [Product, Checkout] | Yes |
Visitor B | 42 | $51k | 91204 | [Mobile, Desktop] | [About us, Careers] | No |
Visitor C | 35 | $84k | 51924 | [Desktop, N/A] | [Product, Product, FAQ] | No |
We say each feature has a “data type.” In the example above, Age and Income are numerical (but note that Income is orders of magnitude larger than Age). Zip Code and Purchase are categorical, and Device Type and Page Visits are sequential. (While Zip Code looks numeric, it’s actually categorical because the similarity in value is not meaningful.) Also, some values are not available, which is represented by N/A. In summary, this dataset is “messy” because each customer is described by a mix of data types where some values are missing.
With this nomenclature established, we’re now prepared to ask ourselves: “How do I cluster messy data?” The answer depends on what you seek to accomplish by doing so.
Introducing the Clustering Problem
Clustering is a widely-known unsupervised learning problem, but it is nuanced and easily misunderstood. The challenge is to group items so that members can be represented by an archetype. A good clustering will create groups where the archetype is similar to its members and the archetypes are sufficiently different from each other.
Clustering can be seen as having two purposes: to be descriptive or to be predictive. The best descriptive clustering will lose as little information as possible, whereas predictive clustering will group things that likely share some unknown outcome.
To quantify this distinction, let’s establish a “distance” metric for evaluating how far an item is considered to be from its group archetype. The best clustering algorithms will choose groupings and corresponding archetypes that minimize the total error. How we choose to define “distance” affects not only how the users are grouped, but also the representative archetype of each cluster.
For example, using a distance metric of total squared error over all features describes the K-means algorithm. Alternatively, using a distance metric of area under curve (AUC) to predict purchases can be optimized by fitting a decision tree, where the leaves are the clusters. The former is more likely to credit groups of similar-looking users, while the latter will have more predictive capability.
A Few Applications of GLRM
To illustrate how powerful clustering with GLRM can be, we’ll walk through some potential applications. Note that both of these examples still work well even when data is missing from some examples (rows) and some features (columns), and that we may choose to use as many clusters as we want.
First you may want to cluster consumers using marketing data. You could use web page visits (time series), the marketing channel (category) and total spend so far (number) to group consumers into three segments, such as The Reactive Buyer (a buyer who visits pages after being targeted by a campaign), The Browser (regularly returns to browse pages while purchasing little), and The Stoic (doesn’t alter behavior despite being targeted by marketing campaigns).
As another example, we could use product purchases (category), and zip code (category), age (number) and device type (category) to group buyers into segments, such as The College Student (university zip codes with ages 18-22), The New Tech Fan (buys newest products using mobile) and The Traveler (purchases from a desktop across many zip codes).
In both the examples above, traditional clustering algorithms would be difficult, if not impossible, to use. First there is the issue of missing data. A common way to handle this problem is to replace missing values with the most common feature from the dataset, or else throw out those examples entirely.
Additionally, data is often categorical rather than numerical, especially when the data describes human behavior. A common way to handle categorical data is to use one-hot encoding; this is when the data table is appended with more columns, each of which will be one if the corresponding category appears and zero otherwise. This approach not only explodes the dimensionality of the data set, but disallows the design of describing one category closer to another.
Finally, it is very difficult using current methods to cluster data where one or more features are time series. The best one can hope to do is replace the time series data with numerical statistics by which to discriminate into clusters. This adds a new layer of complexity in feature engineering.
In conclusion, this powerful and novel approach to clustering using GLRM archetypes circumvents many of the limitations of traditional clustering approaches. It also does so in a manner that makes the results easier to interpret. If you are working on a clustering problem, it’s worth trying out GLRM.