Clustering Messy Time Series

When clustering data, it can be challenging to handle time series data, especially when the series have differing lengths. As part of my PhD work at Stanford, my colleagues and I built a framework for finding structure in messy data tables where some features may be times series. In this blog, I will provide you with the code and theory to cluster your own messy time series. You can download the notebook here.

Time series data table

If you work in marketing, economics or commerce, your data may look something like the data in this user table:

In finance, your table might be an investment portfolio with features like market cap, department size, industry, and history of stock prices. In medicine, patients’ health records capture age, illness, and test results over time.

Experience and intuition tell you that unsupervised methods — such as clustering — are the right tool to make sense of this data because interpretability, simplicity and the whole picture are more important than reactive predictability. However, it’s clear the data is messy and contains time series features, as can be seen in the customers’ engagement data in the table above.

How can we extend our clustering toolbox to handle these messy time series tables?

Cluster Time Series Data

When using the k-means algorithm for clustering numerical data, data samples are treated as vectors and the goal is to sort them into k groups, or clusters. A cluster’s archetype is a hypothetical data point that best represents all data assigned to that cluster. Assuming we have an initial guess for the k archetypes, we:

  • Assign data samples to the nearest archetype
  • Update the k archetypes to be their samples’ center

and iterate this process until the archetypes converge.

For k-means, nearest and center are governed by the choice to use Euclidean distance (or square root of the sum of squared distances). This is clear for the “assign” stage; centering via the sample mean is equivalent to minimizing the total Euclidean distance from the archetype to each sample.

When we use absolute distance instead, we get the k-medoids algorithm. As we think about how to cluster our time series data, a natural question arises. What distance metric is well-suited for handling time series?

Earth mover’s distance

One option is the Wasserstein (earth-mover’s) distance metric:

For one-dimensional sequences, we can interpret earth mover’s distance to be the Euclidean distance between the integral (cumulative sum) of two series.

Our choice of earth mover’s distance is both computationally and intuitively convenient. First, it requires only a single transformation (taking the integral or cumulative sum) before applying the familiar k-means algorithm, and finally deriving the archetypes to return to our original space. Intuitively, comparing integrals is how we measure the differences in shape of time-invariant series.

As a sanity check, let’s run a toy experiment, as seen in the figure below. To begin, we construct four “true” archetypes that have easily distinguishable shapes (right lower subfigure).

def generate_truth(T):
    series1 = np.zeros(T)
    series1[:int(T/4)] = range(int(T/4))
    series1[int(T/4):int(T/2)-1] = range(int(T/4)-1)[::-1]
    series1 = series1/series1.max()

    series2 = np.sin(np.arange(T)/4)
    series2 = series2/series2.max()

    series3 = np.zeros(T)
    series3[1:int(T/2)]=1
    series3 = series3/series3.max()

    series4 = np.zeros(T)
    series4 = np.exp(-(np.arange(T)-(T/2.0))**2/(T/4)**2)
    series4 = series4/series4.max()

    series = [series1, series2, series3, series4]
    
    f, axes = plt.subplots(nrows=2, ncols=2)
    for i in range(2):
        for j in range(2):
            axes[i,j].plot(series[2*i+j])
            axes[i,j].axis('off')
    f.savefig("figures/truth", dpi=600)
    return series
	
T = 60
m = 10000
series = generate_truth(T) # k=4 by design

We generate m noisy time series samples by taking one of the four shapes and adding Gaussian noise with power r (left subfigure).

def generate_data(T, m, noise_level):
    A = np.zeros((m,T))
    labels = np.random.choice(range(4), size=m, replace=True)
    for i in range(m):
        noise = np.random.normal(size=T)*noise_level
        A[i,:] = series[labels[i]]+noise 
    
    f, axes = plt.subplots(nrows=3, ncols=4)
    for i in range(3):
        for j in range(4):
            axes[i,j].plot(A[4*i+j, :])
            axes[i,j].axis('off')
    f.savefig("figures/simple_glimpse", dpi=600)
    
    f, axes = plt.subplots(nrows=8, ncols=3, figsize=(6,10))
    for i in range(8):
        for j in range(3):
            axes[i,j].plot(A[3*i+j, :])
            axes[i,j].axis('off')
    f.savefig("figures/simple_glimpse_big", dpi=600)
    return A
	
noise = 0.4 # level
A = generate_data(T, m, noise)

Finally, we use our variation on k-means using earth mover’s distance to recover the shapes (right upper subfigure).

def cluster_time_series(A):
    B = A.cumsum(axis=1)
    
    model = KMeans(n_clusters=4, random_state=0)
    model.fit(B)
    C = model.cluster_centers_
    
    C[:,1:] -= C[:,:-1].copy()
    
    f, axes = plt.subplots(nrows=2, ncols=2)
    for i in range(2):
        for j in range(2):
            axes[i,j].plot(C[2*i+j, :])
            axes[i,j].axis('off')
    f.savefig("figures/simple_cluster", dpi=600)
    return C, model.labels_
	
C, labels = cluster_time_series(A)

To see the value in clustering these noisy series, let’s overlay the archetypes on the data as a process for de-noising the samples. As seen below, doing so makes it much clearer to sense what’s happening amongst the data.

def impute_simple(A, C, labels, random_seed=1):
    B = np.vstack(C[l,:] for l in labels)
    
    f, axes = plt.subplots(nrows=3, ncols=4, sharex=True, sharey=True)
    for i in range(3):
        for j in range(4):
            axes[i,j].plot(A[4*i+j, :], alpha=0.3, color='k')
            axes[i,j].plot(B[4*i+j, :], alpha=1.0, color='b')
            axes[i,j].axis('off')
    f.savefig("figures/simple_glimpse_impute", dpi=600)
	
impute_simple(A, C, labels)

Earth mover’s distance is a good initial choice for a time series distance metric: It’s clear, simple and performs well. However, most data in the real world will never be this tidy. In particular, it’s rare to encounter a set of time series that all have the same length.

Handle Time-Varying Lengths

Typically, a set of time series samples will look like this:

def remove_entries(A, missing_samples, missing_series, sections=10, random_seed=1): 
    # remove (triangle-structured) missing entries (unknown futures)
    np.random.seed(random_seed)
    m, T = A.shape
    Z = A.copy()
    
    col_jump = int(T*missing_series/sections)
    row_jump = int(m*missing_samples/sections)
    to_remove_row = [m-(row_jump*(sections-i)) for i in range(sections)]
    to_remove_col = [T-(col_jump*(i+1)) for i in range(sections)]
    to_remove = list(zip(to_remove_row, to_remove_col))
    for i,j in to_remove:
        Z[i:,j:] = np.nan
        
    f, axes = plt.subplots(nrows=3, ncols=4, sharex=True, sharey=True)
    Y = Z.copy()
    np.random.shuffle(Y)
    for i in range(3):
        for j in range(4):
            axes[i,j].plot(Y[4*i+j, :])
            axes[i,j].axis('off')
    f.savefig("figures/missing_glimpse", dpi=600) 
    
    f, axes = plt.subplots(nrows=8, ncols=3, figsize=(6,10), sharex=True, sharey=True)
    for i in range(8):
        for j in range(3):
            axes[i,j].plot(Y[3*i+j, :])
            axes[i,j].axis('off')
    f.savefig("figures/missing_glimpse_big", dpi=60)
    return Z, to_remove
	
missing_samples = (m-5)/m
missing_entries = (T-20)/T
A_missing, to_remove = remove_entries(A, missing_samples, missing_entries, sections=4)

Returning to our original contextual example, let’s say some customers have accounts that are two years old, one year old or one month old. Stocks have been publicly traded for differing lengths of time. The length of patients’ illnesses and test histories are also varied.

To circumvent the issue of differing lengths of time, let’s imagine that all time series actually have the same length and treat future values as missing. Now the nature of the problem is clustering data with missing values.

There are two standard approaches for handling missing data, but I will also be demonstrating one of our own.

Approach 1: Drop incomplete samples

As a first attempt, let’s drop all but the longest time series so we don’t need to fit to data with any missing values.

def cluster_exclude_missing(A):
    m, T = A.shape
    A_exclude = A.copy()
   
    A_exclude = A_exclude[(np.isnan(A_missing).sum(axis=1)==0),:]
    B_exclude = A_exclude.cumsum(axis=1)
    
    model = KMeans(n_clusters=4, random_state=0)
    model.fit(B_exclude)
    C_exclude = model.cluster_centers_
    
    C_exclude[:,1:] -= C_exclude[:,:-1].copy()
    
    f, axes = plt.subplots(nrows=2, ncols=2)
    for i in range(2):
        for j in range(2):
            axes[i,j].plot(C_exclude[2*i+j, :])
            axes[i,j].axis('off')
    f.savefig("figures/missing_experiment_exclude", dpi=600)
    return C_exclude
	
C_exclude = cluster_exclude_missing(A_missing)

In a format similar to the one above, the figure below illustrates the results of our earth mover’s distance variation to k-means clustering. This time, most of the data is thrown out:

In this case, there are too few complete samples to make sense of the noisy patterns.

Approach 2: Substitute column means

Next, let’s compute the average value of entries for each point in time across all samples, and replace the corresponding missing value with the average entry from known samples.


def cluster_substitute_mean(A):
    m, T = A.shape
    A_replace = A.copy()
    for j in range(T):
        A_replace[np.isnan(A_replace[:,j]), j] = np.nanmean(A_replace[:,j])
    B_replace = A_replace.cumsum(axis=1)
            
    model = KMeans(n_clusters=4, random_state=0)
    model.fit(B_replace)
    C_replace = model.cluster_centers_
    
    C_replace[:,1:] -= C_replace[:,:-1].copy()

    f, axes = plt.subplots(nrows=2, ncols=2)
    for i in range(2):
        for j in range(2):
            axes[i,j].plot(C_replace[2*i+j, :])
            axes[i,j].axis('off')
    f.savefig("figures/missing_experiment_replace", dpi=600)
    return C_replace

C_replace = cluster_substitute_mean(A_missing)

Glancing at the results, it’s clear that injecting fake data into our analysis is muddying the results. In particular, as the signals progress in time, they become more similar to one another until they become indistinguishable and distort the recovered shapes. This happens because they share the same substituted values.

Approach 3: Our approach

Finally, we use our own clever method to initialize our archetypes. We proceed as we did on our first attempt, but then incorporate data with fewer and fewer entries into the archetype estimation process without adding or injecting any made-up data.

The results below show that our approach outperforms the standard ones available today. It is still a struggle to pin down those final points in time, but that can be attributed to the extremely small number of entries available that far into the series.

Once more, we can overlay the archetypes assigned to each sample to de-noise it and reveal general trends. In this setting, we can use trends to predict future outcomes for series that are missing future values.

def impute_missing(A, C, labels, random_seed=1):
    B = np.vstack(C[l,:] for l in labels)
    
    f, axes = plt.subplots(nrows=3, ncols=4, sharex=True, sharey=True)
    np.random.seed(random_seed)
    A_shuffle = A.copy()
    np.random.shuffle(A_shuffle)
    np.random.seed(random_seed)
    B_shuffle = B.copy()
    np.random.shuffle(B_shuffle)
    for i in range(3):
        for j in range(4):
            axes[i,j].plot(A_shuffle[4*i+j, :], alpha=0.3, color='k')
            axes[i,j].plot(B_shuffle[4*i+j, :], alpha=1.0, color='b')
            axes[i,j].axis('off')
    f.savefig("figures/missing_glimpse_impute", dpi=600)

# see below for definition of cluster_pivot

C_glrm, labels = cluster_pivot(A_missing, to_remove)
A_hat = impute_missing(A_missing, C_glrm, labels)

Clustering messy data tables

So far, we’ve limited our scope to tables that only have one time series feature. With additional features, it becomes possible to guess what a time series would be if it were actually observed. But, we can leverage the sample’s other features and the patterns captured across the rest of the data to make a prediction.

We refer to this as time series imputation. This approach helps us estimate how valuable a customer would be upon conversion. If we were looking at stocks, it could help us estimate how valuable a company would be after an IPO.

from sklearn.datasets import make_blobs
from mpl_toolkits.mplot3d import Axes3D

def generate_additional_features(A, labels, d):
    np.random.seed(0)
    m, T = A.shape
    A_sorted = A[np.argsort(labels), :]
    B, blob_labels = make_blobs(n_samples = [len(labels[labels==0]),
                    len(labels[labels==1]), 
                    len(labels[labels==2]), 
                    len(labels[labels==3])], n_features=d, random_state=1, center_box=(5,25))
    B_sorted = B[np.argsort(blob_labels), :]
    D = np.hstack((B_sorted, A_sorted))
    np.random.shuffle(D)
    
    f = plt.figure()
    ax = f.add_subplot(111, projection='3d')
    ax.scatter(B[:,0], B[:,1], B[:,2])
    ax.grid(False)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_zticks([])
    f.savefig("figures/table_features", dpi=600) 
    
    f, axes = plt.subplots(nrows=3, ncols=4, sharex=True, sharey=True)
    for i in range(3):
        for j in range(4):
            if 4*i+j not in to_remove:
                axes[i,j].plot(D[4*i+j, d:])
            axes[i,j].axis('off')
            axes[i,j].set_title("("+",".join("{0:0.0f}".format(D[4*i+j,l]) for l in range(d))+")")
    f.savefig("figures/table_glimpse", dpi=600) 
    return D
	
d = 3
A_missing_table = A.copy()
missing_series = 0.5
A_missing_table[np.random.choice(range(m), size=int(missing_series*m), replace=False), :] = np.nan
D = generate_additional_features(A_missing_table, labels, d)
def cluster_table(D, d):
    k = 4
    m = D.shape[0]
    to_remove = list(np.where(np.isnan(D[:,d+1]))[0])
    to_keep = list(set(range(m)).difference(to_remove))
    
    F = D.copy()
    F[:,d:] = D[:,d:].cumsum(axis=1)
    
    # cluster those for which values are provided
    model = KMeans(n_clusters=4, random_state=0)
    model.fit(F[to_keep,:])
    C = model.cluster_centers_
    C[:,d+1:] -= C[:,d:-1].copy()
    
    # use first d dimensions to assign messy samples to clusters
    data = np.tile(D[to_remove, :d], (k,1,1)).transpose((1,2,0))
    clusters = np.tile(C[:,:d], (len(to_remove),1,1)).transpose((0,2,1))
    labels = np.argmin(np.linalg.norm(data-clusters, axis=1), axis=1)
    
    all_labels = np.zeros(m)
    all_labels[to_remove] = labels
    all_labels[to_keep] = model.labels_
    
    f, axes = plt.subplots(nrows=2, ncols=2)
    for i in range(2):
        for j in range(2):
            axes[i,j].plot(C[2*i+j, d:])
            axes[i,j].axis('off')
    f.savefig("figures/table_experiment", dpi=600)
    return C, all_labels.astype(int)
	
C, table_labels = cluster_table(D, d)

In the figure above, each time series has a three-tuple of additional data features. By clustering, we can impute entire time series.

def impute_table(A, C, labels, random_seed=1):
    B = np.vstack(C[l,:] for l in labels)
    
    f, axes = plt.subplots(nrows=3, ncols=4, sharex=True, sharey=True)
    for i in range(3):
        for j in range(4):
            axes[i,j].plot(A[4*i+j, d:], alpha=0.3, color='k')
            axes[i,j].plot(B[4*i+j, d:], alpha=1.0, color='b')
            axes[i,j].set_title("("+",".join("{0:0.0f}".format(A[4*i+j,l]) for l in range(d))+")")
            axes[i,j].axis('off')
    f.savefig("figures/table_glimpse_impute", dpi=600)
	
impute_table(D, C, table_labels)

Algorithm and Approach

Here is our method for handling time-varying time series. It works even when the time series are all missing future values, i.e., right-ward entries for each sample’s relative sense of the present. It will also work for larger tables where additional data has numeric features. We use the Euclidean metric to measure distance for these.

Pivoted clustering algorithm:

  1. Drop samples where any entries are missing
  2. Cluster using a variation of k-means
  3. Collect samples that are missing last time entry
  4. Assign these samples to centroids while ignoring missing features
  5. Update centroids for a subset of features to reflect additional assignees
  6. Repeat steps 3-5 for the next missing entry until all samples are clustered
def cluster_pivot(A, to_remove):
    k = 4
    A_glrm = A.copy()
    B_glrm = A_glrm.cumsum(axis=1)
    
    # fit to complete columns
    cols_to_consider = to_remove[-1][1]
    F_glrm = B_glrm[:,:cols_to_consider]
    model = KMeans(n_clusters=4, random_state=0)
    model.fit(F_glrm)
    C_glrm = model.cluster_centers_
    C_glrm[:,1:] -= C_glrm[:,:-1].copy()
    labels = model.labels_
   
    for s in range(1,len(to_remove)):
        # update centroids
        cols_to_consider_next = to_remove[-(s+1)][1]
        cols_to_zeropad = cols_to_consider_next - cols_to_consider
        C_glrm = np.hstack((C_glrm, np.zeros((k,cols_to_zeropad))))
        for i in range(4):
            C_glrm[i, cols_to_consider:] = np.nanmean(
                A_glrm[labels==i, cols_to_consider:cols_to_consider_next], axis=0) 

        # assign
        rows_to_consider_next = to_remove[-s][0]
        data = np.tile(A_glrm[:rows_to_consider_next, :cols_to_consider_next], (k,1,1)).transpose((1,2,0))
        clusters = np.tile(C_glrm, (rows_to_consider_next,1,1)).transpose((0,2,1))
        labels[:rows_to_consider_next] = np.argmin(np.linalg.norm(data-clusters, axis=1), axis=1)

        cols_to_consider = cols_to_consider_next
    
    # final centroid update
    cols_to_consider_next = T
    cols_to_zeropad = cols_to_consider_next - cols_to_consider
    C_glrm = np.hstack((C_glrm, np.zeros((k,cols_to_zeropad))))
    for i in range(4):
        C_glrm[i, cols_to_consider:] = np.nanmean(
            A_glrm[labels==i, cols_to_consider:cols_to_consider_next], axis=0) 
    
    f, axes = plt.subplots(nrows=2, ncols=2, sharex=True)
    for i in range(2):
        for j in range(2):
            axes[i,j].plot(C_glrm[2*i+j, :])
            axes[i,j].axis('off')
    f.savefig("figures/missing_experiment_glrm", dpi=600)
    return C_glrm, labels

Even messier data

Beyond these toy examples, it’s important to note that not all features will be numeric and time series may not have missing values in a highly structured way. When this is the case, we can no longer use our pivoted clustering algorithm.

Fortunately, our algorithm is a simplified case of a more general framework called Generalized Low Rank Models (GLRMs). Indeed, we use this framework to cluster messy data and across more general applications. I am also a contributor to the original GLRM whitepaper.

Conclusion

Something to note in this blog is our choice to use earth mover’s distance as a loss function for time series (or any sequential) data within the GLRM framework. In addition, we described the straight-forward pivoted-clustering algorithm to handle a special type of missing data. Finally, we showed it beats two standard methods for clustering messy times series.

If you’re looking for workarounds to cluster messy time series at scale or tips on how to handle missing data that occurs at random, you’re likely to find that entire teams typically tackle these types of problems. To make things easier, we at Retina are building tools to solve the issues that may arise from customer segmentation and other business applications of time series clustering. If you’d like to learn more, you can schedule a demo with us.

If you are a data scientist and this sounds like work you are interested in doing, Retina is hiring!