Clustering Stock Prices using Python and GMM

5 min readMar 25, 2023

Analysing stock price time series comes with various challenges. One of them is the question of whether stock prices behave similarly depending on the economic sectors and industries.

In this article I investigate whether we can employ a well-known mixed growth model for clustering stock prices along their sectors.

A motivation for this experiment is that parts of the literature indicate that GMM can be used to represent any data set that can be clustered into multiple Gaussian distributions since the algorithm estimates the parameters of the Gaussian distribution of the different clusters [5].

The results using GMM are more promising than initially expected.

Method

GMM is a probabilistic latent variable growth method that attempts to find a mixture of multidimensional Gaussian probability distributions that model the input dataset [6, p. 776].
It differs from basic growth models in that it observes the underlying sub-populations [3]. It is usually employed in health science since it is “considered to be a person-centred method because it is predicated on the assumption that people are the agents that affect the outcomes of interest with predictor variables deemed to be properties of those people” [4].

It is important to consider that GMM is not a clustering method in the strict sense, but rather an algorithm for density estimation. A concise description of the ‘mechanics’ of GMM can be found in this Wikipedia article. In short, it selects starting guesses for the location and shape. Then, iteratively, it finds the weights unwrapping the probability of membership in each cluster (E-step) and updates the location, normalisation and shape of the clusters based on all data points (M-step) [6, p. 784]

You can access the dataset and code (in Jupyter notebook) in this Github repository.

Data

Half-hourly share price data is generated through Financial Modelling Prep’s API. I provide these data in the respository’s “data.csv” file, with its

columns: 333 consecutive datapoints covering half-hourly stock price data (usually between 9am and 16pm the same day) ranging from 13 February 2023 to 17 March 2023. The column names are anonymised.
observations (rows): 473 shares along 4 selected industries (Technology’ Basic Materials, Industrials, Consumer Cyclical). The stock symbols (tickers) are anonymised.
values: Log returns of the open/close prices (each measured with one time lag).

The sectors the companies belong to are mapped in a seperate csv file “sectors.csv”.

We read the data from data.csv:

import numpy as np, pandas as pd

df_in = pd.read_csv('data.csv', index_col=0)
df_in.head()

Normalisation for further handling:

from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
def scale(df):
    arr_scaled = scale.fit_transform(df.iloc[:,1:])
    df_scaled = pd.DataFrame(arr_scaled)
    df_scaled.insert(loc=0, column='symbol', value=df['symbol']) # Insert ticker column
    return(df_scaled)
df_scaled = scale(df_in)
df_scaled.head()

Feature production

The cesium library provides distribution statistics to the observations. It is mostly used in health science. As it appears to apply an agnostic approach to time series I make use of it to generate a number of features:

from cesium import featurize as ft
times = []
log_return_values = []
for idx, row in df_scaled.iterrows():
    log_return_values.append(row.values)
    times.append(np.array([i for i in range(row.values.shape[0])]))

features_to_use = ['amplitude',
                   'percent_beyond_1_std',
                   'percent_close_to_median',
                   'weighted_average',
                   'maximum',
                   'minimum',
                   'median']

df_features = ft.featurize_time_series(
                    times           = times,
                    values          = log_return_values,
                    errors          = None,
                    features_to_use = features_to_use,
                    scheduler       = None)
# Remove multihead, scale and put 'back' headers
df_features = df_features.T.reset_index().drop(['channel'], axis=1).iloc[:,1:].T
df_features_scaled = scale(df_features, symbols=df_in.symbol)
df_features_scaled = df_features_scaled.rename(columns={0:'amplitude', 1:'percent_beyond_1_std', 2:'percent_close_to_median', 
                    3:'weighted_average', 4:'max', 5:'min', 6:'median'}) 
df_features_scaled.head()

GMM

As explained above, GMM is an algorithm for density estimation, in contrast cluster methods such as k-means. For the choice of appropriate hyperparameters this means that:

the number of components does not necessarily need to match the shape of the different clusters, but can rather be ‘grouped’ around the points resulting from GMM. Therefore, the model will be prone to overfitting. Below we loop through choices 2–15 for components.
my experiments indicate that the covariance type appears crucial. From the 4 choices (tied, full, diag, spherical) ‘diag’ appears best for the data at hand. See the scikit-Learn GMM documentation for the details on these parameters. This is illustrated using the BIC according to the documentation on model selection (for full code see my Github repository’s code.ipynb), withe the following result using grid search:

from sklearn.model_selection import GridSearchCV
X = df_features
def gmm_bic_score(estimator, X):
    """Callable to pass to GridSearchCV that will use the BIC score."""
    return estimator.bic(X)

def gmm_aic_score(estimator, X):
    """Callable to pass to GridSearchCV that will use the AIC score."""
    return estimator.aic(X)

def plot_model_grids(X, criterion_type, max_params=25):
    param_grid = {
        "n_components": range(1, max_params),
        "covariance_type": ["spherical", "tied", "diag", "full"],
    }
    score_func_name = ''
    if criterion_type == 'aic':
        score_func_name = gmm_aic_score
    else:
        score_func_name = gmm_bic_score
    grid_search = GridSearchCV(
        GMM(), param_grid=param_grid, scoring=score_func_name
    )
    grid_search.fit(X)

    df = pd.DataFrame(grid_search.cv_results_)[
        ["param_n_components", "param_covariance_type", "mean_test_score"]
    ]
    df["mean_test_score"] = -df["mean_test_score"]
    df = df.rename(
        columns={
            "param_n_components": "Number of components",
            "param_covariance_type": "Type of covariance",
            "mean_test_score": "score",
        }
    )
    df.sort_values(by="score").head()

    import seaborn as sns

    sns.catplot(
        data=df,
        kind="bar",
        x="Number of components",
        y="score",
        hue="Type of covariance",
    )
    plt.show()

plot_model_grids(df_features, 'bic')

Let’s compare with the AIC:

plot_model_grids(df_features, 'aic')

The ‘Full’ covariance type quickly draws to (large) positive values whereas ‘tied’ varies much less. This does not necessarily mean that the ‘tied’ models are robust.

Conclusion

The results look rather promising for GMM being a choice for further investigations.

I hope this article is informative. Your feedback is warmly welcome!

Sources:

[1] Taushanov, Zh., Ghisletta, P.: The Use of a Hidden Mixture Transition Distribution Model in Clustering Few but Long Continuous Sequences: An Illustration with Cognitive Skills Data’, research article, University of Geneva, September 2020

[2] Nielsen, A.: Practical Time Series Analysis, O’Reilly Media, 2020

[3] Ram, N; Grimm, K.J.: Growth Mixture Modeling: A Method for Identifying Differences in Longitudinal Change Among Unobserved Groups; Int J Behav Dev. 2009; 33(6): 565–57, July 2022

[4] Kwon, JY., Sawatzky, R., Baumbusch, J. et al.: Growth mixture models: a case example of the longitudinal analysis of patient‐reported outcomes data captured by a clinical registry, MC Med Res Methodol 21, 79 (2021). https://doi.org/10.1186/s12874-021-01276-z

[5] Kumar, A.: Gaussian Mixture Models: What are they & when to use?, blog post, April 2022

[6] VanderPlas, J.: Python Data Science Handbook, second edition, O’Reilly media, 2023

Clustering Stock Prices using Python and GMM

Method

Data

Feature production

GMM

Conclusion

Written by Sacha Schwab

No responses yet