Clustering Stock Prices using Python and GMM
Analysing stock price time series comes with various challenges. One of them is the question of whether stock prices behave similarly depending on the economic sectors and industries.
In this article I investigate whether we can employ a well-known mixed growth model for clustering stock prices along their sectors.
A motivation for this experiment is that parts of the literature indicate that GMM can be used to represent any data set that can be clustered into multiple Gaussian distributions since the algorithm estimates the parameters of the Gaussian distribution of the different clusters [5].
The results using GMM are more promising than initially expected.
Method
GMM is a probabilistic latent variable growth method that attempts to find a mixture of multidimensional Gaussian probability distributions that model the input dataset [6, p. 776].
It differs from basic growth models in that it observes the underlying sub-populations [3]. It is usually employed in health science since it is “considered to be a person-centred method because it is predicated on the assumption that people are the agents that affect the outcomes of interest with predictor variables deemed to be properties of those people” [4].
It is important to consider that GMM is not a clustering method in the strict sense, but rather an algorithm for density estimation. A concise description of the ‘mechanics’ of GMM can be found in this Wikipedia article. In short, it selects starting guesses for the location and shape. Then, iteratively, it finds the weights unwrapping the probability of membership in each cluster (E-step) and updates the location, normalisation and shape of the clusters based on all data points (M-step) [6, p. 784]
You can access the dataset and code (in Jupyter notebook) in this Github repository.
Data
Half-hourly share price data is generated through Financial Modelling Prep’s API. I provide these data in the respository’s “data.csv” file, with its
- columns: 333 consecutive datapoints covering half-hourly stock price data (usually between 9am and 16pm the same day) ranging from 13 February 2023 to 17 March 2023. The column names are anonymised.
- observations (rows): 473 shares along 4 selected industries (Technology’ Basic Materials, Industrials, Consumer Cyclical). The stock symbols (tickers) are anonymised.
- values: Log returns of the open/close prices (each measured with one time lag).
The sectors the companies belong to are mapped in a seperate csv file “sectors.csv”.
We read the data from data.csv:
import numpy as np, pandas as pd
df_in = pd.read_csv('data.csv', index_col=0)
df_in.head()
Normalisation for further handling:
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
def scale(df):
arr_scaled = scale.fit_transform(df.iloc[:,1:])
df_scaled = pd.DataFrame(arr_scaled)
df_scaled.insert(loc=0, column='symbol', value=df['symbol']) # Insert ticker column
return(df_scaled)
df_scaled = scale(df_in)
df_scaled.head()
Feature production
The cesium library provides distribution statistics to the observations. It is mostly used in health science. As it appears to apply an agnostic approach to time series I make use of it to generate a number of features:
from cesium import featurize as ft
times = []
log_return_values = []
for idx, row in df_scaled.iterrows():
log_return_values.append(row.values)
times.append(np.array([i for i in range(row.values.shape[0])]))
features_to_use = ['amplitude',
'percent_beyond_1_std',
'percent_close_to_median',
'weighted_average',
'maximum',
'minimum',
'median']
df_features = ft.featurize_time_series(
times = times,
values = log_return_values,
errors = None,
features_to_use = features_to_use,
scheduler = None)
# Remove multihead, scale and put 'back' headers
df_features = df_features.T.reset_index().drop(['channel'], axis=1).iloc[:,1:].T
df_features_scaled = scale(df_features, symbols=df_in.symbol)
df_features_scaled = df_features_scaled.rename(columns={0:'amplitude', 1:'percent_beyond_1_std', 2:'percent_close_to_median',
3:'weighted_average', 4:'max', 5:'min', 6:'median'})
df_features_scaled.head()
GMM
As explained above, GMM is an algorithm for density estimation, in contrast cluster methods such as k-means. For the choice of appropriate hyperparameters this means that:
- the number of components does not necessarily need to match the shape of the different clusters, but can rather be ‘grouped’ around the points resulting from GMM. Therefore, the model will be prone to overfitting. Below we loop through choices 2–15 for components.
- my experiments indicate that the covariance type appears crucial. From the 4 choices (tied, full, diag, spherical) ‘diag’ appears best for the data at hand. See the scikit-Learn GMM documentation for the details on these parameters. This is illustrated using the BIC according to the documentation on model selection (for full code see my Github repository’s code.ipynb), withe the following result using grid search:
from sklearn.model_selection import GridSearchCV
X = df_features
def gmm_bic_score(estimator, X):
"""Callable to pass to GridSearchCV that will use the BIC score."""
return estimator.bic(X)
def gmm_aic_score(estimator, X):
"""Callable to pass to GridSearchCV that will use the AIC score."""
return estimator.aic(X)
def plot_model_grids(X, criterion_type, max_params=25):
param_grid = {
"n_components": range(1, max_params),
"covariance_type": ["spherical", "tied", "diag", "full"],
}
score_func_name = ''
if criterion_type == 'aic':
score_func_name = gmm_aic_score
else:
score_func_name = gmm_bic_score
grid_search = GridSearchCV(
GMM(), param_grid=param_grid, scoring=score_func_name
)
grid_search.fit(X)
df = pd.DataFrame(grid_search.cv_results_)[
["param_n_components", "param_covariance_type", "mean_test_score"]
]
df["mean_test_score"] = -df["mean_test_score"]
df = df.rename(
columns={
"param_n_components": "Number of components",
"param_covariance_type": "Type of covariance",
"mean_test_score": "score",
}
)
df.sort_values(by="score").head()
import seaborn as sns
sns.catplot(
data=df,
kind="bar",
x="Number of components",
y="score",
hue="Type of covariance",
)
plt.show()
plot_model_grids(df_features, 'bic')
Let’s compare with the AIC:
plot_model_grids(df_features, 'aic')
The ‘Full’ covariance type quickly draws to (large) positive values whereas ‘tied’ varies much less. This does not necessarily mean that the ‘tied’ models are robust.
Conclusion
The results look rather promising for GMM being a choice for further investigations.
I hope this article is informative. Your feedback is warmly welcome!
Sources:
[1] Taushanov, Zh., Ghisletta, P.: The Use of a Hidden Mixture Transition Distribution Model in Clustering Few but Long Continuous Sequences: An Illustration with Cognitive Skills Data’, research article, University of Geneva, September 2020
[2] Nielsen, A.: Practical Time Series Analysis, O’Reilly Media, 2020
[3] Ram, N; Grimm, K.J.: Growth Mixture Modeling: A Method for Identifying Differences in Longitudinal Change Among Unobserved Groups; Int J Behav Dev. 2009; 33(6): 565–57, July 2022
[4] Kwon, JY., Sawatzky, R., Baumbusch, J. et al.: Growth mixture models: a case example of the longitudinal analysis of patient‐reported outcomes data captured by a clinical registry, MC Med Res Methodol 21, 79 (2021). https://doi.org/10.1186/s12874-021-01276-z
[5] Kumar, A.: Gaussian Mixture Models: What are they & when to use?, blog post, April 2022
[6] VanderPlas, J.: Python Data Science Handbook, second edition, O’Reilly media, 2023