Enhance your forecast accuracy with time sequence clustering
Time sequence are sequences of information factors that happen in successive order over some time period. We frequently analyze these information factors to make higher enterprise selections or acquire aggressive benefits. An instance is Shimamura Music, who used Amazon Forecast to improve shortage rates and increase business efficiency. One other nice instance is Arneg, who used Forecast to predict maintenance needs.
AWS gives varied companies catered to time sequence information which might be low code/no code, which each machine studying (ML) and non-ML practitioners can use for constructing ML options. These contains libraries and companies like AutoGluon, Amazon SageMaker Canvas, Amazon SageMaker Data Wrangler, Amazon SageMaker Autopilot, and Amazon Forecast.
On this submit, we search to separate a time sequence dataset into particular person clusters that exhibit the next diploma of similarity between its information factors and cut back noise. The aim is to enhance accuracy by both coaching a world mannequin that accommodates the cluster configuration or have native fashions particular to every cluster.
We discover how one can extract traits, additionally referred to as options, from time sequence information utilizing the TSFresh library—a Python package deal for computing numerous time sequence traits—and carry out clustering utilizing the K-Means algorithm applied within the scikit-learn library.
We use the Time Series Clustering using TSFresh + KMeans pocket book, which is obtainable on our GitHub repo. We suggest working this pocket book on Amazon SageMaker Studio, a web-based, built-in improvement setting (IDE) for ML.
Answer overview
Clustering is an unsupervised ML approach that teams objects collectively primarily based on a distance metric. The Euclidean distance is mostly used for non-sequential datasets. Nonetheless, as a result of a time sequence inherently has a sequence (timestamp), the Euclidean distance doesn’t work effectively when used instantly on time sequence as a result of it’s invariant to time shifts, ignoring the time dimension of information. For a extra detailed clarification, discuss with Time Series Classification and Clustering with Python. A greater distance metric that works instantly on time sequence is Dynamic Time Warping (DTW). For an instance of clustering primarily based on this metric, discuss with Cluster time series data for use with Amazon Forecast.
On this submit, we generate options from the time sequence dataset utilizing the TSFresh Python library for information extraction. TSFresh is a library that calculates numerous time sequence traits, which embody the usual deviation, quantile, and Fourier entropy, amongst others. This permits us to take away the time dimensionality of the dataset and apply frequent strategies that work for information with flattened codecs. Along with TSFresh, we additionally use StandardScaler, which standardizes options by eradicating the imply and scaling to unit variance, and Principal component analysis (PCA) to carry out dimensionality discount. Scaling reduces the gap between information factors, which in flip promotes stability within the mannequin coaching course of, and dimensionality discount permits the mannequin to study from fewer options whereas retaining the main traits and patterns, thereby enabling extra environment friendly coaching.
Knowledge loading
For this instance, we use the UCI Online Retail II Data Set and carry out fundamental information cleaning and preparation steps as detailed within the Data Cleaning and Preparation notebook.
Function extraction with TSFresh
Let’s begin by utilizing TSFresh to extract options from our time sequence dataset:
Word that our information has been transformed from a time sequence to a desk evaluating StockCode
values vs. Function values
.
Subsequent, we drop all options with n/a
values by using the dropna
methodology:
Then we scale the options utilizing StandardScaler
. The values within the extracted options encompass each detrimental and constructive values. Due to this fact, we use StandardScaler
as an alternative of MinMaxScaler:
We use PCA to do dimensionality discount:
And we decide the optimum variety of elements for PCA:
The defined variance ratio is the share of variance attributed to every of the chosen elements. Sometimes, you identify the variety of elements to incorporate in your mannequin by cumulatively including the defined variance ratio of every element till you attain 0.8–0.9 to keep away from overfitting. The optimum worth normally happens on the elbow.
As proven within the following chart, the elbow worth is roughly 100. Due to this fact, we use 100 because the variety of elements for PCA.
Clustering with Ok-Means
Now let’s use Ok-Means with the Euclidean distance metric for clustering. Within the following code snippet, we decide the optimum variety of clusters. Including extra clusters decreases the inertia worth, but it surely additionally decreases the knowledge contained in every cluster. Moreover, extra clusters means extra native fashions to take care of. Due to this fact, we wish to have a small cluster measurement with a comparatively low inertia worth. The elbow heuristic works effectively for locating the optimum variety of clusters.
The next chart visualizes our findings.
Primarily based on this chart, we’ve got determined to make use of two clusters for Ok-Means. We made this determination as a result of the within-cluster sum of squares (WCSS) decreases on the highest price between one and two clusters. It’s essential to steadiness ease of upkeep with mannequin efficiency and complexity, as a result of though WCSS continues to lower with extra clusters, further clusters improve the danger of overfitting. Moreover, slight variations within the dataset can unexpectedly cut back accuracy.
It’s essential to notice that each clustering strategies, Ok-Means with Euclidian distance (mentioned on this submit) and K-means algorithm with DTW, have their strengths and weaknesses. One of the best strategy is dependent upon the character of your information and the forecasting strategies you’re utilizing. Due to this fact, we extremely suggest experimenting with each approaches and evaluating their efficiency to realize a extra holistic understanding of your information.
Conclusion
On this submit, we mentioned the highly effective strategies of function extraction and clustering for time sequence information. Particularly, we confirmed how one can use TSFresh, a preferred Python library for function extraction, to preprocess your time sequence information and procure significant options.
When the clustering step is full, you’ll be able to prepare a number of Forecast fashions for every cluster, or use the cluster configuration as a function. Confer with the Amazon Forecast Developer Guide for details about data ingestion, predictor training, and generating forecasts. If in case you have merchandise metadata and associated time sequence information, you can even embody these as enter datasets for coaching in Forecast. For extra info, discuss with Start your successful journey with time series forecasting with Amazon Forecast.
References
Concerning the Authors
Aleksandr Patrushev is AI/ML Specialist Options Architect at AWS, primarily based in Luxembourg. He’s passionate concerning the cloud and machine studying, and the best way they might change the world. Exterior work, he enjoys mountaineering, sports activities, and spending time together with his household.
Chong En Lim is a Options Architect at AWS. He’s all the time exploring methods to assist prospects innovate and enhance their workflows. In his free time, he loves watching anime and listening to music.
Egor Miasnikov is a Options Architect at AWS primarily based in Germany. He’s passionate concerning the digital transformation of our lives, companies, and the world itself, in addition to the position of synthetic intelligence on this transformation. Exterior of labor, he enjoys studying journey books, mountaineering, and spending time together with his household.