Integrating customized dependencies in Amazon SageMaker Canvas workflows

When implementing machine learning (ML) workflows in Amazon SageMaker Canvas, organizations would possibly want to contemplate exterior dependencies required for his or her particular use circumstances. Though SageMaker Canvas supplies highly effective no-code and low-code capabilities for speedy experimentation, some initiatives would possibly require specialised dependencies and libraries that aren’t included by default in SageMaker Canvas. This submit supplies an instance of easy methods to incorporate code that depends on exterior dependencies into your SageMaker Canvas workflows.
Amazon SageMaker Canvas is a low-code no-code (LCNC) ML platform that guides customers by each stage of the ML journey, from preliminary information preparation to ultimate mannequin deployment. With out writing a single line of code, customers can discover datasets, remodel information, construct fashions, and generate predictions.
SageMaker Canvas affords complete information wrangling capabilities that make it easier to put together your information, together with:
- Over 300 built-in transformation steps
- Function engineering capabilities
- Information normalization and cleaning capabilities
- A customized code editor supporting Python, PySpark, and SparkSQL
On this submit, we display easy methods to incorporate dependencies saved in Amazon Simple Storage Service (Amazon S3) inside an Amazon SageMaker Data Wrangler stream. Utilizing this strategy, you possibly can run customized scripts that rely on modules not inherently supported by SageMaker Canvas.
Resolution overview
To showcase the mixing of customized scripts and dependencies from Amazon S3 into SageMaker Canvas, we discover the next instance workflow.
The answer follows three fundamental steps:
- Add customized scripts and dependencies to Amazon S3
- Use SageMaker Information Wrangler in SageMaker Canvas to remodel your information utilizing the uploaded code
- Prepare and export the mannequin
The next diagram is the structure for the answer.
On this instance, we work with two complementary datasets out there in SageMaker Canvas that include delivery info for pc display deliveries. By becoming a member of these datasets, we create a complete dataset that captures varied delivery metrics and supply outcomes. Our aim is to construct a predictive mannequin that may decide whether or not future shipments will arrive on time based mostly on historic delivery patterns and traits.
Conditions
As a prerequisite, you want entry to Amazon S3 and Amazon SageMaker AI. Should you don’t have already got a SageMaker AI area configured in your account, you additionally want permissions to create a SageMaker AI domain.
Create the information stream
To create the information stream, observe these steps:
- On the Amazon SageMaker AI console, within the navigation pane, beneath Functions and IDEs, choose Canvas, as proven within the following screenshot. You would possibly must create a SageMaker area should you haven’t executed so already.
- After your area is created, select Open Canvas.
- In Canvas, choose the Datasets tab and choose canvas-sample-shipping-logs.csv, as proven within the following screenshot. After the preview seems, select + Create a knowledge stream.
The preliminary information stream will open with one supply and one information sort.
- On the prime proper of the display, and choose Add information → tabular. Select Canvas Datasets because the supply and choose canvas-sample-product-descriptions.csv.
- Select Subsequent as proven within the following screenshot. Then select Import.
- After each datasets have been added, choose the plus signal. From the dropdown menu, select choose Mix information. From the following dropdown menu, select Be a part of.
- To carry out an interior be a part of on the ProductID column, within the right-hand menu, beneath Be a part of sort, select Interior be a part of. Below Be a part of keys, select ProductId, as proven within the following screenshot.
- After the datasets have been joined, choose the plus signal. Within the dropdown menu, choose + Add remodel. A preview of the dataset will open.
The dataset comprises XShippingDistance (lengthy) and YShippingDistance (lengthy) columns. For our functions, we need to use a customized operate that may discover the entire distance utilizing the X and Y coordinates after which drop the person coordinate columns. For this instance, we discover the entire distance utilizing a operate that depends on the mpmath library.
- To name the customized operate, choose + Add remodel. Within the dropdown menu, choose Customized remodel. Change the editor to Python (Pandas) and attempt to run the next operate from the Python editor:
Operating the operate produces the next error: ModuleNotFoundError: No module named ‘mpmath’, as proven within the following screenshot.
This error happens as a result of mpmath isn’t a module that’s inherently supported by SageMaker Canvas. To make use of a operate that depends on this module, we have to strategy using a customized operate in a different way.
Zip the script and dependencies
To make use of a operate that depends on a module that isn’t natively supported in Canvas, the customized script should be zipped with the module(s) it depends on. For this instance, we used our native built-in improvement atmosphere (IDE) to create a script.py that depends on the mpmath library.
The script.py file comprises two capabilities: one operate that’s suitable with the Python (Pandas) runtime (operate calculate_total_distance
), and one that’s suitable with the Python (Pyspark) runtime (operate udf_total_distance
).
To ensure the script can run, set up mpmath into the identical listing as script.py by working pip set up mpmath
.
Run zip -r my_project.zip
to create a .zip file containing the operate and the mpmath set up. The present listing now comprises a .zip file, our Python script, and the set up our script relies on, as proven within the following screenshot.
Add to Amazon S3
After creating the .zip file, add it to an Amazon S3 bucket.
After the zip file has been uploaded to Amazon S3, it’s accessible in SageMaker Canvas.
Run the customized script
Return to the information stream in SageMaker Canvas and substitute the prior customized operate code with the next code and select Replace.
This instance code unzips the .zip file and provides the required dependencies to the native path in order that they’re out there to the operate at run time. As a result of mpmath was added to the native path, now you can name a operate that depends on this exterior library.
The previous code runs utilizing the Python (Pandas) runtime and calculate_total_distance operate. To make use of the Python (Pyspark) runtime, replace the function_name variable to name the udf_total_distance operate as a substitute.
Full the information stream
As a final step, take away irrelevant columns earlier than coaching the mannequin. Comply with these steps:
- On the SageMaker Canvas console, choose + Add remodel. From the dropdown menu, choose Handle columns
- Below Rework, select Drop column. Below Columns to drop, add ProductId_0, ProductId_1, and OrderID, as proven within the following screenshot.
The ultimate dataset ought to include 13 columns. The entire information stream is pictured within the following picture.
Prepare the mannequin
To coach the mannequin, observe these steps:
- On the prime proper of the web page, choose Create mannequin and title your dataset and mannequin.
- Choose Predictive evaluation as the issue sort and OnTimeDelivery because the goal column, as proven within the screenshot under.
When constructing the mannequin you possibly can select to run a Fast construct or a Normal construct. A Fast construct prioritizes velocity over accuracy and produces a educated mannequin in lower than 20 minutes. A typical construct prioritizes accuracy over latency however the mannequin takes longer to coach.
Outcomes
After the mannequin construct is full, you possibly can view the mannequin’s accuracy, together with metrics like F1, precision and recall. Within the case of a regular construct, the mannequin achieved 94.5% accuracy.
After the mannequin coaching is full, there are 4 methods you should utilize your mannequin:
- Deploy the model directly from SageMaker Canvas to an endpoint
- Add the model to the SageMaker Model Registry
- Export your model to a Jupyter Notebook
- Send your model to Amazon QuickSight to be used in dashboard visualizations
Clear up
To handle prices and stop extra workspace charges, select Sign off to signal out of SageMaker Canvas once you’re executed utilizing the appliance, as proven within the following screenshot. You too can configure SageMaker Canvas to automatically shut down when idle.
Should you created an S3 bucket particularly for this instance, you may additionally need to empty and delete your bucket.
Abstract
On this submit, we demonstrated how one can add customized dependencies to Amazon S3 and combine them into SageMaker Canvas workflows. By strolling by a sensible instance of implementing a customized distance calculation operate with the mpmath library, we confirmed easy methods to:
- Bundle customized code and dependencies right into a .zip file
- Retailer and entry these dependencies from Amazon S3
- Implement customized information transformations in SageMaker Information Wrangler
- Prepare a predictive mannequin utilizing the reworked information
This strategy implies that information scientists and analysts can prolong SageMaker Canvas capabilities past the greater than 300 included capabilities.
To attempt customized transforms your self, seek advice from the Amazon SageMaker Canvas documentation and sign up to SageMaker Canvas in the present day. For added insights into how one can optimize your SageMaker Canvas implementation, we advocate exploring these associated posts:
In regards to the Creator
Nadhya Polanco is an Affiliate Options Architect at AWS based mostly in Brussels, Belgium. On this function, she helps organizations trying to incorporate AI and Machine Studying into their workloads. In her free time, Nadhya enjoys indulging in her ardour for espresso and exploring new locations.