Integrating customized dependencies in Amazon SageMaker Canvas workflows


When implementing machine learning (ML) workflows in Amazon SageMaker Canvas, organizations would possibly want to contemplate exterior dependencies required for his or her particular use circumstances. Though SageMaker Canvas supplies highly effective no-code and low-code capabilities for speedy experimentation, some initiatives would possibly require specialised dependencies and libraries that aren’t included by default in SageMaker Canvas. This submit supplies an instance of easy methods to incorporate code that depends on exterior dependencies into your SageMaker Canvas workflows.

Amazon SageMaker Canvas is a low-code no-code (LCNC) ML platform that guides customers by each stage of the ML journey, from preliminary information preparation to ultimate mannequin deployment. With out writing a single line of code, customers can discover datasets, remodel information, construct fashions, and generate predictions.

SageMaker Canvas affords complete information wrangling capabilities that make it easier to put together your information, together with:

  • Over 300 built-in transformation steps
  • Function engineering capabilities
  • Information normalization and cleaning capabilities
  • A customized code editor supporting Python, PySpark, and SparkSQL

On this submit, we display easy methods to incorporate dependencies saved in Amazon Simple Storage Service (Amazon S3) inside an Amazon SageMaker Data Wrangler stream. Utilizing this strategy, you possibly can run customized scripts that rely on modules not inherently supported by SageMaker Canvas.

Resolution overview

To showcase the mixing of customized scripts and dependencies from Amazon S3 into SageMaker Canvas, we discover the next instance workflow.

The answer follows three fundamental steps:

  1. Add customized scripts and dependencies to Amazon S3
  2. Use SageMaker Information Wrangler in SageMaker Canvas to remodel your information utilizing the uploaded code
  3. Prepare and export the mannequin

The next diagram is the structure for the answer.

Solution architecture

On this instance, we work with two complementary datasets out there in SageMaker Canvas that include delivery info for pc display deliveries. By becoming a member of these datasets, we create a complete dataset that captures varied delivery metrics and supply outcomes. Our aim is to construct a predictive mannequin that may decide whether or not future shipments will arrive on time based mostly on historic delivery patterns and traits.

Conditions

As a prerequisite, you want entry to Amazon S3 and Amazon SageMaker AI. Should you don’t have already got a SageMaker AI area configured in your account, you additionally want permissions to create a SageMaker AI domain.

Create the information stream

To create the information stream, observe these steps:

  1. On the Amazon SageMaker AI console, within the navigation pane, beneath Functions and IDEs, choose Canvas, as proven within the following screenshot. You would possibly must create a SageMaker area should you haven’t executed so already.
  2. After your area is created, select Open Canvas.SageMaker Canvas homepage
  1. In Canvas, choose the Datasets tab and choose canvas-sample-shipping-logs.csv, as proven within the following screenshot. After the preview seems, select + Create a knowledge stream.Data Flow creation

The preliminary information stream will open with one supply and one information sort.

  1. On the prime proper of the display, and choose Add information → tabular. Select Canvas Datasets because the supply and choose canvas-sample-product-descriptions.csv.
  2. Select Subsequent as proven within the following screenshot. Then select Import.Dataset selection
  1. After each datasets have been added, choose the plus signal. From the dropdown menu, select choose Mix information. From the following dropdown menu, select Be a part of.Join datasets
  1. To carry out an interior be a part of on the ProductID column, within the right-hand menu, beneath Be a part of sort, select Interior be a part of. Below Be a part of keys, select ProductId, as proven within the following screenshot.Join datasets
  1. After the datasets have been joined, choose the plus signal. Within the dropdown menu, choose + Add remodel. A preview of the dataset will open.

The dataset comprises XShippingDistance (lengthy) and YShippingDistance (lengthy) columns. For our functions, we need to use a customized operate that may discover the entire distance utilizing the X and Y coordinates after which drop the person coordinate columns. For this instance, we discover the entire distance utilizing a operate that depends on the mpmath library.

  1. To name the customized operate, choose + Add remodel. Within the dropdown menu, choose Customized remodel. Change the editor to Python (Pandas) and attempt to run the next operate from the Python editor:
from mpmath import sqrt  # Import sqrt from mpmath

def calculate_total_distance(df, x_col="XShippingDistance", y_col="YShippingDistance", new_col="TotalDistance"):

    # Use mpmath's sqrt to calculate the entire distance for every row
    df[new_col] = df.apply(lambda row: float(sqrt(row[x_col]**2 + row[y_col]**2)), axis=1)
    
    # Drop the unique x and y columns
    df = df.drop(columns=[x_col, y_col])
    
    return df

df = calculate_total_distance(df)

Operating the operate produces the next error: ModuleNotFoundError: No module named ‘mpmath’, as proven within the following screenshot.

mpmath module error

This error happens as a result of mpmath isn’t a module that’s inherently supported by SageMaker Canvas. To make use of a operate that depends on this module, we have to strategy using a customized operate in a different way.

Zip the script and dependencies

To make use of a operate that depends on a module that isn’t natively supported in Canvas, the customized script should be zipped with the module(s) it depends on. For this instance, we used our native built-in improvement atmosphere (IDE) to create a script.py that depends on the mpmath library.

The script.py file comprises two capabilities: one operate that’s suitable with the Python (Pandas) runtime (operate calculate_total_distance), and one that’s suitable with the Python (Pyspark) runtime (operate udf_total_distance).

def calculate_total_distance(df, x_col="XShippingDistance", y_col="YShippingDistance", new_col="TotalDistance"):
    from npmath import sqrt  # Import sqrt from npmath

    # Use npmath's sqrt to calculate the entire distance for every row
    df[new_col] = df.apply(lambda row: float(sqrt(row[x_col]**2 + row[y_col]**2)), axis=1)

    # Drop the unique x and y columns
    df = df.drop(columns=[x_col, y_col])

    return df

def udf_total_distance(df, x_col="XShippingDistance", y_col="YShippingDistance", new_col="TotalDistance"):
    from pyspark.sql import SparkSession
    from pyspark.sql.capabilities import udf
    from pyspark.sql.varieties import FloatType

    spark = SparkSession.builder 
        .grasp("native") 
        .appName("DistanceCalculation") 
        .getOrCreate()

    def calculate_distance(x, y):
        import sys

        # Add the trail to npmath
        mpmath_path = "/tmp/maths"
        if mpmath_path not in sys.path:
            sys.path.insert(0, mpmath_path)

        from mpmath import sqrt
        return float(sqrt(x**2 + y**2))

    # Register and apply UDF
    distance_udf = udf(calculate_distance, FloatType())
    df = df.withColumn(new_col, distance_udf(df[x_col], df[y_col]))
    df = df.drop(x_col, y_col)

    return df

To ensure the script can run, set up mpmath into the identical listing as script.py by working pip set up mpmath.

Run zip -r my_project.zip to create a .zip file containing the operate and the mpmath set up. The present listing now comprises a .zip file, our Python script, and the set up our script relies on, as proven within the following screenshot.

directory with zip file

Add to Amazon S3

After creating the .zip file, add it to an Amazon S3 bucket.

upload zip file to S3

After the zip file has been uploaded to Amazon S3, it’s accessible in SageMaker Canvas.

Run the customized script

Return to the information stream in SageMaker Canvas and substitute the prior customized operate code with the next code and select Replace.

import zipfile
import boto3
import sys
from pathlib import Path
import shutil
import importlib.util


def load_script_and_dependencies(bucket_name, zip_key, extract_to):
    """
    Downloads a zipper file from S3, unzips it, and ensures dependencies can be found.

    Args:
        bucket_name (str): Identify of the S3 bucket.
        zip_key (str): Key for the .zip file within the bucket.
        extract_to (str): Listing to extract information to.

    Returns:
        str: Path to the extracted folder containing the script and dependencies.
    """
    
    s3_client = boto3.shopper("s3")
    
    # Native path for the zip file
    zip_local_path="/tmp/dependencies.zip"
    
    # Obtain the .zip file from S3
    s3_client.download_file(bucket_name, zip_key, zip_local_path)
    print(f"Downloaded zip file from S3: {zip_key}")

    # Unzip the file
    attempt:
        with zipfile.ZipFile(zip_local_path, 'r') as zip_ref:
            zip_ref.extractall(extract_to)
            print(f"Extracted information to {extract_to}")
    besides Exception as e:
        elevate RuntimeError(f"Did not extract zip file: {e}")

    # Add the extracted folder to Python path
    if extract_to not in sys.path:
      sys.path.insert(0, extract_to)
          
    return extract_to
    


def call_function_from_script(script_path, function_name, df):
    """
    Dynamically hundreds a operate from a Python script utilizing importlib.
    """
    attempt:
        # Get the script title from the trail
        module_name = script_path.cut up('/')[-1].substitute('.py', '')
        
        # Load the module specification
        spec = importlib.util.spec_from_file_location(module_name, script_path)
        if spec is None:
            elevate ImportError(f"Couldn't load specification for module {module_name}")
            
        # Create the module
        module = importlib.util.module_from_spec(spec)
        sys.modules[module_name] = module
        
        # Execute the module
        spec.loader.exec_module(module)
        
        # Get the operate from the module
        if not hasattr(module, function_name):
            elevate AttributeError(f"Perform '{function_name}' not discovered within the script.")
            
        loaded_function = getattr(module, function_name)

        # Clear up: take away module from sys.modules after execution
        del sys.modules[module_name]
        
        # Name the operate
        return loaded_function(df)
        
    besides Exception as e:
        elevate RuntimeError(f"Error loading or executing operate: {e}")


bucket_name="canvasdatabuckett"  # S3 bucket title
zip_key = 'capabilities/my_project.zip'  # S3 path to the zip file with our customized dependancy
script_name="script.py"  # Identify of the script within the zip file
function_name="udf" # Identify of operate to name from our script
extract_to = '/tmp/maths' # Native path to our customized script and dependancies

# Step 1: Load the script and dependencies
extracted_path = load_script_and_dependencies(bucket_name, zip_key, extract_to)

# Step 2: Name the operate from the script
script_path = f"{extracted_path}/{script_name}"
df = call_function_from_script(script_path, function_name, df)

This instance code unzips the .zip file and provides the required dependencies to the native path in order that they’re out there to the operate at run time. As a result of mpmath was added to the native path, now you can name a operate that depends on this exterior library.

The previous code runs utilizing the Python (Pandas) runtime and calculate_total_distance operate. To make use of the Python (Pyspark) runtime, replace the function_name variable to name the udf_total_distance operate as a substitute.

Full the information stream

As a final step, take away irrelevant columns earlier than coaching the mannequin. Comply with these steps:

  1. On the SageMaker Canvas console, choose + Add remodel. From the dropdown menu, choose Handle columns
  2. Below Rework, select Drop column. Below Columns to drop, add ProductId_0, ProductId_1, and OrderID, as proven within the following screenshot.columns to drop

The ultimate dataset ought to include 13 columns. The entire information stream is pictured within the following picture.

complete data Flow

Prepare the mannequin

To coach the mannequin, observe these steps:

  1. On the prime proper of the web page, choose Create mannequin and title your dataset and mannequin.
  2. Choose Predictive evaluation as the issue sort and OnTimeDelivery because the goal column, as proven within the screenshot under.model creation page

When constructing the mannequin you possibly can select to run a Fast construct or a Normal construct. A Fast construct prioritizes velocity over accuracy and produces a educated mannequin in lower than 20 minutes. A typical construct prioritizes accuracy over latency however the mannequin takes longer to coach.

Outcomes

After the mannequin construct is full, you possibly can view the mannequin’s accuracy, together with metrics like F1, precision and recall. Within the case of a regular construct, the mannequin achieved 94.5% accuracy.

model accuracy page

After the mannequin coaching is full, there are 4 methods you should utilize your mannequin:

  1. Deploy the model directly from SageMaker Canvas to an endpoint
  2. Add the model to the SageMaker Model Registry
  3. Export your model to a Jupyter Notebook
  4. Send your model to Amazon QuickSight to be used in dashboard visualizations

Clear up

To handle prices and stop extra workspace charges, select Sign off to signal out of SageMaker Canvas once you’re executed utilizing the appliance, as proven within the following screenshot. You too can configure SageMaker Canvas to automatically shut down when idle.

Should you created an S3 bucket particularly for this instance, you may additionally need to empty and delete your bucket.

log out of Canvas

Abstract

On this submit, we demonstrated how one can add customized dependencies to Amazon S3 and combine them into SageMaker Canvas workflows. By strolling by a sensible instance of implementing a customized distance calculation operate with the mpmath library, we confirmed easy methods to:

  1. Bundle customized code and dependencies right into a .zip file
  2. Retailer and entry these dependencies from Amazon S3
  3. Implement customized information transformations in SageMaker Information Wrangler
  4. Prepare a predictive mannequin utilizing the reworked information

This strategy implies that information scientists and analysts can prolong SageMaker Canvas capabilities past the greater than 300 included capabilities.

To attempt customized transforms your self, seek advice from the Amazon SageMaker Canvas documentation and sign up to SageMaker Canvas in the present day. For added insights into how one can optimize your SageMaker Canvas implementation, we advocate exploring these associated posts:


In regards to the Creator

author photoNadhya Polanco is an Affiliate Options Architect at AWS based mostly in Brussels, Belgium. On this function, she helps organizations trying to incorporate AI and Machine Studying into their workloads. In her free time, Nadhya enjoys indulging in her ardour for espresso and exploring new locations.

Leave a Reply

Your email address will not be published. Required fields are marked *