Fast ML experimentation for enterprises with Amazon SageMaker AI and Comet


This publish was written with Sarah Ostermeier from Comet.

As enterprise organizations scale their machine studying (ML) initiatives from proof of idea to manufacturing, the complexity of managing experiments, monitoring mannequin lineage, and managing reproducibility grows exponentially. That is primarily as a result of knowledge scientists and ML engineers continuously discover completely different mixtures of hyperparameters, mannequin architectures, and dataset variations, producing huge quantities of metadata that should be tracked for reproducibility and compliance. Because the ML mannequin improvement scales throughout a number of groups and regulatory necessities intensify, monitoring experiments turns into much more complicated. With growing AI laws, particularly in the EU, organizations now require detailed audit trails of mannequin coaching knowledge, efficiency expectations, and improvement processes, making experiment monitoring a enterprise necessity and never only a greatest follow.

Amazon SageMaker AI supplies the managed infrastructure enterprises must scale ML workloads, dealing with compute provisioning, distributed coaching, and deployment with out infrastructure overhead. Nonetheless, groups nonetheless want strong experiment monitoring, mannequin comparability, and collaboration capabilities that transcend primary logging.

Comet is a complete ML experiment administration platform that mechanically tracks, compares, and optimizes ML experiments throughout all the mannequin lifecycle. It supplies knowledge scientists and ML engineers with highly effective instruments for experiment monitoring, mannequin monitoring, hyperparameter optimization, and collaborative mannequin improvement. It additionally presents Opik, Comet’s open supply platform for LLM observability and improvement.

Comet is accessible in SageMaker AI as a Partner AI App, as a totally managed experiment administration functionality, with enterprise-grade safety, seamless workflow integration, and an easy procurement course of by AWS Marketplace.

The mix addresses the wants of an enterprise ML workflow end-to-end, the place SageMaker AI handles infrastructure and compute, and Comet supplies the experiment administration, mannequin registry, and manufacturing monitoring capabilities that groups require for regulatory compliance and operational effectivity. On this publish, we display an entire fraud detection workflow utilizing SageMaker AI with Comet, showcasing reproducibility and audit-ready logging wanted by enterprises as we speak.

Enterprise-ready Comet on SageMaker AI

Earlier than continuing to setup directions, organizations should determine their operating model and based mostly on that, resolve how Comet goes to be arrange. We suggest implementing Comet utilizing a federated working mannequin. On this structure, Comet is centrally managed and hosted in a shared providers account, and every knowledge science staff maintains totally autonomous environments. Every working mannequin comes with their very own units of advantages and limitations. For extra data, check with SageMaker Studio Administration Best Practices.

Let’s dive into the setup of Comet in SageMaker AI. Giant enterprise typically have the next personas:

  • Directors – Liable for establishing the widespread infrastructure providers and setting to be used case groups
  • Customers – ML practitioners from use case groups who use the environments arrange by platform staff to resolve their enterprise issues

Within the following sections, we undergo every persona’s journey.

Comet works effectively with each SageMaker AI and Amazon SageMaker. SageMaker AI supplies the Amazon SageMaker Studio built-in improvement setting (IDE), and SageMaker supplies the Amazon SageMaker Unified Studio IDE. For this publish, we use SageMaker Studio.

Administrator journey

On this situation, the administrator receives a request from a staff engaged on a fraud detection use case to provision an ML setting with a totally managed coaching and experimentation setup. The administrator’s journey consists of the next steps:

  1. Comply with the conditions to set up Partner AI Apps. This units up permissions for directors, permitting Comet to imagine a SageMaker AI execution role on behalf of the customers and extra privileges for managing the Comet subscription by AWS Market.
  2. On the SageMaker AI console, below Purposes and IDEs within the navigation pane, select Companion AI Apps, then select View particulars for Comet.

The small print are proven, together with the contract pricing mannequin for Comet and infrastructure tier estimated prices.

Comet supplies completely different subscription choices starting from a 1-month to 36-month contract. With this contract, customers can entry Comet in SageMaker. Primarily based on the variety of customers, the admin can evaluate and analyze the suitable occasion dimension for the Comet dashboard server. Comet helps 5–500 customers working greater than 100 experiment jobs..

  1. Select Go to Market to subscribe to be redirected to the Comet itemizing on AWS Market.
  2. Select View buy choices.

  1. Within the subscription kind, present the required particulars.

When the subscription is full, the admin can begin configuring Comet.

  1. Whereas deploying Comet, add the challenge lead of the fraud detection use case staff as an admin to handle the admin operations for the Comet dashboard.

It takes a couple of minutes for the Comet server to be deployed. For extra particulars on this step, check with Partner AI App provisioning.

  1. Arrange a SageMaker AI area following the steps in Use custom setup for Amazon SageMaker AI. As a greatest follow, present a pre-signed domain URL for the use case staff member to straight entry the Comet UI with out logging in to the SageMaker console.
  2. Add the staff members to this area and allow entry to Comet whereas configuring the area.

Now the SageMaker AI area is prepared for customers to log in to and begin engaged on the fraud detection use case.

Consumer journey

Now let’s discover the journey of an ML practitioner from the fraud detection use case. The consumer completes the next steps:

  1. Log in to the SageMaker AI area by the pre-signed URL.

You may be redirected to the SageMaker Studio IDE. Your consumer identify and AWS Identity and Access Management (IAM) execution position are preconfigured by the admin.

  1. Create a JupyterLab House following the JupyterLab user guide.
  2. You can begin engaged on the fraud detection use case by spinning up a Jupyter pocket book.

The admin has additionally arrange required entry to the info by an Amazon Simple Storage Service (Amazon S3) bucket.

  1. To entry Comet APIs, set up the comet_ml library and configure the required setting variables as described in Set up the Amazon SageMaker Partner AI Apps SDKs.
  2. To entry the Comet UI, select Companion AI Apps within the SageMaker Studio navigation pane and select Open for Comet.

Now, let’s stroll by the use case implementation.

Answer overview

This use case highlights widespread enterprise challenges: working with imbalanced datasets (on this instance, solely 0.17% of transactions are fraudulent), requiring a number of experiment iterations, and sustaining full reproducibility for regulatory compliance. To observe alongside, check with the Comet documentation and Quickstart guide for added setup and API particulars.

For this use case, we use the Credit Card Fraud Detection dataset. The dataset accommodates bank card transactions with binary labels representing fraudulent (1) or authentic (0) transactions. Within the following sections, we stroll by among the necessary sections of the implementation. Your entire code of the implementation is accessible within the GitHub repository.

Conditions

As a prerequisite, configure the required imports and setting variables for the Comet and SageMaker integration:

# Comet ML for experiment monitoring
import comet_ml
from comet_ml import Experiment, API, Artifact
from comet_ml.integration.sagemaker import log_sagemaker_training_job_v1
AWS_PARTNER_APP_AUTH=true
AWS_PARTNER_APP_ARN=<Your_AWS_PARTNER_APP_ARN>
COMET_API_KEY=<Your_Comet_API_Key> 	
# From Particulars Web page, click on Open Comet. Within the prime #proper nook, click on on consumer -> API # Key
# Comet ML configuration
COMET_WORKSPACE = '<your-comet-workspace-name>'
COMET_PROJECT_NAME = '<your-comet-project-name>'

Put together the dataset

Considered one of Comet’s key enterprise options is automated dataset versioning and lineage monitoring. This functionality supplies full auditability of what knowledge was used to coach every mannequin, which is vital for regulatory compliance and reproducibility. Begin by loading the dataset:

# Create a Comet Artifact to trace our uncooked dataset
dataset_artifact = Artifact(
    identify="fraud-dataset",
    artifact_type="dataset",
    aliases=["raw"]
)
# Add the uncooked dataset file to the artifact
dataset_artifact.add_remote(s3_data_path, metadata={
    "dataset_stage": "uncooked", 
    "dataset_split": "not_split", 
    "preprocessing": "none"
})

Begin a Comet experiment

With the dataset artifact created, now you can begin monitoring the ML workflow. Creating a Comet experiment mechanically begins capturing code, put in libraries, system metadata, and different contextual data within the background. You possibly can log the dataset artifact created earlier within the experiment. See the next code:

# Create a brand new Comet experiment
experiment_1 = comet_ml.Experiment(
    project_name=COMET_PROJECT_NAME,
    workspace=COMET_WORKSPACE,
)
# Log the dataset artifact to this experiment for lineage monitoring
experiment_1.log_artifact(dataset_artifact)

Preprocess the info

The following steps are commonplace preprocessing steps, together with eradicating duplicates, dropping unneeded columns, splitting into practice/validation/take a look at units, and standardizing options utilizing scikit-learn’s StandardScaler. We wrap the processing code in preprocess.py and run it as a SageMaker Processing job. See the next code:

# Run SageMaker processing job
processor = SKLearnProcessor(
    framework_version='1.0-1',
    position=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type="ml.t3.medium"
)
processor.run(
    code="preprocess.py",
    inputs=[ProcessingInput(source=s3_data_path, destination='/opt/ml/processing/input')],
    outputs=[ProcessingOutput(source="/opt/ml/processing/output", destination=f's3://{bucket_name}/{processed_data_prefix}')]
)

After you submit the processing job, SageMaker AI launches the compute situations, processes and analyzes the enter knowledge, and releases the assets upon completion. The output of the processing job is saved within the S3 bucket specified.

Subsequent, create a brand new model of the dataset artifact to trace the processed knowledge. Comet mechanically variations artifacts with the identical identify, sustaining full lineage from uncooked to processed knowledge.

# Create an up to date model of the 'fraud-dataset' Artifact for the preprocessed knowledge
preprocessed_dataset_artifact = Artifact(
    identify="fraud-dataset",
    artifact_type="dataset", 
    aliases=["preprocessed"],
    metadata={
        "description": "Bank card fraud detection dataset",
        "fraud_percentage": f"{fraud_percentage:.3f}%",
        "dataset_stage": "preprocessed",
        "preprocessing": "StandardScaler + practice/val/take a look at break up",
    }
)
# Add our practice, validation, and take a look at dataset information as distant property 
preprocessed_dataset_artifact.add_remote(
    uri=f's3://{bucket_name}/{processed_data_prefix}',
    logical_path="split_data"
)
# Log the up to date dataset to the experiment to trace the updates
experiment_1.log_artifact(preprocessed_dataset_artifact)

The Comet and SageMaker AI experiment workflow

Knowledge scientists choose fast experimentation; due to this fact, we organized the workflow into reusable utility capabilities that may be known as a number of occasions with completely different hyperparameters whereas sustaining constant logging and analysis throughout all runs. On this part, we showcase the utility capabilities together with a short snippet of the code contained in the operate:

    # Create SageMaker estimator
    estimator = Estimator(
        image_uri=xgboost_image,
        position=execution_role,
        instance_count=1,
        instance_type="ml.m5.massive",
        output_path=model_output_path,
        sagemaker_session=sagemaker_session_obj,
        hyperparameters=hyperparameters_dict,
        max_run=1800  # Most coaching time in seconds
    )
    # Begin coaching
    estimator.match({
        'practice': train_channel,
        'validation': val_channel
    })

  • log_training_job() – Captures the coaching metadata and metrics and hyperlinks the model asset to the experiment for full traceability:
# Log SageMaker coaching job to Comet 
    log_sagemaker_training_job_v1(
        estimator=training_estimator,
        experiment=api_experiment
    )

  • log_model_to_comet() – Hyperlinks mannequin artifacts to Comet, captures the coaching metadata, and hyperlinks the model asset to the experiment for full traceability:
experiment.log_remote_model(
        model_name=model_name,
        uri=model_artifact_path,
        metadata=metadata
    )

  • deploy_and_evaluate_model() – Performs mannequin deployment and analysis, and metric logging:
# Deploy to endpoint
predictor = estimator.deploy(
initial_instance_count=1,
       instance_type="ml.m5.xlarge")
# Log metrics and visualizations to Comet 
experiment.log_metrics(metrics) experiment.log_confusion_matrix(matrix=cm,labels=['Normal', 'Fraud']) 
# Log ROC curve 
fpr, tpr, _ = roc_curve(y_test, y_pred_prob_as_np_array) experiment.log_curve("roc_curve", x=fpr, y=tpr)

The entire prediction and analysis code is accessible within the GitHub repository.

Run the experiments

Now you possibly can run a number of experiments by calling the utility capabilities with completely different configurations and evaluate experiments to search out essentially the most optimum settings for the fraud detection use case.

For the primary experiment, we set up a baseline utilizing commonplace XGBoost hyperparameters:

# Outline hyperparameters for first experiment
hyperparameters_v1 = {
    'goal': 'binary:logistic',	# Binary classification
    'num_round': 100,                   # Variety of boosting rounds
    'eval_metric': 'auc',               # Analysis metric
    'learning_rate': 0.15,              # Studying price
    'booster': 'gbtree'                 # Booster algorithm
}
# Practice the mannequin
estimator_1 = practice(
    model_output_path=f"s3://{bucket_name}/{model_output_prefix}/1",
    execution_role=position,
    sagemaker_session_obj=sagemaker_session,
    hyperparameters_dict=hyperparameters_v1,
    train_channel_loc=train_channel_location,
    val_channel_loc=validation_channel_location
)
# log the coaching job and mannequin artifact
log_training_job(experiment_key = experiment_1.get_key(), training_estimator=estimator_1)
log_model_to_comet(experiment = experiment_1,
                   model_name="fraud-detection-xgb-v1", 
                   model_artifact_path=estimator_1.model_data, 
                   metadata=metadata)
# Deploy and consider
deploy_and_evaluate_model(experiment=experiment_1,
                          estimator=estimator_1,
                          X_test_scaled=X_test_scaled,
                          y_test=y_test
                          )

Whereas working a Comet experiment from a Jupyter pocket book, we have to finish the experiment to ensure all the things is captured and endured within the Comet server. See the next code: experiment_1.finish()

When the baseline experiment is full, you possibly can run extra experiments with completely different hyperparameters. Try the notebook to see the small print of each experiments.

When the second experiment is full, navigate to the Comet UI to match these two experiment runs.

View Comet experiments within the UI

To entry the UI, you possibly can find the URL within the SageMaker Studio IDE or by executing the code offered within the pocket book: experiment_2.url

The next screenshot reveals the Comet experiments UI. The experiment particulars are for illustration functions solely and don’t symbolize a real-world fraud detection experiment.

This concludes the fraud detection experiment.

Clear up

For the experimentation half, SageMaker processing and coaching infrastructure is ephemeral in nature and shuts down mechanically when the job is full. Nonetheless, you could nonetheless manually clear up just a few assets to keep away from pointless prices:

  1. Shut down the SageMaker JupyterLab House after use. For directions, check with Idle shutdown.
  2. The Comet subscription renews based mostly on the contract chosen. Cancel the contract when there is no such thing as a additional requirement to resume the Comet subscription.

Benefits of SageMaker and Comet integration

Having demonstrated the technical workflow, let’s look at the broader benefits this integration supplies.

Streamlined mannequin improvement

The Comet and SageMaker mixture reduces the handbook overhead of working ML experiments. Whereas SageMaker handles infrastructure provisioning and scaling, Comet’s automated logging captures hyperparameters, metrics, code, put in libraries, and system efficiency out of your coaching jobs with out extra configuration. This helps groups deal with mannequin improvement slightly than experiment bookkeeping.Comet’s visualization capabilities lengthen past primary metric plots. Constructed-in charts allow fast experiment comparability, and customized Python panels help domain-specific evaluation instruments for debugging mannequin conduct, optimizing hyperparameters, or creating specialised visualizations that commonplace instruments can’t present.

Enterprise collaboration and governance

For enterprise groups, the mix creates a mature platform for scaling ML initiatives throughout regulated environments. SageMaker supplies constant, safe ML environments, and Comet permits seamless collaboration with full artifact and mannequin lineage monitoring. This helps keep away from pricey errors that happen when groups can’t recreate earlier outcomes.

Full ML lifecycle integration

In contrast to level options that solely tackle coaching or monitoring, Comet paired with SageMaker helps your full ML lifecycle. Fashions might be registered in Comet’s mannequin registry with full model monitoring and governance. SageMaker handles mannequin deployment, and Comet maintains the lineage and approval workflows for mannequin promotion. Comet’s manufacturing monitoring capabilities monitor mannequin efficiency and knowledge drift after deployment, making a closed loop the place manufacturing insights inform your subsequent spherical of SageMaker experiments.

Conclusion

On this publish, we confirmed use SageMaker and Comet collectively to spin up totally managed ML environments with reproducibility and experiment monitoring capabilities.

To boost your SageMaker workflows with complete experiment administration, deploy Comet straight in your SageMaker setting by the AWS Marketplace, and share your suggestions within the feedback.

For extra details about the providers and options mentioned on this publish, check with the next assets:


In regards to the authors

Vikesh Pandey is a Principal GenAI/ML Specialist Options Architect at AWS, serving to massive monetary establishments undertake and scale generative AI and ML workloads. He’s the creator of e book “Generative AI for monetary providers.” He carries greater than 15 years of expertise constructing enterprise-grade functions on generative AI/ML and associated applied sciences. In his spare time, he performs an unnamed sport together with his son that lies someplace between soccer and rugby.

Naufal Mir is a Senior GenAI/ML Specialist Options Architect at AWS. He focuses on serving to clients construct, practice, deploy and migrate machine studying workloads to SageMaker. He beforehand labored at monetary providers institutes growing and working methods at scale. Exterior of labor, he enjoys extremely endurance working and biking.

Sarah Ostermeier is a Technical Product Advertising Supervisor at Comet. She focuses on bringing Comet’s GenAI and ML developer merchandise to the engineers who want them by technical content material, instructional assets, and product messaging. She has beforehand labored as an ML engineer, knowledge scientist, and buyer success supervisor, serving to clients implement and scale AI options. Exterior of labor she enjoys touring off the crushed path, writing about AI, and studying science fiction.

Leave a Reply

Your email address will not be published. Required fields are marked *