Optimize gear efficiency with historic information, Ray, and Amazon SageMaker

Environment friendly management insurance policies allow industrial firms to extend their profitability by maximizing productiveness whereas decreasing unscheduled downtime and power consumption. Discovering optimum management insurance policies is a fancy job as a result of bodily programs, equivalent to chemical reactors and wind generators, are sometimes laborious to mannequin and since drift in course of dynamics may cause efficiency to deteriorate over time. Offline reinforcement studying is a management technique that permits industrial firms to construct management insurance policies totally from historic information with out the necessity for an express course of mannequin. This method doesn’t require interplay with the method immediately in an exploration stage, which removes one of many limitations for the adoption of reinforcement studying in safety-critical purposes. On this publish, we are going to construct an end-to-end answer to seek out optimum management insurance policies utilizing solely historic information on Amazon SageMaker utilizing Ray’s RLlib library. To study extra about reinforcement studying, see Use Reinforcement Learning with Amazon SageMaker.

Use instances

Industrial management includes the administration of complicated programs, equivalent to manufacturing traces, power grids, and chemical vegetation, to make sure environment friendly and dependable operation. Many conventional management methods are primarily based on predefined guidelines and fashions, which frequently require guide optimization. It’s customary follow in some industries to watch efficiency and regulate the management coverage when, for instance, gear begins to degrade or environmental situations change. Retuning can take weeks and will require injecting exterior excitations within the system to file its response in a trial-and-error method.

Reinforcement studying has emerged as a brand new paradigm in course of management to study optimum management insurance policies by means of interacting with the surroundings. This course of requires breaking down information into three classes: 1) measurements accessible from the bodily system, 2) the set of actions that may be taken upon the system, and three) a numerical metric (reward) of apparatus efficiency. A coverage is educated to seek out the motion, at a given commentary, that’s prone to produce the very best future rewards.

In offline reinforcement studying, one can practice a coverage on historic information earlier than deploying it into manufacturing. The algorithm educated on this weblog publish known as “Conservative Q Learning” (CQL). CQL incorporates an “actor” mannequin and a “critic” mannequin and is designed to conservatively predict its personal efficiency after taking a really helpful motion. On this publish, the method is demonstrated with an illustrative cart-pole management downside. The objective is to coach an agent to steadiness a pole on a cart whereas concurrently shifting the cart in the direction of a delegated objective location. The coaching process makes use of the offline information, permitting the agent to study from preexisting data. This cart-pole case examine demonstrates the coaching course of and its effectiveness in potential real-world purposes.

Resolution overview

The answer offered on this publish automates the deployment of an end-to-end workflow for offline reinforcement studying with historic information. The next diagram describes the structure used on this workflow. Measurement information is produced on the edge by a bit of business gear (right here simulated by an AWS Lambda perform). The info is put into an Amazon Kinesis Information Firehose, which shops it in Amazon Simple Storage Service (Amazon S3). Amazon S3 is a sturdy, performant, and low-cost storage answer that means that you can serve giant volumes of knowledge to a machine studying coaching course of.

AWS Glue catalogs the info and makes it queryable utilizing Amazon Athena. Athena transforms the measurement information right into a type {that a} reinforcement studying algorithm can ingest after which unloads it again into Amazon S3. Amazon SageMaker hundreds this information right into a coaching job and produces a educated mannequin. SageMaker then serves that mannequin in a SageMaker endpoint. The commercial gear can then question that endpoint to obtain motion suggestions.

Figure 1: Architecture diagram showing the end-to-end reinforcement learning workflow.

Determine 1: Structure diagram displaying the end-to-end reinforcement studying workflow.

On this publish, we are going to break down the workflow within the following steps:

  1. Formulate the issue. Resolve which actions could be taken, which measurements to make suggestions primarily based on, and decide numerically how nicely every motion carried out.
  2. Put together the info. Remodel the measurements desk right into a format the machine studying algorithm can devour.
  3. Practice the algorithm on that information.
  4. Choose the very best coaching run primarily based on coaching metrics.
  5. Deploy the mannequin to a SageMaker endpoint.
  6. Consider the efficiency of the mannequin in manufacturing.


To finish this walkthrough, you might want to have an AWS account and a command line interface with AWS SAM installed. Comply with these steps to deploy the AWS SAM template to run this workflow and generate coaching information:

  1. Obtain the code repository with the command
    git clone https://github.com/aws-samples/sagemaker-offline-reinforcement-learning-ray-cql

  2. Change listing to the repo:
    cd sagemaker-offline-reinforcement-learning-ray-cql

  3. Construct the repo:
    sam construct --use-container

  4. Deploy the repo
    sam deploy --guided --capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND

  5. Use the next instructions to name a bash script, which generates mock information utilizing an AWS Lambda perform.
    1. sudo yum set up jq
    2. cd utils
    3. sh generate_mock_data.sh

Resolution walkthrough

Formulate downside

Our system on this weblog publish is a cart with a pole balanced on prime. The system performs nicely when the pole is upright, and the cart place is near the objective place. Within the prerequisite step, we generated historic information from this technique.

The next desk exhibits historic information gathered from the system.

Cart place Cart velocity Pole angle Pole angular velocity Objective place Exterior drive Reward Time
0.53 -0.79 -0.08 0.16 0.50 -0.04 11.5 5:37:54 PM
0.51 -0.82 -0.07 0.17 0.50 -0.04 11.9 5:37:55 PM
0.50 -0.84 -0.07 0.18 0.50 -0.03 12.2 5:37:56 PM
0.48 -0.85 -0.07 0.18 0.50 -0.03 10.5 5:37:57 PM
0.46 -0.87 -0.06 0.19 0.50 -0.03 10.3 5:37:58 PM

You’ll be able to question historic system data utilizing Amazon Athena with the next question:

FROM "AWS CloudFormation Stack Name_glue_db"."measurements_table"
ORDER BY episode_id, epoch_time ASC
restrict 10;

The state of this technique is outlined by the cart place, cart velocity, pole angle, pole angular velocity, and objective place. The motion taken at every time step is the exterior drive utilized to the cart. The simulated surroundings outputs a reward worth that’s larger when the cart is nearer to the objective place and the pole is extra upright.

Put together information

To current the system data to the reinforcement studying mannequin, remodel it into JSON objects with keys that categorize values into the state (additionally known as commentary), motion, and reward classes. Retailer these objects in Amazon S3. Right here’s an instance of JSON objects produced from time steps within the earlier desk.

{“obs”:[[0.53,-0.79,-0.08,0.16,0.5]], “motion”:[[-0.04]], “reward”:[11.5] ,”next_obs”:[[0.51,-0.82,-0.07,0.17,0.5]]}

{“obs”:[[0.51,-0.82,-0.07,0.17,0.5]], “motion”:[[-0.04]], “reward”:[11.9], “next_obs”:[[0.50,-0.84,-0.07,0.18,0.5]]}

{“obs”:[[0.50,-0.84,-0.07,0.18,0.5]], “motion”:[[-0.03]], “reward”:[12.2], “next_obs”:[[0.48,-0.85,-0.07,0.18,0.5]]}

The AWS CloudFormation stack incorporates an output known as AthenaQueryToCreateJsonFormatedData. Run this question in Amazon Athena to carry out the transformation and retailer the JSON objects in Amazon S3. The reinforcement studying algorithm makes use of the construction of those JSON objects to know which values to base suggestions on and the end result of taking actions within the historic information.

Practice agent

Now we are able to begin a coaching job to provide a educated motion suggestion mannequin. Amazon SageMaker enables you to rapidly launch a number of coaching jobs to see how varied configurations have an effect on the ensuing educated mannequin. Name the Lambda perform named TuningJobLauncherFunction to begin a hyperparameter tuning job that experiments with 4 completely different units of hyperparameters when coaching the algorithm.

Choose greatest coaching run

To seek out which of the coaching jobs produced the very best mannequin, look at loss curves produced throughout coaching. CQL’s critic mannequin estimates the actor’s efficiency (known as a Q worth) after taking a really helpful motion. A part of the critic’s loss perform contains the temporal distinction error. This metric measures the critic’s Q worth accuracy. Search for coaching runs with a excessive imply Q worth and a low temporal distinction error. This paper, A Workflow for Offline Model-Free Robotic Reinforcement Learning, particulars find out how to choose the very best coaching run. The code repository has a file, /utils/investigate_training.py, that creates a plotly html determine describing the newest coaching job. Run this file and use the output to choose the very best coaching run.

We will use the imply Q worth to foretell the efficiency of the educated mannequin. The Q values are educated to conservatively predict the sum of discounted future reward values. For long-running processes, we are able to convert this quantity to an exponentially weighted common by multiplying the Q worth by (1-“low cost charge”). The perfect coaching run on this set achieved a imply Q worth of 539. Our low cost charge is 0.99, so the mannequin is predicting at the very least 5.39 common reward per time step. You’ll be able to examine this worth to historic system efficiency for a sign of if the brand new mannequin will outperform the historic management coverage. On this experiment, the historic information’s common reward per time step was 4.3, so the CQL mannequin is predicting 25 p.c higher efficiency than the system achieved traditionally.

Deploy mannequin

Amazon SageMaker endpoints allow you to serve machine studying fashions in a number of alternative ways to fulfill a wide range of use instances. On this publish, we’ll use the serverless endpoint kind in order that our endpoint mechanically scales with demand, and we solely pay for compute utilization when the endpoint is producing an inference. To deploy a serverless endpoint, embody a ProductionVariantServerlessConfig within the production variant of the SageMaker endpoint configuration. The next code snippet exhibits how the serverless endpoint on this instance is deployed utilizing the Amazon SageMaker software program improvement package for Python. Discover the pattern code used to deploy the mannequin at sagemaker-offline-reinforcement-learning-ray-cql.

predictor = mannequin.deploy(

The educated mannequin recordsdata are situated on the S3 mannequin artifacts for every coaching run. To deploy the machine studying mannequin, find the mannequin recordsdata of the very best coaching run, and name the Lambda perform named “ModelDeployerFunction” with an occasion that incorporates this mannequin information. The Lambda perform will launch a SageMaker serverless endpoint to serve the educated mannequin. Pattern occasion to make use of when calling the “ModelDeployerFunction”:

{ "DescribeTrainingJob": 
    { "ModelArtifacts": 
	    { "S3ModelArtifacts": "s3://your-bucket/coaching/my-training-job/output/mannequin.tar.gz"} 

Consider educated mannequin efficiency

It’s time to see how our educated mannequin is doing in manufacturing! To test the efficiency of the brand new mannequin, name the Lambda perform named “RunPhysicsSimulationFunction” with the SageMaker endpoint title within the occasion. It will run the simulation utilizing the actions really helpful by the endpoint. Pattern occasion to make use of when calling the RunPhysicsSimulatorFunction:

{"random_action_fraction": 0.0, "inference_endpoint_name": "sagemaker-endpoint-name"}

Use the next Athena question to check the efficiency of the educated mannequin with historic system efficiency.

    sum_reward_by_episode AS (
        SELECT SUM(reward) as sum_reward, m_temp.action_source
        FROM "<AWS CloudFormation Stack Identify>_glue_db"."measurements_table" m_temp
        GROUP BY m_temp.episode_id, m_temp.action_source

SELECT sre.action_source, AVG(sre.sum_reward) as avg_total_reward_per_episode
FROM  sum_reward_by_episode sre
GROUP BY sre.action_source
ORDER BY avg_total_reward_per_episode DESC

Right here is an instance outcomes desk. We see the educated mannequin achieved 2.5x extra reward than the historic information! Moreover, the true efficiency of the mannequin was 2x higher than the conservative efficiency prediction.

Motion supply Common reward per time step
trained_model 10.8
historic_data 4.3

The next animations present the distinction between a pattern episode from the coaching information and an episode the place the educated mannequin was used to choose which motion to take. Within the animations, the blue field is the cart, the blue line is the pole, and the inexperienced rectangle is the objective location. The purple arrow exhibits the drive utilized to the cart at every time step. The purple arrow within the coaching information jumps backwards and forwards fairly a bit as a result of the info was generated utilizing 50 p.c knowledgeable actions and 50 p.c random actions. The educated mannequin realized a management coverage that strikes the cart rapidly to the objective place, whereas sustaining stability, totally from observing nonexpert demonstrations.

 Clear up

To delete assets used on this workflow, navigate to the assets part of the Amazon CloudFormation stack and delete the S3 buckets and IAM roles. Then delete the CloudFormation stack itself.


Offline reinforcement studying may also help industrial firms automate the seek for optimum insurance policies with out compromising security through the use of historic information. To implement this method in your operations, begin by figuring out the measurements that make up a state-determined system, the actions you’ll be able to management, and metrics that point out desired efficiency. Then, entry this GitHub repository for the implementation of an computerized end-to-end answer utilizing Ray and Amazon SageMaker.

The publish simply scratches the floor of what you are able to do with Amazon SageMaker RL. Give it a strive, and please ship us suggestions, both within the Amazon SageMaker discussion forum or by means of your traditional AWS contacts.

In regards to the Authors

Walt Mayfield is a Options Architect at AWS and helps power firms function extra safely and effectively. Earlier than becoming a member of AWS, Walt labored as an Operations Engineer for Hilcorp Power Firm. He likes to backyard and fly fish in his spare time.

Felipe Lopez is a Senior Options Architect at AWS with a focus in Oil & Gasoline Manufacturing Operations. Previous to becoming a member of AWS, Felipe labored with GE Digital and Schlumberger, the place he centered on modeling and optimization merchandise for industrial purposes.

Yingwei Yu is an Utilized Scientist at Generative AI Incubator, AWS. He has expertise working with a number of organizations throughout industries on varied proofs of idea in machine studying, together with pure language processing, time collection evaluation, and predictive upkeep. In his spare time, he enjoys swimming, portray, mountain climbing, and spending time with household and associates.

Haozhu Wang is a analysis scientist in Amazon Bedrock specializing in constructing Amazon’s Titan basis fashions. Beforehand he labored in Amazon ML Options Lab as a co-lead of the Reinforcement Studying Vertical and helped clients construct superior ML options with the newest analysis on reinforcement studying, pure language processing, and graph studying. Haozhu acquired his PhD in Electrical and Laptop Engineering from the College of Michigan.

Leave a Reply

Your email address will not be published. Required fields are marked *