How Provider predicts HVAC faults utilizing AWS Glue and Amazon SageMaker

In their very own phrases, “In 1902, Willis Provider solved one in every of mankind’s most elusive challenges of controlling the indoor setting by way of fashionable air-con. At the moment, Provider merchandise create snug environments, safeguard the worldwide meals provide, and allow protected transport of important medical provides below exacting situations.”

At Carrier, the inspiration of our success is making merchandise our clients can belief to maintain them snug and protected year-round. Excessive reliability and low gear downtime are more and more essential as excessive temperatures turn out to be extra widespread because of local weather change. We’ve traditionally relied on threshold-based methods that alert us to irregular gear habits, utilizing parameters outlined by our engineering crew. Though such methods are efficient, they’re supposed to determine and diagnose gear points slightly than predict them. Predicting faults earlier than they happen permits our HVAC sellers to proactively handle points and enhance the shopper expertise.

As a way to enhance our gear reliability, we partnered with the Amazon Machine Learning Solutions Lab to develop a customized machine studying (ML) mannequin able to predicting gear points previous to failure. Our groups developed a framework for processing over 50 TB of historic sensor information and predicting faults with 91% precision. We will now notify sellers of impending gear failure, in order that they’ll schedule inspections and reduce unit downtime. The answer framework is scalable as extra gear is put in and will be reused for quite a lot of downstream modeling duties.

On this submit, we present how the Provider and AWS groups utilized ML to foretell faults throughout giant fleets of kit utilizing a single mannequin. We first spotlight how we use AWS Glue for extremely parallel information processing. We then talk about how Amazon SageMaker helps us with characteristic engineering and constructing a scalable supervised deep studying mannequin.

Overview of use case, targets, and dangers

The principle aim of this venture is to scale back downtime by predicting impending gear failures and notifying sellers. This permits sellers to schedule upkeep proactively and supply distinctive customer support. We confronted three main challenges when engaged on this resolution:

  • Knowledge scalability – Knowledge processing and have extraction must scale throughout giant rising historic sensor information
  • Mannequin scalability – The modeling method must be able to scaling throughout over 10,000 items
  • Mannequin precision – Low false constructive charges are wanted to keep away from pointless upkeep inspections

Scalability, each from a knowledge and modeling perspective, is a key requirement for this resolution. We’ve over 50 TB of historic gear information and count on this information to develop shortly as extra HVAC items are related to the cloud. Knowledge processing and mannequin inference have to scale as our information grows. To ensure that our modeling method to scale throughout over 10,000 items, we’d like a mannequin that may study from a fleet of kit slightly than counting on anomalous readings for a single unit. This may enable for generalization throughout items and cut back the price of inference by internet hosting a single mannequin.

The opposite concern for this use case is triggering false alarms. Which means a vendor or technician will go on-site to examine the shopper’s gear and discover every part to be working appropriately. The answer requires a excessive precision mannequin to make sure that when a vendor is alerted, the gear is more likely to fail. This helps earn the belief of sellers, technicians, and owners alike, and reduces the prices related to pointless on-site inspections.

We partnered with the AI/ML consultants on the Amazon ML Options Lab for a 14-week improvement effort. In the long run, our resolution consists of two main elements. The primary is a knowledge processing module constructed with AWS Glue that summarizes gear habits and reduces the scale of our coaching information for environment friendly downstream processing. The second is a mannequin coaching interface managed by way of SageMaker, which permits us to coach, tune, and consider our mannequin earlier than it’s deployed to a manufacturing endpoint.

Knowledge processing

Every HVAC unit we set up generates information from 90 totally different sensors with readings for RPMs, temperature, and pressures all through the system. This quantities to roughly 8 million information factors generated per unit per day, with tens of hundreds of items put in. As extra HVAC methods are related to the cloud, we anticipate the amount of knowledge to develop shortly, making it vital for us to handle its measurement and complexity to be used in downstream duties. The size of the sensor information historical past additionally presents a modeling problem. A unit might begin displaying indicators of impending failure months earlier than a fault is definitely triggered. This creates a major lag between the predictive sign and the precise failure. A technique for compressing the size of the enter information turns into vital for ML modeling.

To handle the scale and complexity of the sensor information, we compress it into cycle options as proven in Determine 1. This dramatically reduces the scale of knowledge whereas capturing options that characterize the gear’s habits.

Determine 1: Pattern of HVAC sensor information

AWS Glue is a serverless information integration service for processing giant portions of knowledge at scale. AWS Glue allowed us to simply run parallel information preprocessing and have extraction. We used AWS Glue to detect cycles and summarize unit habits utilizing key options recognized by our engineering crew. This dramatically lowered the scale of our dataset from over 8 million information factors per day per unit right down to roughly 1,200. Crucially, this method preserves predictive details about unit habits with a a lot smaller information footprint.

The output of the AWS Glue job is a abstract of unit habits for every cycle. We then use an Amazon SageMaker Processing job to calculate options throughout cycles and label our information. We formulate the ML downside as a binary classification job with a aim of predicting gear faults within the subsequent 60 days. This permits our vendor community to deal with potential gear failures in a well timed method. It’s essential to notice that not all items fail inside 60 days. A unit experiencing gradual efficiency degradation might take extra time to fail. We handle this through the mannequin analysis step. We centered our modeling on summertime as a result of these months are when most HVAC methods within the US are in constant operation and below extra excessive situations.


Transformer architectures have turn out to be the state-of-the-art method for dealing with temporal information. They will use lengthy sequences of historic information at every time step with out affected by vanishing gradients. The enter to our mannequin at a given cut-off date consists of the options for the earlier 128 gear cycles, which is roughly one week of unit operation. That is processed by a three-layer encoder whose output is averaged and fed right into a multi-layered perceptron (MLP) classifier. The MLP classifier consists of three linear layers with ReLU activation capabilities and a closing layer with LogSoftMax activation. We use weighted unfavorable log-likelihood loss with a distinct weight on the constructive class for our loss operate. This biases our mannequin in the direction of excessive precision and avoids pricey false alarms. It additionally incorporates our enterprise targets instantly into the mannequin coaching course of. Determine 2 illustrates the transformer structure.

Transformer Architecture

Determine 2: Temporal transformer structure


One problem when coaching this temporal studying mannequin is information imbalance. Some items have an extended operational historical past than others and subsequently have extra cycles in our dataset. As a result of they’re overrepresented within the dataset, these items may have extra affect on our mannequin. We clear up this by randomly sampling 100 cycles in a unit’s historical past the place we assess the chance of a failure at the moment. This ensures that every unit is equally represented through the coaching course of. Whereas eradicating the imbalanced information downside, this method has the additional benefit of replicating a batch processing method that might be utilized in manufacturing. This sampling method was utilized to the coaching, validation, and take a look at units.

Coaching was carried out utilizing a GPU-accelerated occasion on SageMaker. Monitoring the loss exhibits that it achieves one of the best outcomes after 180 coaching epochs as present in Determine 3. Determine 4 exhibits that the world below the ROC curve for the ensuing temporal classification mannequin is 81%.

Training Curve

Determine 3: Coaching loss over epochs

Determine 4: ROC-AUC for 60-day lockout


Whereas our mannequin is skilled on the cycle degree, analysis must happen on the unit degree. On this means, one unit with a number of true constructive detections remains to be solely counted as a single true constructive on the unit degree. To do that, we analyze the overlap between the anticipated outcomes and the 60-day window previous a fault. That is illustrated within the following determine, which exhibits 4 circumstances of predicting outcomes:

  • True unfavorable – All of the prediction outcomes are unfavorable (purple) (Determine 5)
  • False constructive – The constructive predictions are false alarms (Determine 6)
  • False unfavorable – Though the predictions are all unfavorable, the precise labels could possibly be constructive (inexperienced) (Determine 7)
  • True constructive – A few of the predictions could possibly be unfavorable (inexperienced), and no less than one prediction is constructive (yellow) (Determine 8)
True Negative

Determine 5.1: True unfavorable case

False Positive

Determine 5.2: False constructive case

False Negative

Determine 5.3: False unfavorable case

True Positive

Determine 5.4: True constructive case

After coaching, we use the analysis set to tune the brink for sending an alert. Setting the mannequin confidence threshold at 0.99 yields a precision of roughly 81%. This falls in need of our preliminary 90% criterion for achievement. Nonetheless, we discovered {that a} good portion of items failed simply exterior the 60-day analysis window. This is smart, as a result of a unit might actively show defective habits however take longer than 60 days to fail. To deal with this, we outlined a metric known as efficient precision, which is a mixture of the true constructive precision (81%) with the added precision of lockouts that occurred within the 30 days past our goal 60-day window.

For an HVAC vendor, what’s most essential is that an onsite inspection helps forestall future HVAC points for the shopper. Utilizing this mannequin, we estimate that 81.2% of the time the inspection will forestall a lockout from occurring within the subsequent 60 days. Moreover, 10.4% of the time the lockout would have occurred in inside 90 days of inspection. The remaining 8.4% might be a false alarm. The efficient precision of the skilled mannequin is 91.6%.


On this submit, we confirmed how our crew used AWS Glue and SageMaker to create a scalable supervised studying resolution for predictive upkeep. Our mannequin is able to capturing traits throughout long-term histories of sensor information and precisely detecting tons of of kit failures weeks prematurely. Predicting faults prematurely will cut back curb-to-curb time, permitting our sellers to supply extra well timed technical help and enhancing the general buyer expertise. The impacts of this method will develop over time as extra cloud-connected HVAC items are put in yearly.

Our subsequent step is to combine these insights into the upcoming launch of Provider’s Related Vendor Portal. The portal combines these predictive alerts with different insights we derive from our AWS-based information lake with the intention to give our sellers extra readability into gear well being throughout their total consumer base. We’ll proceed to enhance our mannequin by integrating information from further sources and extracting extra superior options from our sensor information. The strategies employed on this venture present a robust basis for our crew to begin answering different key questions that may assist us cut back guarantee claims and enhance gear effectivity within the area.

For those who’d like assist accelerating the usage of ML in your services, please contact the Amazon ML Solutions Lab. To study extra concerning the providers used on this venture, confer with the AWS Glue Developer Guide and the Amazon SageMaker Developer Guide.

Concerning the Authors

Ravi Patankar is a technical chief for IoT associated analytics at Provider’s Residential HVAC Unit. He formulates analytics issues associated to diagnostics and prognostics and supplies path for ML/deep learning-based analytics options and structure.

Dan Volk is a Knowledge Scientist on the AWS Generative AI Innovation Heart. He has ten years of expertise in machine studying, deep studying and time-series evaluation and holds a Grasp’s in Knowledge Science from UC Berkeley. He’s obsessed with reworking advanced enterprise challenges into alternatives by leveraging cutting-edge AI applied sciences.

Yingwei Yu is an Utilized Scientist at AWS Generative AI Innovation Heart. He has expertise working with a number of organizations throughout industries on varied proof-of-concepts in machine studying, together with NLP, time-series evaluation, and generative AI applied sciences. Yingwei acquired his PhD in pc science from Texas A&M College.

Yanxiang Yu is an Utilized Scientist at Amazon Internet Providers, engaged on the Generative AI Innovation Heart. With over 8 years of expertise constructing AI and machine studying fashions for industrial purposes, he makes a speciality of generative AI, pc imaginative and prescient, and time collection modeling. His work focuses on discovering modern methods to use superior generative methods to real-world issues.

Diego Socolinsky is a Senior Utilized Science Supervisor with the AWS Generative AI Innovation Heart, the place he leads the supply crew for the Jap US and Latin America areas. He has over twenty years of expertise in machine studying and pc imaginative and prescient, and holds a PhD diploma in arithmetic from The Johns Hopkins College.

Kexin Ding is a fifth-year Ph.D. candidate in pc science at UNC-Charlotte. Her analysis focuses on making use of deep studying strategies for analyzing multi-modal information, together with medical picture and genomics sequencing information.

Leave a Reply

Your email address will not be published. Required fields are marked *