How Palo Alto Networks enhanced system safety infra log evaluation with Amazon Bedrock
This publish is co-written by Fan Zhang, Sr Principal Engineer / Architect from Palo Alto Networks.
Palo Alto Networks’ Gadget Safety workforce needed to detect early warning indicators of potential manufacturing points to supply extra time to SMEs to react to those rising issues. The first problem they confronted was that reactively processing over 200 million each day service and utility log entries resulted in delayed response instances to those crucial points, leaving them in danger for potential service degradation.
To deal with this problem, they partnered with the AWS Generative AI Innovation Center (GenAIIC) to develop an automatic log classification pipeline powered by Amazon Bedrock. The answer achieved 95% precision in detecting manufacturing points whereas decreasing incident response instances by 83%.
On this publish, we discover construct a scalable and cost-effective log evaluation system utilizing Amazon Bedrock to rework reactive log monitoring into proactive situation detection. We focus on how Amazon Bedrock, via Anthropic’ s Claude Haiku mannequin, and Amazon Titan Text Embeddings work collectively to robotically classify and analyze log information. We discover how this automated pipeline detects crucial points, study the answer structure, and share implementation insights which have delivered measurable operational enhancements.
Palo Alto Networks affords Cloud-Delivered Security Services (CDSS) to deal with system safety dangers. Their resolution makes use of machine studying and automatic discovery to supply visibility into linked units, implementing Zero Trust principles. Groups dealing with related log evaluation challenges can discover sensible insights on this implementation.
Resolution overview
Palo Alto Networks’ automated log classification system helps their Gadget Safety workforce detect and reply to potential service failures forward of time. The answer processes over 200 million service and utility logs each day, robotically figuring out crucial points earlier than they escalate into service outages that influence prospects.
The system makes use of Amazon Bedrock with Anthropic’s Claude Haiku mannequin to grasp log patterns and classify severity ranges, and Amazon Titan Textual content Embeddings permits clever similarity matching. Amazon Aurora supplies a caching layer that makes processing huge log volumes possible in actual time. The answer integrates seamlessly with Palo Alto Networks’ current infrastructure, serving to the Gadget Safety workforce concentrate on stopping outages as a substitute of managing complicated log evaluation processes.
Palo Alto Networks and the AWS GenAIIC collaborated to construct an answer with the next capabilities:
- Clever deduplication and caching – The system scales by intelligently figuring out duplicate log entries for a similar code occasion. Relatively than utilizing a big language mannequin (LLM) to categorise each log individually, the system first identifies duplicates via actual matching, then makes use of overlap similarity, and at last employs semantic similarity provided that no earlier match is discovered. This strategy cost-effectively reduces the 200 million each day logs by over 99%, to logs solely representing distinctive occasions. The caching layer permits real-time processing by decreasing the necessity for redundant LLM invocations.
- Context retrieval for distinctive logs – For distinctive logs, Anthropic’s Claude Haiku mannequin utilizing Amazon Bedrock classifies every log’s severity. The mannequin processes the incoming log together with related labeled historic examples. The examples are dynamically retrieved at inference time via vector similarity search. Over time, labeled examples are added to supply wealthy context to the LLM for classification. This context-aware strategy improves accuracy for Palo Alto Networks’ inner logs and programs and evolving log patterns that conventional rule-based programs battle to deal with.
- Classification with Amazon Bedrock – The answer supplies structured predictions, together with severity classification (Precedence 1 (P1), Precedence 2 (P2), Precedence 3 (P3)) and detailed reasoning for every resolution. This complete output helps Palo Alto Networks’ SMEs shortly prioritize responses and take preventive motion earlier than potential outages happen.
- Integration with current pipelines for motion – Outcomes combine with their current FluentD and Kafka pipeline, with information flowing to Amazon Simple Storage Service (Amazon S3) and Amazon Redshift for additional evaluation and reporting.
The next diagram (Determine 1) illustrates how the three-stage pipeline processes Palo Alto Networks’ 200 million each day log quantity whereas balancing scale, accuracy, and cost-efficiency. The structure consists of the next key elements:
- Knowledge ingestion layer – FluentD and Kafka pipeline and incoming logs
- Processing pipeline – Consisting of the next phases:
- Stage 1: Good caching and deduplication – Aurora for actual matching and Amazon Titan Textual content Embeddings for semantic matching
- Stage 2: Context retrieval – Amazon Titan Textual content Embeddings to allow historic labeled examples, and vector similarity search
- Stage 3: Classification – Anthropic’s Claude Haiku mannequin for severity classification (P1/P2/P3)
- Output layer – Aurora, Amazon S3, Amazon Redshift, and SME evaluate interface
The processing workflow strikes via the next phases:
- Stage 1: Good caching and deduplication – Incoming logs from Palo Alto Networks’ FluentD and Kafka pipeline are instantly processed via an Aurora primarily based caching layer. The system first applies actual matching, then falls again to overlap similarity, and at last makes use of semantic similarity via Amazon Titan Textual content Embeddings if no earlier match is discovered. Throughout testing, this strategy recognized that greater than 99% of logs corresponded to duplicate occasions, though they contained totally different time stamps, log ranges, and phrasing. The caching system decreased response instances for cached outcomes and decreased pointless LLM processing.
- Stage 2: Context retrieval for distinctive logs – The remaining lower than 1% of actually distinctive logs require classification. For these entries, the system makes use of Amazon Titan Textual content Embeddings to establish probably the most related historic examples from Palo Alto Networks’ labeled dataset. Relatively than utilizing static examples, this dynamic retrieval makes certain every log receives contextually acceptable steerage for classification.
- Stage 3: Classification with Amazon Bedrock – Distinctive logs and their chosen examples are processed by Amazon Bedrock utilizing Anthropic’s Claude Haiku mannequin. The mannequin analyzes the log content material alongside related historic examples to supply severity classifications (P1, P2, P3) and detailed explanations. Outcomes are saved in Aurora and the cache and built-in into Palo Alto Networks’ current information pipeline for SME evaluate and motion.
This structure permits cost-effective processing of huge log volumes whereas sustaining 95% precision for crucial P1 severity detection. The system makes use of rigorously crafted prompts that mix area experience with dynamically chosen examples:
system_prompt = """
<Process>
You might be an skilled log evaluation system liable for classifying manufacturing system logs primarily based on severity. Your evaluation helps engineering groups prioritize their response to system points and preserve service reliability.
</Process>
<Severity_Definitions>
P1 (Essential): Requires speedy motion - system-wide outages, repeated utility crashes
P2 (Excessive): Warrants consideration throughout enterprise hours - efficiency points, partial service disruption
P3 (Low): Will be addressed when sources out there - minor bugs, authorization failures, intermittent community points
</Severity_Definitions>
<Examples>
<log_snippet>
2024-08-17 01:15:00.00 [warn] failed (104: Connection reset by peer) whereas studying response header from upstream
</log_snippet>
severity: P3
class: Class A
<log_snippet>
2024-08-18 17:40:00.00 <warn> Error: Request failed with standing code 500 at settle
</log_snippet>
severity: P2
class: Class B
</Examples>
<Target_Log>
Log: {incoming_log_snippet}
Location: {system_location}
</Target_Log>"""
Present severity classification (P1/P2/P3) and detailed reasoning.
Implementation insights
The core worth of Palo Alto Networks’ resolution lies in making an insurmountable problem manageable: AI helps their workforce analyze 200 million of each day volumes effectively, whereas the system’s dynamic adaptability makes it doable to increase the answer into the longer term by including extra labeled examples. Palo Alto Networks’ profitable implementation of their automated log classification system yielded key insights that may assist organizations constructing production-scale AI options:
- Steady studying programs ship compounding worth – Palo Alto Networks designed their system to enhance robotically as SMEs validate classifications and label new examples. Every validated classification turns into a part of the dynamic few-shot retrieval dataset, enhancing accuracy for related future logs whereas growing cache hit charges. This strategy creates a cycle the place operational use enhances system efficiency and reduces prices.
- Clever caching permits AI at manufacturing scale – The multi-layered caching structure processes greater than 99% of logs via cache hits, remodeling costly per-log LLM operations into an economical system able to dealing with 200 million each day volumes. This basis makes AI processing economically viable at enterprise scale whereas sustaining response instances.
- Adaptive programs deal with evolving necessities with out code modifications – The answer accommodates new log classes and patterns with out requiring system modifications. When efficiency wants enchancment for novel log sorts, SMEs can label extra examples, and the dynamic few-shot retrieval robotically incorporates this information into future classifications. This adaptability permits the system to scale with enterprise wants.
- Explainable classifications drive operational confidence – SMEs responding to crucial alerts require confidence in AI suggestions, significantly for P1 severity classifications. By offering detailed reasoning alongside every classification, Palo Alto Networks permits SMEs to shortly validate choices and take acceptable motion. Clear explanations rework AI outputs from predictions into actionable intelligence.
These insights display how AI programs designed for steady studying and explainability turn out to be more and more beneficial operational belongings.
Conclusion
Palo Alto Networks’ automated log classification system demonstrates how generative AI powered by AWS helps operational groups handle huge volumes in actual time. On this publish, we explored how an structure combining Amazon Bedrock, Amazon Titan Textual content Embeddings, and Aurora processes 200 million of each day logs via clever caching and dynamic few-shot studying, enabling proactive detection of crucial points with 95% precision. Palo Alto Networks’ automated log classification system delivered concrete operational enhancements:
- 95% precision, 90% recall for P1 severity logs – Essential alerts are correct and actionable, minimizing false alarms whereas catching 9 out of 10 pressing points, leaving the remaining alerts to be captured by current monitoring programs
- 83% discount in debugging time – SMEs spend much less time on routine log evaluation and extra time on strategic enhancements
- Over 99% cache hit charge – The clever caching layer processes 20 million each day quantity cost-effectively via subsecond responses
- Proactive situation detection – The system identifies potential issues earlier than they influence prospects, stopping the multi-week outages that beforehand disrupted service
- Steady enchancment – Every SME validation robotically improves future classifications and will increase cache effectivity, leading to decreased prices
For organizations evaluating AI initiatives for log evaluation and operational monitoring, Palo Alto Networks’ implementation affords a blueprint for constructing production-scale programs that ship measurable enhancements in operational effectivity and price discount. To construct your individual generative AI options, discover Amazon Bedrock for managed entry to basis fashions. For added steerage, take a look at the AWS Machine Learning resources and browse implementation examples within the AWS Artificial Intelligence Blog.
The collaboration between Palo Alto Networks and the AWS GenAIIC demonstrates how considerate AI implementation can rework reactive operations into proactive, scalable programs that ship sustained enterprise worth.
To get began with Amazon Bedrock, see Build generative AI solutions with Amazon Bedrock.
Concerning the authors




