How Beekeeper optimized consumer personalization with Amazon Bedrock


This put up is cowritten by Mike Koźmiński from Beekeeper.

Giant Language Fashions (LLMs) are evolving quickly, making it tough for organizations to pick out the very best mannequin for every particular use case, optimize prompts for high quality and price, adapt to altering mannequin capabilities, and personalize responses for various customers.

Selecting the “proper” LLM and immediate isn’t a one-time choice—it shifts as fashions, costs, and necessities change. System prompts have gotten bigger (e.g. Anthropic system prompt) and extra advanced. A whole lot of mid-sized corporations don’t have assets to shortly consider and enhance them. To deal with this subject, Beekeeper constructed an Amazon Bedrock-powered system that constantly evaluates mannequin+immediate candidates, ranks them on a reside leaderboard, and routes every request to the present most suitable option for that use case.

Beekeeper: Connecting and empowering the frontline workforce

Beekeeper presents a complete digital office system particularly designed for frontline workforce operations. The corporate supplies a mobile-first communication and productiveness answer that connects non-desk staff with one another and headquarters, enabling organizations to streamline operations, enhance worker engagement, and handle duties effectively. Their system options strong integration capabilities with current enterprise methods (human assets, scheduling, payroll), whereas focusing on industries with giant deskless workforces similar to hospitality, manufacturing, retail, healthcare, and transportation. At its core, Beekeeper addresses the standard disconnect between frontline workers and their organizations by offering accessible digital instruments that improve communication, operational effectivity, and workforce retention, all delivered via a cloud-based SaaS system with cellular apps, administrative dashboards, and enterprise-grade security measures.

Beekeeper’s answer: A dynamic analysis system

Beekeeper solved this problem with an automatic system that constantly checks completely different mannequin and immediate combos, ranks choices based mostly on high quality, value, and pace, incorporates consumer suggestions to personalize responses, and mechanically routes requests to the present most suitable choice. High quality is scored with a small artificial check set and validated in manufacturing with consumer suggestions (thumbs up/down and feedback). By incorporating immediate mutation, Beekeeper created an natural system that evolves over time. The result’s a constantly-optimizing setup that balances high quality, latency, and price—and adapts mechanically when the panorama adjustments.

Actual-world instance: Chat Summarization

Beekeeper’s Frontline Success Platform unifies communication for deskless staff throughout industries. One sensible software of their LLM system is chat summarization. When a consumer returns to shift, they may discover a chat with many unread messages – as a substitute of studying all the pieces, they will request a abstract. The system generates a concise overview with motion gadgets tailor-made to the consumer’s wants. Customers can then present suggestions to enhance future summaries. This seemingly easy characteristic depends on refined expertise behind the scenes. The system should perceive dialog context, determine vital factors, acknowledge motion gadgets, and current data concisely—all whereas adapting to consumer preferences.

Answer overview

Beekeeper’s answer consists of two most important phases: constructing a baseline leaderboard and personalizing with consumer suggestions.

The system makes use of a number of AWS parts, together with Amazon EventBridge for scheduling, Amazon Elastic Kubernetes Service (EKS) for orchestration, AWS Lambda for analysis capabilities, Amazon Relational Database Service (RDS) for knowledge storage, and Amazon Mechanical Turk for guide validation.

The workflow begins with an artificial rank creator that establishes baseline efficiency. A scheduler triggers the coordinator, which fetches check knowledge and sends it to evaluators. These evaluators check every mannequin/immediate pair and return outcomes, with a portion despatched for guide validation. The system mutates promising prompts to create variations, evaluates these once more, and saves the very best performers. When consumer suggestions arrives, the system incorporates it via a second part. The coordinator fetches ranked mannequin/immediate pairs and sends them with consumer suggestions to a mutator, which returns personalised prompts. A drift detector makes certain these personalised variations don’t stray too removed from high quality requirements, and validated prompts are saved for particular customers.

Constructing the baseline leaderboard

To kick-start the optimization journey, Beekeeper engineers chosen numerous fashions and supplied them with domain-specific human-written prompts. The tech workforce examined these prompts utilizing LLM-generated examples to verify they had been error-free. A stable baseline is essential right here. This basis helps them refine their strategy when incorporating suggestions from actual customers.

The next sections, we dive into their success metrics, which guides their refinement of prompts and helps create an optimum consumer expertise.

Analysis standards for baseline

The standard of summaries generated by mannequin/immediate pairs is measured utilizing each quantitative and qualitative metrics, together with the next:

  • Compression ratio – Measures abstract size relative to the unique textual content, rewarding adherence to focus on lengths and penalizing extreme size.
  • Presence of motion gadgets – Makes certain user-specific motion gadgets are clearly recognized.
  • Lack of hallucinations – Validates factual accuracy and consistency.
  • Vector comparability – Assesses semantic similarity to human-generated excellent outcomes.

Within the following sections, we stroll via every of the analysis standards and the way they’re carried out.

Compression ratio

The compression ratio evaluates the size of the summarized textual content in comparison with the unique one and its adherence to a goal size (it rewards compression ratios near the goal and penalizes texts that deviate from goal size). The corresponding rating, between 0 and 100, is computed programmatically with the next Python code:

def calculate_compression_score(original_text, compressed_text):
    max_length = 650
    target_ratio = 1 / 5
    margin = 0.05
    max_penalty_points = 100 # Most penalty if the textual content is just too lengthy
    
    original_length = len(original_text)
    compressed_length = len(compressed_text)
    
    # Calculate penalty for exceeding most size
    excess_length = max(0, original_length - max_length)
    penalty = (excess_length / original_length) * max_penalty_points
    
    # Calculate the precise compression ratio
    actual_ratio = compressed_length / original_length
    lower_bound = target_ratio * (1 - margin)
    upper_bound = target_ratio * (1 + margin)
    
    # Calculate the bottom rating based mostly on the compression ratio
    if actual_ratio < lower_bound:
        base_score = 100 * (actual_ratio / lower_bound)
    elif actual_ratio > upper_bound:
        base_score = 100 * (upper_bound / actual_ratio)
    else:
        base_score = 100
    
    # Apply the penalty to the bottom rating
    rating = base_score - penalty
    
    # Make sure the rating doesn't go under 0
    rating = max(0, rating)
    
    return spherical(rating, 2)

Presence of motion gadgets associated to the consumer

To examine whether or not the abstract comprises all of the motion gadgets associated to the customers, Beekeeper depends on the comparability to the bottom reality. For the bottom reality comparability, the anticipated output format requires a bit labeled “Motion gadgets:” adopted by bullet factors, which makes use of common expressions to extract the motion merchandise record as within the following Python code:

import re

def extract_action_items(textual content):
    action_section = re.search(r'Motion gadgets:(.*?)(?=nn|Z)', textual content, re.DOTALL)
 
    if action_section:
        action_content = action_section.group(1).strip() 
        action_items = re.findall(r'^s*-s*(.+)$', action_content, re.MULTILINE)
        return action_items
    else:
        return []

They embody this extra extraction step to verify the info is formatted in a means that the LLM can simply course of. The extracted record is shipped to an LLM with the request to examine whether or not it’s appropriate or not. A +1 rating is assigned for every motion merchandise accurately assigned, and a -1 is utilized in case of false constructive. After that, scores are normalized to not penalize/gratify summaries with kind of motion gadgets.

Lack of hallucinations

To guage hallucinations, Beekeeper makes use of two approaches: cross-LLM analysis and guide validation.

Within the cross-LLM analysis, a abstract created by LLM A (for instance, Mistral Giant) is handed to the evaluator element, along with the immediate and the preliminary enter. The evaluator submits this textual content to LLM B (for instance, Anthropic’s Claude), asking if the info from the abstract match the uncooked context. An LLM of a unique household is used for this analysis. Amazon Bedrock makes this train significantly easy via the Converse API—customers can choose completely different LLMs by altering the mannequin identifier string.

One other vital level is the presence of guide verification on a small set of evaluations at Beekeeper, to keep away from instances of double hallucination. They assign a rating of 1 if no hallucination was detected and -1 if any is detected. For the entire pipeline, they use the identical heuristic of seven% guide analysis (particulars mentioned additional alongside on this put up).

Vector comparability

As a further analysis technique, semantic similarity is used for knowledge with accessible floor reality data. The embedding fashions are chosen among the many MTEB Leaderboard (multi-task and multi-language comparability of embedding fashions), contemplating giant vector dimensionality to maximise the quantity of knowledge saved contained in the vector. Beekeeper makes use of as its baseline Qwen3, a mannequin offering a 4096 dimensionality and supporting 16-bit quantization for quick computation. Additional embedding fashions are additionally used straight from Amazon Bedrock. After computing the embedding vectors for each the bottom reality reply and the one generated by a given mannequin/immediate pair, cosine similarity is used to compute the similarity, as proven within the following Python code:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(synthetic_summary_embed, generated_summary_embed)

Analysis baseline

The analysis baseline of every mannequin/immediate pair is carried out by accumulating the generated output of a set of mounted, predefined queries which can be manually annotated with floor reality outputs containing the “true solutions” (on this case, the best summaries from in-house and public dataset). This set as talked about earlier than is created from a public dataset in addition to hand crafted examples higher representing a buyer’s area. The scores are evaluated mechanically based mostly on the metrics described earlier: compression, lack of hallucinations, presence of motion gadgets, and vector comparability, to construct a baseline model of the leaderboard.

Guide evaluations

For extra validation, Beekeeper manually evaluations a scientifically decided pattern of evaluations utilizing Amazon Mechanical Turk. This pattern measurement is calculated utilizing Cochran’s formula to help statistical significance.

Amazon Mechanical Turk allows companies to harness human intelligence for duties computer systems can’t carry out successfully. This crowdsourcing market connects customers with a worldwide, on-demand workforce to finish microtasks like knowledge labeling, content material moderation, and analysis validation—serving to to scale operations with out sacrificing high quality or rising overhead. As talked about earlier, Beekeeper employs human suggestions to confirm that the automated LLM-based ranking system is working accurately. Primarily based on their prior assumptions, they know what share of responses ought to be categorised as containing hallucinations. If the quantity detected by human verification diverges by greater than two share factors from their estimations, they know that the automated course of isn’t working correctly and wishes revision. Now that Beekeeper has established their baseline, they will present the very best outcomes to their prospects. By continually updating their fashions, they will deliver new worth in an automatic style. At any time when their engineers have concepts for brand new immediate optimization, they will let the pipeline consider it in opposition to earlier ones utilizing baseline outcomes. Beekeeper can take it additional and embed consumer suggestions, permitting for extra customizable outcomes. Nonetheless, they don’t need consumer suggestions to totally change the conduct of their mannequin via immediate injection in suggestions. Within the following part, we look at the natural a part of Beekeeper’s pipeline that embeds consumer preferences into responses with out affecting different customers.

Analysis of consumer suggestions

Now that Beekeeper has established their baseline utilizing floor reality set, they will begin incorporating human suggestions. This works in accordance with the identical rules because the beforehand described hallucination detection course of. Person suggestions is pulled along with enter and LLM response. They move inquiries to the LLM within the following format:

You're given a process to determine if the speculation is in settlement with the context 
under. You'll solely use the contents of the context and never depend on exterior information.
Reply with sure/no."context": {{enter}} "abstract": {{output}} "speculation": {{ assertion }} "settlement":

They use this to examine whether or not the suggestions supplied remains to be relevant after the prompt-model pair was up to date. This works as a baseline for incorporating consumer suggestions. They’re now prepared to begin mutating the immediate. That is accomplished to keep away from suggestions being utilized a number of instances. If mannequin change or mutation already solved the issue, there is no such thing as a want to use it once more.

The mutation course of consists of reevaluating the consumer generated dataset after immediate mutation till the output incorporates the consumer suggestions, then we use the baseline to know variations and discard adjustments in case they undermine mannequin work.

The 4 best-performing mannequin/immediate pairs chosen within the baseline analysis (for mutated prompts) are additional processed via a immediate mutation course of, to examine for residual enchancment of the outcomes. That is important in an setting the place even small modifications to a immediate can result in dramatically completely different outcomes when used at the side of consumer suggestions.

The preliminary immediate is enriched with a immediate mutation, the acquired consumer suggestions, a considering fashion (a particular cognitive strategy like “Make it artistic” or “Assume in steps” that guides how the LLM approaches the mutation process), the consumer context, and is shipped to the LLM to supply a mutated immediate. The mutated prompts are added to the record, evaluated, and the corresponding scores are included into the leaderboard. Mutation prompts may also embody customers suggestions when such is current.

Examples of generated mutations prompts embody:

“Add hints which might assist LLM clear up this downside:”

“Modify Directions to be less complicated:”

“Repeat that instruction in one other means:”

“What extra directions would you give somebody to incorporate this suggestions {suggestions}
 into that directions:”

Answer instance

The baseline analysis course of begins with eight pairs of prompts and related fashions (Amazon Nova, Anthropic Claude 4 Sonnet, Meta Llama 3, and Mistral 8x7B). Beekeeper often makes use of 4 base prompts and two fashions to begin with. These prompts are used throughout all of the fashions, however outcomes are thought-about in pairs of prompt-models. Fashions are mechanically up to date as newer variations turn into accessible through Amazon Bedrock.

Beekeeper begins by evaluating the eight current pairs:

  • Every analysis requires producing 20 summaries per pair (8 x 20 = 160)
  • Every abstract is checked by three static checks and two LLM checks (160 x 2 = 320)

In complete, this creates 480 LLM calls. Scores are in contrast, making a leaderboard, and two prompt-model pairs are chosen. These two prompts are mutated utilizing consumer suggestions, creating 10 new prompts, that are once more evaluated, creating 600 calls to the LLM (10 x 20 + 10 x 20 x 2 = 600).

This course of will be run n instances to carry out extra artistic mutations; Beekeeper often performs two cycles.

In complete, this train performs checks on (8 + 10 + 10) x 2 mannequin/immediate pairs. The entire course of on common requires round 8,352,000 enter tokens and round 1,620,000 output tokens, costing round $48.Newly chosen mannequin/immediate pairs are utilized in manufacturing with ratios 1st: 50%, 2nd: 30%, and third: 20%.After deploying the brand new mannequin/immediate pairs, Beekeeper gathers suggestions from the customers. This suggestions is used to feed the mutator to create three new prompts. These prompts are despatched for drift detection, which compares them to the baseline. In complete, they create 4 LLM calls, costing round 4,800 enter tokens and 500 output tokens.

Advantages

The important thing advantage of Beekeeper’s answer is its capability to quickly evolve and adapt to consumer wants. With this strategy, they will make preliminary estimations of which mannequin/immediate pairs can be optimum candidates for every process, whereas controlling each value and the standard of outcomes. By combining the advantages of artificial knowledge with consumer suggestions, the answer is appropriate even for smaller engineering groups. As a substitute of specializing in generic prompts, Beekeeper prioritizes tailoring the immediate enchancment course of to satisfy the distinctive wants of every tenant. By doing so, they will refine prompts to be extremely related and user-friendly. This strategy permits customers to develop their very own fashion, which in flip enhances their expertise as they supply suggestions and see its impression. One of many uncomfortable side effects they noticed is that sure teams of individuals choose completely different kinds of communication. By mapping these outcomes to buyer interactions, they goal to current a extra tailor-made expertise. This makes certain that suggestions given by one consumer doesn’t impression one other. Their preliminary outcomes counsel 13–24% higher scores on response when aggregated per tenant. In abstract, the proposed answer presents a number of notable advantages. It reduces guide labor by automating the LLM and immediate choice course of, shortens the suggestions cycle, allows the creation of user- or tenant-specific enhancements, and supplies the capability to seamlessly combine and estimate the efficiency of latest fashions in the identical method because the earlier ones.

Conclusion

Beekeeper’s automated leaderboard strategy and human suggestions loop system for dynamic LLM and immediate pair choice addresses the important thing challenges organizations face in navigating the quickly evolving panorama of language fashions. By constantly evaluating and optimizing high quality, measurement, pace, and price, the answer helps prospects use the best-performing mannequin/immediate combos for his or her particular use instances. Trying forward, Beekeeper plans to additional refine and increase the capabilities of this method, incorporating extra superior strategies for immediate engineering and analysis. Moreover, the workforce is exploring methods to empower customers to develop their very own custom-made prompts, fostering a extra personalised and interesting expertise. In case your group is exploring methods to optimize LLM choice and immediate engineering, there’s no want to begin from scratch. Utilizing AWS providers like Amazon Bedrock for mannequin entry, AWS Lambda for light-weight analysis, Amazon EKS for orchestration, and Amazon Mechanical Turk for human validation, a pipeline will be constructed that mechanically evaluates, ranks, and evolves your prompts. As a substitute of manually updating prompts or re-benchmarking fashions, concentrate on making a feedback-driven system that constantly improves outcomes to your customers. Begin with a small set of fashions and prompts, outline your analysis metrics, and let the system scale as new fashions and use instances emerge.


Concerning the authors

Mike (Michał) Koźmiński is a Zürich-based Principal Engineer at Beekeeper by LumApps, the place he builds the foundations that make AI a first-class a part of the product. With 10+ years spanning startups and enterprises, he focuses on translating new expertise into dependable methods and actual buyer impression.

Magdalena Gargas is a Options Architect enthusiastic about expertise and fixing buyer challenges. At AWS, she works principally with software program corporations, serving to them innovate within the cloud.

Luca Perrozzi is a Options Architect at Amazon Internet Providers (AWS), based mostly in Switzerland. He focuses on innovation matters at AWS, particularly within the space of Synthetic Intelligence. Luca holds a PhD in particle physics and has 15 years of hands-on expertise as a analysis scientist and software program engineer.

Simone PomataSimone Pomata is a Principal Options Architect at AWS. He has labored enthusiastically within the tech business for greater than 10 years. At AWS, he helps prospects achieve constructing new applied sciences daily.

Leave a Reply

Your email address will not be published. Required fields are marked *