Unlock mannequin insights with log chance help for Amazon Bedrock Customized Mannequin Import
You should utilize Amazon Bedrock Custom Model Import to seamlessly combine your custom-made fashions—corresponding to Llama, Mistral, and Qwen—that you’ve got fine-tuned elsewhere into Amazon Bedrock. The expertise is totally serverless, minimizing infrastructure administration whereas offering your imported fashions with the identical unified API entry as native Amazon Bedrock fashions. Your customized fashions profit from automated scaling, enterprise-grade safety, and native integration with Amazon Bedrock options corresponding to Amazon Bedrock Guardrails and Amazon Bedrock Knowledge Bases.
Understanding how assured a mannequin is in its predictions is crucial for constructing dependable AI purposes, significantly when working with specialised customized fashions that may encounter domain-specific queries.
With log chance help now added to Customized Mannequin Import, you may entry details about your fashions’ confidence of their predictions on the token degree. This enhancement supplies larger visibility into mannequin conduct and allows new capabilities for mannequin analysis, confidence scoring, and superior filtering strategies.
On this put up, we discover how log chances work with imported fashions in Amazon Bedrock. You’ll study what log chances are, the way to allow them in your API calls, and the way to interpret the returned knowledge. We additionally spotlight sensible purposes—from detecting potential hallucinations to optimizing RAG techniques and evaluating fine-tuned fashions—that display how these insights can enhance your AI purposes, serving to you construct extra reliable options together with your customized fashions.
Understanding log chances
In language fashions, a log chance represents the logarithm of the chance that the mannequin assigns to a token in a sequence. These values point out how assured the mannequin is about every token it generates or processes. Log chances are expressed as destructive numbers, with values nearer to zero indicating greater confidence. For instance, a log chance of -0.1 corresponds to roughly 90% confidence, whereas a price of -3.0 corresponds to about 5% confidence. By analyzing these values, you may establish when a mannequin is very sure versus when it’s making much less assured predictions. Log chances present a quantitative measure of how doubtless the mannequin thought of every generated token, providing invaluable perception into the arrogance of its output. By analyzing them you may,
- Gauge confidence throughout a response: Assess how assured the mannequin was in numerous sections of its output, serving to you establish the place it was sure versus unsure.
- Rating and evaluate outputs: Examine total sequence chance (by including or averaging log chances) to rank or filter a number of mannequin outputs.
- Detect potential hallucinations: Determine sudden drops in token-level confidence, which might flag segments that may require verification or evaluate.
- Scale back RAG prices with early pruning: Run brief, low-cost draft generations primarily based on retrieved contexts, compute log chances for these drafts, and discard low-scoring candidates early, avoiding pointless full-length generations or costly reranking whereas preserving solely essentially the most promising contexts within the pipeline.
- Construct confidence-aware purposes: Adapt system conduct primarily based on certainty ranges—for instance, set off clarifying prompts, present fallback responses, or flagging for human evaluate.
Total, log chances are a robust instrument for decoding and debugging mannequin responses with measurable certainty—significantly invaluable for purposes the place understanding why a mannequin responded in a sure manner may be as vital because the response itself.
Conditions
To make use of log chance help with customized mannequin import in Amazon Bedrock, you want:
- An lively AWS account with entry to Amazon Bedrock
- A customized mannequin created in Amazon Bedrock utilizing the Customized Mannequin Import function after July 31, 2025, when the log chances help was launched
- Applicable AWS Identity and Access Management (IAM) permissions to invoke fashions by way of the Amazon Bedrock Runtime
Introducing log chances help in Amazon Bedrock
With this launch, Amazon Bedrock now permits fashions imported utilizing the Customized Mannequin Import function to return token-level log chances as a part of the inference response.
When invoking a mannequin by way of Amazon Bedrock InvokeModel API, you may entry token log chances by setting "return_logprobs": true within the JSON request physique. With this flag enabled, the mannequin’s response will embody further fields offering log chances for each the immediate tokens and the generated tokens, in order that clients can analyze the mannequin’s confidence in its predictions. These log chances allow you to quantitatively assess how assured your customized fashions are when processing inputs and producing responses. The granular metrics enable for higher analysis of response high quality, troubleshooting of sudden outputs, and optimization of prompts or mannequin configurations.
Let’s stroll by way of an instance of invoking a customized mannequin on Amazon Bedrock with log chances enabled and study the output format. Suppose you have got already imported a customized mannequin (as an example, a fine-tuned Llama 3.2 1B mannequin) into Amazon Bedrock and have its mannequin Amazon Useful resource Identify (ARN). You may invoke this mannequin utilizing the Amazon Bedrock Runtime SDK (Boto3 for Python on this instance) as proven within the following instance:
Within the previous code, we ship a immediate—"The short brown fox jumps"—to our customized imported mannequin. We configure customary inference parameters: a most era size of fifty tokens, a reasonable temperature of 0.5 for reasonable randomness, and a cease situation (both a interval or a newline). The "return_logprobs":True parameter tells Amazon Bedrock to return log chances within the response.
The InvokeModel API returns a JSON response containing three most important parts: the usual generated textual content output, metadata in regards to the era course of, and now log chances for each immediate and generated tokens. These values reveal the mannequin’s inside confidence for every token prediction, so you may perceive not simply what textual content was produced, however how sure the mannequin was at every step of the method. The next is an instance response from the "fast brown fox jumps" immediate, displaying log chances (showing as destructive numbers):
The uncooked API response supplies token IDs paired with their log chances. To make this knowledge interpretable, we have to first decode the token IDs utilizing the suitable tokenizer (on this case, the Llama 3.2 1B tokenizer), which maps every ID again to its precise textual content token. Then we convert log chances to chances by making use of the exponential operate, translating these values into extra intuitive chances between 0 and 1. We’ve applied these transformations utilizing customized code (not proven right here) to provide a human-readable format the place every token seems alongside its chance, making the mannequin’s confidence in its predictions instantly clear.
Let’s break down what this tells us in regards to the mannequin’s inside processing:
era: That is the precise textual content generated by the mannequin (in our instance, it’s a continuation of the immediate that we despatched to the mannequin). This is similar area you’d get usually from any mannequin invocation.prompt_token_countandgeneration_token_count: These point out the variety of tokens within the enter immediate and within the output, respectively. In our instance, the immediate was tokenized into six tokens, and the mannequin generated 5 tokens in its completion.stop_reason: The explanation the era stopped ("cease"means the mannequin naturally stopped at a cease sequence or end-of-text,"size"means it hit the max token restrict, and so forth). In our case it reveals"cease", indicating the mannequin stopped by itself or due to the cease situation we supplied.prompt_logprobs: This array supplies log chances for every token within the immediate. Because the mannequin processes your enter, it repeatedly predicts what ought to come subsequent primarily based on what it has seen to date. These values measure which tokens in your immediate have been anticipated or stunning to the mannequin.- The primary entry is
Noneas a result of the very first token has no previous context. The mannequin can not predict something with out prior info. Every subsequent entry incorporates token IDs mapped to their log chances. We’ve transformed these IDs to readable textual content and reworked the log chances into percentages for simpler understanding. - You may observe the mannequin’s growing confidence because it processes acquainted sequences. For instance, after seeing The short brown, the mannequin predicted fox with 95.1% confidence. After seeing the complete context as much as fox, it predicted jumps with 81.1% confidence.
- Many positions present a number of tokens with their chances, revealing options the mannequin thought of. As an illustration, on the second place, the mannequin evaluated each The (2.7%) and Query (30.6%), which suggests the mannequin thought of each tokens viable at that place. This added visibility helps you perceive the place the mannequin weighted options and may reveal when it was extra unsure or had issue selecting from a number of choices.
- Notably low chances seem for some tokens—fast acquired simply 0.01%—indicating the mannequin discovered these phrases sudden of their context.
- The general sample tells a transparent story: particular person phrases initially acquired low chances, however as the entire fast brown fox jumps phrase emerged, the mannequin’s confidence elevated dramatically, displaying it acknowledged this as a well-known expression.
- When a number of tokens in your immediate constantly obtain low chances, your phrasing is likely to be uncommon for the mannequin. This uncertainty can have an effect on the standard of completions. Utilizing these insights, you may reformulate prompts to raised align with patterns the mannequin encountered in its coaching knowledge.
- The primary entry is
logprobs: This array incorporates log chances for every token within the mannequin’s generated output. The format is analogous: a dictionary mapping token IDs to their corresponding log chances.- After decoding these values, we are able to see that the tokens over, the, lazy, and canine all have excessive chances. This demonstrates the mannequin acknowledged it was finishing the well-known phrase the fast brown fox jumps over the lazy canine—a typical pangram that the mannequin seems to have sturdy familiarity with.
- In distinction, the ultimate interval (newline) token has a a lot decrease chance (30.3%), revealing the mannequin’s uncertainty about the way to conclude the sentence. This is sensible as a result of the mannequin had a number of legitimate choices: ending the sentence with a interval, persevering with with further content material, or selecting one other punctuation mark altogether.
Sensible use circumstances of log chances
Token-level log chances from the Customized Mannequin Import function present invaluable insights into your mannequin’s decision-making course of. These metrics rework the way you work together together with your customized fashions by revealing their confidence ranges for every generated token. Listed below are impactful methods to make use of these insights:
Rating a number of completions
You should utilize log chances to quantitatively rank a number of generated outputs for a similar immediate. When your utility wants to decide on between completely different attainable completions—whether or not for summarization, translation, or inventive writing—you may calculate every completion’s total chance by averaging or including the log chances throughout all its tokens.
Instance:
Immediate: Translate the phrase "Battre le fer pendant qu'il est chaud"
- Completion A:
"Strike whereas the iron is scorching"(Common log chance: -0.39) - Completion B:
"Beat the iron whereas it's scorching."(Common log chance: -0.46)
On this instance, Completion A receives the next log chance rating (nearer to zero), indicating the mannequin discovered this idiomatic translation extra pure than the extra literal Completion B. This numerical method allows your utility to robotically choose essentially the most possible output or current a number of candidates ranked by the mannequin’s confidence degree.
This rating functionality extends past translation to many situations the place a number of legitimate outputs exist—together with content material era, code completion, and inventive writing—offering an goal high quality metric primarily based on the mannequin’s confidence moderately than relying solely on subjective human judgment.
Detecting hallucinations and low-confidence solutions
Fashions may produce hallucinations—plausible-sounding however factually incorrect statements—when dealing with ambiguous prompts, advanced queries, or subjects exterior their experience. Log chances present a sensible approach to detect these cases by revealing the mannequin’s inside uncertainty, serving to you establish probably inaccurate info even when the output seems assured.
By analyzing token-level log chances, you may establish which elements of a response the mannequin was probably unsure about, even when the textual content seems assured on the floor. This functionality is particularly invaluable in retrieval-augmented era (RAG) techniques, the place responses needs to be grounded in retrieved context. When a mannequin has related info obtainable, it usually generates solutions with greater confidence. Conversely, low confidence throughout a number of tokens suggests the mannequin is likely to be producing content material with out ample supporting info.
Instance:
- Immediate:
- Mannequin output:
On this instance, we deliberately requested a couple of fictional metric—Portfolio Synergy Quotient (PSQ)—to display how log chances reveal uncertainty in mannequin responses. Regardless of producing a professional-sounding definition for this non-existent monetary idea, the token-level confidence scores inform a revealing story. The arrogance scores proven beneath are derived by making use of the exponential operate to the log chances returned by the mannequin.
- PSQ reveals medium confidence (63.8%), indicating that the mannequin acknowledged the acronym format however wasn’t extremely sure about this particular time period.
- Widespread finance terminology like lessons (98.2%) and portfolio (92.8%) exhibit excessive confidence, doubtless as a result of these are customary ideas broadly utilized in monetary contexts.
- Crucial connecting ideas present notably low confidence: measure (14.0%) and diversification (31.8%), reveal the mannequin’s uncertainty when trying to clarify what PSQ means or does.
- Purposeful phrases like is (45.9%) and of (56.6%) hover within the medium confidence ranges, suggesting uncertainty in regards to the total construction of the reason.
By figuring out these low-confidence segments, you may implement focused safeguards in your purposes—corresponding to flagging content material for verification, retrieving further context, producing clarifying questions, or making use of confidence thresholds for delicate info. This method helps create extra dependable AI techniques that may distinguish between high-confidence data and unsure responses.
Monitoring immediate high quality
When engineering prompts in your utility, log chances reveal how properly the mannequin understands your directions. If the primary few generated tokens present unusually low chances, it typically indicators that the mannequin struggled to interpret what you’re asking.
By monitoring the common log chance of the preliminary tokens—usually the primary 5–10 generated tokens—you may quantitatively measure immediate readability. Effectively-structured prompts with clear context usually produce greater chances as a result of the mannequin instantly is aware of what to do. Imprecise or underspecified prompts typically yield decrease preliminary token likelihoods because the mannequin hesitates or searches for route.
Instance:
Immediate comparability for customer support responses:
- Fundamental immediate:
- Common log chance of first 5 tokens: -1.215 (decrease confidence)
- Optimized immediate:
- Common log chance of first 5 tokens: -0.333 (greater confidence)
The optimized immediate generates greater log chances, demonstrating that exact directions and clear context cut back the mannequin’s uncertainty. Fairly than making absolute judgments about immediate high quality, this method allows you to measure relative enchancment between variations. You may immediately observe how particular components—function definitions, contextual particulars, and specific expectations—enhance mannequin confidence. By systematically measuring these confidence scores throughout completely different immediate iterations, you construct a quantitative framework for immediate engineering that reveals precisely when and the way your directions turn into unclear to the mannequin, enabling steady data-driven refinement.
Decreasing RAG prices with early pruning
In conventional RAG implementations, techniques retrieve 5–20 paperwork and generate full responses utilizing these retrieved contexts. This method drives up inference prices as a result of each retrieved context consumes tokens no matter precise usefulness.
Log chances allow a more cost effective different by way of early pruning. As a substitute of instantly processing the retrieved paperwork in full:
- Generate draft responses primarily based on every retrieved context
- Calculate the common log chance throughout these brief drafts
- Rank contexts by their common log chance scores
- Discard low-scoring contexts that fall beneath a confidence threshold
- Generate the entire response utilizing solely the highest-confidence contexts
This method works as a result of contexts that include related info produce greater log chances within the draft era section. When the mannequin encounters useful context, it generates textual content with larger confidence, mirrored in log chances nearer to zero. Conversely, irrelevant or tangential contexts produce extra unsure outputs with decrease log chances.
By filtering contexts earlier than full era, you may cut back token consumption whereas sustaining and even bettering reply high quality. This shifts the method from a brute-force method to a focused pipeline that directs full era solely towards contexts the place the mannequin demonstrates real confidence within the supply materials.
Effective-tuning analysis
When you have got fine-tuned a mannequin in your particular area, log chances provide a quantitative approach to assess the effectiveness of your coaching. By analyzing confidence patterns in responses, you may decide in case your mannequin has developed correct calibration—displaying excessive confidence for proper domain-specific solutions and applicable uncertainty elsewhere.
A well-calibrated fine-tuned mannequin ought to assign greater chances to correct info inside its specialised space whereas sustaining decrease confidence when working exterior its coaching area. Issues with calibration seem in two most important varieties. Overconfidence happens when the mannequin assigns excessive chances to incorrect responses, suggesting it hasn’t correctly discovered the boundaries of its data. Beneath confidence manifests as constantly low chances regardless of producing correct solutions, indicating that coaching won’t have sufficiently bolstered appropriate patterns.
By systematically testing your mannequin throughout numerous situations and analyzing the log chances, you may establish areas needing further coaching or detect potential biases in your present method. This creates a data-driven suggestions loop for iterative enhancements, ensuring your mannequin performs reliably inside its meant scope whereas sustaining applicable boundaries round its experience.
Getting began
Right here’s the way to begin utilizing log chances with fashions imported by way of the Amazon Bedrock Customized Mannequin Import function:
- Allow log chances in your API calls: Add
"return_logprobs": trueto your request payload when invoking your customized imported mannequin. This parameter works with each theInvokeModelandInvokeModelWithResponseStreamAPIs. Start with acquainted prompts to look at which tokens your mannequin predicts with excessive confidence in comparison with which it finds stunning. - Analyze confidence patterns in your customized fashions: Study how your fine-tuned or domain-adapted fashions reply to completely different inputs. The log chances reveal whether or not your mannequin is appropriately calibrated in your particular area—displaying excessive confidence the place it needs to be sure.
- Develop confidence-aware purposes: Implement sensible use circumstances corresponding to hallucination detection, response rating, and content material verification to make your purposes extra strong. For instance, you may flag low-confidence sections of responses for human evaluate or choose the highest-confidence response from a number of generations.
Conclusion
Log chance help for Amazon Bedrock Customized Mannequin Import affords enhanced visibility into mannequin decision-making. This function transforms beforehand opaque mannequin conduct into quantifiable confidence metrics that builders can analyze and use.
All through this put up, we have now demonstrated the way to allow log chances in your API calls, interpret the returned knowledge, and use these insights for sensible purposes. From detecting potential hallucinations and rating a number of completions to optimizing RAG techniques and evaluating fine-tuning high quality, log chances provide tangible advantages throughout various use circumstances.
For patrons working with custom-made basis fashions like Llama, Mistral, or Qwen, these insights handle a basic problem: understanding not simply what a mannequin generates, however how assured it’s in its output. This distinction turns into crucial when deploying AI in domains requiring excessive reliability—corresponding to finance, healthcare, or enterprise purposes—the place incorrect outputs can have vital penalties.
By revealing confidence patterns throughout various kinds of queries, log chances assist you assess how properly your mannequin customizations have affected calibration, highlighting the place your mannequin excels and the place it’d want refinement. Whether or not you’re evaluating fine-tuning effectiveness, debugging sudden responses, or constructing techniques that adapt to various confidence ranges, this functionality represents an vital development in bringing larger transparency and management to generative AI growth on Amazon Bedrock.
We look ahead to seeing how you utilize log chances to construct extra clever and reliable purposes together with your customized imported fashions. This functionality demonstrates the dedication from Amazon Bedrock to supply builders with instruments that allow assured innovation whereas delivering the scalability, safety, and ease of a totally managed service.
Concerning the authors
Manoj Selvakumar is a Generative AI Specialist Options Architect at AWS, the place he helps organizations design, prototype, and scale AI-powered options within the cloud. With experience in deep studying, scalable cloud-native techniques, and multi-agent orchestration, he focuses on turning rising improvements into production-ready architectures that drive measurable enterprise worth. He’s obsessed with making advanced AI ideas sensible and enabling clients to innovate responsibly at scale—from early experimentation to enterprise deployment. Earlier than becoming a member of AWS, Manoj labored in consulting, delivering knowledge science and AI options for enterprise shoppers, constructing end-to-end machine studying techniques supported by sturdy MLOps practices for coaching, deployment, and monitoring in manufacturing.
Yanyan Zhang is a Senior Generative AI Information Scientist at Amazon Net Providers, the place she has been engaged on cutting-edge AI/ML applied sciences as a Generative AI Specialist, serving to clients use generative AI to realize their desired outcomes. Yanyan graduated from Texas A&M College with a PhD in Electrical Engineering. Exterior of labor, she loves touring, figuring out, and exploring new issues.
Lokeshwaran Ravi is a Senior Deep Studying Compiler Engineer at AWS, specializing in ML optimization, mannequin acceleration, and AI safety. He focuses on enhancing effectivity, decreasing prices, and constructing safe ecosystems to democratize AI applied sciences, making cutting-edge ML accessible and impactful throughout industries.
Revendra Kumar is a Senior Software program Growth Engineer at Amazon Net Providers. In his present function, he focuses on mannequin internet hosting and inference MLOps on Amazon Bedrock. Previous to this, he labored as an engineer on internet hosting Quantum computer systems on the cloud and creating infrastructure options for on-premises cloud environments. Exterior of his skilled pursuits, Revendra enjoys staying lively by enjoying tennis and mountain climbing.