Consider fashions with the Amazon Nova analysis container utilizing Amazon SageMaker AI


This weblog publish introduces the brand new Amazon Nova model evaluation options in Amazon SageMaker AI. This launch provides {custom} metrics help, LLM-based choice testing, log likelihood seize, metadata evaluation, and multi-node scaling for big evaluations.

The brand new options embrace:

  • Customized metrics use the convey your individual metrics (BYOM) features to regulate analysis standards on your use case.
  • Nova LLM-as-a-Choose handles subjective evaluations via pairwise A/B comparisons, reporting win/tie/loss ratios and Bradley-Terry scores with explanations for every judgment.
  • Token-level log possibilities reveal mannequin confidence, helpful for calibration and routing selections.
  • Metadata passthrough retains per-row fields for evaluation by buyer phase, area, issue, or precedence degree with out additional processing.
  • Multi-node execution distributes workloads whereas sustaining secure aggregation, scaling analysis datasets from 1000’s to tens of millions of examples.

In SageMaker AI, groups can outline mannequin evaluations utilizing JSONL information in Amazon Simple Storage Service (Amazon S3), then execute them as SageMaker training jobs with management over pre- and post-processing workflows with outcomes delivered as structured JSONL with per-example and aggregated metrics and detailed metadata. Groups can then combine outcomes with analytics instruments like Amazon Athena and AWS Glue, or straight route them into current observability stacks, with constant outcomes.

The remainder of this publish introduces the brand new options after which demonstrates step-by-step the best way to arrange evaluations, run choose experiments, seize and analyze log possibilities, use metadata for evaluation, and configure multi-node runs in an IT help ticket classification instance.

Options for mannequin analysis utilizing Amazon SageMaker AI

When selecting which fashions to convey into manufacturing, correct analysis methodologies require testing a number of fashions, together with personalized variations in SageMaker AI. To take action successfully, groups want an identical take a look at situations passing the identical prompts, metrics, and analysis logic to totally different fashions. This makes certain rating variations replicate mannequin efficiency, not analysis strategies.

Amazon Nova fashions which might be personalized in SageMaker AI now inherit the total analysis infrastructure as base fashions making it a good comparability. Outcomes land as structured JSONL in Amazon S3, prepared for Athena queries or routing to your observability stack. Let’s talk about a few of the new options out there for mannequin analysis.

Convey your individual metrics (BYOM)

Normal metrics won’t at all times suit your particular necessities. Customized metrics options leverage AWS Lambda features to deal with information preprocessing, output post-processing, and metric calculation. For example, a customer support bot wants empathy and model consistency metrics; a medical assistant may require medical accuracy measures. With {custom} metrics, you’ll be able to take a look at what issues on your area.

On this characteristic, pre- and post-processor features are encapsulated in a Lambda operate that’s used to course of information earlier than inference to normalize codecs or inject context and to then calculate your {custom} metrics utilizing post-processing operate after the mannequin responds. Lastly, the outcomes are aggregated utilizing your alternative of min, max, common, or sum, thereby providing better flexibility when totally different take a look at examples carry various significance.

Multimodal LLM-as-a-judge analysis

LLM-as-a-judge automates choice testing for textual content in addition to multimodal duties utilizing Amazon Nova LLM-as-a-Choose fashions for response comparability. The system implements pairwise analysis: for every immediate, it compares baseline and challenger responses, working the comparability in each ahead and backward passes to detect positional bias. The output contains Bradley-Terry possibilities (the probability one response is most popular over one other) with bootstrap-sampled confidence intervals, giving statistical confidence in choice outcomes.

Nova LLM-as-a-Choose fashions are purposefully personalized for judging associated analysis duties. Every judgment contains pure language rationales explaining why the choose most popular one response over different, serving to with focused enhancements somewhat than blind optimization. Nova LLM-as-a-Choose evaluates complicated reasoning duties like help ticket classification, the place nuanced understanding issues greater than easy key phrase matching.

The tie detection is equally precious, figuring out the place fashions have reached parity. Mixed with customary error metrics, you’ll be able to decide whether or not efficiency variations are statistically significant or inside noise margins; that is essential when deciding if a mannequin replace justifies deployment.

Use log likelihood for mannequin analysis

Log possibilities present mannequin confidence for every generated token, revealing insights into mannequin uncertainty and prediction high quality. Log possibilities help calibration research, confidence routing, and hallucination detection past primary accuracy. Token-level confidence helps establish unsure predictions for extra dependable methods.

A Nova analysis container with SageMaker AI mannequin analysis now captures token-level log possibilities throughout inference for uncertainty-aware analysis workflows. The characteristic integrates with analysis pipelines and offers the inspiration for superior diagnostic capabilities. You possibly can correlate mannequin confidence with precise efficiency, implement high quality gates primarily based on uncertainty thresholds, and detect potential points earlier than they affect manufacturing methods. Add log likelihood seize by including the top_logprobs parameter to your analysis configuration:

# Add log likelihood seize to your analysis configuration
inference:
  max_new_tokens: 2048
  temperature: 0
  top_logprobs: 10  # Captures prime 10 log possibilities per token

When mixed with the metadata passthrough characteristic as mentioned within the subsequent part, log possibilities assist with stratified confidence evaluation throughout totally different information segments and use instances. This mix offers detailed insights into mannequin habits, so groups can perceive not simply the place fashions fail, however why they fail and the way assured they’re of their predictions giving them extra management over calibration.

Cross metadata data when utilizing mannequin analysis

Customized datasets now help metadata fields when making ready the analysis dataset. Metadata helps evaluate outcomes throughout totally different fashions and datasets. The metadata area accepts any string for tagging and evaluation with the enter information and eval outcomes. With the addition of the metadata area, the general schema per information level in JSONL file turns into the next:

{
  "question": "(Required) String containing the query or instruction that wants a solution",
  "response": "(Required) String containing the anticipated mannequin output",
  "system": "(Non-obligatory) String containing the system immediate that units the habits, function, or character of the AI mannequin earlier than it processes the question",
  "metadata": "(Non-obligatory) String containing metadata related to the entry for tagging functions."
}

Allow multi-node analysis

The analysis container helps multi-node analysis for sooner processing. Set the replicas parameter to allow multi-node analysis to a price better than one.

Case examine: IT help ticket classification assistant

The next case examine demonstrates a number of of those new options utilizing IT help ticket classification. On this use case, fashions classify tickets as {hardware}, software program, community, or entry points whereas explaining their reasoning. This exams each accuracy and clarification high quality, and exhibits {custom} metrics, metadata passthrough, log likelihood evaluation, and multi-node scaling in follow.

Dataset overview

The help ticket classification dataset incorporates IT help tickets spanning totally different precedence ranges and technical domains, every with structured metadata for detailed evaluation. Every analysis instance features a help ticket question, the system context, a structured response containing the expected class, the reasoning primarily based on ticket content material, and a pure language description. Amazon SageMaker Floor Reality responses embrace considerate explanations like Based mostly on the error message mentioning community timeout and the consumer's description of intermittent connectivity, this seems to be a community infrastructure situation requiring escalation to the community staff. The dataset contains metadata tags for issue degree (straightforward/medium/onerous primarily based on technical complexity), precedence (low/medium/excessive), and area class, demonstrating how metadata passthrough works for stratified evaluation with out post-processing joins.

Conditions

Earlier than you run the pocket book, ensure the provisioned atmosphere has the next:

  1. An AWS account
  2. AWS Identity and Access Management (IAM) permissions to create a Lambda operate, the power to run SageMaker coaching jobs throughout the related AWS account within the earlier step, and browse and write permissions to an S3 bucket
  3. A growth atmosphere with SageMaker Python SDK and the Nova {custom} analysis SDK (nova_custom_evaluation_sdk)

Step 1: Put together the immediate

For our help ticket classification process, we have to assess not solely whether or not the mannequin identifies the right class, but in addition whether or not it offers coherent reasoning and adheres to structured output codecs to have an entire overview required in manufacturing methods. For crafting the immediate, we’re going to use Nova prompting best practices.

System immediate design: Beginning with the system immediate, we set up the mannequin’s function and anticipated habits via a targeted system immediate:

You might be an IT help specialist. Analyze the help ticket and classify it precisely primarily based on the problem description and technical particulars.

This immediate units clear expectations: the mannequin ought to act as a website skilled, base selections on visible proof, and prioritize accuracy. By framing the duty as skilled evaluation somewhat than informal commentary, we encourage extra considerate, detailed responses.

Question construction: The question template requests each classification and justification:

# Question template requests classification and reasoning
"What class does this help ticket belong to? Present your reasoning primarily based on the problem description."

The express request for reasoning is essential—it forces the mannequin to articulate its decision-making course of, serving to with analysis of clarification high quality alongside classification accuracy. This mirrors real-world necessities the place mannequin selections typically should be interpretable for stakeholders or regulatory compliance.

Structured response format: We outline the anticipated output as JSON with three elements:

{
  "class": "network_connectivity",
  "thought": "Based mostly on the error message mentioning community timeout and intermittent connectivity points, this seems to be a community infrastructure drawback",
  "description": "Community connectivity situation requiring infrastructure staff escalation"
}

This construction helps the three-dimensional analysis technique we’ll talk about later on this publish:

  • class area – Classification accuracy metrics (precision, recall, F1)
  • thought area – Reasoning coherence analysis
  • description area – Pure language high quality evaluation

By defining the response as parseable JSON, we assist with automated metric calculation via our {custom} Lambda features whereas sustaining human-readable explanations for mannequin selections. This immediate structure transforms analysis from easy proper/flawed classification into an entire evaluation of mannequin capabilities. Manufacturing AI methods should be correct, explainable, and dependable of their output formatting—and our immediate design explicitly exams all three dimensions. The structured format additionally facilitates the metadata-driven stratified evaluation we’ll use in later steps, the place we are able to correlate reasoning high quality with confidence scores and issue ranges throughout totally different breed classes.

Step 2: Put together the dataset for analysis with metadata

On this step, we’ll put together our help ticket dataset with metadata help to assist with stratified evaluation throughout totally different classes and issue ranges. The metadata passthrough characteristic retains {custom} fields full for detailed efficiency evaluation with out post-hoc joins. Let’s evaluate an instance dataset.

Dataset schema with metadata

For our help ticket classification analysis, we’ll use the improved gen_qa format with structured metadata:

{
  "system": "You're a help ticket classification skilled. Analyze the ticket content material and classify it precisely primarily based on the problem kind, urgency, and division that ought to deal with it.",
  "question": "My laptop computer will not activate in any respect. I've tried holding the ability button, checking the charger connection, however there is no response. The LED indicators aren't lighting up both.",
  "response": "{"class": "Hardware_Critical", "thought": "Full gadget failure with no energy response signifies a vital {hardware} situation", "description": "Vital {hardware} failure requiring pressing IT help"}",
  "metadata": "{"class": "support_ticket_classification", "issue": "straightforward", "area": "IT_support", "precedence": "vital"}"
}

Look at this additional: how can we routinely generate structured metadata for every analysis instance? This metadata enrichment course of analyzes the content material to categorise process sorts, assess issue ranges, and establish domains, creating the inspiration for stratified evaluation in later steps. By embedding this contextual data straight into our dataset, we assist the Nova analysis pipeline maintain these insights full, so we are able to perceive mannequin efficiency throughout totally different segments with out requiring complicated post-processing joins.

import json
from typing import Listing, Dict

def load_sample_dataset(file_path: str="sample_text_eval_dataset.jsonl") -> Listing[Dict]:
    """Load the pattern dataset with metadata"""
    dataset = []
    with open(file_path, 'r') as f:
        for line in f:
            dataset.append(json.hundreds(line))
    return dataset

As soon as our dataset is enriched with metadata, we have to export it within the JSONL format required by the Nova analysis container.

The next export operate codecs our ready examples with embedded metadata in order that they’re prepared for the analysis pipeline, sustaining the precise schema construction wanted for the Amazon SageMaker processing workflow:

# Export for SageMaker Analysis
def export_to_sagemaker_format(dataset: Listing[Dict], output_path: str="gen_qa.jsonl"):
    """Export dataset in format suitable with SageMaker """
    
    with open(output_path, 'w', encoding='utf-8') as f:
        for merchandise in dataset:
            f.write(json.dumps(merchandise, ensure_ascii=False) + 'n')
    
    print(f"Dataset exported: {len(dataset)} examples to {output_path}")

# Utilization Instance
dataset = load_sample_dataset('sample_text_eval_dataset.jsonl')
export_to_sagemaker_format(dataset)

Step 3: Put together {custom} metrics to judge {custom} fashions

After making ready and verifying your information adheres to the required schema, the following essential step is to develop analysis metrics code to evaluate your {custom} mannequin’s efficiency. Use Nova analysis container and the convey your individual metric (BYOM) workflow to regulate your mannequin analysis pipeline with {custom} metrics and information workflows.

Introduction to BYOM workflow

With the BYOM characteristic, you’ll be able to tailor your mannequin analysis workflow to your particular wants with totally customizable pre-processing, post-processing, and metrics capabilities. BYOM provides you management over the analysis course of, serving to you to fine-tune and enhance your mannequin’s efficiency metrics in keeping with your undertaking’s distinctive necessities.

Key duties for this classification drawback

  1. Outline duties and metrics: On this use case, mannequin analysis requires three duties:
    1. Class prediction accuracy: This may assess how precisely the mannequin predicts the right class for given inputs. For this we’ll use customary metrics equivalent to accuracy, precision, recall, and F1 rating to quantify efficiency.
    2. Schema adherence: Subsequent, we additionally wish to be certain that the mannequin’s outputs conform to the required schema. This step is essential for sustaining consistency and compatibility with downstream functions. For this we’ll use validation methods to confirm that the output format matches the required schema.
    3. Thought course of coherence: Subsequent, we additionally wish to consider the coherence and reasoning behind the mannequin’s selections. This includes analyzing the mannequin’s thought course of to assist validate predictions are logically sound. Strategies equivalent to consideration mechanisms, interpretability instruments, and mannequin explanations can present insights into the mannequin’s decision-making course of.

The BYOM characteristic for evaluating {custom} fashions requires constructing a Lambda operate.

  1. Configure a {custom} layer in your Lambda operate. Within the GitHub release, discover and obtain the pre-built nova-custom-eval-layer.zip file.
  2. Use the next command to add the {custom} Lambda layer:
# Add {custom} lambda layer for Nova analysis
aws lambda publish-layer-version 
    --layer-name nova-custom-eval-layer 
    --zip-file fileb://nova-custom-eval-layer.zip 
    --compatible-runtimes python3.12 python3.11 python3.10 python3.9

  1. Add the printed layer and AWSLambdaPowertoolsPythonV3-python312-arm64 (or related AWS layer primarily based on Python model and runtime model compatibility) to your Lambda operate to make sure all mandatory dependencies are put in.
  2. For growth of the Lambda operate, import two key dependencies: one for importing the preprocessor and postprocessor decorators and one to construct the lambda_handler:
# Import required dependencies for Nova analysis
from nova_custom_evaluation_sdk.processors.decorators import preprocess, postprocess
from nova_custom_evaluation_sdk.lambda_handler import build_lambda_handler

  1. Add the preprocessor and postprocessor logic.
    1. Preprocessor logic: Implement features that manipulate the information earlier than it’s handed to the inference server. This will embrace immediate manipulations or different information preprocessing steps. The pre-processor expects an occasion dictionary (dict), a sequence of key worth pairs, as enter:
      occasion = {
          "process_type": "preprocess",
          "information": {
              "system": < handed in system immediate>,
              "immediate": < handed in consumer immediate >,
              "gold": <expected_answer>
          }
      }

      Instance:

      @preprocess
      def preprocessor(occasion: dict, context) -> dict:
          information = occasion.get('information', {})
          return {
              "statusCode": 200,
              "physique": {
                  "system": information.get("system"),
                  "immediate": information.get("immediate", ""),
                  "gold": information.get("gold", "")
              }
          }

    2. Postprocessor logic: Implement features that course of the inference outcomes. This will contain parsing fields, including {custom} validations, or calculating particular metrics. The postprocessor expects an occasion dict as enter which has this format:
      occasion = {
      "process_type": "postprocess",
      "information":{
          "immediate": < the enter immediate >,
          "inference_output": < the output from inference on {custom} or base mannequin >,
          "gold": < the anticipated output >
          }
      }

  2. Outline the Lambda handler, the place you add the pre-processor and post-processor logics, earlier than and after inference respectively.
    # Construct Lambda handler
    lambda_handler = build_lambda_handler(
    preprocessor=preprocessor,
    postprocessor=postprocessor
    )
    

Step 4: Launch the analysis job with {custom} metrics

Now that you’ve constructed your {custom} processors and encoded your analysis metrics, you’ll be able to select a recipe and make mandatory changes to ensure the earlier BYOM logic will get executed. For this, first select bring your own data recipes from the general public repo, and ensure the next code adjustments are made.

  1. Guarantee that the processor secret is added on to the recipe with appropriate particulars:
    # Processor configuration in recipe
    processor:
      lambda_arn: arn:aws:lambda:<area>:<account_id>:operate:<identify> 
      preprocessing:
        enabled: true
      postprocessing:
        enabled: true
      aggregation: common 

    • lambda-arn: The Amazon Useful resource Identify (ARN) for a buyer Lambda operate that handles pre-processing and post-processing
    • preprocessing: Whether or not so as to add {custom} pre-processing operations
    • post-processing: Whether or not so as to add {custom} post-processing operations
    • aggregation: In-built aggregation operate to select from.

      min, max, common, or sum
  2. Launch a coaching job with an analysis container:
    # SageMaker coaching job configuration
    input_s3_uri = "s3://amzn-s3-demo-input-bucket/<jsonl_data_file>"  #  exchange together with your S3 URI and  <jsonl_data_file> together with your precise file identify
    output_s3_uri = "s3://amzn-s3-demo-input-bucket"  # Output outcomes location
    instance_type = "ml.g5.12xlarge"  # Occasion kind for analysis
    job_name = "nova-lite-support-ticket-class"
    recipe_path = "./class_pred_with_byom.yaml"  # Recipe configuration file
    image_uri = "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-latest”   # Newest as of this publish
    # Configure coaching enter
    evalInput = TrainingInput(
        s3_data=input_s3_uri,
        distribution='FullyReplicated',
        s3_data_type="S3Prefix"
    )
    # Create and configure estimator
    estimator = PyTorch(
        output_path=output_s3_uri,
        base_job_name=job_name,
        function=function,
        instance_type=instance_type,
        training_recipe=recipe_path,
        sagemaker_session=sagemaker_session,
        image_uri=image_uri
    )
    # Launch analysis job
    estimator.match(
        inputs={"prepare": evalInput}

Step 5: Use metadata and log possibilities to calibrate the accuracy

You too can embrace log likelihood as an inference config variable to assist conduct logit-based evaluations. For this we are able to move top_logprobs below inference within the recipe:

top_logprobs signifies the variety of probably tokens to return at every token place every with an related log likelihood. This worth should be an integer from 0 to twenty. Logprobs comprise the thought-about output tokens and log possibilities of every output token returned within the content material of message.

# Log likelihood configuration in inference settings
inference:
  max_new_tokens: 2048 
  top_k: -1 
  top_p: 1.0 
  temperature: 0
  top_logprobs: 10  # Variety of probably tokens to return (0-20)

As soon as the job runs efficiently and you’ve got the outcomes, yow will discover the log possibilities below the sphere pred_logprobs. This area incorporates the thought-about output tokens and log possibilities of every output token returned within the content material of message. Now you can use the logits produced to do calibration on your classification process. The log possibilities of every output token will be helpful for calibration, to regulate the predictions and deal with these possibilities as confidence rating.

Step 6: Failure evaluation on low confidence prediction

After calibrating our mannequin utilizing metadata and log possibilities, we are able to now establish and analyze failure patterns in low-confidence predictions. This evaluation helps us perceive the place our mannequin struggles and guides focused enhancements.

Loading outcomes with log possibilities

Now, let’s study intimately how we mix the inference outputs with detailed log likelihood information from the Amazon Nova analysis pipeline. This helps us carry out confidence-aware failure evaluation by merging the prediction outcomes with token-level uncertainty data.

import pandas as pd
import json
import numpy as np
from sklearn.metrics import classification_report

def load_evaluation_results_with_logprobs(inference_output_path: str, details_parquet_path: str) -> pd.DataFrame:
    """Load inference outcomes and merge with log possibilities"""
    
    # Load inference outputs
    outcomes = []
    with open(inference_output_path, 'r') as f:
        for line in f:
            outcomes.append(json.hundreds(line))
    
    # Load log possibilities from parquet
    details_df = pd.read_parquet(details_parquet_path)
    
    # Create outcomes dataframe
    results_df = pd.DataFrame(outcomes)
    
    # Add log likelihood information and calculate confidence
    if 'pred_lobprobs' in details_df.columns:
        results_df['pred_lobprobs'] = details_df['pred_lobprobs'].tolist()
        results_df['confidence'] = results_df['pred_lobprobs'].apply(calculate_confidence_score)
    
    # Calculate correctness and parse metadata
    results_df['is_correct'] = results_df.apply(calculate_correctness, axis=1)
    results_df['metadata_parsed'] = results_df['metadata'].apply(safe_parse_metadata)
    
    return results_df

Generate a confidence rating from log possibilities by changing the logprobs to possibilities and utilizing the rating of the primary token within the classification response. We solely use the primary token as we all know subsequent tokens within the classification would align the category label. This step creates downstream high quality gates during which we might route low confidence scores to human evaluate, have a view into mannequin uncertainty to validate if the mannequin is “guessing,” stopping hallucinations from reaching customers, and later permits stratified evaluation.

def calculate_confidence_score(pred_logprobs_str, prediction_text="") -> tuple:
    """
    Calculate confidence rating from log possibilities for the expected class.
    
    Args:
        pred_logprobs_str (str): String illustration of prediction logprobs (up to date area identify as of 10/23/25)
        prediction_text (str): The precise prediction textual content to extract class from
        
    Returns:
        tuple: (confidence, predicted_token, filtered_keys) - 
               confidence: The primary token likelihood of the expected class
               predicted_token: The anticipated class identify
               filtered_keys: Listing of tokens within the class identify
    
    Word: Tokens could embrace SentencePiece prefix characters (▁) as of container replace 10/23/25
    """
    if not pred_logprobs_str or pred_logprobs_str == '[]':
        return 0.0, "", []
    
    strive:
        # Extract the expected class from the prediction textual content
        predicted_class = ""
        if prediction_text:
            strive:
                parsed = json.hundreds(prediction_text)
                predicted_class = str(parsed.get("class", "")).strip()
            besides:
                predicted_class = ""
        
        if not predicted_class:
            return 0.0, "", []
        
        logprobs = ast.literal_eval(pred_logprobs_str)
        if not logprobs or len(logprobs) == 0:
            return 0.0, predicted_class, []
        
        # Construct dictionary of all tokens and their possibilities
        token_probs = {}
        for token_prob_dict in logprobs[0]:
            if isinstance(token_prob_dict, dict) and token_prob_dict:
                max_key = max(token_prob_dict.gadgets(), key=lambda x: x[1])[0]
                max_logprob = token_prob_dict[max_key]
                token_probs[max_key] = np.exp(max_logprob)
        
        # Discover tokens that match the expected class
        class_tokens = []
        class_probs = []
        
        # Break up class identify into potential tokens
        class_parts = re.cut up(r'[_s]+', predicted_class)
        
        for token, prob in token_probs.gadgets():
            # Strip SentencePiece prefix (▁) and different particular characters for matching
            clean_token = re.sub(r'[^a-zA-Z0-9]', '', token)
            if clean_token:
                # Examine if this token matches any a part of the category identify
                for half in class_parts:
                    if half and (clean_token.decrease() == half.decrease() or 
                                half.decrease().startswith(clean_token.decrease()) or
                                clean_token.decrease().startswith(half.decrease())):
                        class_tokens.append(clean_token)
                        class_probs.append(prob)
                        break
        
        # Use first token confidence (OpenAI's classification method)
        if class_probs:
            confidence = float(class_probs[0])
        else:
            confidence = 0.0
        
        return confidence, predicted_class, class_tokens
    besides Exception as e:
        print(f"Error processing log probs: {e}")
        return 0.0, "", []

Preliminary evaluation

Subsequent, we carry out stratified failure evaluation, which mixes confidence scores with metadata classes to establish particular failure patterns. This multi-dimensional evaluation reveals failure modes throughout totally different process sorts, issue ranges, and domains. Stratified failure evaluation systematically examines low-confidence predictions to establish particular patterns and root causes. It first filters predictions under the boldness threshold, then conducts multi-dimensional evaluation throughout metadata classes to pinpoint the place the mannequin struggles most. We additionally analyze content material patterns in failed predictions, on the lookout for uncertainty language and categorizing error sorts (JSON format points, size issues, or content material errors) earlier than producing insights that inform groups precisely what to repair.

def analyze_low_confidence_failures(results_df: pd.DataFrame, confidence_threshold: float = 0.7, quality_threshold: float = 0.3) -> dict:
    """Carry out complete failure evaluation"""
    
    # Make a replica to keep away from modifying unique
    df = results_df.copy()
    
    # Helper to extract prediction textual content
    def get_prediction_text(row):
        pred = row['predictions']
        
        # Deal with string illustration of array: "['text']" 
        if isinstance(pred, str):
            strive:
                pred_list = ast.literal_eval(pred)
                if isinstance(pred_list, record) and len(pred_list) > 0:
                    return str(pred_list[0]).strip()
            besides:
                move
        
        # Deal with precise array
        if isinstance(pred, (record, np.ndarray)) and len(pred) > 0:
            return str(pred[0]).strip()
        
        return str(pred).strip()
    
    # Apply the operate with each logprobs and prediction textual content
    confidence_results = df.apply(
        lambda row: calculate_confidence_score(row['pred_logprobs'], get_prediction_text(row)), 
        axis=1
    )
    df[['confidence', 'predicted_token', 'filtered_keys']] = pd.DataFrame(
        record(confidence_results), 
        index=df.index,
        columns=['confidence', 'predicted_token', 'filtered_keys']
    )
    
    df['quality_score'] = df['metrics'].apply(
        lambda x: ast.literal_eval(x).get('f1', 0) if x else 0
    )
    df['is_correct'] = df['quality_score'] >= quality_threshold
    
    # Analyze low confidence predictions
    low_conf_df = df[df['confidence'] < confidence_threshold]
    
    # Additionally establish excessive confidence however low high quality (overconfident errors)
    overconfident_df = df[(df['confidence'] >= confidence_threshold) & 
                          (df['quality_score'] < quality_threshold)]
    
    analysis_results = {
        'abstract': {
            'total_predictions': len(df),
            'low_confidence_count': len(low_conf_df),
            'low_confidence_rate': len(low_conf_df) / len(df) if len(df) > 0 else 0,
            'overconfident_errors': len(overconfident_df),
            'overconfident_rate': len(overconfident_df) / len(df) if len(df) > 0 else 0,
            'avg_confidence': df['confidence'].imply(),
            'avg_quality': df['quality_score'].imply(),
            'overall_accuracy': df['is_correct'].imply()
        },
        'low_confidence_examples': [],
        'overconfident_examples': []
    }
    
    # Low confidence examples
    if len(low_conf_df) > 0:
        for idx, row in low_conf_df.iterrows():
            strive:
                specifics = ast.literal_eval(row['specifics'])
                metadata = json.hundreds(specifics.get('metadata', '{}'))
                
                analysis_results['low_confidence_examples'].append({
                    'instance': row['example'][:100] + '...' if len(row['example']) > 100 else row['example'],
                    'confidence': row['confidence'],
                    'quality_score': row['quality_score'],
                    'class': metadata.get('class', 'unknown'),
                    'issue': metadata.get('issue', 'unknown'),
                    'area': metadata.get('area', 'unknown')
                })
            besides:
                move
    
    # Overconfident errors
    if len(overconfident_df) > 0:
        for idx, row in overconfident_df.iterrows():
            strive:
                specifics = ast.literal_eval(row['specifics'])
                metadata = json.hundreds(specifics.get('metadata', '{}'))
                
                analysis_results['overconfident_examples'].append({
                    'instance': row['example'][:100] + '...' if len(row['example']) > 100 else row['example'],
                    'confidence': row['confidence'],
                    'quality_score': row['quality_score'],
                    'class': metadata.get('class', 'unknown'),
                    'issue': metadata.get('issue', 'unknown'),
                    'area': metadata.get('area', 'unknown')
                })
            besides:
                move
    
    return analysis_results

Preview preliminary outcomes

Now let’s evaluate our preliminary outcomes displaying what was parsed out.

print("=" * 60)
print("FAILURE ANALYSIS RESULTS")
print("=" * 60)
print(f"nTotal predictions: {analysis_results['summary']['total_predictions']}")
print(f"Common confidence: {analysis_results['summary']['avg_confidence']:.3f}")
print(f"Common F1 high quality: {analysis_results['summary']['avg_quality']:.3f}")
print(f"Total accuracy (F1>0.3): {analysis_results['summary']['overall_accuracy']:.1%}")

print(f"n{'=' * 60}")
print("LOW CONFIDENCE PREDICTIONS (confidence < 0.7)")
print("=" * 60)
print(f"Rely: {analysis_results['summary']['low_confidence_count']} ({analysis_results['summary']['low_confidence_rate']:.1%})")
for i, instance in enumerate(analysis_results['low_confidence_examples'][:5], 1):
    print(f"n{i}. Confidence: {instance['confidence']:.3f} | F1: {instance['quality_score']:.3f}")
    print(f"   {instance['difficulty']} | {instance['domain']}")
    print(f"   {instance['example']}")

print(f"n{'=' * 60}")
print("OVERCONFIDENT ERRORS (confidence >= 0.7 however F1 < 0.3)")
print("=" * 60)
print(f"Rely: {analysis_results['summary']['overconfident_errors']} ({analysis_results['summary']['overconfident_rate']:.1%})")
for i, instance in enumerate(analysis_results['overconfident_examples'][:5], 1):
    print(f"n{i}. Confidence: {instance['confidence']:.3f} | F1: {instance['quality_score']:.3f}")
    print(f"   {instance['difficulty']} | {instance['domain']}")
    print(f"   {instance['example']}")

Step 7: Scale the evaluations on multi-node prediction

After figuring out failure patterns, we have to scale our analysis to bigger datasets for testing. Nova analysis containers now help multi-node analysis to enhance throughput and pace by configuring the variety of replicas wanted within the recipe.

The Nova analysis container handles multi-node scaling routinely once you specify multiple reproduction in your analysis recipe. Multi-node scaling distributes the workload throughout a number of nodes whereas sustaining the identical analysis high quality and metadata passthrough capabilities.

# Multi-node scaling configuration - simply change one line!
run:
  identify: support-ticket-eval-multinode
  model_name_or_path: amazon.nova-lite-v1:0:300k
  replicas: 4  # Now scales to 4 nodes routinely

analysis:
  process: gen_qa
  technique: gen_qa
  metric: all

inference:
  max_new_tokens: 2048
  temperature: 0
  top_logprobs: 10

Consequence aggregation and efficiency evaluation

The Nova analysis container routinely handles outcome aggregation from a number of replicas, however we are able to analyze the scaling effectiveness and restrict metadata-based evaluation to the distributed analysis.

Multi-node analysis makes use of the Nova analysis container’s built-in capabilities via the replicas parameter, distributing workloads routinely and aggregating outcomes whereas protecting all metadata-based stratified evaluation capabilities. The container handles the complexity of distributed processing, serving to groups to scale from 1000’s to tens of millions of examples by growing the reproduction depend.

Conclusion

This instance demonstrated Nova mannequin analysis fundamentals exhibiting the capabilities of latest characteristic releases for the Nova analysis container. We confirmed how us utilization of {custom} metrics (BYOM) with domain-specific assessments can drive deep insights. Then defined the best way to extract and use log possibilities to disclose mannequin uncertainty easing the implementation of high quality gates and confidence-based routing. Then confirmed how the metadata passthrough functionality is used downstream for stratified evaluation, pinpointing the place fashions wrestle and the place to focus enhancements. We then recognized a easy method to scale these methods with multi-node analysis capabilities. Together with these options in your analysis pipeline will help you make knowledgeable selections on which fashions to undertake and the place customization needs to be utilized.

Get began now with the Nova evaluation demo notebook which has detailed executable code for every step above, from dataset preparation via failure evaluation, providing you with a baseline to switch so you’ll be able to consider your individual use case.

Take a look at the Amazon Nova Samples repository for full code examples throughout quite a lot of use instances.


In regards to the authors

Tony Santiago is a Worldwide Associate Options Architect at AWS, devoted to scaling generative AI adoption throughout International Programs Integrators. He makes a speciality of resolution constructing, technical go-to-market alignment, and functionality growth—enabling tens of 1000’s of builders at GSI companions to ship AI-powered options for his or her clients. Drawing on greater than 20 years of worldwide know-how expertise and a decade with AWS, Tony champions sensible applied sciences that drive measurable enterprise outcomes. Outdoors of labor, he’s obsessed with studying new issues and spending time with household.

Akhil Ramaswamy is a Worldwide Specialist Options Architect at AWS, specializing in superior mannequin customization and inference on SageMaker AI. He companions with international enterprises throughout varied industries to unravel complicated enterprise issues utilizing the AWS generative AI stack. With experience in constructing production-grade agentic methods, Akhil focuses on growing scalable go-to-market options that assist enterprises drive innovation whereas maximizing ROI. Outdoors of labor, yow will discover him touring, understanding, or having fun with a pleasant ebook.

Anupam Dewan is a Senior Options Architect working in Amazon Nova staff with a ardour for generative AI and its real-world functions. He focuses on constructing, enabling, and benchmarking AI functions for GenAI clients in Amazon. With a background in AI/ML, information science, and analytics, Anupam helps clients study and make Amazon Nova work for his or her GenAI use instances to ship enterprise outcomes. Outdoors of labor, yow will discover him mountain climbing or having fun with nature.

Leave a Reply

Your email address will not be published. Required fields are marked *