Construct an agentic multimodal AI assistant with Amazon Nova and Amazon Bedrock Information Automation


Fashionable enterprises are wealthy in information that spans a number of modalities—from textual content paperwork and PDFs to presentation slides, photos, audio recordings, and extra. Think about asking an AI assistant about your organization’s quarterly earnings name: the assistant mustn’t solely learn the transcript but in addition “see” the charts within the presentation slides and “hear” the CEO’s remarks. Gartner predicts that by 2027, 40% of generative AI solutions will be multimodal (textual content, picture, audio, video), up from only one% in 2023. This shift underlines how important multimodal understanding is changing into for enterprise functions. Reaching this requires a multimodal generative AI assistant—one that may perceive and mix textual content, visuals, and different information sorts. It additionally requires an agentic structure so the AI assistant can actively retrieve info, plan duties, and make selections on instrument calling, somewhat than simply responding passively to prompts.

On this publish, we discover an answer that does precisely that—utilizing Amazon Nova Pro, a multimodal massive language mannequin (LLM) from AWS, because the central orchestrator, together with highly effective new Amazon Bedrock options like Amazon Bedrock Data Automation for processing multimodal information. We exhibit how agentic workflow patterns akin to Retrieval Augmented Generation (RAG), multi-tool orchestration, and conditional routing with LangGraph allow end-to-end options that synthetic intelligence and machine studying (AI/ML) builders and enterprise architects can undertake and lengthen. We stroll by way of an instance of a monetary administration AI assistant that may present quantitative analysis and grounded monetary recommendation by analyzing each the earnings name (audio) and the presentation slides (photos), together with related monetary information feeds. We additionally spotlight how one can apply this sample in industries like finance, healthcare, and manufacturing.

Overview of the agentic workflow

The core of the agentic sample consists of the next phases:

  • Cause – The agent (typically an LLM) examines the consumer’s request and the present context or state. It decides what the following step must be—whether or not that’s offering a direct reply or invoking a instrument or sub-task to get extra info.
  • Act – The agent executes that step. This might imply calling a instrument or operate, akin to a search question, a database lookup, or a doc evaluation utilizing Amazon Bedrock Information Automation.
  • Observe – The agent observes the results of the motion. For example, it reads the retrieved textual content or information that got here again from the instrument.
  • Loop – With new info in hand, the agent causes once more, deciding if the duty is full or if one other step is required. This loop continues till the agent determines it could produce a last reply for the consumer.

This iterative decision-making permits the agent to deal with advanced requests which can be inconceivable to satisfy with a single immediate. Nevertheless, implementing agentic methods may be difficult. They introduce extra complexity within the management movement, and naive brokers may be inefficient (making too many instrument calls or looping unnecessarily) or exhausting to handle as they scale. That is the place structured frameworks like LangGraph are available in. LangGraph makes it potential to outline a directed graph (or state machine) of potential actions with well-defined nodes (actions like “Report Author” or “Question Information Base”) and edges (allowable transitions). Though the agent’s inside reasoning nonetheless decides which path to take, LangGraph makes positive the method stays manageable and clear. This managed flexibility means the assistant has sufficient autonomy to deal with various duties whereas ensuring the general workflow is steady and predictable.

Answer overview

This answer is a monetary administration AI assistant designed to assist analysts question portfolios, analyze firms, and generate stories. At its core is Amazon Nova, an LLM that acts as an clever LLM for inference. Amazon Nova processes textual content, photos, or paperwork (like earnings name slides), and dynamically decides which instruments to make use of to satisfy requests. Amazon Nova is optimized for enterprise duties and helps function calling, so the mannequin can plan actions and name instruments in a structured method. With a large context window (as much as 300,000 tokens in Amazon Nova Lite and Amazon Nova Professional), it could handle lengthy paperwork or dialog historical past when reasoning.

The workflow consists of the next key parts:

  • Information base retrieval – Each the earnings name audio file and PowerPoint file are processed by Amazon Bedrock Information Automation, a managed service that extracts textual content, transcribes audio and video, and prepares information for evaluation. If the consumer uploads a PowerPoint file, the system converts every slide into a picture (PNG) for environment friendly search and evaluation, a method impressed by generative AI functions like Manus. Amazon Bedrock Information Automation is successfully a multimodal AI pipeline out of the field. In our structure, Amazon Bedrock Information Automation acts as a bridge between uncooked information and the agentic workflow. Then Amazon Bedrock Knowledge Bases converts these chunks extracted from Amazon Bedrock Information Automation into vector embeddings utilizing Amazon Titan Text Embeddings V2, and shops these vectors in an Amazon OpenSearch Serverless database.
  • Router agent – When a consumer asks a query—for instance, “Summarize the important thing dangers on this Q3 earnings report”—Amazon Nova first determines whether or not the duty requires retrieving information, processing a file, or producing a response. It maintains reminiscence of the dialogue, interprets the consumer’s request, and plans which actions to take to satisfy it. The “Reminiscence & Planning” module within the answer diagram signifies that the router agent can use dialog historical past and chain-of-thought (CoT) prompting to find out subsequent steps. Crucially, the router agent determines if the question may be answered with inside firm information or if it requires exterior info and instruments.
  • Multimodal RAG agent – For queries associated with audio and video info, Amazon Bedrock Information Automation makes use of a unified API name to extract insights from such multimedia information, and shops the extracted insights in Amazon Bedrock Information Bases. Amazon Nova makes use of Amazon Bedrock Information Bases to retrieve factual solutions utilizing semantic search. This makes positive responses are grounded in actual information, minimizing hallucination. If Amazon Nova generates a solution, a secondary hallucination test cross-references the response in opposition to trusted sources to catch unsupported claims.
  • Hallucination test (high quality gate) – To additional confirm reliability, the workflow can embody a postprocessing step utilizing a distinct basis mannequin (FM) outdoors of the Amazon Nova household, akin to Anthropic’s Claude, Mistral, or Meta’s Llama, to grade the reply’s faithfulness. For instance, after Amazon Nova generates a response, a hallucination detector mannequin or operate can examine the reply in opposition to the retrieved sources or recognized info. If a possible hallucination is detected (the reply isn’t supported by the reference information), the agent can select to do extra retrieval, modify the reply, or escalate to a human.
  • Multi-tool collaboration – This multi-tool collaboration permits the AI to not solely discover info but in addition take actions earlier than formulating a last reply. This introduces multi-tool choices. The supervisor agent may spawn or coordinate a number of tool-specific brokers (for instance, an internet search agent to do a normal net search, a inventory search agent to get market information, or different specialised brokers for firm monetary metrics or {industry} information). Every agent performs a centered activity (one may name an API or carry out a question on the web) and returns findings to the supervisor agent. Amazon Nova Professional contains a sturdy reasoning capability that permits the supervisor agent to merge these findings. This multi-agent strategy follows the precept of dividing advanced duties amongst specialist brokers, enhancing effectivity and reliability for advanced queries.
  • Report creation agent – One other notable side within the structure is using Amazon Nova Canvas for output technology. Amazon Nova Canvas is a specialised image-generation mannequin within the Amazon Nova household, however on this context, we use the idea of a “canvas” extra figuratively to imply a structured template or format generated content material output. For example, we might outline a template for an “investor report” that the assistant fills out: Part 1: Key Highlights (bullet factors), Part 2: Monetary Abstract (desk of figures), Part 3: Notable Quotes, and so forth. The agent can information Amazon Nova to populate such a template by offering it with a system immediate containing the specified format (that is just like few-shot prompting, the place the structure is given). The result’s that the assistant not solely solutions ad-hoc questions, however may also produce complete generated stories that look as if a human analyst ready them, combining textual content, picture, and references to visuals.

These parts are orchestrated in an agentic workflow. As an alternative of a hard and fast script, the answer makes use of a dynamic resolution graph (carried out with the open supply LangGraph library within the pocket book answer) to route between steps. The result’s an assistant that feels much less like a chatbot and extra like a collaborative analyst—one that may parse an earnings name audio recording, critique a slide deck, or draft an investor memo with minimal human intervention.

The next diagram exhibits the high-level structure of the agentic AI workflow. Amazon Nova orchestrates varied instruments—together with Bedrock Amazon Information Automation for doc and picture processing and a information base for retrieval—to satisfy advanced consumer requests. For brevity, we don’t listing all of the code right here; the GitHub repo features a full working instance. Builders can run that to see the agent in motion and lengthen it with their very own information.

Instance of the multi-tool collaboration workflow

To exhibit the multi-tool collaboration agent workflow, we discover an instance of how a question-answer interplay may movement by way of our deployed system for multi-tool collaboration:

  • Person immediate – Within the chat UI, the end-user asks a query, akin to “What’s XXX’s inventory efficiency this 12 months, and the way does it examine to its rideshare‑{industry} friends?”
  • Agent preliminary response – The agent (Amazon Nova FM orchestrator) receives the query and responds with:
    Acquired your query. Routing to the reasoning engine…

  • Planning and gear choice – The agent determines that it wants the next:
    • The ticker image for the corporate (XXX)
    • Actual‑time inventory value and YTD modifications
    • Key monetary metrics (income, web earnings, price-earnings ratio)
    • Business benchmarks (peer YTD efficiency, common income development)
  • Planning execution utilizing instrument calls – The agent calls instruments to carry out the next actions:
    • Lookup ticker image:
      Agent → WebSearchTool.lookupTicker("XXX Inc")
      WebSearchTool → Agent: returns "XXX"

    • Fetch actual‑time inventory efficiency utilizing the retrieved ticker image:
      Agent → StockAnalysisTool.getPerformance(
       image="XXX",
       interval="YTD"
       )
      StockAnalysisTool → Agent:
       {
       currentPrice: 
       ytdChange: 
       52wkRange: 
       quantity: 
       }

    • Retrieve firm monetary metrics utilizing the retrieved ticker image:
      Agent → CompanyFinancialAnalysisTool.getMetrics("UBER")
      CompanyFinancialAnalysisTool → Agent:
       {
       revenueQ4_2024: xxx B,
       netIncomeQ4_2024: xxx M,
       peRatio: xxx
       }

    • Collect {industry} benchmark information utilizing the retrieved ticker image:
      Agent → IndustryAnalysisTool.comparePeers(
       image="XXX",
       sector="Rideshare"
       )
      IndustryAnalysisTool → Agent:
       {
       avgPeerYTD:
       avgRevenueGrowth: 
       }

    • Validation loop – The agent runs a validation loop:
      Agent: validate()
       ↳ Are all 4 information factors current?
       • Ticker :heavy_check_mark: 
       • Inventory efficiency :heavy_check_mark: 
       • Monetary metrics :heavy_check_mark: 
       • Business benchmark :heavy_check_mark: 
       ↳ All set—no retry wanted.

If something is lacking or a instrument encountered an error, the FM orchestrator triggers the error handler (as much as three retries), then resumes the plan on the failed step.

  • Synthesis and last reply – The agent makes use of Amazon Nova Professional to synthesize the info factors and generate last solutions primarily based on these information factors.

The next determine exhibits a movement diagram of this multi-tool collaboration agent.

Advantages of utilizing Amazon Bedrock for scalable generative AI agent workflows

This answer is constructed on Amazon Bedrock as a result of AWS gives an built-in ecosystem for constructing such subtle options at scale:

  • Amazon Bedrock delivers top-tier FMs like Amazon Nova, with managed infrastructure—no want for provisioning GPU servers or dealing with scaling complexities.
  • Amazon Bedrock Information Automation gives an out-of-the-box answer to course of paperwork, photos, audio, and video into actionable information. Amazon Bedrock Information Automation can convert presentation slides to pictures, convert audio to textual content, carry out OCR, and generate textual summaries or captions which can be then listed in an Amazon Bedrock information bases.
  • Amazon Bedrock Information Bases can retailer embeddings from unstructured information and assist retrieval operations utilizing similarity search.
  • Along with LangGraph (as proven on this answer), it’s also possible to use Amazon Bedrock Agents to develop agentic workflows. Amazon Bedrock Brokers simplifies the configuration of instrument flows and motion teams, so you’ll be able to declaratively handle your agentic workflows.
  • Purposes developed by open supply frameworks like LangGraph (an extension of LangChain) may also run and scale with AWS infrastructure akin to Amazon Elastic Compute Cloud (Amazon EC2) or Amazon SageMaker situations, so you’ll be able to outline directed graphs for agent orchestration, making it easy to handle multi-step reasoning and gear chaining.

You don’t must assemble a dozen disparate methods; AWS gives an built-in community for generative AI workflows.

Concerns and customizations

The structure demonstrates distinctive flexibility by way of its modular design rules. At its core, the system makes use of Amazon Nova FMs, which may be chosen primarily based on activity complexity. Amazon Nova Micro handles easy duties like classification with minimal latency. Amazon Nova Lite manages reasonably advanced operations with balanced efficiency, and Amazon Nova Professional excels at subtle duties requiring superior reasoning or producing complete responses.

The modular nature of the answer (Amazon Nova, instruments, information base, and Amazon Bedrock Information Automation) means each bit may be swapped or adjusted with out overhauling the entire system. Answer architects can use this reference structure as a basis, implementing customizations as wanted. You’ll be able to seamlessly combine new capabilities by way of AWS Lambda features for specialised operations, and the LangGraph orchestration permits dynamic mannequin choice and complex routing logic. This architectural strategy makes positive the system can evolve organically whereas sustaining operational effectivity and cost-effectiveness.

Bringing it to manufacturing requires considerate design, however AWS gives scalability, safety, and reliability. For example, you’ll be able to safe the information base content material with encryption and entry management, combine the agent with AWS Identity and Access Management (IAM) to verify it solely performs allowed actions (for instance, if an agent can entry delicate monetary information, confirm it checks consumer permissions ), and monitor the prices (you’ll be able to observe Amazon Bedrock pricing and instruments utilization; you may use Provisioned Throughput for constant high-volume utilization). Moreover, with AWS, you’ll be able to scale from an experiment in a pocket book to a full manufacturing deployment while you’re prepared, utilizing the identical constructing blocks (built-in with correct AWS infrastructure like Amazon API Gateway or Lambda, if deploying as a service).

Vertical industries that may profit from this answer

The structure we described is kind of normal. Let’s briefly take a look at how this multimodal agentic workflow can drive worth in numerous industries:

  • Monetary providers – Within the monetary sector, the answer integrates multimedia RAG to unify earnings name transcripts, presentation slides (transformed to searchable photos), and real-time market feeds right into a single analytical framework. Multi-agent collaboration permits Amazon Nova to orchestrate instruments like Amazon Bedrock Information Automation for slide textual content extraction, semantic seek for regulatory filings, and stay information APIs for pattern detection. This enables the system to generate actionable insights—akin to figuring out portfolio dangers or recommending sector rebalancing—whereas automating content material creation for investor stories or commerce approvals (with human oversight). By mimicking an analyst’s capability to cross-reference information sorts, the AI assistant transforms fragmented inputs into cohesive methods.
  • Healthcare – Healthcare workflows use multimedia RAG to course of medical notes, lab PDFs, and X-rays, grounding responses in peer-reviewed literature and affected person audio interview. Multi-agent collaboration excels in situations like triage: Amazon Nova interprets symptom descriptions, Amazon Bedrock Information Automation extracts textual content from scanned paperwork, and built-in APIs test for drug interactions, all whereas validating outputs in opposition to trusted sources. Content material creation ranges from succinct affected person summaries (“Extreme pneumonia, handled with levofloxacin”) to evidence-based solutions for advanced queries, akin to summarizing diabetes tips. The structure’s strict hallucination checks and supply citations assist reliability, which is essential for sustaining belief in medical decision-making.
  • Manufacturing – Industrial groups use multimedia RAG to index tools manuals, sensor logs, employee audio dialog, and schematic diagrams, enabling speedy troubleshooting. Multi-agent collaboration permits Amazon Nova to correlate sensor anomalies with guide excerpts, and Amazon Bedrock Information Automation highlights defective elements in technical drawings. The system generates restore guides (for instance, “Exchange valve Half 4 in schematic”) or contextualizes historic upkeep information, bridging the hole between veteran experience and new technicians. By unifying textual content, photos, and time sequence information into actionable content material, the assistant reduces downtime and preserves institutional information—proving that even in hardware-centric fields, AI-driven insights can drive effectivity.

These examples spotlight a standard sample: the synergy of information automation, highly effective multimodal fashions, and agentic orchestration results in options that carefully mimic a human knowledgeable’s help. The monetary AI assistant cross-checks figures and explanations like an analyst would, the medical AI assistant correlates photos and notes like a diligent physician, and the economic AI assistant recollects diagrams and logs like a veteran engineer. All of that is made potential by the underlying structure we’ve constructed.

Conclusion

The period of siloed AI fashions that solely deal with one kind of enter is drawing to an in depth. As we’ve mentioned, combining multimodal AI with an agentic workflow unlocks a brand new stage of functionality for enterprise functions. On this publish, we demonstrated learn how to assemble such a workflow utilizing AWS providers: we used Amazon Nova because the core AI orchestrator with its multimodal, agent-friendly capabilities, Amazon Bedrock Information Automation to automate the ingestion and indexing of advanced information (paperwork, slides, audio) into Amazon Bedrock Information Bases, and the idea of an agentic workflow graph for reasoning and situation (utilizing LangChain or LangGraph) to orchestrate multi-step reasoning and gear utilization. The tip result’s an AI assistant that operates very like a diligent analyst: researching, cross-checking a number of sources, and delivering insights—however at machine velocity and scale.The answer demonstrates that constructing a complicated agentic AI system is not an instructional dream—it’s sensible and achievable with immediately’s AWS applied sciences. By utilizing Amazon Nova as a robust multimodal LLM and Amazon Bedrock Information Automation for multimodal information processing, together with frameworks for instrument orchestration like LangGraph (or Amazon Bedrock Brokers), builders get a head begin. Many challenges (like OCR, doc parsing, or conversational orchestration) are dealt with by these managed providers or libraries, so you’ll be able to concentrate on the enterprise logic and domain-specific wants.

The answer introduced within the BDA_nova_agentic pattern pocket book is a superb start line to experiment with these concepts. We encourage you to strive it out, lengthen it, and tailor it to your group’s wants. We’re excited to see what you’ll construct—the strategies mentioned right here signify solely a small portion of what’s potential while you mix modalities and clever brokers.


Concerning the authors

Julia Hu Julia Hu is a Sr. AI/ML Options Architect at Amazon Net Providers, presently centered on the Amazon Bedrock staff. Her core experience lies in agentic AI, the place she explores the capabilities of basis fashions and AI brokers to drive productiveness in Generative AI functions. With a background in Generative AI, Utilized Information Science, and IoT structure, she companions with prospects—from startups to massive enterprises—to design and deploy impactful AI options.

Rui Cardoso is a companion options architect at Amazon Net Providers (AWS). He’s specializing in AI/ML and IoT. He works with AWS Companions and assist them in creating options in AWS. When not working, he enjoys biking, climbing and studying new issues.

Jessie-Lee Fry is a Product and Go-to Market (GTM) Technique govt specializing in Generative AI and Machine Studying, with over 15 years of world management expertise in Technique, Product, Buyer success, Enterprise Growth, Enterprise Transformation and Strategic Partnerships. Jessie has outlined and delivered a broad vary of merchandise and cross-industry go- to-market methods driving enterprise development, whereas maneuvering market complexities and C-Suite buyer teams. In her present position, Jessie and her staff concentrate on serving to AWS prospects undertake Amazon Bedrock at scale enterprise use instances and adoption frameworks, assembly prospects the place they’re of their Generative AI Journey.

Leave a Reply

Your email address will not be published. Required fields are marked *