Deploy DeepSeek-R1 distilled Llama fashions with Amazon Bedrock Customized Mannequin Import

Open basis fashions (FMs) have develop into a cornerstone of generative AI innovation, enabling organizations to construct and customise AI functions whereas sustaining management over their prices and deployment methods. By offering high-quality, overtly accessible fashions, the AI group fosters fast iteration, information sharing, and cost-effective options that profit each builders and end-users. DeepSeek AI, a analysis firm centered on advancing AI know-how, has emerged as a major contributor to this ecosystem. Their DeepSeek-R1 fashions signify a household of huge language fashions (LLMs) designed to deal with a variety of duties, from code technology to basic reasoning, whereas sustaining aggressive efficiency and effectivity.

Amazon Bedrock Custom Model Import allows the import and use of your personalized fashions alongside present FMs via a single serverless, unified API. You may entry your imported customized fashions on-demand and with out the necessity to handle underlying infrastructure. Speed up your generative AI software improvement by integrating your supported customized fashions with native Bedrock instruments and options like Data Bases, Guardrails, and Brokers.

On this publish, we discover the right way to deploy distilled variations of DeepSeek-R1 with Amazon Bedrock Customized Mannequin Import, making them accessible to organizations wanting to make use of state-of-the-art AI capabilities throughout the safe and scalable AWS infrastructure at an efficient price.

DeepSeek-R1 distilled variations

From the muse of DeepSeek-R1, DeepSeek AI has created a sequence of distilled fashions primarily based on each Meta’s Llama and Qwen architectures, starting from 1.5–70 billion parameters. The distillation course of entails coaching smaller, extra environment friendly fashions to imitate the conduct and reasoning patterns of the bigger DeepSeek-R1 mannequin by utilizing it as a trainer—primarily transferring the information and capabilities of the 671 billion parameter mannequin into extra compact architectures. The ensuing distilled fashions, similar to DeepSeek-R1-Distill-Llama-8B (from base mannequin Llama-3.1-8B) and DeepSeek-R1-Distill-Llama-70B (from base mannequin Llama-3.3-70B-Instruct), supply totally different trade-offs between efficiency and useful resource necessities. Though distilled fashions would possibly present some discount in reasoning capabilities in comparison with the unique 671B mannequin, they considerably enhance inference pace and cut back computational prices. As an example, smaller distilled fashions just like the 8B model can course of requests a lot quicker and eat fewer assets, making them cheaper for manufacturing deployments, whereas bigger distilled variations just like the 70B mannequin preserve nearer efficiency to the unique whereas nonetheless providing significant effectivity good points.

Answer overview

On this publish, we reveal the right way to deploy distilled variations of DeepSeek-R1 fashions utilizing Amazon Bedrock Customized Mannequin Import. We give attention to importing the variants presently supported DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Llama-70B, which supply an optimum stability between efficiency and useful resource effectivity. You may import these fashions from Amazon Simple Storage Service (Amazon S3) or an Amazon SageMaker AI mannequin repo, and deploy them in a completely managed and serverless setting via Amazon Bedrock. The next diagram illustrates the end-to-end movement.

On this workflow, mannequin artifacts saved in Amazon S3 are imported into Amazon Bedrock, which then handles the deployment and scaling of the mannequin robotically. This serverless method eliminates the necessity for infrastructure administration whereas offering enterprise-grade safety and scalability.

You need to use the Amazon Bedrock console for deploying utilizing the graphical interface and following the directions on this publish, or alternatively use the following notebook to deploy programmatically with the Amazon Bedrock SDK.

Conditions

You must have the next conditions:

Put together the mannequin package deal

Full the next steps to organize the mannequin package deal:

Obtain the DeepSeek-R1-Distill-Llama mannequin artifacts from Hugging Face, from one of many following hyperlinks, relying on the mannequin you need to deploy:
1. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/tree/main
2. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B/tree/main

For extra info, you possibly can observe the Hugging Face’s Downloading models or Download files from the hub directions.

You usually want the next information:

- Mannequin configuration file: config.json
- Tokenizer information: tokenizer.json, tokenizer_config.json, and tokenizer.mode
- Mannequin weights information in .safetensors format

Add these information to a folder in your S3 bucket, in the identical AWS Area the place you propose to make use of Amazon Bedrock. Pay attention to the S3 path you’re utilizing.

Import the mannequin

Full the next steps to import the mannequin:

On the Amazon Bedrock console, select Imported fashions below Basis fashions within the navigation pane.

Select Import mannequin.

For Mannequin identify, enter a reputation in your mannequin (it’s advisable to make use of a versioning scheme in your identify, for monitoring your imported mannequin).
For Import job identify, enter a reputation in your import job.
For Mannequin import settings, choose Amazon S3 bucket as your import supply, and enter the S3 path you famous earlier (present the complete path within the type s3://<your-bucket>/folder-with-model-artifacts/).
For Encryption, optionally select to customise your encryption settings.
For Service entry function, select to both create a brand new IAM function or present your individual.
Select Import mannequin.

Importing the mannequin will take a number of minutes relying on the mannequin being imported (for instance, the Distill-Llama-8B mannequin might take 5–20 minutes to finish).

Watch this video demo for a step-by-step information.

Check the imported mannequin

After you import the mannequin, you possibly can check it by utilizing the Amazon Bedrock Playground or immediately via the Amazon Bedrock invocation APIs. To make use of the Playground, full the next steps:

On the Amazon Bedrock console, select Chat / Textual content below Playgrounds within the navigation pane.
From the mannequin selector, select your imported mannequin identify.
Modify the inference parameters as wanted and write your check immediate. For instance:
<｜start▁of▁sentence｜><｜Person｜>Given the next monetary information: - Firm A's income grew from $10M to $15M in 2023 - Working prices elevated by 20% - Preliminary working prices had been $7M Calculate the corporate's working margin for 2023. Please cause step-by-step, and put your remaining reply inside boxed{}<｜Assistant｜>

As we’re utilizing an imported mannequin within the playground, we should embrace the “beginning_of_sentence” and “person/assistant” tags to correctly format the context for DeepSeek fashions; these tags assist the mannequin perceive the construction of the dialog and supply extra correct responses. When you’re following the programmatic method within the following notebook then that is being robotically taken care of by configuring the mannequin.

Evaluate the mannequin response and metrics offered.

Notice: Whenever you invoke the mannequin for the primary time, in the event you encounter a ModelNotReadyException error the SDK robotically retries the request with exponential backoff. The restoration time varies relying on the on-demand fleet measurement and mannequin measurement. You may customise the retry conduct utilizing the AWS SDK for Python (Boto3) Config object. For extra info, see Handling ModelNotReadyException.

As soon as you might be able to import the mannequin, use this step-by-step video demo that can assist you get began.

Pricing

Customized Mannequin Import lets you use your customized mannequin weights inside Amazon Bedrock for supported architectures, serving them alongside Amazon Bedrock hosted FMs in a completely managed means via On-Demand mode. Customized Mannequin Import doesn’t cost for mannequin import, you might be charged for inference primarily based on two elements: the variety of energetic mannequin copies and their period of exercise.

Billing happens in 5-minute home windows, ranging from the primary profitable invocation of every mannequin copy. The pricing per mannequin copy per minute varies primarily based on elements together with structure, context size, area, and compute unit model, and is tiered by mannequin copy measurement. The Customized Mannequin Models required for internet hosting is determined by the mannequin’s structure, parameter depend, and context size, with examples starting from 2 Models for a Llama 3.1 8B 128K mannequin to eight Models for a Llama 3.1 70B 128K mannequin.

Amazon Bedrock robotically manages scaling, sustaining zero to a few mannequin copies by default (adjustable via Service Quotas) primarily based in your utilization patterns. If there aren’t any invocations for five minutes, it scales to zero and scales up when wanted, although this will contain cold-start latency of tens of seconds. Extra copies are added if inference quantity constantly exceeds single-copy concurrency limits. The utmost throughput and concurrency per copy is decided throughout import, primarily based on elements similar to enter/output token combine, {hardware} sort, mannequin measurement, structure, and inference optimizations.

Contemplate the next pricing instance: An software developer imports a personalized Llama 3.1 sort mannequin that’s 8B parameter in measurement with a 128K sequence size in us-east-1 area and deletes the mannequin after 1 month. This requires 2 Customized Mannequin Models. So, the value per minute will likely be $0.1570 and the mannequin storage prices will likely be $3.90 for the month.

For extra info, see Amazon Bedrock pricing.

Benchmarks

DeepSeek has published benchmarks evaluating their distilled fashions towards the unique DeepSeek-R1 and base Llama fashions, accessible within the mannequin repositories. The benchmarks present that relying on the duty DeepSeek-R1-Distill-Llama-70B maintains between 80-90% of the unique mannequin’s reasoning capabilities, whereas the 8B model achieves between 59-92% efficiency with considerably decreased useful resource necessities. Each distilled variations reveal enhancements over their corresponding base Llama fashions in particular reasoning duties.

Different concerns

When deploying DeepSeek fashions in Amazon Bedrock, contemplate the next features:

Mannequin versioning is important. As a result of Customized Mannequin Import creates distinctive fashions for every import, implement a transparent versioning technique in your mannequin names to trace totally different variations and variations.
The present supported mannequin codecs give attention to Llama-based architectures. Though DeepSeek-R1 distilled variations supply wonderful efficiency, the AI ecosystem continues evolving quickly. Regulate the Amazon Bedrock mannequin catalog as new architectures and bigger fashions develop into accessible via the platform.
Consider your use case necessities rigorously. Though bigger fashions like DeepSeek-R1-Distill-Llama-70B present higher efficiency, the 8B model would possibly supply ample functionality for a lot of functions at a decrease price.
Contemplate implementing monitoring and observability. Amazon CloudWatch offers metrics in your imported fashions, serving to you observe utilization patterns and efficiency. You may monitor prices with AWS Cost Explorer.
Begin with a decrease concurrency quota and scale up primarily based on precise utilization patterns. The default restrict of three concurrent mannequin copies per account is appropriate for many preliminary deployments.

Conclusion

Amazon Bedrock Customized Mannequin Import empowers organizations to make use of highly effective publicly accessible fashions like DeepSeek-R1 distilled variations, amongst others, whereas benefiting from enterprise-grade infrastructure. The serverless nature of Amazon Bedrock eliminates the complexity of managing mannequin deployments and operations, permitting groups to give attention to constructing functions somewhat than infrastructure. With options like auto scaling, pay-per-use pricing, and seamless integration with AWS companies, Amazon Bedrock offers a production-ready setting for AI workloads. The mix of DeepSeek’s revolutionary distillation method and the Amazon Bedrock managed infrastructure provides an optimum stability of efficiency, price, and operational effectivity. Organizations can begin with smaller fashions and scale up as wanted, whereas sustaining full management over their mannequin deployments and benefiting from AWS safety and compliance capabilities.

The power to decide on between proprietary and open FMs Amazon Bedrock provides organizations the flexibleness to optimize for his or her particular wants. Open fashions allow cost-effective deployment with full management over the mannequin artifacts, making them excellent for situations the place customization, price optimization, or mannequin transparency are essential. This flexibility, mixed with the Amazon Bedrock unified API and enterprise-grade infrastructure, permits organizations to construct resilient AI methods that may adapt as their necessities evolve.

For extra info, seek advice from the Amazon Bedrock User Guide.

In regards to the Authors

Raj Pathak is a Principal Options Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance coverage, Capital Markets) clients throughout Canada and america. Raj focuses on Machine Studying with functions in Generative AI, Pure Language Processing, Clever Doc Processing, and MLOps.

Yanyan Zhang is a Senior Generative AI Knowledge Scientist at Amazon Internet Providers, the place she has been engaged on cutting-edge AI/ML applied sciences as a Generative AI Specialist, serving to clients use generative AI to realize their desired outcomes. Yanyan graduated from Texas A&M College with a PhD in Electrical Engineering. Exterior of labor, she loves touring, understanding, and exploring new issues.

Ishan Singh is a Generative AI Knowledge Scientist at Amazon Internet Providers, the place he helps clients construct revolutionary and accountable generative AI options and merchandise. With a powerful background in AI/ML, Ishan focuses on constructing Generative AI options that drive enterprise worth. Exterior of labor, he enjoys taking part in volleyball, exploring native bike trails, and spending time together with his spouse and canine, Beau.

Morgan Rankey is a Options Architect primarily based in New York Metropolis, specializing in Hedge Funds. He excels in helping clients to construct resilient workloads throughout the AWS ecosystem. Previous to becoming a member of AWS, Morgan led the Gross sales Engineering workforce at Riskified via its IPO. He started his profession by specializing in AI/ML options for machine asset administration, serving a few of the largest automotive firms globally.

Harsh Patel is an AWS Options Architect supporting 200+ SMB clients throughout america to drive digital transformation via cloud-native options. As an AI&ML Specialist, he focuses on Generative AI, Laptop Imaginative and prescient, Reinforcement Studying and Anomaly Detection. Exterior the tech world, he recharges by hitting the golf course and embarking on scenic hikes together with his canine.