How Patsnap used GPT-2 inference on Amazon SageMaker with low latency and value

This weblog submit was co-authored, and contains an introduction, by Zilong Bai, senior pure language processing engineer at Patsnap.

You’re doubtless acquainted with the autocomplete suggestion characteristic if you seek for one thing on Google or Amazon. Though the search phrases in these situations are fairly frequent key phrases or expressions that we use in each day life, in some instances search phrases are very particular to the state of affairs. Patent search is one among them. Just lately, the AWS Generative AI Innovation Heart collaborated with Patsnap to implement a characteristic to robotically counsel search key phrases as an innovation exploration to enhance consumer experiences on their platform.

Patsnap supplies a world one-stop platform for patent search, evaluation, and administration. They use huge knowledge (equivalent to a historical past of previous search queries) to supply many highly effective but easy-to-use patent instruments. These instruments have enabled Patsnap’s world prospects to have a greater understanding of patents, observe current technological advances, establish innovation developments, and analyze rivals in actual time.

On the similar time, Patsnap is embracing the facility of machine studying (ML) to develop options that may constantly enhance consumer experiences on the platform. A current initiative is to simplify the problem of setting up search expressions by autofilling patent search queries utilizing state-of-the-art textual content era fashions. Patsnap had educated a personalized GPT-2 mannequin for such a objective. As a result of there isn’t a such present characteristic in a patent search engine (to their finest data), Patsnap believes including this characteristic will improve end-user stickiness.

Nevertheless, of their current experiments, the inference latency and queries per second (QPS) of a PyTorch-based GPT-2 mannequin couldn’t meet sure thresholds that may justify its enterprise worth. To deal with this problem, AWS Generative AI Innovation Heart scientists explored a wide range of options to optimize GPT-2 inference efficiency, leading to decreasing the mannequin latency by 50% on common and bettering the QPS by 200%.

Giant language mannequin inference challenges and optimization approaches

Basically, making use of such a big mannequin in a real-world manufacturing setting is non-trivial. The prohibitive computation price and latency of PyTorch-based GPT-2 made it tough to be extensively adopted from a enterprise operation perspective. On this venture, our goal is to considerably enhance the latency with affordable computation prices. Particularly, Patsnap requires the next:

The common latency of mannequin inference for producing search expressions must be managed inside 600 milliseconds in real-time search situations
The mannequin requires excessive throughput and QPS to do numerous searches per second throughout peak enterprise hours

On this submit, we talk about our findings utilizing Amazon Elastic Compute Cloud (Amazon EC2) situations, that includes GPU-based situations utilizing NVIDIA TensorRT.

In a brief abstract, we use NVIDIA TensorRT to optimize the latency of GPT-2 and deploy it to an Amazon SageMaker endpoint for mannequin serving, which reduces the common latency from 1,172 milliseconds to 531 milliseconds

Within the following sections, we go over the technical particulars of the proposed options with key code snippets and present comparisons with the client’s established order primarily based on key metrics.

GPT-2 mannequin overview

Open AI’s GPT-2 is a big transformer-based language mannequin with 1.5 billion parameters, educated on the WebText dataset, containing 8 million internet pages. The GPT-2 is educated with a easy goal: predict the following phrase, given all the earlier phrases inside some textual content. The variety of the dataset causes this easy purpose to comprise naturally occurring demonstrations of many duties throughout various domains. GPT-2 shows a broad set of capabilities, together with the power to generate conditional artificial textual content samples of unprecedented high quality, the place we prime the mannequin with an enter and let it generate a prolonged continuation. On this scenario, we exploit it to generate search queries. As GPT fashions continue to grow bigger, inference prices are constantly rising, which will increase the necessity to deploy these fashions with acceptable price.

Obtain low latency on GPU situations through TensorRT

TensorRT is a C++ library for high-performance inference on NVIDIA GPUs and deep studying accelerators, supporting main deep studying frameworks equivalent to PyTorch and TensorFlow. Earlier research have proven nice efficiency enchancment by way of mannequin latency. Due to this fact, it’s a really perfect selection for us to scale back the latency of the goal mannequin on NVIDIA GPUs.

We’re capable of obtain a major discount in GPT-2 mannequin inference latency with a TensorRT-based mannequin on NVIDIA GPUs. The TensorRT-based mannequin is deployed through SageMaker for efficiency assessments. On this submit, we present the steps to transform the unique PyTorch-based GPT-2 mannequin to a TensorRT-based mannequin.

Changing the PyTorch-based GPT-2 to the TensorRT-based mannequin isn’t tough through the official tool offered by NVIDIA. As well as, with such easy conversions, no apparent mannequin accuracy degradation has been noticed. Basically, there are three steps to observe:

Analyze your GPT-2. As of this writing, NVIDIA’s conversion software solely helps Hugging Face’s model of GPT-2 mannequin. If the present GPT-2 mannequin isn’t the unique model, you must modify it accordingly. It’s really helpful to strip out customized code from the unique GPT-2 implementation of Hugging Face, which may be very useful for the conversion.
Set up the required Python packages. The conversion course of first converts the PyTorch-based mannequin to the ONNX mannequin after which converts the ONNX-based mannequin to the TensorRT-based mannequin. The next Python packages are wanted for this two-step conversion:

tabulate
toml
torch
sentencepiece==0.1.95
onnx==1.9.0
onnx_graphsurgeon
polygraphy
transformers

Convert your mannequin. The next code accommodates the capabilities for the two-step conversion:

def torch2onnx():
    metadata = NetworkMetadata(variant=GPT2_VARIANT, precision=Precision(fp16=True), different=GPT2Metadata(kv_cache=False))
    gpt2 = GPT2TorchFile(mannequin.to('cpu'), metadata)
    onnx_path = ('Your personal path to save lots of ONNX-based mannequin') # e.g, ./model_fp16.onnx
    gpt2.as_onnx_model(onnx_path, force_overwrite=False)
    return onnx_path, metadata
   
def onnx2trt(onnx_path, metadata):
    trt_path="Your personal path to save lots of TensorRT-based mannequin" # e.g., ./model_fp16.onnx.engine
    batch_size = 10
    max_sequence_length = 42
    profiles = [Profile().add(
        "input_ids",
        min=(1, 1),
        opt=(batch_size, max_sequence_length // 2),
        max=(batch_size, max_sequence_length),
    )]
    gpt2_engine = GPT2ONNXFile(onnx_path, metadata).as_trt_engine(output_fpath=trt_path, profiles=profiles)
    gpt2_trt = GPT2TRTDecoder(gpt2_engine, metadata, config, max_sequence_length=42, batch_size=10)

Latency comparability: PyTorch vs. TensorRT

JMeter is used for efficiency benchmarking on this venture. JMeter is an Apache venture that can be utilized as a load testing software for analyzing and measuring the efficiency of a wide range of providers. We file the QPS and latency of the unique PyTorch-based mannequin and our transformed TensorRT-based GPT-2 mannequin on an AWS P3.2xlarge occasion. As we present later on this submit, as a result of highly effective acceleration skill of TensorRT, the latency of GPT-2 is considerably diminished. When the request concurrency is 1, the common latency has been diminished by 274 milliseconds (2.9 occasions sooner). From the angle of QPS, it’s elevated to 7 from 2.4, which is round a 2.9 occasions enhance in comparison with the unique PyTorch-based mannequin. Furthermore, because the concurrency will increase, QPS retains growing. This implies decrease prices with acceptable latency improve (however nonetheless a lot sooner than the unique mannequin).

The next desk compares latency:

.	Concurrency	QPS	Most Latency	Minumum Latency	Common Latency
Buyer PyTorch model (on p3.2xlarge)	1	2.4	632	105	417
	2	3.1	919	168	636
	3	3.4	1911	222	890
	4	3.4	2458	277	1172
AWS TensorRT model (on p3.2xlarge)	1	7 (+4.6)	275	22	143 (-274 ms)
	2	7.2 (+4.1)	274	51	361 (-275 ms)
	3	7.3 (+3.9)	548	49	404 (-486 ms)
	4	7.5 (+4.1)	765	62	531 (-641 ms)

Deploy TensorRT-based GPT-2 with SageMaker and a customized container

TensorRT-based GPT-2 requires a comparatively current TensorRT model, so we select the bring your own container (BYOC) mode of SageMaker to deploy our mannequin. BYOC mode supplies a versatile option to deploy the mannequin, and you may construct personalized environments in your personal Docker container. On this part, we present tips on how to construct your personal container, deploy your personal GPT-2 mannequin, and check with the SageMaker endpoint API.

Construct your personal container

The container’s file listing is offered within the following code. Particularly, Dockerfile and construct.sh are used to construct the Docker container. gpt2 and predictor.py implement the mannequin and the inference API. serve, nginx.conf, and wsgi.py present the configuration for the NGINX internet server.

container
├── Dockerfile    # construct our docker primarily based on this file.
├── construct.sh      # create our personal picture and push it to Amazon ECR
├── gpt2          # mannequin listing
├── predictor.py  # backend operate for invoke the mannequin
├── serve         # internet server setting file
├── nginx.conf    # internet server setting file
└── wsgi.py       # internet server setting file

You possibly can run sh ./construct.sh to construct the container.

Deploy to a SageMaker endpoint

After you could have constructed a container to run the TensorRT-based GPT-2, you may allow real-time inference through a SageMaker endpoint. Use the next code snippets to create the endpoint and deploy the mannequin to the endpoint utilizing the corresponding SageMaker APIs:

import boto3from time import gmtime, strftime
from sagemaker import get_execution_role

sm_client = boto3.consumer(service_name="sagemaker")
runtime_sm_client = boto3.consumer(service_name="sagemaker-runtime")
account_id = boto3.consumer('sts').get_caller_identity()['Account']
area = boto3.Session().region_name
s3_bucket="${Your s3 bucket}"
function = get_execution_role()
model_name="${Your Mannequin Identify}"
# you must add your container to S3 first
container="${Your Picture Path}"
instance_type="ml.p3.2xlarge"
container = {
    'Picture': container
}
create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = function,
    Containers = [container])
    
# Endpoint Setting
endpoint_config_name="${Your Endpoint Config Identify}"
print('Endpoint config identify: ' + endpoint_config_name)
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType': instance_type,
        'InitialInstanceCount': 1,
        'InitialVariantWeight': 1,
        'ModelName': model_name,
        'VariantName': 'AllTraffic'}])
print("Endpoint config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

# Deploy Mannequin
endpoint_name="${Your Endpoint Identify}"
print('Endpoint identify: ' + endpoint_name)
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print('Endpoint Arn: ' + create_endpoint_response['EndpointArn'])
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
standing = resp['EndpointStatus']
print("Endpoint Standing: " + standing)
print('Ready for {} endpoint to be in service...'.format(endpoint_name))
waiter = sm_client.get_waiter('endpoint_in_service')
waiter.wait(EndpointName=endpoint_name)

Take a look at the deployed mannequin

After the mannequin is efficiently deployed, you may check the endpoint through the SageMaker pocket book occasion with the next code:

import json
import boto3

sagemaker_runtime = boto3.consumer("sagemaker-runtime", region_name="us-east-2")
endpoint_name = "${Your Endpoint Identify}"
request_body = {"enter": "amazon"}
payload = json.dumps(request_body)
content_type = "software/json"
response = sagemaker_runtime.invoke_endpoint(
                            EndpointName=endpoint_name,
                            ContentType=content_type,
                            Physique=payload # Exchange with your personal knowledge.
                            )
end result = json.masses(response['Body'].learn().decode())
print(end result)

Conclusion

On this submit, we described tips on how to allow low-latency GPT-2 inference on SageMaker to create enterprise worth. Particularly, with the assist of NVIDIA TensorRT, we will obtain 2.9 occasions acceleration on the NVIDIA GPU situations with SageMaker for a personalized GPT-2 mannequin.

In order for you assist with accelerating the usage of GenAI fashions in your services, please contact the AWS Generative AI Innovation Heart. The AWS Generative AI Innovation Heart will help you make your concepts a actuality sooner and extra successfully. To get began with the Generative AI Innovation Heart, go to here.

In regards to the Authors

Hao Huang is an utilized scientist on the AWS Generative AI Innovation Heart. He makes a speciality of Laptop Imaginative and prescient (CV) and Visible-Language Mannequin (VLM). Just lately, he has developed a powerful curiosity in generative AI applied sciences and has already collaborated with prospects to use these cutting-edge applied sciences to their enterprise. He’s additionally a reviewer for AI conferences equivalent to ICCV and AAAI.

Zilong Bai is a senior pure language processing engineer at Patsnap. He’s captivated with analysis and proof-of-concept work on cutting-edge strategies for generative language fashions.

Yuanjun Xiao is a Answer Architect at AWS. He’s accountable for AWS structure consulting and design. He’s additionally captivated with constructing AI and analytic options.

Xuefei Zhang is an utilized scientist on the AWS Generative AI Innovation Heart, works in NLP and AGI areas to unravel business issues with prospects.

Guang Yang is a senior utilized scientist on the AWS Generative AI Innovation Heart the place he works with prospects throughout numerous verticals and applies artistic drawback fixing to generate worth for purchasers with state-of-the-art ML/AI options.