Inference AudioCraft MusicGen fashions utilizing Amazon SageMaker


Music era fashions have emerged as highly effective instruments that rework pure language textual content into musical compositions. Originating from developments in synthetic intelligence (AI) and deep studying, these fashions are designed to know and translate descriptive textual content into coherent, aesthetically pleasing music. Their potential to democratize music manufacturing permits people with out formal coaching to create high-quality music by merely describing their desired outcomes.

Generative AI fashions are revolutionizing music creation and consumption. Corporations can reap the benefits of this expertise to develop new merchandise, streamline processes, and discover untapped potential, yielding important enterprise impression. Such music era fashions allow numerous functions, from customized soundtracks for multimedia and gaming to academic sources for college students exploring musical types and buildings. It assists artists and composers by offering new concepts and compositions, fostering creativity and collaboration.

One outstanding instance of a music era mannequin is AudioCraft MusicGen by Meta. MusicGen code is launched below MIT, mannequin weights are launched below CC-BY-NC 4.0. MusicGen can create music primarily based on textual content or melody inputs, providing you with higher management over the output. The next diagram reveals how MusicGen, a single stage auto-regressive Transformer mannequin, can generate high-quality music primarily based on textual content descriptions or audio prompts.

Music Generation Models - MusicGen Input Output flow

MusicGen makes use of cutting-edge AI expertise to generate numerous musical types and genres, catering to varied inventive wants. In contrast to conventional strategies that embrace cascading a number of fashions, similar to hierarchically or upsampling, MusicGen operates as a single language mannequin, which operates over a number of streams of compressed discrete music illustration (tokens). This streamlined strategy empowers customers with exact management over producing high-quality mono and stereo samples tailor-made to their preferences, revolutionizing AI-driven music composition.

MusicGen fashions can be utilized throughout schooling, content material creation, and music composition. They’ll allow college students to experiment with numerous musical types, generate customized soundtracks for multimedia tasks, and create customized music compositions. Moreover, MusicGen can help musicians and composers, fostering creativity and innovation.

This put up demonstrates how you can deploy MusicGen, a music era mannequin on Amazon SageMaker utilizing asynchronous inference. We particularly deal with textual content conditioned era of music samples utilizing MusicGen fashions.

Answer overview

With the flexibility to generate audio, music, or video, generative AI fashions will be computationally intensive and time-consuming. Generative AI fashions with audio, music, and video output can use asynchronous inference that queues incoming requests and course of them asynchronously. Our resolution includes deploying the AudioCraft MusicGen mannequin on SageMaker utilizing SageMaker endpoints for asynchronous inference. This entails deploying AudioCraft MusicGen fashions sourced from the Hugging Face Model Hub onto a SageMaker infrastructure.

The next resolution structure diagram reveals how a person can generate music utilizing pure language textual content as an enter immediate by utilizing AudioCraft MusicGen fashions deployed on SageMaker.

MusicGen on Amazon SageMaker Asynchronous Inference

The next steps element the sequence occurring within the workflow from the second the person enters the enter to the purpose the place music is generated as output:

  1. The person invokes the SageMaker asynchronous endpoint utilizing an Amazon SageMaker Studio pocket book.
  2. The enter payload is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket for inference. The payload consists of each the immediate and the music era parameters. The generated music might be downloaded from the S3 bucket.
  3. The fb/musicgen-large mannequin is deployed to a SageMaker asynchronous endpoint. This endpoint is used to deduce for music era.
  4. The HuggingFace Inference Containers picture is used as a base picture. We use a picture that helps PyTorch 2.1.0 with a Hugging Face Transformers framework.
  5. The SageMaker HuggingFaceModel is deployed to a SageMaker asynchronous endpoint.
  6. The Hugging Face mannequin (fb/musicgen-large) is uploaded to Amazon S3 throughout deployment. Additionally, throughout inference, the generated outputs are uploaded to Amazon S3.
  7. We use Amazon Simple Notification Service (Amazon SNS) subjects to inform the success and failure as outlined as part of SageMaker asynchronous inference configuration.

Conditions

Be sure to have the next conditions in place :

  1. Affirm you will have entry to the AWS Management Console to create and handle sources in SageMaker, AWS Identity and Access Management (IAM), and different AWS providers.
  2. When you’re utilizing SageMaker Studio for the primary time, create a SageMaker area. Discuss with Quick setup to Amazon SageMaker to create a SageMaker area with default settings.
  3. Receive the AWS Deep Studying Containers for Massive Mannequin Inference from pre-built HuggingFace Inference Containers.

Deploy the answer

To deploy the AudioCraft MusicGen mannequin to a SageMaker asynchronous inference endpoint, full the next steps:

  1. Create a mannequin serving package deal for MusicGen.
  2. Create a Hugging Face mannequin.
  3. Outline asynchronous inference configuration.
  4. Deploy the mannequin on SageMaker.

We element every of the steps and present how we will deploy the MusicGen mannequin onto SageMaker. For sake of brevity, solely important code snippets are included. The complete supply code for deploying the MusicGen mannequin is on the market within the GitHub repo.

Create a mannequin serving package deal for MusicGen

To deploy MusicGen, we first create a mannequin serving package deal. The mannequin package deal accommodates a necessities.txt file that lists the required Python packages to be put in to serve the MusicGen mannequin. The mannequin package deal additionally accommodates an inference.py script that holds the logic for serving the MusicGen mannequin.

Let’s have a look at the important thing features utilized in serving the MusicGen mannequin for inference on SageMaker:

def model_fn(model_dir):
    '''masses mannequin'''
    mannequin = MusicgenForConditionalGeneration.from_pretrained("fb/musicgen-large")
    return mannequin

The model_fn operate masses the MusicGen mannequin fb/musicgen-large from the Hugging Face Mannequin Hub. We depend on the MusicgenForConditionalGeneration Transformers module to load the pre-trained MusicGen mannequin.

You can even check with musicgen-large-load-from-s3/deploy-musicgen-large-from-s3.ipynb, which demonstrates the perfect apply of downloading the mannequin from the Hugging Face Hub to Amazon S3 and reusing the mannequin artifacts for future deployments. As a substitute of downloading the mannequin each time from Hugging Face after we deploy or when scaling occurs, we obtain the mannequin to Amazon S3 and reuse it for deployment and through scaling actions. Doing so can enhance the obtain velocity, particularly for giant fashions, thereby serving to forestall the obtain from occurring over the web from an internet site outdoors of AWS. This greatest apply additionally maintains consistency, which suggests the identical mannequin from Amazon S3 will be deployed throughout varied staging and manufacturing environments.

The predict_fn operate makes use of the information supplied throughout the inference request and the mannequin loaded by means of model_fn:

texts, generation_params = _process_input(information)
processor = AutoProcessor.from_pretrained("fb/musicgen-large")
inputs = processor (
    textual content = texts,
    padding=True,
    return_tensors="pt",
)

Utilizing the data out there within the information dictionary, we course of the enter information to acquire the immediate and era parameters used to generate the music. We focus on the era parameters in additional element later on this put up.

gadget = torch.gadget('cuda' if torch.cuda.is_available() else 'cpu')
mannequin.to(gadget)
audio_values = mannequin.generate(**inputs.to(gadget),
                                **generation_params)

We load the mannequin to the gadget after which ship the inputs and era parameters as inputs to the mannequin. This course of generates the music within the type of a three-dimensional Torch tensor of form (batch_size, num_channels, sequence_length).

sampling_rate = mannequin.config.audio_encoder.sampling_rate
disk_wav_locations = _write_wavs_to_disk(sampling_rate, audio_values)
# Add wavs to S3
result_dict["generated_outputs_s3"] = _upload_wav_files(disk_wav_locations, bucket_name)
# Clear up disk
for wav_on_disk in disk_wav_locations:
    _delete_file_on_disk(wav_on_disk)

We then use the tensor to generate .wav music and add these information to Amazon S3 and clear up the .wav information saved on disk. We then receive the S3 URI of the .wav information and ship them places within the response.

We now create the archive of the inference scripts and add these to the S3 bucket:

musicgen_prefix = 'musicgen_large'
s3_model_key = f'{musicgen_prefix}/mannequin/mannequin.tar.gz'
s3_model_location = f"s3://{sagemaker_session_bucket}/{s3_model_key}"
s3 = boto3.useful resource("s3")
s3.Bucket(sagemaker_session_bucket).upload_file("mannequin.tar.gz", s3_model_key)

The uploaded URI of this object on Amazon S3 will later be used to create the Hugging Face mannequin.

Create the Hugging Face mannequin

Now we initialize HuggingFaceModel with the required arguments. Throughout deployment, the mannequin serving artifacts, saved in s3_model_location, might be deployed. Earlier than the mannequin serving, the MusicGen mannequin might be downloaded from Hugging Face as per the logic in model_fn.

huggingface_model = HuggingFaceModel(
    identify=async_endpoint_name,
    model_data=s3_model_location,  # path to your mannequin artifacts 
    position=position,
    env= {
           'TS_MAX_REQUEST_SIZE': '100000000',
           'TS_MAX_RESPONSE_SIZE': '100000000',
           'TS_DEFAULT_RESPONSE_TIMEOUT': '3600'
       },# iam position with permissions to create an Endpoint
    transformers_version="4.37",  # transformers model used
    pytorch_version="2.1",  # pytorch model used
    py_version="py310",  # python model used
)

The env argument accepts a dictionary of parameters similar to TS_MAX_REQUEST_SIZE and TS_MAX_RESPONSE_SIZE, which outline the byte measurement values for request and response payloads to the asynchronous inference endpoint. The TS_DEFAULT_RESPONSE_TIMEOUT key within the env dictionary represents the timeout in seconds after which the asynchronous inference endpoint stops responding.

You may run MusicGen with the Hugging Face Transformers library from model 4.31.0 onwards. Right here we set transformers_version to 4.37. MusicGen requires not less than PyTorch model 2.1 or newest, and we now have set pytorch_version to 2.1.

Outline asynchronous inference configuration

Music era utilizing a textual content immediate as enter will be each computationally intensive and time-consuming. Asynchronous inference in SageMaker is designed to deal with these calls for. When working with music era fashions, it’s essential to notice that the method can typically take greater than 60 seconds to finish.

SageMaker asynchronous inference queues incoming requests and processes them asynchronously, making it best for requests with massive payload sizes (as much as 1 GB), lengthy processing occasions (as much as 1 hour), and close to real-time latency necessities. By queuing incoming requests and processing them asynchronously, this functionality effectively handles the prolonged processing occasions inherent in music era duties. Furthermore, asynchronous inference allows seamless auto scaling, ensuring that sources are allotted solely when wanted, resulting in price financial savings.

Earlier than we proceed with asynchronous inference configuration , we create SNS subjects for fulfillment and failure that can be utilized to carry out downstream duties:

from utils.sns_client import SnsClient
import time
sns_client = SnsClient(boto3.shopper("sns"))
timestamp = time.time_ns()
topic_names = [f"musicgen-large-topic-SuccessTopic-{timestamp}", f"musicgen-large-topic-ErrorTopic-{timestamp}"]

topic_arns = []
for topic_name in topic_names:
    print(f"Creating subject {topic_name}.")
    response = sns_client.create_topic(topic_name)
    topic_arns.append(response.get('TopicArn'))

We now create an asynchronous inference endpoint configuration by specifying the AsyncInferenceConfig object:

# create async endpoint configuration
async_config = AsyncInferenceConfig(
    output_path=s3_path_join(
        "s3://", sagemaker_session_bucket, "musicgen_large/async_inference/output"
    ),  # The place our outcomes might be saved
    # Add nofitication SNS if wanted
    notification_config={
        "SuccessTopic": topic_arns[0],
        "ErrorTopic": topic_arns[1],
    },  #  Notification configuration
)

The arguments to the AsyncInferenceConfig are detailed as follows:

  • output_path – The placement the place the output of the asynchronous inference endpoint might be saved. The information on this location may have an .out extension and can include the small print of the asynchronous inference carried out by the MusicGen mannequin.
  • notification_config – Optionally, you possibly can affiliate success and error SNS subjects. Dependent workflows can ballot these subjects to make knowledgeable selections primarily based on the inference outcomes.

Deploy the mannequin on SageMaker

With the asynchronous inference configuration outlined, we will deploy the Hugging Face mannequin, setting initial_instance_count to 1:

# deploy the endpoint
async_predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    async_inference_config=async_config,
    endpoint_name=async_endpoint_name,
)

After efficiently deploying, you possibly can optionally configure automatic scaling to the asynchronous endpoint. With asynchronous inference, it’s also possible to scale down your asynchronous endpoint’s situations to zero.

We now dive into inferencing the asynchronous endpoint for music era.

Inference

On this part, we present how you can carry out inference utilizing an asynchronous inference endpoint with the MusicGen mannequin. For the sake of brevity, solely important code snippets are included. The complete supply code for inferencing the MusicGen mannequin is on the market within the GitHub repo. The next diagram explains the sequence of steps to invoke the asynchronous inference endpoint.

MusicGen - Amazon SageMaker Async Inference Sequence Diagram

We element the steps to invoke the SageMaker asynchronous inference endpoint for MusicGen by prompting a desired temper in pure language utilizing English. We then show how you can obtain and play the .wav information generated from the person immediate. Lastly, we cowl the method of cleansing up the sources created as a part of this deployment.

Put together immediate and directions

For managed music era utilizing MusicGen fashions, it’s essential to know varied era parameters:

generation_params = { 
    'guidance_scale': 3,
    'max_new_tokens': 1200, 
    'do_sample': True, 
    'temperature': 1 
}

From the previous code, let’s perceive the era parameters:

  • guidance_scale – The guidance_scale is utilized in classifier-free steering (CFG), setting the weighting between the conditional logits (predicted from the textual content prompts) and the unconditional logits (predicted from an unconditional or ‘null’ immediate). The next steering scale encourages the mannequin to generate samples which can be extra carefully linked to the enter immediate, normally on the expense of poorer audio high quality. CFG is enabled by setting guidance_scale > 1. For greatest outcomes, use guidance_scale = 3. Our deployment defaults to three.
  • max_new_tokens – The max_new_tokens parameter specifies the variety of new tokens to generate. Era is restricted by the sinusoidal positional embeddings to 30-second inputs, which means MusicGen can’t generate greater than 30 seconds of audio (1,503 tokens). Our deployment defaults to 256.
  • do_sample – The mannequin can generate an audio pattern conditioned on a textual content immediate by means of use of the MusicgenProcessor to preprocess the inputs. The preprocessed inputs can then be handed to the .generate methodology to generate text-conditional audio samples. Our deployment defaults to True.
  • temperature – That is the softmax temperature parameter. The next temperature will increase the randomness of the output, making it extra numerous. Our deployment defaults to 1.

Let’s have a look at how you can construct a immediate to deduce the MusicGen mannequin:

information = {
    "texts": [
        "Warm and vibrant weather on a sunny day, feeling the vibes of hip hop and synth",
    ],
    "bucket_name": sagemaker_session_bucket,
    "generation_params": generation_params
}

The previous code is the payload, which might be saved as a JSON file and uploaded to an S3 bucket. We then present the URI of the enter payload throughout the asynchronous inference endpoint invocation together with different arguments as follows.

The texts key accepts an array of texts, which can include the temper you wish to replicate in your generated music. You may embrace musical devices within the textual content immediate to the MusicGen mannequin to generate music that includes these devices.

The response from the invoke_endpoint_async is a dictionary of assorted parameters:

response = sagemaker_runtime.invoke_endpoint_async(
    EndpointName=endpoint_name,
    InputLocation=input_s3_location,
    ContentType="utility/json",
    InvocationTimeoutSeconds=3600
)

OutputLocation within the response metadata represents Amazon S3 URI the place the inference response payload is saved.

Asynchronous music era

As quickly because the response metadata is distributed to the shopper, the asynchronous inference begins the music era. The music era occurs on the occasion chosen throughout the deployment of the MusicGen mannequin on the SageMaker asynchronous Inference endpoint , as detailed within the deployment part.

Steady polling and acquiring music information

Whereas the music era is in progress, we repeatedly ballot for the response metadata parameter OutputLocation:

from utils.inference_utils import get_output
output = get_output(sm_session, response.get('OutputLocation'))

The get_output operate retains polling for the presence of OutputLocation and returns the S3 URI of the .wav music file.

Audio output

Lastly, we obtain the information from Amazon S3 and play the output utilizing the next logic:

from utils.inference_utils import play_output_audios
music_files = []
for s3_url in output.get('generated_outputs_s3'):
    if s3_url shouldn't be None:
        music_files.append(download_from_s3(s3_url))
play_output_audios(music_files, information.get('texts'))

You now have entry to the .wav information and might strive altering the era parameters to experiment with varied textual content prompts.

The next is one other music pattern primarily based on the next era parameters:

generation_params = { 'guidance_scale': 5, 'max_new_tokens': 1503, 'do_sample': True, 'temperature': 0.9 }
information = {
    "texts": [
        "Catchy funky beats with drums and bass, synthesized pop for an upbeat pop game",
    ],
    "bucket_name": sagemaker_session_bucket,
    "generation_params": generation_params
}

Clear up

To keep away from incurring pointless fees, you possibly can clear up utilizing the next code:

import boto3
sagemaker_runtime = boto3.shopper('sagemaker-runtime')

cleanup = False # < - Set this to True to scrub up sources.
endpoint_name = <Endpoint_Name>

sm_client = boto3.shopper('sagemaker')
endpoint = sm_client.describe_endpoint(EndpointName=endpoint_name)
endpoint_config_name = endpoint['EndpointConfigName']
endpoint_config = sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)
model_name = endpoint_config['ProductionVariants'][0]['ModelName']
notification_config = endpoint_config['AsyncInferenceConfig']['OutputConfig'].get('NotificationConfig', None)
print(f"""
About to delete the next sagemaker sources:
Endpoint: {endpoint_name}
Endpoint Config: {endpoint_config_name}
Mannequin: {model_name}
""")
for okay,v in notification_config.objects():
    print(f'About to delete SNS subjects for {okay} with ARN: {v}')

if cleanup:
    # delete endpoint
    sm_client.delete_endpoint(EndpointName=endpoint_name)
    # delete endpoint config
    sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
    # delete mannequin
    sm_client.delete_model(ModelName=model_name)
    print('deleted mannequin, config and endpoint')

The aforementioned cleanup routine will delete the SageMaker endpoint, endpoint configurations, and fashions related to MusicGen mannequin, so that you just keep away from incurring pointless fees. Be certain that to set cleanup variable to True, and substitute <Endpoint_Name> with the precise endpoint identify of the MusicGen mannequin deployed on SageMaker. Alternatively, you need to use the console to delete the endpoints and its associated resources that had been created whereas operating the code talked about within the put up.

Conclusion

On this put up, we realized how you can use SageMaker asynchronous inference to deploy the AudioCraft MusicGen mannequin. We began by exploring how the MusicGen fashions work and lined varied use instances for deploying MusicGen fashions. We additionally explored how one can profit from capabilities similar to auto scaling and the mixing of asynchronous endpoints with Amazon SNS to energy downstream duties. We then took a deep dive into the deployment and inference workflow of MusicGen fashions on SageMaker, utilizing the AWS Deep Studying Containers for HuggingFace inference and the MusicGen mannequin sourced from the Hugging Face Hub.

Get began with producing music utilizing your inventive prompts by signing up for AWS. The complete supply code is on the market on the official GitHub repository.

References


In regards to the Authors

Pavan Kumar Rao NavulePavan Kumar Rao Navule is a Options Architect at Amazon Net Providers, the place he works with ISVs in India to assist them innovate on the AWS platform. He’s specialised in architecting AI/ML and generative AI providers at AWS. Pavan is a printed writer for the e book “Getting Began with V Programming.” In his free time, Pavan enjoys listening to the good magical voices of Sia and Rihanna.

David John ChakramDavid John Chakram is a Principal Options Architect at AWS. He focuses on constructing information platforms and architecting seamless information ecosystems. With a profound ardour for databases, information analytics, and machine studying, he excels at reworking advanced information challenges into progressive options and driving companies ahead with data-driven insights.

Sudhanshu HateSudhanshu Hate is a principal AI/ML specialist with AWS and works with purchasers to advise them on their MLOps and generative AI journey. In his earlier position earlier than Amazon, he conceptualized, created, and led groups to construct ground-up open source-based AI and gamification platforms, and efficiently commercialized it with over 100 purchasers. Sudhanshu has to his credit score a few patents, has written two books and several other papers and blogs, and has offered his factors of view in varied technical boards. He has been a thought chief and speaker, and has been within the trade for almost 25 years. He has labored with Fortune 1000 purchasers throughout the globe and most not too long ago with digital native purchasers in India.

Rupesh BajajRupesh Bajaj is a Options Architect at Amazon Net Providers, the place he collaborates with ISVs in India to assist them leverage AWS for innovation. He focuses on offering steering on cloud adoption by means of well-architected options and holds seven AWS certifications. With 5 years of AWS expertise, Rupesh can be a Gen AI Ambassador. In his free time, he enjoys taking part in chess.

Leave a Reply

Your email address will not be published. Required fields are marked *