Implement backup and restoration utilizing an event-driven serverless structure with Amazon SageMaker Studio


Amazon SageMaker Studio is the primary absolutely built-in growth setting (IDE) for ML. It gives a single, web-based visible interface the place you’ll be able to carry out all machine studying (ML) growth steps required to construct, practice, tune, debug, deploy, and monitor fashions. It offers knowledge scientists all of the instruments you must take ML fashions from experimentation to manufacturing with out leaving the IDE. Furthermore, as of November 2022, Studio helps shared spaces to speed up real-time collaboration and multiple Amazon SageMaker domains in a single AWS Area for every account.

There are two prevailing use circumstances for Studio area backup and restoration. The primary use case entails a buyer enterprise unit and mission wanting a performance to duplicate knowledge scientists’ artifacts and knowledge information to any goal domains and profiles at will. The second use case entails the replication solely when the area and profile are deleted because of situations such because the change from a customer-managed key to an AWS-managed key or a change of onboarding from AWS Identity and Access Management (IAM) authentication (see Onboard to Amazon SageMaker Domain Using IAM) to AWS IAM Identity Center (see Onboard to Amazon SageMaker Domain Using IAM Identity Center).

This publish primarily covers the second use case by presenting the right way to again up and get well customers’ work when the user and space profiles are deleted and recreated, however we additionally present the Python script to assist the primary use case.

When the person and house profiles are recreated within the current Studio area, a brand new ID of the profile listing can be created inside the Studio Amazon Elastic File System (Amazon EFS) volume. In consequence, the Studio customers might lose entry to the mannequin artifacts and knowledge information saved of their earlier profile listing if they’re deleted. Moreover, Studio domains don’t currently support mounting custom or additional EFS volumes. We suggest conserving the earlier Studio EFS quantity as a backup utilizing RetentionPolicy in Studio.

Subsequently, a correct restoration resolution must be applied to entry the info from the earlier listing in case of profile deletion or to get well information from a indifferent quantity in case of area deletion. Information scientists can decrease the potential impacts of deleting the area and profiles in the event that they ceaselessly commit their code to the repository and make the most of exterior storage for knowledge entry. Nevertheless, having the aptitude to again up and get well the info scientist’s workspace is one other layer to make sure their continuity of labor, which can improve their productiveness. Furthermore, when you have tens and a whole lot of Studio customers, contemplate the right way to automate the restoration course of to keep away from errors and save prices and time. To resolve this drawback, we offer a solution to supplement Studio domain recovery.

This publish explains the backup and restoration module and one strategy to automate the method utilizing an event-driven structure. First, we show the right way to carry out backup and restoration when you create a brand new Studio area, person, and house profiles utilizing AWS CloudFormation templates. Subsequent, we clarify the required steps to check our restoration resolution utilizing the present area and profiles with out utilizing our CloudFormation templates (you should utilize your individual templates). Though this publish focuses on a single area setting, our resolution works for a number of Studio domains as effectively. Lastly, now we have automated the provisioning of all assets utilizing the AWS Serverless Application Model (AWS SAM), an open-source framework for constructing serverless purposes.

Answer overview

The next diagram illustrates the high-level workflow of Studio area backup and restoration with an event-driven structure.

technical architecture

The event-driven app consists of the next steps:

  1. An Amazon CloudWatch events rule makes use of AWS CloudTrail to traceCreateUserProfile and CreateSpace API calls, set off the rule, and invoke the AWS Lambda perform.
  2. The perform updates the person desk and appends objects within the historical past desk in Amazon DynamoDB. As well as, the database layer retains observe of the area and profile identify and file system mapping.

The next picture reveals the DynamoDB tables construction. The partition key and sort key within the studioUser desk encompass the profile and area identify. The replication column holds the replication flag with true because the default worth. As well as, bytes_written, bytes_file_transferred, total_duration_ms, and replication_status fields are populated when the replication completes efficiently.

table schema

The database layer might be changed by different companies, similar to Amazon Relational Database Service (Amazon RDS) or Amazon Simple Storage Service (Amazon S3). Nevertheless, we selected DynamoDB due to the Amazon DynamoDB Streams characteristic.

  1. DynamoDB Streams is enabled on the person desk, and the Lambda perform is ready as a set off and synchronously invoked when new stream information can be found.
  2. One other Lambda perform triggers the method to revive the information utilizing the person and house information restore instruments.

The backup and restoration workflow consists of the next steps:

  1. The backup and restoration workflow consists of AWS Step Functions, built-in with different AWS companies, together with AWS DataSync, to orchestrate the restoration of the person and house information from the earlier listing to a brand new listing between the identical Studio domain EFS volume (profile recreation) or a brand new area EFS quantity (area recreation). With the Step Functions Workflow Studio, the workflow might be applied with no code (similar to on this case) or low code for a extra personalized resolution. The Step Features state machine is invoked when the event-driven app detects the profile creation occasion. For every profile, the Step Features state machine runs the DataSync activity to repeat all information from their earlier directories to the brand new listing.

The next picture is the precise graph of the Step Features state machine. Be aware that the ListApp* step ensures the profile directories are populated within the Studio EFS quantity earlier than continuing. Additionally, we applied retry with exponential backoff to deal with API throttle for DataSync CreateLocationEfs and CreateTask API calls.

step functions diagram

  1. When the customers open their Studio, all of the information from the respective directories from the earlier listing can be out there to proceed their work. The DataSync job replicating one gigabyte of information from our experiment took roughly 1 minute.

The next are companies that can be used as a part of the answer:

Conditions

To implement this resolution, you will need to have the next conditions:

  • An AWS account when you don’t have already got one. The IAM person that you just use will need to have ample permissions to make the mandatory AWS service calls and handle AWS assets.
  • The AWS SAM CLI put in and configured.
  • Your AWS credentials arrange.
  • Git installed.
  • Python 3.9.
  • A Studio profile and area identify mixture that’s distinctive throughout all Studio domains inside a Area and account.
  • It’s essential to use the present Amazon VPC and S3 bucket to observe the deployment step.
  • Additionally, concentrate on the service quota for the maximum number of DataSync tasks per account per Region (default is 100). You’ll be able to request a quota increase to satisfy the variety of replication duties on your use case.

Discuss with the AWS Regional Services List for service availability based mostly on Area. Moreover, evaluate Amazon SageMaker endpoints and quotas.

Arrange a Studio profile restoration infrastructure

The next diagram reveals the logical steps for a SageMaker administrator to arrange the Studio person and house restoration infrastructure, which a single command can full with our automated resolution.

logical flow 1

To arrange the setting, clone the GitHub repo within the terminal:

git clone https://github.com/aws-samples/sagemaker-studio-efs-recovery-serverless.git && cd sagemaker-studio-efs-recovery-serverless

The next code reveals the deployment script utilization:

bash deploy.sh -h

Utilization: deploy.sh [-n <stack_name>] [-v <vpc_id>] [-s <subnet_id>] [-b <s3_bucket>] [-r <aws_region>] [-d] Choices: -n: specify stack identify -v: specify your vpc id -s: specify subnet -b: specify s3 bucket identify to retailer artifacts -r: specify aws area -d: whether or not to skip a creation of a brand new SageMaker Studio Area (default: no)

To create a brand new Amazon SageMaker area, run the next command. It’s essential to specify which Amazon VPC and subnet you wish to use. We use VPC only mode for the Studio deployment. If you happen to don’t have any choice, you should utilize the default VPC and subnet. Additionally, specify any stack identify, AWS Region, and S3 bucket identify for AWS SAM to deploy the Lambda perform:

bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

If you wish to use an current Studio area, run the next command. Choice -d sure will skip creating a brand new Studio area:

bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region> -d sure

For the present domains, the SageMaker administrator should additionally replace the supply and goal Studio EFS safety teams to permit connection to the person and house file restore device. For instance, to run the next command, you must specify HomeEfsFileSystemId, the EFS file system ID, and SecurityGroupId utilized by the person and house file restore device (we talk about this in additional element later within the publish):

python3 src/add-security-group.py --efs-id <HomeEfsFileSystemId> --security-groups <SecurityGroupId> --region <aws_region>

Person and house restoration logical stream

The next diagram reveals the logical person and house restoration stream diagram for a SageMaker administrator to know how the answer works, and no further setup is required. If the profile (person or house) and area are unintentionally deleted, the EFS volume is detached but not deleted. A potential situation is that we could wish to revert the deletion by recreating a brand new area and profiles. If the identical profiles are being onboarded once more, they might want to entry the information from their respective workspace within the indifferent quantity. The restoration course of is sort of totally automated; the one motion required by the SageMaker administrator is to recreate the Studio area and profiles utilizing the identical CloudFormation template. The remainder of the steps are automated.

logical flow 2

Optionally, if the SageMaker admin needs management over replication, run the next command to show off replication for particular domains and profiles. This script updates the replication area given the area and profile identify within the desk. Be aware that you must run the script for a similar person every time they get recreated.

python3 src/update-replication-flag.py --profile-name <profile_name> --domain-name <domain_name> --region <aws_region> --no-replication

The next non-obligatory step gives the answer for the primary use case to permit replication to happen between the required supply file system to any goal area and profile identify. If the SageMaker admin needs to duplicate explicit profile knowledge to a distinct area and a profile that doesn’t exist but, run the next command. The script inserts the brand new area and profile identify with the required supply file system data. The following profile creation will set off the replication activity. Be aware that you must run add-security-group.py from the earlier step to permit connection to the file restore device.

python3 src/add-replication-target.py --src-profile-name <profile_name> --src-domain-name <domain_name> --target-profile-name <profile_name> --target-domain-name <domain_name> --region <aws_region>

Within the following sections, we check two eventualities to verify that the answer works as anticipated.

Create a brand new Studio area

Our first check situation assumes you’re ranging from scratch and wish to create a brand new Studio area and profiles in your setting utilizing our templates. Then we deploy the Studio area, person and house, backup and restoration workflow, and occasion app. The aim of the primary situation is to verify that the profile file is recovered within the new residence listing mechanically when the profile is deleted and recreated inside the similar Studio area.

Full the next steps:

  1. To deploy the applying, run the next command:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

  1. On the AWS CloudFormation console, guarantee the next stacks are in CREATE_COMPLETE standing:
    1. <stack_name>-DemoBootstrap-*
    2. <stack_name>-StepFunction-*
    3. <stack_name>-EventApp-*
    4. <stack_name>-StudioDomain-*
    5. <stack_name>-StudioUser1-*
    6. <stack_name>-StudioSpace-*

cloud formation console

If the deployment failed in any stacks, verify the error and resolve the problems. Then, proceed to the subsequent step provided that the issues are resolved.

  1. On the DynamoDB console, select Tables within the navigation pane and make sure that the studioUser and studioUserHistory tables are created.
  2. Choose studioUser and select Discover desk objects to verify that objects for user1 and space1 are populated within the desk.
  3. On the SageMaker console, select Domains within the navigation pane.
  4. Select demo-myapp-dev-studio-domain.
  5. On the Person profiles tab, choose user1 and select Launch, and select Studio to open the Studio for the person.

Be aware that Studio may take 10-15 minutes to load for the first time.

  1. On the File menu, select Terminal to launch a brand new terminal inside Studio.
  2. Run the next command within the terminal to create a file for testing:
    echo "i do not wish to lose entry to this file" > user1.txt

  1. Repeat these steps for space1 (select Areas in Step 7). Be happy to create a file of your selection.
  2. Delete the Studio person user1 and space1 by eradicating the nested stacks <stack_name>-StudioUser1-* and <stack_name>-StudioSpace-* from the mother or father. Delete the stacks by commenting out the next code blocks from the AWS SAM template file, template.yaml. Be sure to avoid wasting the file after the edit:
    StudioUser1:
      Kind: AWS::Serverless::Utility
      Situation: CreateDomainCond
      DependsOn: StudioDomain
      Properties:
        Location: Infrastructure/Templates/sagemaker-studio-user.yaml
        Parameters:
          LambdaLayerArn: !GetAtt DemoBootstrap.Outputs.LambdaLayerArn
          StudioUserProfileName: !Ref StudioUserProfileName1
          UID: !Ref UID
          Env: !Ref Env
          AppName: !Ref AppName
    StudioSpace:
      Kind: AWS::Serverless::Utility
      Situation: CreateDomainCond
      DependsOn: StudioDomain
      Properties:
        Location: Infrastructure/Templates/sagemaker-studio-space.yaml
        Parameters:
          LambdaLayerArn: !GetAtt DemoBootstrap.Outputs.LambdaLayerArn
          StudioSpaceName: !Ref StudioSpaceName
          UID: !Ref UID
          Env: !Ref Env
          AppName: !Ref AppName

  1. Run the next command to deploy the stack with this modification:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

  2. Recreate the Studio profiles by including again the stack again to the mother or father. Uncomment the code block from the earlier step, save the file, and run the identical command:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

After a profitable deployment, you’ll be able to verify the outcomes.

  1. On the AWS CloudFormation console, select the stack <stack_name>-StepFunction-*
  2. Within the stack, select the worth for Bodily ID of StepFunction within the Assets part.
  3. Select the newest run and make sure its standing in Graph view.

It ought to appear to be the next screenshot for the person profile replication. You may as well verify the opposite run to make sure the identical for the house profile.

step functions complete

  1. If you happen to accomplished Steps 5–10, open the Studio area for user1 and make sure that the user1.txt file is copied to the newly created listing.

It shouldn’t be seen in space1 listing, conserving the identical file possession.

  1. Repeat this step for space1.
  2. On the DataSync console, select the newest activity ID.
  3. Select Historical past and the newest run ID.

That is one other technique to examine the configurations and the run standing of the DataSync activity. For instance, the next screenshot reveals the duty outcome for user1 listing replication.

datasync complete

We solely coated profile recreation on this situation. Nevertheless, our resolution works in the identical manner for Studio area recreation, and it may be examined by deleting and recreating the area.

Use an current Studio area

Our second check situation assumes you wish to use the present SageMaker area and profiles within the setting. Subsequently, we solely deploy the backup and restoration workflow and the occasion app. Once more, you should utilize your individual Studio CloudFormation template or create them by way of the AWS CloudFormation console to observe alongside. As a result of we’re utilizing the present Studio area, the answer will listing the present person and house for all domains inside the Area, which we name seeding.

Full the next steps:

  1. To deploy the applying, run the next command:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region> -d sure

  2. On the AWS CloudFormation console, guarantee the next stacks are in CREATE_COMPLETE standing:
    1. <stack_name>-DemoBootstrap-*
    2. <stack_name>-StepFunction-*
    3. <stack_name>-EventApp-*

If the deployment failed in any stacks, verify the error and resolve the problems. Then, proceed to the subsequent step provided that the issues are resolved.

  1. Confirm the preliminary knowledge seed has accomplished.
  2. On the DynamoDB console, select Tables within the navigation pane and make sure that the studioUser and studioUserHistory tables are created.
  3. Select studioUser and select Discover desk objects to verify that objects for the present Studio area are populated within the desk.

Proceed to the subsequent step provided that the seed has accomplished efficiently. If the tables aren’t populated, verify the CloudWatch logs of the corresponding Lambda perform. On the AWS CloudFormation console, select the stack <stack_name>-EventApp-*, and select the bodily ID of DDBSeedLambda within the Assets part. Underneath Monitor, select View CloudWatch Logs and verify the logs for the newest run to troubleshoot.

  1. To replace the EFS safety group, first get the SecurityGroupId. We use the safety group created by the CloudFormation template, which permits all visitors within the outbound connection. Run the next command:
    echo "SecurityGroupId:" $(aws ssm get-parameter --name /community/vpc/sagemaker/securitygroups --region
    
    <aws_region> --query 'Parameter.Worth')

  1. Get the HomeEfsFileSystemId, which is the ID of the Studio residence EFS quantity. Run the next command:
    echo "HomeEfsFileSystemId:" $(aws sagemaker describe-domain --domain-id <domain_id> --region <aws_region> --query 'HomeEfsFileSystemId')

  2. Lastly, replace the EFS safety group by permitting inbounds from the safety group shared with the DataSync activity utilizing port 2049. Run the next command:
    python3 src/add-security-group.py --efs-id <HomeEfsFileSystemId> --security-groups <SecurityGroupId> --region <aws_region>

  3. Delete and recreate the Studio profiles of your selection utilizing the identical profile identify.
  4. Affirm the run standing of the Step Features state machine and restoration of the Studio profile listing by following the steps from the primary situation.

You may as well check the Step Features workflow manually together with your selection of supply and goal inputs for replication (extra particulars present in README.md within the GitHub repository).

Clear up

Run the next instructions to scrub up your assets:

sam delete --region <aws_region> --no-prompts --stack-name <stack_name>

Manually delete the SageMakerSecurityGroup after 20 minutes or so. Deletion of the Elastic Network Interface (ENI) could make the stack present as DELETE_IN_PROGRESS for a while, so we deliberately set the safety group to be retained. Additionally, you must disassociate that security group from the security group managed by SageMaker before you can delete it.

Conclusion

Studio is a strong IDE that permits knowledge scientists to rapidly develop, practice, check, and deploy fashions. This publish discusses the right way to again up and get well the information saved in an information scientist’s residence and shared house listing. We additionally demonstrated how an event-driven structure might help automate the restoration course of.

Our resolution might help enhance the resiliency of information scientists’ artifacts inside Studio, resulting in operational effectivity on the AWS Cloud. Additionally, the answer is modular, so you should utilize the mandatory elements and replace them on your utilization. For example, the enhancement to this resolution is likely to be a cross-account replication. We hope that what we demonstrated within the publish can be a useful useful resource to assist these concepts.

To get began with Studio, take a look at Amazon SageMaker for Data Scientists. Please ship us suggestions on the AWS forum for SageMaker or by way of your AWS assist contacts. You’ll find different Studio examples in our GitHub repository.


In regards to the Authors

Kenny Sato is a Machine Studying Engineer at AWS, guiding clients in architecting and implementing machine studying options. He acquired his grasp’s in Laptop Engineering from Virginia Tech and is pursuing a PhD in Laptop Science. In his spare time, you will discover him in his yard or out someplace taking part in together with his beautiful daughters.

Gautam Nambiar is a DevOps Marketing consultant with AWS. He’s notably involved in architecting and constructing automated options, MLOps pipelines, and creating reusable and safe DevOps finest observe patterns. In his spare time, he likes taking part in and watching soccer.

Leave a Reply

Your email address will not be published. Required fields are marked *