Host the Spark UI on Amazon SageMaker Studio

Amazon SageMaker affords a number of methods to run distributed information processing jobs with Apache Spark, a well-liked distributed computing framework for giant information processing.

You may run Spark purposes interactively from Amazon SageMaker Studio by connecting SageMaker Studio notebooks and AWS Glue Interactive Sessions to run Spark jobs with a serverless cluster. With interactive classes, you possibly can select Apache Spark or Ray to simply course of giant datasets, with out worrying about cluster administration.

Alternately, in the event you want extra management over the atmosphere, you should use a pre-built SageMaker Spark container to run Spark purposes as batch jobs on a totally managed distributed cluster with Amazon SageMaker Processing. This feature lets you choose a number of forms of situations (compute optimized, reminiscence optimized, and extra), the variety of nodes within the cluster, and the cluster configuration, thereby enabling higher flexibility for information processing and mannequin coaching.

Lastly, you possibly can run Spark purposes by connecting Studio notebooks with Amazon EMR clusters, or by working your Spark cluster on Amazon Elastic Compute Cloud (Amazon EC2).

All these choices mean you can generate and retailer Spark occasion logs to research them by way of the web-based consumer interface generally named the Spark UI, which runs a Spark Historical past Server to observe the progress of Spark purposes, monitor useful resource utilization, and debug errors.

On this publish, we share a solution for putting in and working Spark Historical past Server on SageMaker Studio and accessing the Spark UI instantly from the SageMaker Studio IDE, for analyzing Spark logs produced by completely different AWS providers (AWS Glue Interactive Periods, SageMaker Processing jobs, and Amazon EMR) and saved in an Amazon Simple Storage Service (Amazon S3) bucket.

Answer overview

The answer integrates Spark Historical past Server into the Jupyter Server app in SageMaker Studio. This enables customers to entry Spark logs instantly from the SageMaker Studio IDE. The built-in Spark Historical past Server helps the next:

Accessing logs generated by SageMaker Processing Spark jobs
Accessing logs generated by AWS Glue Spark purposes
Accessing logs generated by self-managed Spark clusters and Amazon EMR

A utility command line interface (CLI) referred to as sm-spark-cli can be offered for interacting with the Spark UI from the SageMaker Studio system terminal. The sm-spark-cli allows managing Spark Historical past Server with out leaving SageMaker Studio.

The answer consists of shell scripts that carry out the next actions:

Set up Spark on the Jupyter Server for SageMaker Studio consumer profiles or for a SageMaker Studio shared area
Set up the sm-spark-cli for a consumer profile or shared area

Set up the Spark UI manually in a SageMaker Studio area

To host Spark UI on SageMaker Studio, full the next steps:

Select System terminal from the SageMaker Studio launcher.

Run the next instructions within the system terminal:

curl -LO https://github.com/aws-samples/amazon-sagemaker-spark-ui/releases/obtain/v0.1.0/amazon-sagemaker-spark-ui-0.1.0.tar.gz
tar -xvzf amazon-sagemaker-spark-ui-0.1.0.tar.gz

cd amazon-sagemaker-spark-ui-0.1.0/install-scripts
chmod +x install-history-server.sh
./install-history-server.sh

The instructions will take a number of seconds to finish.

When the set up is full, you can begin the Spark UI through the use of the offered sm-spark-cli and entry it from an internet browser by working the next code:

sm-spark-cli begin s3://DOC-EXAMPLE-BUCKET/<SPARK_EVENT_LOGS_LOCATION>

The S3 location the place the occasion logs produced by SageMaker Processing, AWS Glue, or Amazon EMR are saved might be configured when working Spark purposes.

For SageMaker Studio notebooks and AWS Glue Interactive Periods, you possibly can arrange the Spark occasion log location instantly from the pocket book through the use of the sparkmagic kernel.

The sparkmagic kernel incorporates a set of instruments for interacting with distant Spark clusters by way of notebooks. It affords magic (%spark, %sql) instructions to run Spark code, carry out SQL queries, and configure Spark settings like executor reminiscence and cores.

For the SageMaker Processing job, you possibly can configure the Spark occasion log location instantly from the SageMaker Python SDK.

Consult with the AWS documentation for extra info:

You may select the generated URL to entry the Spark UI.

The next screenshot exhibits an instance of the Spark UI.

You may verify the standing of the Spark Historical past Server through the use of the sm-spark-cli standing command within the Studio System terminal.

You can even cease the Spark Historical past Server when wanted.

Automate the Spark UI set up for customers in a SageMaker Studio area

As an IT admin, you possibly can automate the set up for SageMaker Studio customers through the use of a lifecycle configuration. This may be executed for all consumer profiles underneath a SageMaker Studio area or for particular ones. See Customize Amazon SageMaker Studio using Lifecycle Configurations for extra particulars.

You may create a lifecycle configuration from the install-history-server.sh script and fix it to an current SageMaker Studio area. The set up is run for all of the consumer profiles within the area.

From a terminal configured with the AWS Command Line Interface (AWS CLI) and applicable permissions, run the next instructions:

curl -LO https://github.com/aws-samples/amazon-sagemaker-spark-ui/releases/obtain/v0.1.0/amazon-sagemaker-spark-ui-0.1.0.tar.gz
tar -xvzf amazon-sagemaker-spark-ui-0.1.0.tar.gz

cd amazon-sagemaker-spark-ui-0.1.0/install-scripts

LCC_CONTENT=`openssl base64 -A -in install-history-server.sh`

aws sagemaker create-studio-lifecycle-config 
	--studio-lifecycle-config-name install-spark-ui-on-jupyterserver 
	--studio-lifecycle-config-content $LCC_CONTENT 
	--studio-lifecycle-config-app-type JupyterServer 
	--query 'StudioLifecycleConfigArn'

aws sagemaker update-domain 
	--region {YOUR_AWS_REGION} 
	--domain-id {YOUR_STUDIO_DOMAIN_ID} 
	--default-user-settings 
	'{
	"JupyterServerAppSettings": {
	"DefaultResourceSpec": {
	"LifecycleConfigArn": "arn:aws:sagemaker:{YOUR_AWS_REGION}:{YOUR_STUDIO_DOMAIN_ID}:studio-lifecycle-config/install-spark-ui-on-jupyterserver",
	"InstanceType": "system"
	},
	"LifecycleConfigArns": [
	"arn:aws:sagemaker:{YOUR_AWS_REGION}:{YOUR_STUDIO_DOMAIN_ID}:studio-lifecycle-config/install-spark-ui-on-jupyterserver"
	]
	}}'

After Jupyter Server restarts, the Spark UI and the sm-spark-cli can be obtainable in your SageMaker Studio atmosphere.

Clear up

On this part, we present you how you can clear up the Spark UI in a SageMaker Studio area, both manually or robotically.

Manually uninstall the Spark UI

To manually uninstall the Spark UI in SageMaker Studio, full the next steps:

Select System terminal within the SageMaker Studio launcher.

Run the next instructions within the system terminal:

cd amazon-sagemaker-spark-ui-0.1.0/install-scripts

chmod +x uninstall-history-server.sh
./uninstall-history-server.sh

Uninstall the Spark UI robotically for all SageMaker Studio consumer profiles

To robotically uninstall the Spark UI in SageMaker Studio for all consumer profiles, full the next steps:

On the SageMaker console, select Domains within the navigation pane, then select the SageMaker Studio area.

On the area particulars web page, navigate to the Atmosphere tab.
Choose the lifecycle configuration for the Spark UI on SageMaker Studio.
Select Detach.

Delete and restart the Jupyter Server apps for the SageMaker Studio consumer profiles.

Conclusion

On this publish, we shared an answer you should use to shortly set up the Spark UI on SageMaker Studio. With the Spark UI hosted on SageMaker, machine studying (ML) and information engineering groups can use scalable cloud compute to entry and analyze Spark logs from wherever and velocity up their undertaking supply. IT admins can standardize and expedite the provisioning of the answer within the cloud and keep away from proliferation of customized growth environments for ML tasks.

All of the code proven as a part of this publish is out there within the GitHub repository.

In regards to the Authors

Giuseppe Angelo Porcelli is a Principal Machine Studying Specialist Options Architect for Amazon Internet Providers. With a number of years software program engineering and an ML background, he works with clients of any measurement to know their enterprise and technical wants and design AI and ML options that make the perfect use of the AWS Cloud and the Amazon Machine Studying stack. He has labored on tasks in numerous domains, together with MLOps, pc imaginative and prescient, and NLP, involving a broad set of AWS providers. In his free time, Giuseppe enjoys enjoying soccer.

Bruno Pistone is an AI/ML Specialist Options Architect for AWS based mostly in Milan. He works with clients of any measurement, serving to them perceive their technical wants and design AI and ML options that make the perfect use of the AWS Cloud and the Amazon Machine Studying stack. His subject of expertice consists of machine studying finish to finish, machine studying endustrialization, and generative AI. He enjoys spending time along with his mates and exploring new locations, in addition to touring to new locations.