Entry Snowflake knowledge utilizing OAuth-based authentication in Amazon SageMaker Knowledge Wrangler

On this submit, we present easy methods to configure a brand new OAuth-based authentication characteristic for utilizing Snowflake in Amazon SageMaker Data Wrangler. Snowflake is a cloud knowledge platform that gives knowledge options for knowledge warehousing to knowledge science. Snowflake is an AWS Partner with a number of AWS accreditations, together with AWS competencies in machine studying (ML), retail, and knowledge and analytics.

Knowledge Wrangler simplifies the info preparation and have engineering course of, decreasing the time it takes from weeks to minutes by offering a single visible interface for knowledge scientists to pick and clear knowledge, create options, and automate knowledge preparation in ML workflows with out writing any code. You may import knowledge from a number of knowledge sources, reminiscent of Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, Amazon EMR, and Snowflake. With this new characteristic, you need to use your personal id supplier (IdP) reminiscent of Okta, Azure AD, or Ping Federate to hook up with Snowflake through Knowledge Wrangler.

Resolution overview

Within the following sections, we offer steps for an administrator to arrange the IdP, Snowflake, and Studio. We additionally element the steps that knowledge scientists can take to configure the info circulate, analyze the info high quality, and add knowledge transformations. Lastly, we present easy methods to export the info circulate and prepare a mannequin utilizing SageMaker Autopilot.

Conditions

For this walkthrough, it is best to have the next conditions:

For admin:
- A Snowflake person with permissions to create storage integrations, and safety integrations in Snowflake.
- An AWS account with permissions to create AWS Identity and Access Management (IAM) insurance policies and roles.
- Entry and permissions to configure IDP to register Knowledge Wrangler software and arrange the authorization server or API.

For knowledge scientist:

Administrator setup

As a substitute of getting your customers straight enter their Snowflake credentials into Knowledge Wrangler, you possibly can have them use an IdP to entry Snowflake.

The next steps are concerned to allow Knowledge Wrangler OAuth entry to Snowflake:

Configure the IdP.
Configure Snowflake.
Configure SageMaker Studio.

Configure the IdP

To arrange your IdP, you have to register the Knowledge Wrangler software and arrange your authorization server or API.

Register the Knowledge Wrangler software inside the IdP

Consult with the next documentation for the IdPs that Knowledge Wrangler helps:

Use the documentation offered by your IdP to register your Knowledge Wrangler software. The data and procedures on this part show you how to perceive easy methods to correctly use the documentation offered by your IdP.

Particular customizations along with the steps within the respective guides are known as out within the subsections.

Choose the configuration that begins the method of registering Knowledge Wrangler as an software.
Present the customers inside the IdP entry to Knowledge Wrangler.
Allow OAuth consumer authentication by storing the consumer credentials as a Secrets and techniques Supervisor secret.
Specify a redirect URL utilizing the next format: https://domain-ID.studio.AWS Area.sagemaker.aws/jupyter/default/lab.

You’re specifying the SageMaker area ID and AWS Area that you just’re utilizing to run Knowledge Wrangler. It’s essential to register a URL for every area and Area the place you’re working Knowledge Wrangler. Customers from a website and Area that don’t have redirect URLs arrange for them gained’t have the ability to authenticate with the IdP to entry the Snowflake connection.

Be sure that the authorization code and refresh token grant sorts are allowed on your Knowledge Wrangler software.

Arrange the authorization server or API inside the IdP

Inside your IdP, you have to arrange an authorization server or an software programming interface (API). For every person, the authorization server or the API sends tokens to Knowledge Wrangler with Snowflake because the viewers.

Snowflake makes use of the idea of roles which might be distinct from IAM roles utilized in AWS. It’s essential to configure the IdP to make use of ANY Position to make use of the default function related to the Snowflake account. For instance, if a person has programs administrator because the default function of their Snowflake profile, the connection from Knowledge Wrangler to Snowflake makes use of programs administrator because the function.

Use the next process to arrange the authorization server or API inside your IdP:

Out of your IdP, start the method of establishing the server or API.
Configure the authorization server to make use of the authorization code and refresh token grant sorts.
Specify the lifetime of the entry token.
Set the refresh token idle timeout.

The idle timeout is the time that the refresh token expires if it’s not used. Should you’re scheduling jobs in Knowledge Wrangler, we suggest making the idle timeout time higher than the frequency of the processing job. In any other case, some processing jobs would possibly fail as a result of the refresh token expired earlier than they may run. When the refresh token expires, the person should re-authenticate by accessing the connection that they’ve made to Snowflake via Knowledge Wrangler.

Be aware that Knowledge Wrangler doesn’t help rotating refresh tokens. Utilizing rotating refresh tokens would possibly lead to entry failures or customers needing to log in incessantly.

If the refresh token expires, your customers should reauthenticate by accessing the connection that they’ve made to Snowflake via Knowledge Wrangler.

Specify session:role-any as the brand new scope.

For Azure AD, you have to additionally specify a singular identifier for the scope.

After you’ve arrange the OAuth supplier, you present Knowledge Wrangler with the knowledge it wants to hook up with the supplier. You should use the documentation out of your IdP to get values for the next fields:

Token URL – The URL of the token that the IdP sends to Knowledge Wrangler
Authorization URL – The URL of the authorization server of the IdP
Shopper ID – The ID of the IdP
Shopper secret – The key that solely the authorization server or API acknowledges
OAuth scope – That is for Azure AD solely

Configure Snowflake

To configure Snowflake, full the directions in Import data from Snowflake.

Use the Snowflake documentation on your IdP to arrange an exterior OAuth integration in Snowflake. See the earlier part Register the Knowledge Wrangler software inside the IdP for extra data on easy methods to arrange an exterior OAuth integration.

Once you’re establishing the safety integration in Snowflake, be sure to activate external_oauth_any_role_mode.

Configure SageMaker Studio

You retailer the fields and values in a Secrets and techniques Supervisor secret and add it to the Studio Lifecycle Configuration that you just’re utilizing for Knowledge Wrangler. A Lifecycle Configuration is a shell script that mechanically hundreds the credentials saved within the secret when the person logs into Studio. For details about creating secrets and techniques, see Move hardcoded secrets to AWS Secrets Manager. For details about utilizing Lifecycle Configurations in Studio, see Use Lifecycle Configurations with Amazon SageMaker Studio.

Create a secret for Snowflake credentials

To create your secret for Snowflake credentials, full the next steps:

On the Secrets and techniques Supervisor console, select Retailer a brand new secret.
For Secret sort, choose Different sort of secret.
Specify the small print of your secret as key-value pairs.

Key names require lowercase letters resulting from case sensitivity. Knowledge Wrangler provides a warning if you happen to enter any of those incorrectly. Enter the key values as key-value pairs Key/worth if you happen to’d like, or use the Plaintext choice.

The next is the format of the key used for Okta. If you’re utilizing Azure AD, you want to add the datasource_oauth_scope area.

"AZURE_AD"

Replace the previous values together with your selection of IdP and data gathered after software registration.
Select Subsequent.
For Secret identify, add the prefix AmazonSageMaker (for instance, our secret is AmazonSageMaker-DataWranglerSnowflakeCreds).
Within the Tags part, add a tag with the important thing SageMaker and worth true.
Select Subsequent.
The remainder of the fields are elective; select Subsequent till you could have the choice to decide on Retailer to retailer the key.

After you retailer the key, you’re returned to the Secrets and techniques Supervisor console.

Select the key you simply created, then retrieve the key ARN.
Retailer this in your most well-liked textual content editor to be used later if you create the Knowledge Wrangler knowledge supply.

Create a Studio Lifecycle Configuration

To create a Lifecycle Configuration in Studio, full the next steps:

On the SageMaker console, select Lifecycle configurations within the navigation pane.
Select Create configuration.
Select Jupyter server app.

Create a brand new lifecycle configuration or append an present one with the next content material:

#!/bin/bash
set -eux
## Script Physique
cat > ~/.snowflake_identity_provider_oauth_config <<EOL
{
    "secret_arn": "<secret_arn>"
}
EOL

The configuration creates a file with the identify ".snowflake_identity_provider_oauth_config", containing the key within the person’s dwelling folder.

Select Create Configuration.

Set the default Lifecycle Configuration

Full the next steps to set the Lifecycle Configuration you simply created because the default:

On the SageMaker console, select Domains within the navigation pane.
Select the Studio area you’ll be utilizing for this instance.
On the Atmosphere tab, within the Lifecycle configurations for private Studio apps part, select Connect.
For Supply, choose Current configuration.
Choose the configuration you simply made, then select Connect to area.
Choose the brand new configuration and select Set as default, then select Set as default once more within the pop-up message.

Your new settings ought to now be seen beneath Lifecycle configurations for private Studio apps as default.

Shut down the Studio app and relaunch for the modifications to take impact.

Knowledge scientist expertise

On this part, we cowl how knowledge scientists can connect with Snowflake as an information supply in Knowledge Wrangler and put together knowledge for ML.

Create a brand new knowledge circulate

To create your knowledge circulate, full the next steps:

On the SageMaker console, select Amazon SageMaker Studio within the navigation pane.
Select Open Studio.
On the Studio Residence web page, select Import & put together knowledge visually. Alternatively, on the File drop-down, select New, then select SageMaker Knowledge Wrangler Stream.

Creating a brand new circulate can take a couple of minutes.

Creating a new flow

On the Import knowledge web page, select Create connection.
Select Snowflake from the listing of information sources.
For Authentication methodology, select OAuth.

Should you don’t see OAuth, confirm the previous Lifecycle Configuration steps.

Enter particulars for Snowflake account identify and Storage integration.
Ener a connection identify and select Join.

You’re redirected to an IdP authentication web page. For this instance, we’re utilizing Okta.

Enter your person identify and password, then select Register.

After the authentication is profitable, you’re redirected to the Studio knowledge circulate web page.

On the Import knowledge from Snowflake web page, browse the database objects, or run a question for the focused knowledge.
Within the question editor, enter a question and preview the outcomes.

Within the following instance, we load Mortgage Knowledge and retrieve all columns from 5,000 rows.

Select Import.
Enter a dataset identify (for this submit, we use snowflake_loan_dataset) and select Add.

You’re redirected to the Put together web page, the place you possibly can add transformations and analyses to the info.

Knowledge Wrangler makes it simple to ingest knowledge and carry out knowledge preparation duties reminiscent of exploratory knowledge evaluation, characteristic choice, and have engineering. We’ve solely coated a number of of the capabilities of Knowledge Wrangler on this submit on knowledge preparation; you need to use Knowledge Wrangler for extra superior knowledge evaluation reminiscent of characteristic significance, goal leakage, and mannequin explainability utilizing a simple and intuitive person interface.

Analyze knowledge high quality

Use the Data Quality and Insights Report to carry out an evaluation of the info that you just’ve imported into Knowledge Wrangler. Knowledge Wrangler creates the report from the sampled knowledge.

On the Knowledge Wrangler circulate web page, select the plus signal subsequent to Knowledge sorts, then select Get knowledge insights.
Select Knowledge High quality And Insights Report for Evaluation sort.
For Goal column, select your goal column.
For Downside sort, choose Classification.
Select Create.

The insights report has a quick abstract of the info, which incorporates common data reminiscent of lacking values, invalid values, characteristic sorts, outlier counts, and extra. You may both obtain the report or view it on-line.

Insight report

Add transformations to the info

Knowledge Wrangler has over 300 built-in transformations. On this part, we use a few of these transformations to organize the dataset for an ML mannequin.

On the Knowledge Wrangler circulate web page, select plus signal, then select Add remodel.

Should you’re following the steps within the submit, you’re directed right here mechanically after including your dataset.

Add Transform

Confirm and modify the info sort of the columns.

Trying via the columns, we establish that MNTHS_SINCE_LAST_DELINQ and MNTHS_SINCE_LAST_RECORD ought to most definitely be represented as a quantity sort quite than string.

Verify and modify the data type of the columns.

After making use of the modifications and including the step, you possibly can confirm the column knowledge sort is modified to drift.

Update the data type of the columns.

Trying via the info, we will see that the fields EMP_TITLE, URL, DESCRIPTION, and TITLE will probably not present worth to our mannequin in our use case, so we will drop them.

Select Add Step, then select Handle columns.
For Remodel, select Drop column.
For Column to drop, specify EMP_TITLE, URL, DESCRIPTION, and TITLE.
Select Preview and Add.

Drop Columns

Subsequent, we need to search for categorical knowledge in our dataset. Knowledge Wrangler has a built-in performance to encode categorical knowledge utilizing each ordinal and one-hot encodings. our dataset, we will see that the TERM, HOME_OWNERSHIP, and PURPOSE columns all look like categorical in nature.

Add one other step and select Encode categorical.
For Remodel, select One-hot encode.
For Enter column, select TERM.
For Output type, select Columns.
Go away all different settings as default, then select Preview and Add.

The HOME_OWNERSHIP column has 4 potential values: RENT, MORTGAGE, OWN, and different.

Repeat the previous steps to use a one-hot encoding method on these values.

Lastly, the PURPOSE column has a number of potential values. For this knowledge, we use a one-hot encoding method as nicely, however we set the output to a vector quite than columns.

For Remodel, select One-hot encode.
For Enter column, select PURPOSE.
For Output type, select Vector.
For Output column, we name this column PURPOSE_VCTR.

This retains the unique PURPOSE column, if we determine to make use of it later.

Go away all different settings as default, then select Preview and Add.

Export the info circulate

Lastly, we export this complete knowledge circulate to a characteristic retailer with a SageMaker Processing job, which creates a Jupyter pocket book with the code pre-populated.

On the info circulate web page , select the plus signal and Export to.
Select the place to export. For our use case, we select SageMaker Characteristic Retailer.

The exported pocket book is now able to run.

Notebook is ready to run

Export knowledge and prepare a mannequin with Autopilot

Now we will prepare the mannequin utilizing Amazon SageMaker Autopilot.

On the info circulate web page, select the Coaching tab.
For Amazon S3 location, enter a location for the info to be saved.
Select Export and prepare.
Specify the settings within the Goal and options, Coaching methodology, Deployment and advance settings, and Evaluate and create sections.
Select Create experiment to search out the perfect mannequin on your drawback.

Clear up

In case your work with Knowledge Wrangler is full, shut down your Data Wrangler instance to keep away from incurring further charges.

Conclusion

On this submit, we demonstrated connecting Data Wrangler to Snowflake using OAuth, reworking and analyzing a dataset, and eventually exporting it to the info circulate in order that it could possibly be utilized in a Jupyter pocket book. Most notably, we created a pipeline for knowledge preparation with out having to write down any code in any respect.

To get began with Knowledge Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler.

In regards to the authors

Ajjay Govindaram is a Senior Options Architect at AWS. He works with strategic clients who’re utilizing AI/ML to unravel advanced enterprise issues. His expertise lies in offering technical course in addition to design help for modest to large-scale AI/ML software deployments. His information ranges from software structure to massive knowledge, analytics, and machine studying. He enjoys listening to music whereas resting, experiencing the outside, and spending time along with his family members.

Bosco Albuquerque is a Sr. Companion Options Architect at AWS and has over 20 years of expertise in working with database and analytics merchandise from enterprise database distributors and cloud suppliers. He has helped massive expertise firms design knowledge analytics options and has led engineering groups in designing and implementing knowledge analytics platforms and knowledge merchandise.

Matt Marzillo is a Sr. Companion Gross sales Engineer at Snowflake. He has 10 years of expertise in knowledge science and machine studying roles each in consulting and with trade organizations. Matt has expertise growing and deploying AI and ML fashions throughout many various organizations in areas reminiscent of advertising, gross sales, operations, medical, and finance, in addition to advising in consultative roles.

Huong Nguyen is a product chief for Amazon SageMaker Knowledge Wrangler at AWS. She has 15 years of expertise creating customer-obsessed and data-driven merchandise for each enterprise and shopper areas. In her spare time, she enjoys audio books, gardening, mountain climbing, and spending time along with her household and mates.

Entry Snowflake knowledge utilizing OAuth-based authentication in Amazon SageMaker Knowledge Wrangler

Resolution overview

Conditions