Easy methods to Scale Your Information Pipelines and Information Merchandise with Contract Testing and Dbt

First, we have to add two new dbt packages, dbt-expectations and dbt-utils, that may permit us to make assertions on the schema of our sources and the accepted values.

# packages.yml

- package deal: dbt-labs/dbt_utils
model: 1.1.1

- package deal: calogica/dbt_expectations
model: 0.8.5

Testing the info sources

Let’s begin by defining a contract check for our first supply. We pull information from raw_height, a desk that comprises top data from the customers of the fitness center app.

We agree with our information producers that we’ll obtain the peak measurement, the items for the measurements, and the consumer ID. We agree on the info sorts and that solely ‘cm’ and ‘inches’ are supported as items. With all this, we will outline our first contract within the dbt supply YAML file.

The constructing blocks

Trying on the earlier check, we will see a number of of the dbt-unit-testing macros in use:

  • dbt_expectations.expect_column_values_to_be_of_type: This assertion permits us to outline the anticipated column information kind.
  • accepted_values: This assertion permits us to outline an inventory of the accepeted values for a particular column.
  • dbt_utils.accepted_range: This assertion permits us to outline a numerical vary for a given column. Within the instance, we anticipated the column’s worth to not be lower than 0.
  • not null: Lastly, built-in assertions like ‘not null’ permit us to outline column constraints.

Utilizing these constructing blocks, we added a number of assessments to outline the contract expectations described above. Discover additionally how we’ve tagged the assessments as “contract-test-source”. This tag permits us to run all contract assessments in isolation, each domestically, and as we are going to see later, within the CI/CD pipeline:

dbt check --select tag:contract-test-source

We now have seen how rapidly we will create contract assessments for the sources of our dbt app, however what in regards to the public interfaces of our information pipeline or information product?

As information producers, we need to make sure that we’re producing information in response to the expectations of our information shoppers so we will fulfill the contract we’ve with them and make our data pipeline or data product trustworthy and reliable.

A easy means to make sure that we’re assembly our obligations to our information shoppers is so as to add contract testing for our public interfaces.

Dbt recently released a new feature for SQL fashions, mannequin contracts, that permits to outline the contract for a dbt mannequin. Whereas constructing your mannequin, dbt will confirm that your mannequin’s transformation will produce a dataset matching up with its contract, or it’s going to fail to construct.

Let’s see it in motion. Our mart, body_mass_indexes, produces a BMI metric from the load and top measure information we get from our sources. The contract with our supplier establishes the next:

  • Information sorts for every column.
  • Person IDs can’t be null
  • Person IDs are at all times better than 0

Let’s outline the contract of the body_mass_indexes mannequin utilizing dbt mannequin contracts:

The constructing blocks

Trying on the earlier mannequin specification file, we will see a number of metadata that permit us to outline the contract.

  • contract.enforced: This configuration tells dbt that we need to implement the contract each time the mannequin is run.
  • data_type: This assertion permits us to outline the column kind we predict to supply as soon as the mannequin runs.
  • constraints: Lastly, the constraints block provides us the prospect to outline helpful constraints like {that a} column can’t be null, set major keys, and customized expressions. Within the instance above we outlined a constraint to inform dbt that the user_id should at all times be better than 0. You possibly can see all of the out there constraints here.

A distinction between the contract assessments we outlined for our sources and those outlined for our marts or output ports is when the contracts are verified an enforced.

Mannequin contracts are enforced when the mannequin is being generated by dbt run, whereas contracts based mostly on the dbt assessments are enforced when the dbt assessments run.

If one of many mannequin contracts shouldn’t be glad, you will note an error once you execute ‘dbt run’ with particular particulars on the failure. You possibly can see an instance within the following dbt run console output.

1 of 4 START sql desk mannequin dbt_testing_example.stg_gym_app__height ........... [RUN]
2 of 4 START sql desk mannequin dbt_testing_example.stg_gym_app__weight ........... [RUN]
2 of 4 OK created sql desk mannequin dbt_testing_example.stg_gym_app__weight ...... [SELECT 4 in 0.88s]
1 of 4 OK created sql desk mannequin dbt_testing_example.stg_gym_app__height ...... [SELECT 4 in 0.92s]
3 of 4 START sql desk mannequin dbt_testing_example.int_weight_measurements_with_latest_height [RUN]
3 of 4 OK created sql desk mannequin dbt_testing_example.int_weight_measurements_with_latest_height [SELECT 4 in 0.96s]
4 of 4 START sql desk mannequin dbt_testing_example.body_mass_indexes ............. [RUN]
4 of 4 ERROR creating sql desk mannequin dbt_testing_example.body_mass_indexes .... [ERROR in 0.77s]

Completed operating 4 desk fashions in 0 hours 0 minutes and 6.28 seconds (6.28s).

Accomplished with 1 error and 0 warnings:

Database Error in mannequin body_mass_indexes (fashions/marts/body_mass_indexes.sql)
new row for relation "body_mass_indexes__dbt_tmp" violates verify constraint
DETAIL: Failing row comprises (1, 2009-07-01, 82.5, null, null).
compiled Code at goal/run/dbt_testing_example/fashions/marts/body_mass_indexes.sql

Till now we’ve a check suite of highly effective contract assessments, however how and when will we run them?

We are able to run contract assessments in two forms of pipelines.

  • CI/CD pipelines
  • Information pipelines

For instance, you possibly can execute the supply contract assessments on a schedule in a CI/CD pipeline concentrating on the info sources out there in decrease environments like check or staging. You possibly can set the pipeline to fail each time the contract shouldn’t be met.

These failures supplies precious details about contract-breaking modifications launched by different groups earlier than these modifications attain manufacturing.

Leave a Reply

Your email address will not be published. Required fields are marked *