Tips on how to Construct Easy ETL Pipelines With GitHub Actions


Picture by Roman Synkevych 🇺🇦 on Unsplash

Should you’re into software program growth, you’d know what GitHub actions are. It’s a utility by GitHub to automate dev duties. Or, in widespread language, a DevOps instrument.

However individuals hardly use it for constructing ETL pipelines.

The very first thing that involves thoughts when discussing ETLs is Airflow, Prefect, or associated instruments. They’re, no doubt, the perfect available in the market for process orchestration. However many ETLs we construct are easy, and internet hosting a separate instrument for them is commonly overkill.

You should use GitHub Actions as a substitute.

This text focuses on GitHub Actions. However in case you’re on Bitbucket or GitLab, you might use their respective alternate options too.

We are able to run our Python, R, or Julia scripts on GitHub Actions. In order an information scientist, you don’t should be taught a brand new language or instrument for this matter. You would even get electronic mail notifications when any of your ETL duties fail.

You’ll be able to nonetheless get pleasure from 2000min of computation month-to-month in case you’re on a free account. You’ll be able to attempt GitHub motion in case you can estimate your ETL workload inside this vary.

How can we begin constructing ETLs on GitHub Actions?

Getting began with the GitHub actions is straightforward. You would comply with the official doc. Or the three easy steps are as follows.

In your repository, create a listing at .github/workflows . Then create the YAML config file actions.yaml inside it with the next content material.

title: ETL Pipeline

on:
schedule:
- cron: '0 0 * * *' # Runs at 12.00 AM on daily basis

jobs:
etl:
runs-on: ubuntu-latest
steps:
- title: Checkout code
makes use of: actions/checkout@v2

- title: Arrange Python
makes use of: actions/setup-python@v2
with:
python-version: '3.9'

- title: Extract information
run: python extract.py

- title: Remodel information
run: python rework.py

- title: Load information
run: python load.py

The above YAML automates an ETL (Extract, Remodel, Load) pipeline. The workflow is triggered on daily basis at 12:00 AM UTC, and it consists of a single job that runs on the ubuntu-latest setting (No matter that’s accessible on the time.)

The steps of those configurations are easy.

The job has 5 steps: the primary two steps take a look at the code and arrange the Python setting, respectively, whereas the subsequent three steps execute the extract.py, rework.py, and load.py Python scripts sequentially.

This workflow offers an automatic and environment friendly method of extracting, remodeling, and loading information every day utilizing GitHub Actions.

The Python scripts might differ relying on the situation. Right here’s one in every of some ways.

# extract.py
# --------------------------------
import requests

response = requests.get("https://api.instance.com/information")
with open("information.json", "w") as f:
f.write(response.textual content)

# rework.py
# --------------------------------
import json

with open("information.json", "r") as f:
information = json.load(f)

# Carry out transformation
transformed_data = [item for item in data if item["key"] == "worth"]

# Save reworked information
with open("transformed_data.json", "w") as f:
json.dump(transformed_data, f)

# load.py
# --------------------------------
import json
from sqlalchemy import create_engine, Desk, Column, Integer, String, MetaData

# Connect with database
engine = create_engine("postgresql://myuser:mypassword@localhost:5432/mydatabase")

# Create metadata object
metadata = MetaData()

# Outline desk schema
mytable = Desk(
"mytable",
metadata,
Column("id", Integer, primary_key=True),
Column("column1", String),
Column("column2", String),
)

# Learn reworked information from file
with open("transformed_data.json", "r") as f:
information = json.load(f)

# Load information into database
with engine.join() as conn:
for merchandise in information:
conn.execute(
mytable.insert().values(column1=merchandise["column1"], column2=merchandise["column2"])
)

The above scripts learn from a dummy API and push it to a Postgres database.

Issues to think about when deploying ETL pipelines to GitHub Actions.

1. Safety: Maintain your secrets and techniques safe by utilizing GitHub’s secret retailer and keep away from hardcoding secrets and techniques into your workflows.

Have you ever already observed that the pattern code I’ve given above has database credentials? It’s not proper for a manufacturing system.

Now we have different methods to securely embed secrets and techniques, like database credentials.

Should you don’t encrypt your secrets and techniques in GitHub Actions, they are going to be seen to anybody who has entry to the repository’s supply code. Because of this if an attacker positive factors entry to the repository or the repository’s supply code is leaked; the attacker will be capable to see your secret values.

To guard your secrets and techniques, GitHub offers a characteristic referred to as encrypted secrets and techniques, which lets you retailer your secret values securely within the repository settings. Encrypted secrets and techniques are solely accessible to approved customers and are by no means uncovered in plaintext in your GitHub Actions workflows.

Right here’s the way it works.

Within the repository settings sidebar, you could find the secrets and techniques and variables for Actions. You’ll be able to create your variables right here.

Screenshot by the writer.

Secrets and techniques created right here will not be seen to anybody. They’re encrypted and can be utilized within the workflow. Even you possibly can’t learn them. However you possibly can replace them with a brand new worth.

When you created the secrets and techniques, you possibly can go in them utilizing the GitHub Actions configuration as an setting variable. Right here’s the way it works:

title: ETL Pipeline

on:
schedule:
- cron: '0 0 * * *' # Runs at 12.00 AM on daily basis

jobs:
etl:
runs-on: ubuntu-latest
steps:
...

- title: Load information
env: # Or as an setting variable
DB_USER: ${{ secrets and techniques.DB_USER }}
DB_PASS: ${{ secrets and techniques.DB_PASS }}
run: python load.py

Now, we will modify the Python scripts to learn credentials from setting variables.

# load.py
# --------------------------------
import json
import os
from sqlalchemy import create_engine, Desk, Column, Integer, String, MetaData

# Connect with database
engine = create_engine(
f"postgresql://{os.environ['DB_USER']}:{os.environ['DB_PASS']}@localhost:5432/mydatabase"
)

2. Dependencies: Be certain that to make use of the right model of dependencies to keep away from any points.

Your Python challenge might have already got a necessities.txt file that specifies dependencies together with their variations. Or, for extra subtle tasks, you could be utilizing fashionable dependency administration instruments like Poetry.

It’s best to have a step to arrange your setting earlier than you run the opposite items of your ETL. You are able to do this by specifying the next in your YAML configuration.

- title: Set up dependencies
run: pip set up -r necessities.txt

3. Timezone settings: GitHub actions use UTC timezone, and as of penning this submit, you possibly can’t change it.

Thus you need to make sure you’re utilizing the right timezone. You should use a web-based converter or manually alter your native time to UTC earlier than configuring.

The most important caveat of GitHub motion scheduling is its uncertainty within the execution time. Though you’ve configured it to run at a particular cut-off date, if the demand is excessive at that time, your job will likely be qued. Thus, there will likely be a brief delay within the precise job beginning time.

In case your job is dependent upon precise execution time, utilizing GitHub Actions scheduling might be not an excellent possibility. Utilizing a self-hosted runner in GitHub actions might assist.

4. Useful resource Utilization: Keep away from overloading the assets offered by GitHub.

Though GitHub actions, even with a free account, has 2000 minutes of free run time, in case you use a special OS than Linux, guidelines change a bit.

Should you’re utilizing a Home windows runtime, you’ll get solely half of it. In a MacOS setting, you’ll solely get one-tenth of it.

Conclusion

GitHub actions is a DevOps instrument. However we will use it to run any scheduled duties. On this submit, we’ve mentioned how you can create an ETL that periodically fetches an API and pushes the info to a dataframe.

For easy ETLs, this strategy is straightforward to develop and deploy.

However scheduled jobs in GitHub actions don’t should run at the very same time. Therefore for time bounded duties, this isn’t appropriate.

Leave a Reply

Your email address will not be published. Required fields are marked *