Docker Tutorial for Information Scientists
Picture by Writer
Python and the suite of Python information evaluation and machine studying libraries like pandas and scikit-learn allow you to develop information science functions with ease. Nonetheless, dependency administration in Python is a problem. When engaged on an information science challenge, you’ll should spend substantial time putting in the assorted libraries and maintaining observe of the model of the libraries you’re utilizing amongst others.
What if different builders need to run your code and contribute to the challenge? Nicely, different builders who need to replicate your information science software ought to first arrange the challenge surroundings on their machine—earlier than they’ll go forward and run the code. Even small variations comparable to differing library variations can introduce breaking modifications to the code. Docker to the rescue. Docker simplifies the event course of and facilitates seamless collaboration.
This information will introduce you to the fundamentals of Docker and educate you easy methods to containerize information science functions with Docker.
Picture by Writer
Docker is a containerization instrument that allows you to construct and share functions as transportable artifacts referred to as pictures.
Apart from supply code, your software can have a set of dependencies, required configuration, system instruments, and extra. For instance, in an information science challenge, you’ll set up all of the required libraries in your growth surroundings (ideally inside a digital surroundings). You’ll additionally be certain that you’re utilizing an up to date model of Python that the libraries assist.
Nonetheless, you should still run into issues when attempting to run your software on one other machine. These issues usually come up from mismatched configuration and library variations—within the growth surroundings—between the 2 machines.
With Docker, you may package deal your software—together with the dependencies and configuration. So you may outline an remoted, reproducible, and constant surroundings to your functions throughout the vary of host machines.
Let’s go over just a few ideas/terminologies:
Docker Picture
A Docker picture is the transportable artifact of your software.
Docker Container
Whenever you run a picture, you’re primarily getting the appliance working contained in the container surroundings. So a working occasion of a picture is a container.
Docker Registry
Docker registry is a system for storing and distributing Docker pictures. After containerizing an software right into a Docker picture, you can also make it out there for the developer neighborhood by pushing them to a picture registry. DockerHub is the most important public registry, and all pictures are pulled from DockerHub by default.
As a result of containers present an remoted surroundings to your functions, different builders now solely have to have Docker arrange on their machine. And so they can begin containers they’ll pull the Docker picture and begin containers utilizing a single command—with out having to fret about complicated installations—in distant
When growing an software, additionally it is widespread to construct and check a number of variations of the identical app. Should you use Docker, you may have a number of variations of the identical app working inside totally different containers—with out any conflicts—in the identical surroundings.
Along with simplifying growth, Docker additionally additionally simplifies deployment and helps the event and operations groups to collaborate successfully. On the server facet, the operations staff does not should spend time resolving complicated model and dependency conflicts. They solely have to have a docker runtime arrange
Let’s shortly go over some primary Docker instructions most of which we’ll use on this tutorial. For a extra detailed overview learn: 12 Docker Commands Every Data Scientist Should Know.
Command | Operate |
docker ps |
Lists all working containers |
docker pull image-name |
Pulls image-name from DockerHub by default |
docker pictures |
Lists all of the out there pictures |
docker run image-name |
Begins a container from a picture |
docker begin container-id |
Restarts a stopped container |
docker cease container-id |
Stops a working container |
docker construct path |
Builds a picture on the path utilizing directions within the Dockerfile |
Be aware: Run all of the instructions by prefixing sudo
for those who haven’t created the docker group with the consumer.
We’ve realized the fundamentals of Docker, and it’s time to use what we’ve realized. On this part, we’ll containerize a easy information science software utilizing Docker.
Home Value Prediction Mannequin
Let’s take the next linear regression mannequin that predicts the goal worth: the median home worth based mostly on the enter options. The mannequin is constructed utilizing the California housing dataset:
# house_price_prediction.py
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the California Housing dataset
information = fetch_california_housing(as_frame=True)
X = information.information
y = information.goal
# Cut up the dataset into coaching and check units
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize options
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.remodel(X_test)
# Practice the mannequin
mannequin = LinearRegression()
mannequin.match(X_train, y_train)
# Make predictions on the check set
y_pred = mannequin.predict(X_test)
# Consider the mannequin
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Imply Squared Error: {mse:.2f}")
print(f"R-squared Rating: {r2:.2f}")
We all know that scikit-learn is a required dependency. Should you undergo the code, we set as_frame
equal to True when loading the dataset . So we additionally want pandas. And the necessities.txt
file seems like so:
pandas==2.0
scikit-learn==1.2.2
Picture by Writer
Create the Dockerfile
Up to now, we’ve the supply code file house_price_prediction.py
and the necessities.txt
file. We must always now outline how to construct a picture from our software. The Dockerfile is used to create this definition of constructing a picture from the appliance supply code information.
So what’s a Dockerfile? It’s a textual content doc that accommodates step-by-step directions to construct the Docker picture.
Picture by Writer
Right here’s the Dockerfile for our instance:
# Use the official Python picture as the bottom picture
FROM python:3.9-slim
# Set the working listing within the container
WORKDIR /app
# Copy the necessities.txt file to the container
COPY necessities.txt .
# Set up the dependencies
RUN pip set up --no-cache-dir -r necessities.txt
# Copy the script file to the container
COPY house_price_prediction.py .
# Set the command to run your Python script
CMD ["python", "house_price_prediction.py"]
Let’s break down the contents of the Dockerfile:
- All Dockerfiles begin with a
FROM
instruction specifying the bottom picture. Base picture is that picture on which your picture is predicated. Right here we use an out there picture for Python 3.9. TheFROM
instruction tells Docker to construct the present picture from the required base picture. - The
SET
command is used to set the working listing for all the next instructions (app on this instance). - We then copy the
necessities.txt
file to the container’s file system. - The
RUN
instruction executes the required command—in a shell—contained in the container. Right here we set up all of the required dependencies utilizingpip
. - We then copy the supply code file—the Python script
house_price_prediction.py
—to the container’s file system. - Lastly
CMD
refers back to the instruction to be executed—when the container begins. Right here we have to run thehouse_price_prediction.py
script. The Dockerfile ought to comprise just oneCMD
instruction.
Construct the Picture
Now that we’ve outlined the Dockerfile, we are able to construct the docker picture by working the docker construct
:
The choice -t permits us to specify a reputation and tag for the picture within the title:tag format. The default tag is newest.
The construct course of takes a few minutes:
Sending construct context to Docker daemon 4.608kB
Step 1/6 : FROM python:3.9-slim
3.9-slim: Pulling from library/python
5b5fe70539cd: Pull full
f4b0e4004dc0: Pull full
ec1650096fae: Pull full
2ee3c5a347ae: Pull full
d854e82593a7: Pull full
Digest: sha256:0074c6241f2ff175532c72fb0fb37264e8a1ac68f9790f9ee6da7e9fdfb67a0e
Standing: Downloaded newer picture for python:3.9-slim
---> 326a3a036ed2
Step 2/6 : WORKDIR /app
...
...
...
Step 6/6 : CMD ["python", "house_price_prediction.py"]
---> Working in 7fcef6a2ab2c
Eradicating intermediate container 7fcef6a2ab2c
---> 2607aa43c61a
Efficiently constructed 2607aa43c61a
Efficiently tagged ml-app:newest
After the Docker picture has been constructed, run the docker pictures
command. You need to see theml-app
picture listed, too.
You possibly can run the Docker picture
ml-app
utilizing the docker run
command:
Congratulations! You’ve simply dockerized your first information science software. By making a DockerHub account, you may push the picture to it (or to a non-public repository inside the group).
Hope you discovered this introductory Docker tutorial useful. You’ll find the code used on this tutorial in this GitHub repository. As a subsequent step, arrange Docker in your machine and do this instance. Or dockerize an software of your alternative.
The simplest approach to set up Docker in your machine is utilizing Docker Desktop: you get each the Docker CLI consumer in addition to a GUI to handle your containers simply. So arrange Docker and get coding instantly!
Bala Priya C is a developer and technical author from India. She likes working on the intersection of math, programming, information science, and content material creation. Her areas of curiosity and experience embody DevOps, information science, and pure language processing. She enjoys studying, writing, coding, and low! Presently, she’s engaged on studying and sharing her information with the developer neighborhood by authoring tutorials, how-to guides, opinion items, and extra.