Docker Tutorial for Information Scientists

Docker Tutorial for Data Scientists
Picture by Writer


Python and the suite of Python information evaluation and machine studying libraries like pandas and scikit-learn allow you to develop information science functions with ease. Nonetheless, dependency administration in Python is a problem. When engaged on an information science challenge, you’ll should spend substantial time putting in the assorted libraries and maintaining observe of the model of the libraries you’re utilizing amongst others.

What if different builders need to run your code and contribute to the challenge? Nicely, different builders who need to replicate your information science software ought to first arrange the challenge surroundings on their machine—earlier than they’ll go forward and run the code. Even small variations comparable to differing library variations can introduce breaking modifications to the code. Docker to the rescue. Docker simplifies the event course of and facilitates seamless collaboration.

This information will introduce you to the fundamentals of Docker and educate you easy methods to containerize information  science functions with Docker.



Docker Tutorial for Data Scientists
Picture by Writer


Docker is a containerization instrument that allows you to construct and share functions as transportable artifacts referred to as pictures

Apart from supply code, your software can have a set of dependencies, required configuration, system instruments, and extra. For instance, in an information science challenge, you’ll set up all of the required libraries in your growth surroundings (ideally inside a digital surroundings). You’ll additionally be certain that you’re utilizing an up to date model of Python that the libraries assist. 

Nonetheless, you should still run into issues when attempting to run your software on one other machine. These issues usually come up from mismatched configuration and library variations—within the growth surroundings—between the 2 machines.

With Docker, you may package deal your software—together with the dependencies and configuration. So you may outline an remoted, reproducible, and constant surroundings to your functions throughout the vary of host machines.



Let’s go over just a few ideas/terminologies:


Docker Picture


A Docker picture is the transportable artifact of your software. 


Docker Container


Whenever you run a picture, you’re primarily getting the appliance working contained in the container surroundings. So a working occasion of a picture is a container.


Docker Registry


Docker registry is a system for storing and distributing Docker pictures. After containerizing an software right into a Docker picture, you can also make it out there for the developer neighborhood by pushing them to a picture registry. DockerHub is the most important public registry, and all pictures are pulled from DockerHub by default.



As a result of containers present an remoted surroundings to your functions, different builders now solely have to have Docker arrange on their machine. And so they can begin containers they’ll pull the Docker picture and begin containers utilizing a single command—with out having to fret about complicated installations—in distant 

When growing an software, additionally it is widespread to construct and check a number of variations of the identical app. Should you use Docker, you may have a number of variations of the identical app working inside totally different containers—with out any conflicts—in the identical surroundings.

Along with simplifying growth, Docker additionally additionally simplifies deployment and helps the event and operations groups to collaborate successfully. On the server facet, the operations staff does not should spend time resolving complicated model and dependency conflicts. They solely have to have a docker runtime arrange



Let’s shortly go over some primary Docker instructions most of which we’ll use on this tutorial. For a extra detailed overview learn: 12 Docker Commands Every Data Scientist Should Know.

Command Operate
docker ps Lists all working containers
docker pull image-name Pulls image-name from DockerHub by default
docker pictures Lists all of the out there pictures
docker run image-name Begins a container from a picture
docker begin container-id Restarts a stopped container
docker cease container-id Stops a working container
docker construct path Builds a picture on the path utilizing directions within the Dockerfile


Be aware: Run all of the instructions by prefixing sudo for those who haven’t created the docker group with the consumer.



We’ve realized the fundamentals of Docker, and it’s time to use what we’ve realized. On this part, we’ll containerize a easy information science software utilizing Docker.


Home Value Prediction Mannequin


Let’s take the next linear regression mannequin that predicts the goal worth: the median home worth based mostly on the enter options. The mannequin is constructed utilizing the California housing dataset:

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the California Housing dataset
information = fetch_california_housing(as_frame=True)
X = information.information
y = information.goal

# Cut up the dataset into coaching and check units
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize options
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.remodel(X_test)

# Practice the mannequin
mannequin = LinearRegression()
mannequin.match(X_train, y_train)

# Make predictions on the check set
y_pred = mannequin.predict(X_test)

# Consider the mannequin
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Imply Squared Error: {mse:.2f}")
print(f"R-squared Rating: {r2:.2f}")


We all know that scikit-learn is a required dependency. Should you undergo the code, we set as_frame equal to True when loading the dataset . So we additionally want pandas. And the necessities.txt file seems like so:



Docker Tutorial for Data Scientists
Picture by Writer


Create the Dockerfile


Up to now, we’ve the supply code file and the necessities.txt file. We must always now outline how to construct a picture from our software. The Dockerfile is used to create this definition of constructing a picture from the appliance supply code information.

So what’s a Dockerfile? It’s a textual content doc that accommodates step-by-step directions to construct the Docker picture.


Docker Tutorial for Data Scientists
Picture by Writer


Right here’s the Dockerfile for our instance:

# Use the official Python picture as the bottom picture
FROM python:3.9-slim

# Set the working listing within the container

# Copy the necessities.txt file to the container
COPY necessities.txt .

# Set up the dependencies
RUN pip set up --no-cache-dir -r necessities.txt

# Copy the script file to the container

# Set the command to run your Python script
CMD ["python", ""]


Let’s break down the contents of the Dockerfile:

  • All Dockerfiles begin with a FROM instruction specifying the bottom picture. Base picture is that picture on which your picture is predicated. Right here we use an out there picture for Python 3.9. The FROM instruction tells Docker to construct the present picture from the required base picture.
  • The SET command is used to set the working listing for all the next instructions (app on this instance).
  • We then copy the necessities.txt file to the container’s file system. 
  • The RUN instruction executes the required command—in a shell—contained in the container. Right here we set up all of the required dependencies utilizing pip
  • We then copy the supply code file—the Python script—to the container’s file system.
  • Lastly CMD refers back to the instruction to be executed—when the container begins. Right here we have to run the script. The Dockerfile ought to comprise just one CMD instruction.


Construct the Picture


Now that we’ve outlined the Dockerfile, we are able to construct the docker picture by working the docker construct:


The choice -t permits us to specify a reputation and tag for the picture within the title:tag format. The default tag is newest

The construct course of takes a few minutes:

Sending construct context to Docker daemon  4.608kB
Step 1/6 : FROM python:3.9-slim
3.9-slim: Pulling from library/python
5b5fe70539cd: Pull full 
f4b0e4004dc0: Pull full 
ec1650096fae: Pull full 
2ee3c5a347ae: Pull full 
d854e82593a7: Pull full 
Digest: sha256:0074c6241f2ff175532c72fb0fb37264e8a1ac68f9790f9ee6da7e9fdfb67a0e
Standing: Downloaded newer picture for python:3.9-slim
 ---> 326a3a036ed2
Step 2/6 : WORKDIR /app
Step 6/6 : CMD ["python", ""]
 ---> Working in 7fcef6a2ab2c
Eradicating intermediate container 7fcef6a2ab2c
 ---> 2607aa43c61a
Efficiently constructed 2607aa43c61a
Efficiently tagged ml-app:newest


After the Docker picture has been constructed, run the docker pictures command. You need to see theml-app picture listed, too.


Docker Tutorial for Data Scientists

You possibly can run the Docker picture ml-app utilizing the docker run command:


Docker Tutorial for Data Scientists


Congratulations! You’ve simply dockerized your first information science software. By making a DockerHub account, you may push the picture to it (or to a non-public repository inside the group).



Hope you discovered this introductory Docker tutorial useful. You’ll find the code used on this tutorial in this GitHub repository. As a subsequent step, arrange Docker in your machine and do this instance. Or dockerize an software of your alternative. 

The simplest approach to set up Docker in your machine is utilizing Docker Desktop: you get each the  Docker CLI consumer in addition to a GUI to handle your containers simply. So arrange Docker and get coding instantly!
Bala Priya C is a developer and technical author from India. She likes working on the intersection of math, programming, information science, and content material creation. Her areas of curiosity and experience embody DevOps, information science, and pure language processing. She enjoys studying, writing, coding, and low! Presently, she’s engaged on studying and sharing her information with the developer neighborhood by authoring tutorials, how-to guides, opinion items, and extra.

Leave a Reply

Your email address will not be published. Required fields are marked *