Posit AI Weblog: pins 0.4: Versioning
A brand new model of pins
is accessible on CRAN at the moment, which provides assist for versioning your datasets and DigitalOcean Spaces boards!
As a fast recap, the pins bundle permits you to cache, uncover and share assets. You should utilize pins
in a variety of conditions, from downloading a dataset from a URL to creating complicated automation workflows (study extra at pins.rstudio.com). You can too use pins
together with TensorFlow and Keras; as an example, use cloudml to coach fashions in cloud GPUs, however moderately than manually copying recordsdata into the GPU occasion, you possibly can retailer them as pins instantly from R.
To put in this new model of pins
from CRAN, merely run:
You’ll find an in depth checklist of enhancements within the pins NEWS file.
As an instance the brand new versioning performance, let’s begin by downloading and caching a distant dataset with pins. For this instance, we’ll obtain the climate in London, this occurs to be in JSON format and requires jsonlite
to be parsed:
library(pins)
<- "https://samples.openweathermap.org/data/2.5/weather?q=London,uk&appid=b6907d289e10d714a6e88b30761fae22"
weather_url
pin(weather_url, "weather") %>%
::read_json() %>%
jsonliteas.data.frame()
coord.lon coord.lat weather.id weather.main weather.description weather.icon
1 -0.13 51.51 300 Drizzle light intensity drizzle 09d
One advantage of using pins
is that, even if the URL or your internet connection becomes unavailable, the above code will still work.
But back to pins 0.4
! The new signature
parameter in pin_info()
allows you to retrieve the “version” of this dataset:
pin_info("weather", signature = TRUE)
# Source: local<weather> [files]
# Signature: 624cca260666c6f090b93c37fd76878e3a12a79b
# Properties:
# - path: weather
You can then validate the remote dataset has not changed by specifying its signature:
pin(weather_url, "weather", signature = "624cca260666c6f090b93c37fd76878e3a12a79b") %>%
::read_json() jsonlite
If the remote dataset changes, pin()
will fail and you can take the appropriate steps to accept the changes by updating the signature or properly updating your code. The previous example is useful as a way of detecting version changes, but we might also want to retrieve specific versions even when the dataset changes.
pins 0.4
allows you to display and retrieve versions from services like GitHub, Kaggle and RStudio Connect. Even in boards that don’t support versioning natively, you can opt-in by registering a board with versions = TRUE
.
To keep this simple, let’s focus on GitHub first. We will register a GitHub board and pin a dataset to it. Notice that you can also specify the commit
parameter in GitHub boards as the commit message for this change.
board_register_github(repo = "javierluraschi/datasets", branch = "datasets")
pin(iris, name = "versioned", board = "github", commit = "use iris as the main dataset")
Now suppose that a colleague comes along and updates this dataset as well:
pin(mtcars, name = "versioned", board = "github", commit = "slight preference to mtcars")
From now on, your code could be broken or, even worse, produce incorrect results!
However, since GitHub was designed as a version control system and pins 0.4
adds support for pin_versions()
, we can now explore particular versions of this dataset:
pin_versions("versioned", board = "github")
# A tibble: 2 x 4
version created author message
<chr> <chr> <chr> <chr>
1 6e6c320 2020-04-02T21:28:07Z javierluraschi slight preference to mtcars
2 01f8ddf 2020-04-02T21:27:59Z javierluraschi use iris as the main dataset
You can then retrieve the version you are interested in as follows:
pin_get("versioned", version = "01f8ddf", board = "github")
# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# … with 140 more rows
You can follow similar steps for RStudio Connect and Kaggle boards, even for present pins! Different boards like Amazon S3, Google Cloud, Digital Ocean and Microsoft Azure require you explicitly allow versioning when registering your boards.
To check out the brand new DigitalOcean Spaces board, first you’ll have to register this board and allow versioning by setting variations
to TRUE
:
library(pins)
board_register_dospace(space = "pinstest",
key = "AAAAAAAAAAAAAAAAAAAA",
secret = "ABCABCABCABCABCABCABCABCABCABCABCABCABCA==",
datacenter = "sfo2",
versions = TRUE)
You can then use all the functionality pins provides, including versioning:
# create pin and replace content in digitalocean
pin(iris, name = "versioned", board = "pinstest")
pin(mtcars, name = "versioned", board = "pinstest")
# retrieve versions from digitalocean
pin_versions(name = "versioned", board = "pinstest")
# A tibble: 2 x 1
version
<chr>
1 c35da04
2 d9034cd
Notice that enabling versions in cloud services requires additional storage space for each version of the dataset being stored:
To learn more visit the Versioning and DigitalOcean articles. To meet up with earlier releases:
Thanks for studying alongside!