Posit AI Weblog: Hugging Face Integrations
We’re completely happy to announce the primary releases of hfhub and tok at the moment are on CRAN.
hfhub is an R interface to Hugging Face Hub, permitting customers to obtain and cache recordsdata
from Hugging Face Hub whereas tok implements R bindings for the Hugging Face tokenizers
library.
Hugging Face quickly grew to become the platform to construct, share and collaborate on
deep studying purposes and we hope these integrations will assist R customers to
get began utilizing Hugging Face instruments in addition to constructing novel purposes.
We even have beforehand introduced the safetensors
bundle permitting to learn and write recordsdata within the safetensors format.
hfhub
hfhub is an R interface to the Hugging Face Hub. hfhub presently implements a single
performance: downloading recordsdata from Hub repositories. Mannequin Hub repositories are
primarily used to retailer pre-trained mannequin weights along with another metadata
essential to load the mannequin, such because the hyperparameters configurations and the
tokenizer vocabulary.
Downloaded recordsdata are ached utilizing the identical structure because the Python library, thus cached
recordsdata might be shared between the R and Python implementation, for simpler and faster
switching between languages.
We already use hfhub within the minhub bundle and
within the ‘GPT-2 from scratch with torch’ blog post to
obtain pre-trained weights from Hugging Face Hub.
You should use hub_download()
to obtain any file from a Hugging Face Hub repository
by specifying the repository id and the trail to file that you simply need to obtain.
If the file is already within the cache, then the operate returns the file path imediately,
in any other case the file is downloaded, cached after which the entry path is returned.
<- hfhub::hub_download("gpt2", "model.safetensors")
path
path#> /Users/dfalbel/.cache/huggingface/hub/models--gpt2/snapshots/11c5a3d5811f50298f278a704980280950aedb10/model.safetensors
tok
Tokenizers are responsible for converting raw text into the sequence of integers that
is often used as the input for NLP models, making them an critical component of the
NLP pipelines. If you want a higher level overview of NLP pipelines, you might want to read
our previous blog post ‘What are Large Language Models? What are they not?’.
When utilizing a pre-trained mannequin (each for inference or for high quality tuning) it’s very
necessary that you simply use the very same tokenization course of that has been used throughout
coaching, and the Hugging Face staff has completed an incredible job ensuring that its algorithms
match the tokenization methods used most LLM’s.
tok supplies R bindings to the 🤗 tokenizers library. The tokenizers library is itself
carried out in Rust for efficiency and our bindings use the extendr project
to assist interfacing with R. Utilizing tok we will tokenize textual content the very same means most
NLP fashions do, making it simpler to load pre-trained fashions in R in addition to sharing
our fashions with the broader NLP neighborhood.
tok might be put in from CRAN, and presently it’s utilization is restricted to loading
tokenizers vocabularies from recordsdata. For instance, you possibly can load the tokenizer for the GPT2
mannequin with:
<- tok::tokenizer$from_pretrained("gpt2")
tokenizer <- tokenizer$encode("Hello world! You can use tokenizers from R")$ids
ids
ids#> [1] 15496 995 0 921 460 779 11241 11341 422 371
$decode(ids)
tokenizer#> [1] "Hello world! You can use tokenizers from R"
Spaces
Remember that you can already host
Shiny (for R and Python) on Hugging Face Areas. For example, we’ve constructed a Shiny
app that makes use of:
- torch to implement GPT-NeoX (the neural community structure of StableLM – the mannequin used for chatting)
- hfhub to obtain and cache pre-trained weights from the StableLM repository
- tok to tokenize and pre-process textual content as enter for the torch mannequin. tok additionally makes use of hfhub to obtain the tokenizer’s vocabulary.
The app is hosted at on this Space.
It presently runs on CPU, however you possibly can simply swap the the Docker picture in order for you
to run it on a GPU for sooner inference.
The app supply code can be open-source and might be discovered within the Areas file tab.
Wanting ahead
It’s the very early days of hfhub and tok and there’s nonetheless a variety of work to do
and performance to implement. We hope to get neighborhood assist to prioritize work,
thus, if there’s a characteristic that you’re lacking, please open a difficulty within the
GitHub repositories.
Reuse
Textual content and figures are licensed below Artistic Commons Attribution CC BY 4.0. The figures which have been reused from different sources do not fall below this license and might be acknowledged by a word of their caption: “Determine from …”.
Quotation
For attribution, please cite this work as
Falbel (2023, July 12). Posit AI Weblog: Hugging Face Integrations. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2023-07-12-hugging-face-integrations/
BibTeX quotation
@misc{hugging-face-integrations, writer = {Falbel, Daniel}, title = {Posit AI Weblog: Hugging Face Integrations}, url = {https://blogs.rstudio.com/tensorflow/posts/2023-07-12-hugging-face-integrations/}, yr = {2023} }