How one can Implement Named Entity Recognition with Hugging Face Transformers
Picture by Creator | Ideogram
Named entity recognition (NER) is a elementary pure language processing (NLP) activity, involving the identification and classification of named entities inside textual content into predefined classes. These classes may very well be individual names, organizations, areas, dates, and extra. It is a helpful functionality in quite a lot of actual world eventualities, from info extraction, to textual content summarization, to Q&A, and past. In actuality, for any scenario during which the understanding and categorizing of particular parts of textual content is a aim, NER might help.
Let’s check out how we will carry out NER utilizing that Swiss military knife of NLP and LLM libraries, Hugging Face’s Transformers.
Making ready Our Setting
We’ll assume you have already got a Python surroundings setup. We will then being by putting in the required libraries, which in our case are Transformers and PyTorch. We will use pip to take action:
pip set up transformers torch
Now, let’s do our imports:
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, AutoConfig
Subsequent we’d like a mannequin. Hugging Face gives a spread of pre-trained fashions appropriate for NER, and for this tutorial we’ll use the dbmdz/bert-large-cased-finetuned-conll03-english
mannequin, which has been fine-tuned on the CoNLL-03 dataset for English NER duties. You may load this mannequin with the code beneath:
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
mannequin = AutoModelForTokenClassification.from_pretrained(model_name)
Right here is a few textual content we will use for our testing:
KDnuggets is a number one web site and on-line neighborhood centered on information science, machine studying, synthetic intelligence, and analytics. Based in 1997 close to Boston, Massachussetts by Gregory Piatetsky-Shapiro, it has develop into one of the crucial outstanding sources for professionals and fans in these fields. The location options articles, tutorials, information, and academic content material contributed by trade consultants and practitioners. The web site’s title originated from “Information Discovery Nuggets,” and started life as an e mail summarizing the proceedings of the data discovery (information mining) trade’s authentic convention, the KDD Convention, reflecting its mission to share priceless bits of information within the discipline of information mining and analytics. It has performed a major position in constructing and supporting the info mining and information science communities over the previous a long time.
Earlier than we will truly get to recognizing named entities, we first must tokenize our textual content, provided that the ensuing tokens will develop into the mannequin enter. You are able to do so with the next:
textual content = """KDnuggets is a number one web site and on-line neighborhood centered on information science, machine studying, synthetic intelligence, and analytics. Based in 1997 close to Boston, Massachusetts by Gregory Piatetsky-Shapiro, it has develop into one of the crucial outstanding sources for professionals and fans in these fields. The location options articles, tutorials, information, and academic content material contributed by trade consultants and practitioners. The web site's title originated from "Information Discovery Nuggets," a spin-off e mail summarizing the proceedings of the data discovery trade's authentic convention, reflecting its mission to share priceless bits of information within the discipline of information mining and analytics. It has performed a major position in constructing and supporting the info mining and information science communities over the previous a long time."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer(textual content, return_tensors="pt", truncation=True, max_length=512)
Performing Named Entity Recognition
With the tokenized textual content, we will now run our mannequin to carry out NER. The mannequin will output logits, that are the uncooked predictions for every token. Right here’s the code to acquire the logits:
with torch.no_grad():
outputs = mannequin(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
The tokens have to be mapped to their corresponding entity labels for significant output. We will do that by changing the token IDs again to tokens and mapping the predictions to human-readable labels:
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
token_labels = [config.id2label[p.item()] for p in predictions[0]]
Subsequent, we’ll course of the tokens and labels to deal with subwords and format them into readable output:
# Course of outcomes, dealing with particular tokens and subwords
outcomes = []
current_entity = []
current_label = None
for token, label in zip(tokens[1:-1], token_labels[1:-1]): # Skip [CLS] and [SEP]
# Deal with subwords
if token.startswith("##"):
if current_entity:
current_entity[-1] += token[2:]
proceed
# Deal with entity continuation
if label.startswith("B-") or label == "O":
if current_entity:
outcomes.append((" ".be part of(current_entity), current_label))
current_entity = []
if label != "O":
current_entity = [token]
current_label = label[2:] # Take away B- prefix
elif label.startswith("I-"):
if not current_entity:
current_entity = [token]
current_label = label[2:]
else:
current_entity.append(token)
# Add last entity if exists
if current_entity:
outcomes.append((" ".be part of(current_entity), current_label))
# Print outcomes
for entity, label in outcomes:
if label: # Solely print precise entities
print(f"{entity}: {label}")
This code handles subword tokens (like “##ing”), tracks entity boundaries utilizing B- (starting) and I- (inside) prefixes, and prints solely the recognized named entities with their labels.
Full Implementation
Lastly, let’s put all of it collectively:
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, AutoConfig
def perform_ner(textual content, model_name="dbmdz/bert-large-cased-finetuned-conll03-english"):
strive:
# Load mannequin and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
mannequin = AutoModelForTokenClassification.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
# Tokenize enter
inputs = tokenizer(textual content, return_tensors="pt", truncation=True, max_length=512)
# Get predictions
with torch.no_grad():
outputs = mannequin(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
# Get tokens and their predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
token_labels = [config.id2label[p.item()] for p in predictions[0]]
# Course of outcomes, dealing with particular tokens and subwords
outcomes = []
current_entity = []
current_label = None
# Skip [CLS] and [SEP]
for token, label in zip(tokens[1:-1], token_labels[1:-1]):
# Deal with subwords
if token.startswith("##"):
if current_entity:
current_entity[-1] += token[2:]
proceed
# Deal with entity continuation
if label.startswith("B-") or label == "O":
if current_entity:
outcomes.append((" ".be part of(current_entity), current_label))
current_entity = []
# Take away B- prefix
if label != "O":
current_entity = [token]
current_label = label[2:]
elif label.startswith("I-"):
if not current_entity:
current_entity = [token]
current_label = label[2:]
else:
current_entity.append(token)
# Add last entity if exists
if current_entity:
outcomes.append((" ".be part of(current_entity), current_label))
return outcomes
besides Exception as e:
print(f"Error performing NER: {str(e)}")
return []
textual content = """KDnuggets is a number one web site and on-line neighborhood centered on information science, machine studying, synthetic intelligence, and analytics. Based in 1997 close to Boston, Massachussetts by Gregory Piatetsky-Shapiro, it has develop into one of the crucial outstanding sources for professionals and fans in these fields. The location options articles, tutorials, information, and academic content material contributed by trade consultants and practitioners. The web site's title originated from "Information Discovery Nuggets," a spin-off e mail summarizing the proceedings of the data discovery trade's authentic convention, reflecting its mission to share priceless bits of information within the discipline of information mining and analytics. It has performed a major position in constructing and supporting the info mining and information science communities over the previous a long time."""
outcomes = perform_ner(textual content)
print(outcomes)
And listed below are the outcomes:
[('KDnuggets', 'ORG'), ('Boston', 'LOC'), ('Massachusetts', 'LOC'), ('Gregory Piatetsky - Shapiro', 'PER'), ('Knowledge Discovery Nuggets', 'ORG')]
There you’ve it, the named entities we’d have anticipated!
Wrapping Up
Implementing NER utilizing Hugging Face Transformers is each highly effective and easy. By leveraging pre-trained fashions and tokenizers, you’ll be able to rapidly arrange and carry out NER duties on numerous textual content information. Experimenting with completely different fashions and datasets can additional improve your NER capabilities and supply priceless insights to your information science initiatives.
This tutorial has supplied a step-by-step information to establishing your surroundings, processing textual content, performing NER, deciphering outputs, and visualizing outcomes. Be happy to discover and adapt this method to suit your particular wants and functions.
Matthew Mayo (@mattmayo13) holds a grasp’s diploma in laptop science and a graduate diploma in information mining. As managing editor of KDnuggets & Statology, and contributing editor at Machine Learning Mastery, Matthew goals to make advanced information science ideas accessible. His skilled pursuits embrace pure language processing, language fashions, machine studying algorithms, and exploring rising AI. He’s pushed by a mission to democratize data within the information science neighborhood. Matthew has been coding since he was 6 years previous.