GraphStorm 0.3: Scalable, multi-task studying on graphs with user-friendly APIs


GraphStorm is a low-code enterprise graph machine studying (GML) framework to construct, practice, and deploy graph ML options on advanced enterprise-scale graphs in days as an alternative of months. With GraphStorm, you’ll be able to construct options that immediately take into consideration the construction of relationships or interactions between billions of entities, that are inherently embedded in most real-world information, together with fraud detection situations, suggestions, group detection, and search/retrieval issues.

At this time, we’re launching GraphStorm 0.3, including native help for multi-task studying on graphs. Particularly, GraphStorm 0.3 means that you can outline a number of coaching targets on totally different nodes and edges inside a single coaching loop. As well as, GraphStorm 0.3 provides new APIs to customise GraphStorm pipelines: you now solely want 12 traces of code to implement a customized node classification coaching loop. That can assist you get began with the brand new API, we’ve revealed two Jupyter pocket book examples: one for node classification, and one for a hyperlink prediction job. We additionally launched a complete research of co-training language fashions (LM) and graph neural networks (GNN) for big graphs with wealthy textual content options utilizing the Microsoft Educational Graph (MAG) dataset from our KDD 2024 paper. The research showcases the efficiency and scalability of GraphStorm on textual content wealthy graphs and the perfect practices of configuring GML coaching loops for higher efficiency and effectivity.

Native help for multi-task studying on graphs

Many enterprise purposes have graph information related to a number of duties on totally different nodes and edges. For instance, retail organizations wish to conduct fraud detection on each sellers and consumers. Scientific publishers wish to discover extra associated works to quote of their papers and wish to pick out the proper topic for his or her publication to be discoverable. To higher mannequin such purposes, prospects have requested us to help multi-task studying on graphs.

GraphStorm 0.3 helps multi-task studying on graphs with six most typical duties: node classification, node regression, edge classification, edge regression, hyperlink prediction, and node function reconstruction. You’ll be able to specify the coaching targets via a YAML configuration file. For instance, a scientific writer can use the next YAML configuration to concurrently outline a paper topic classification job on paper nodes and a hyperlink prediction job on paper-citing-paper edges for the scientific writer use case:

model: 1.0
    gsf:
        fundamental: # fundamental settings of the spine GNN mannequin
            ...
        ...
        multi_task_learning:
            - node_classification:         # outline a node classification job for paper topic prediction.
                target_ntype: "paper"      # the paper nodes are the coaching targets.
                label_field: "label_class" # the node function "label_class" incorporates the coaching labels.
				mask_fields:
                    - "train_mask_class"   # practice masks is called as train_mask_class.
                    - "val_mask_class"     # validation masks is called as val_mask_class.
                    - "test_mask_class"    # check masks is called as test_mask_class.
                num_classes: 10            # There are whole 10 totally different courses (topic) to foretell.
                task_weight: 1.0           # The duty weight is 1.0.
                
            - link_prediction:                # outline a hyperlink prediction paper quotation advice.
                num_negative_edges: 4         # Pattern 4 unfavorable edges for every constructive edge throughout coaching
                num_negative_edges_eval: 100  # Pattern 100 unfavorable edges for every constructive edge throughout analysis
                train_negative_sampler: joint # Share the unfavorable edges between constructive edges (to speedup coaching)
                train_etype:
                    - "paper,citing,paper"    # The goal edge sort for hyperlink prediction coaching is "paper, citing, paper"
                mask_fields:
                    - "train_mask_lp"         # practice masks is called as train_mask_lp.
                    - "val_mask_lp"           # validation masks is called as val_mask_lp.
                    - "test_mask_lp"          # check masks is called as test_mask_lp.
                task_weight: 0.5              # The duty weight is 0.5.

For extra particulars about find out how to run graph multi-task studying with GraphStorm, discuss with Multi-task Learning in GraphStorm in our documentation.

New APIs to customise GraphStorm pipelines and elements

Since GraphStorm’s launch in early 2023, prospects have primarily used its command line interface (CLI), which abstracts away the complexity of the graph ML pipeline so that you can shortly construct, practice, and deploy fashions utilizing widespread recipes. Nonetheless, prospects are telling us that they need an interface that permits them to customise the coaching and inference pipeline of GraphStorm to their particular necessities extra simply. Primarily based on buyer suggestions for the experimental APIs we launched in GraphStorm 0.2, GraphStorm 0.3 introduces refactored graph ML pipeline APIs. With the brand new APIs, you solely want 12 traces of code to outline a customized node classification coaching pipeline, as illustrated by the next instance:

import graphstorm as gs
gs.initialize()

acm_data = gs.dataloading.GSgnnData(part_config='./acm_gs_1p/acm.json')

train_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_train_set(ntypes=['paper']), fanout=[20, 20], batch_size=64)
val_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_val_set(ntypes=['paper']), fanout=[100, 100], batch_size=256, train_task=False)
test_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_test_set(ntypes=['paper']), fanout=[100, 100], batch_size=256, train_task=False)

mannequin = RgcnNCModel(g=acm_data.g, num_hid_layers=2, hid_size=128, num_classes=14)
evaluator = gs.eval.GSgnnClassificationEvaluator(eval_frequency=100)

coach = gs.coach.GSgnnNodePredictionTrainer(mannequin)
coach.setup_evaluator(evaluator)

coach.match(train_dataloader, val_dataloader, test_dataloader, num_epochs=5)

That can assist you get began with the brand new APIs, we even have launched new Jupyter notebook examples in our Documentation and Tutorials page.

Complete research of LM+GNN for big graphs with wealthy textual content options

Many enterprise purposes have graphs with textual content options. In retail search purposes, for instance, purchasing log information offers insights on how text-rich product descriptions, search queries, and buyer conduct are associated. Foundational giant language fashions (LLMs) alone will not be appropriate to mannequin such information as a result of the underlying information distributions and relationships don’t correspond to what LLMs be taught from their pre-training information corpuses. GML, alternatively, is nice for modeling associated information (graphs) however till now, GML practitioners needed to manually mix their GML fashions with LLMs to mannequin textual content options and get the perfect efficiency for his or her use circumstances. Particularly when the underlying graph dataset was giant, this handbook work was difficult and time-consuming.

In GraphStorm 0.2, GraphStorm launched built-in strategies to coach language fashions (LMs) and GNN fashions collectively effectively at scale on huge text-rich graphs. Since then, prospects have been asking us for steerage on how GraphStorm’s LM+GNN strategies needs to be employed to optimize efficiency. To deal with this, with GraphStorm 0.3, we launched a LM+GNN benchmark utilizing the massive graph dataset, Microsoft Educational Graph (MAG), on two commonplace graph ML duties: node classification and hyperlink prediction. The graph dataset is a heterogeneous graph, incorporates lots of of hundreds of thousands of nodes and billions of edges, and nearly all of nodes are attributed with wealthy textual content options. The detailed statistics of the datasets are proven within the following desk.

Dataset Num. of nodes Num. of edges Num. of node/edge sorts Num. of nodes in NC coaching set Num. of edges in LP coaching set Num. of nodes with text-features
MAG 484,511,504 7,520,311,838 4/4 28,679,392 1,313,781,772 240,955,156

We benchmark two principal LM-GNN strategies in GraphStorm: pre-trained BERT+GNN, a baseline methodology that’s extensively adopted, and fine-tuned BERT+GNN, launched by GraphStorm builders in 2022. With the pre-trained BERT+GNN methodology, we first use a pre-trained BERT mannequin to compute embeddings for node textual content options after which practice a GNN mannequin for prediction. With the fine-tuned BERT+GNN methodology, we initially fine-tune the BERT fashions on the graph information and use the ensuing fine-tuned BERT mannequin to compute embeddings which can be then used to coach a GNN fashions for prediction. GraphStorm offers other ways to fine-tune the BERT fashions, relying on the duty sorts. For node classification, we fine-tune the BERT mannequin on the coaching set with the node classification duties; for hyperlink prediction, we fine-tune the BERT mannequin with the hyperlink prediction duties. Within the experiment, we use 8 r5.24xlarge cases for information processing and use 4 g5.48xlarge cases for mannequin coaching and inference. The fine-tuned BERT+GNN strategy has as much as 40% higher efficiency (hyperlink prediction on MAG) in comparison with pre-trained BERT+GNN.

The next desk exhibits the mannequin efficiency of the 2 strategies and the general computation time of the entire pipeline ranging from information processing and graph development. NC means node classification and LP means hyperlink prediction. LM Time Value means the time spent on computing BERT embeddings and the time spent on fine-tuning the BERT fashions for pre-trained BERT+GNN and fine-tuned BERT+GNN, respectively.

Dataset Process Knowledge processing time Goal Pre-trained BERT + GNN Tremendous-tuned BERT + GNN
LM Time Value One epoch time Metric LM Time Value One epoch time Metric
MAG NC 553 min paper topic 206 min 135 min Acc:0.572 1423 min 137 min Acc:0.633
LP cite 198 min 2195 min Mrr: 0.487 4508 min 2172 min Mrr: 0.684

We additionally benchmark GraphStorm on giant artificial graphs to showcase its scalability. We generate three artificial graphs with 1 billion, 10 billion, and 100 billion edges. The corresponding coaching set sizes are 8 million, 80 million, and 800 million, respectively. The next desk exhibits the computation time of graph preprocessing, graph partition, and mannequin coaching. General, GraphStorm permits graph development and mannequin coaching on 100 billion scale graphs inside hours!

Graph Measurement Knowledge pre-process Graph Partition Mannequin Coaching
# cases Time # cases Time # cases Time
1B 4 19 min 4 8 min 4 1.5 min
10B 8 31 min 8 41 min 8 8 min
100B 16 61 min 16 416 min 16 50 min

Extra benchmark particulars and outcomes can be found in our KDD 2024 paper.

Conclusion

GraphStorm 0.3 is revealed beneath the Apache-2.0 license that can assist you sort out your large-scale graph ML challenges, and now affords native help for multi-task studying and new APIs to customise pipelines and different elements of GraphStorm. Consult with the GraphStorm GitHub repository and documentation to get began.


Concerning the Writer

Xiang Music is a senior utilized scientist at AWS AI Analysis and Training (AIRE), the place he develops deep studying frameworks together with GraphStorm, DGL and DGL-KE. He led the event of Amazon Neptune ML, a brand new functionality of Neptune that makes use of graph neural networks for graphs saved in graph database. He’s now main the event of GraphStorm, an open-source graph machine studying framework for enterprise use circumstances. He acquired his Ph.D. in laptop techniques and structure on the Fudan College, Shanghai, in 2014.

Jian Zhang is a senior utilized scientist who has been utilizing machine studying strategies to assist prospects resolve varied issues, comparable to fraud detection, ornament picture era, and extra. He has efficiently developed graph-based machine studying, significantly graph neural community, options for patrons in China, USA, and Singapore. As an enlightener of AWS’s graph capabilities, Zhang has given many public displays in regards to the GNN, the Deep Graph Library (DGL), Amazon Neptune, and different AWS providers.

Florian Saupe is a Principal Technical Product Supervisor at AWS AI/ML analysis supporting science groups just like the graph machine studying group, and ML Methods groups engaged on giant scale distributed coaching, inference, and fault resilience. Earlier than becoming a member of AWS, Florian lead technical product administration for automated driving at Bosch, was a technique guide at McKinsey & Firm, and labored as a management techniques/robotics scientist – a area through which he holds a phd.

Leave a Reply

Your email address will not be published. Required fields are marked *