GraphGen Engine Refactor with Ray Data #115

ChenZiHong-Gavin · 2025-12-16T08:35:24Z

ChenZiHong-Gavin
Dec 16, 2025
Maintainer

We're thrilled to announce a groundbreaking refactor of GraphGen's pipeline engine, now powered by Ray Data for truly distributed, scalable data processing! This architectural evolution transforms GraphGen from a multi-threaded local engine into a production-ready distributed system that unifies heterogeneous resource management and enables high-performance streaming pipelines.

✨ What's New

1. Ray Data-Powered Execution Engine

True Distributed Processing: Leverage Ray's actor model and dataset abstractions for seamless multi-node scaling. The engine now performs topological sorting of your YAML-defined DAG and translates it into a streaming Ray Data flow, where each operator becomes a distributed transformation.
Elastic Resource Management: Automatic GPU/CPU allocation with ActorPoolStrategy - no more manual worker management. Declare resources per operator (replicas, num_gpus, batch_size) and Ray handles the rest.
Streaming Dataflow with Lazy Execution: Efficient pipeline execution with built-in batching, partitioning, and backpressure handling. The entire DAG is constructed lazily and only materializes when you consume final results, minimizing intermediate data serialization.
Fault Tolerance: Ray's native failure recovery ensures robust long-running jobs, with per-operator retry logic and lineage reconstruction.

2. New Operator Framework

BaseOperator Class: All operators (read, chunk, build_kg, generate, etc.) now inherit from a unified Ray-native base class with integrated logging and lifecycle management.
Per-Actor Logging: Individual log files per operator instance (cache/logs/OperatorName_workerID.log) for easier debugging in distributed environments.
Flexible Execution Models: Native support for map, filter, flatmap, map_batch, and aggregate patterns, with batch size control and automatic pandas DataFrame serialization.

3. Distributed Storage Layer with Actor Isolation

Ray Actor-Backed Storage: Both Graph (Kuzu/NetworkX) and KV (RocksDB/JSON) stores are now managed by persistent Ray actors, eliminating serialization overhead and race conditions.
Thread-Safe Operations: Actor-based storage provides atomic upsert/get operations, preventing corruption when multiple workers simultaneously write to the knowledge graph.
Lifecycle Management: Detached actors survive driver exit - perfect for iterative development and avoiding costly re-initialization of graph databases.

4. LLM Serviceization for Resource Reuse

Dedicated LLM Actors: Each LLM backend (sglang, vLLM, OpenAI, etc.) runs in isolated actors with proper GPU allocation. Models stay loaded in memory throughout the pipeline execution.
LLMServiceProxy Pattern: Lightweight proxies in each operator worker communicate with central LLM actors via Ray handles, enabling efficient model sharing without Pickle serialization of model weights.
Dynamic GPU Scaling: SYNTHESIZER_NUM_GPUS and TRAINEE_NUM_GPUS environment variables control resource assignment, with automatic actor placement on appropriate nodes.

🏗️ Architecture Deep Dive

We've restructured GraphGen into a modular, maintainable architecture:

bases/ (Core Abstractions)

Defines unified interfaces: BaseOperator, BaseLLMWrapper, BaseStorage
Standardizes data types: Chunk, Node, Config, Community

models/ (Atomic Capabilities)

Self-contained algorithm implementations (readers, splitters, storage backends, LLM wrappers)
Each model can be independently tested and developed

operators/ (Ray-Ready Tasks)

Business logic nodes that wrap models into Ray-schedulable operators
Examples: ReadOperator, ChunkOperator, BuildKGOperator, GenerateOperator

common/ (Global Services)

Singleton factories: init_llm() and init_storage() manage actor lifecycles
Ensures one LLM actor and one storage actor per namespace

engine.py (DAG Orchestrator)

Parses YAML configs, builds computation graph, performs topological sorting
Translates graph nodes into Ray Data transformations with proper dependency handling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GraphGen Engine Refactor with Ray Data #115

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

GraphGen Engine Refactor with Ray Data #115

Uh oh!

ChenZiHong-Gavin Dec 16, 2025 Maintainer

Replies: 0 comments

ChenZiHong-Gavin
Dec 16, 2025
Maintainer