Suhwan Choi*, Jaeyoon Jung*, Haebin Seong*, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu‡, Yunsung Lee‡
-
[2026/01/15] We release the Generalist-IDM demo on Hugging Face Spaces lastdefiance20/Generalist-IDM, the model weights open-world-agents/Generalist-IDM-1B, and the inference code
inference.py. -
[2025/12/18] We release the FHD/QHD versions of the dataset on Hugging Face open-world-agents/D2E-Original for training world models and video generation models. We also fix issues in the 480p dataset open-world-agents/D2E-480p.
-
[2025/12/01] We release 480p version of dataset at huggingface. open-world-agents/D2E-480p: 267 hours of synchronized video, audio, and input events from 29 PC games across diverse genres (FPS, open-world, sandbox, and more), for training vision-action models and game agents.
-
[2025/10/21] We release part of our source codes. Code is comming soon!
ocapandowatoolkit is being open-sourced already, have a look at these first.- https://github.com/open-world-agents/ocap: ocap (Omnimodal CAPture) captures all essential desktop signals in synchronized format. Records screen video, audio, keyboard/mouse input, and window events.
- https://github.com/open-world-agents/open-world-agents: A versatile and efficient monorepo that embraces and grows multiple projects, containing all the essential building blocks for agent development.
- https://worv-ai.github.io/d2e/: D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI. Code will coming soon!
We provide 267 hours of synchronized video, audio, and input events from 29 PC games across diverse genres (FPS, open-world, sandbox, and more).
| Dataset | Resolution | Use Case |
|---|---|---|
| open-world-agents/D2E-480p | 480p 60fps | Vision-action model training |
| open-world-agents/D2E-Original | FHD/QHD | World models, video generation |
- Video + Audio: H.264 encoded at 60fps with game audio
- Input Events: Keyboard (press/release), mouse (clicks, coordinates, raw HID deltas)—all with nanosecond timestamps
- OWAMcap Format: Built on MCAP, indexed for fast random access
pip install mcap-owa-support owa-msgs huggingface_hubfrom huggingface_hub import hf_hub_download
from mcap_owa.highlevel import OWAMcapReader
# Download a sample recording
mcap_file = hf_hub_download(
repo_id="open-world-agents/D2E-480p",
filename="Apex_Legends/0805_01.mcap",
repo_type="dataset"
)
with OWAMcapReader(mcap_file) as reader:
# Load a video frame
for msg in reader.iter_messages(topics=["screen"]):
screen = msg.decoded
screen.resolve_relative_path(mcap_file)
frame = screen.load_frame_array() # numpy array (H, W, 3)
break
# Read keyboard events
for msg in reader.iter_messages(topics=["keyboard"]):
print(msg.decoded) # KeyboardEvent(event_type='press', vk=87)
break
# Read raw mouse events
for msg in reader.iter_messages(topics=["mouse/raw"]):
print(msg.decoded) # RawMouseEvent(last_x=12, last_y=-3, button_flags=0)
breakExplore recordings in your browser with synchronized keyboard/mouse overlay: Open in Dataset Visualizer
We provide owa-data, a data pipeline that converts this dataset into HuggingFace Datasets ready for PyTorch DataLoader with tokenization and sequence packing.
Run action prediction on any gameplay video using inference.py. The script uses uv for dependency management—no manual installation required.
- uv
- FFmpeg (for video preprocessing)
- CUDA-capable GPU (recommended, ~8GB+ VRAM)
⏱️ Inference Time: On H100, processing 1 second of video takes ~6 seconds. For a 1-minute video, expect ~6 minutes of inference time. Use
--max-durationto limit video length for faster testing.
# Run inference on a video (dependencies are auto-installed by uv)
uv run inference.py input_video.mp4 output.mcap
# Specify a different model or device
uv run inference.py input_video.mp4 output.mcap --model open-world-agents/Generalist-IDM-1B
uv run inference.py input_video.mp4 output.mcap --device cpu
# Limit video duration for faster testing
uv run inference.py input_video.mp4 output.mcap --max-duration 30| Option | Default | Description |
|---|---|---|
--model |
open-world-agents/Generalist-IDM-1B |
Model path or Hugging Face ID |
--device |
cuda |
Device to run on (cuda or cpu) |
--max-duration |
None | Max video duration in seconds |
--max-context-length |
2048 | Max context length for the model |
--time-shift |
0.1 | Time shift for actions in seconds |
The output is an MCAP file containing predicted keyboard and mouse events with timestamps synchronized to the input video. You can visualize the output using the Dataset Visualizer.
If you find this work useful, please cite our paper:
@article{choi2025d2e,
title={D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI},
author={Choi, Suhwan and Jung, Jaeyoon and Seong, Haebin and Kim, Minchan and Kim, Minyeong and Cho, Yongjun and Kim, Yoonshik and Park, Yubeen and Yu, Youngjae and Lee, Yunsung},
journal={arXiv preprint arXiv:2510.05684},
year={2025}
}
