Skip to content
/ D2E Public
forked from worv-ai/D2E

[ICLR 2026] Official PyTorch Implementation of CANVAS

Notifications You must be signed in to change notification settings

cloneisme/D2E

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 

Repository files navigation

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Suhwan Choi*, Jaeyoon Jung*, Haebin Seong*, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu‡, Yunsung Lee‡

Project Page arXiv Demo Model Dataset 480p Dataset Original

image

News

Dataset

We provide 267 hours of synchronized video, audio, and input events from 29 PC games across diverse genres (FPS, open-world, sandbox, and more).

Dataset Resolution Use Case
open-world-agents/D2E-480p 480p 60fps Vision-action model training
open-world-agents/D2E-Original FHD/QHD World models, video generation

What's Included

  • Video + Audio: H.264 encoded at 60fps with game audio
  • Input Events: Keyboard (press/release), mouse (clicks, coordinates, raw HID deltas)—all with nanosecond timestamps
  • OWAMcap Format: Built on MCAP, indexed for fast random access

Quick Start

pip install mcap-owa-support owa-msgs huggingface_hub
from huggingface_hub import hf_hub_download
from mcap_owa.highlevel import OWAMcapReader

# Download a sample recording
mcap_file = hf_hub_download(
    repo_id="open-world-agents/D2E-480p",
    filename="Apex_Legends/0805_01.mcap",
    repo_type="dataset"
)

with OWAMcapReader(mcap_file) as reader:
    # Load a video frame
    for msg in reader.iter_messages(topics=["screen"]):
        screen = msg.decoded
        screen.resolve_relative_path(mcap_file)
        frame = screen.load_frame_array()  # numpy array (H, W, 3)
        break

    # Read keyboard events
    for msg in reader.iter_messages(topics=["keyboard"]):
        print(msg.decoded)  # KeyboardEvent(event_type='press', vk=87)
        break

    # Read raw mouse events
    for msg in reader.iter_messages(topics=["mouse/raw"]):
        print(msg.decoded)  # RawMouseEvent(last_x=12, last_y=-3, button_flags=0)
        break

Visualize

Explore recordings in your browser with synchronized keyboard/mouse overlay: Open in Dataset Visualizer

Dataset Visualizer Preview

For Training

We provide owa-data, a data pipeline that converts this dataset into HuggingFace Datasets ready for PyTorch DataLoader with tokenization and sequence packing.

Inference

Run action prediction on any gameplay video using inference.py. The script uses uv for dependency management—no manual installation required.

Prerequisites

  • uv
  • FFmpeg (for video preprocessing)
  • CUDA-capable GPU (recommended, ~8GB+ VRAM)

⏱️ Inference Time: On H100, processing 1 second of video takes ~6 seconds. For a 1-minute video, expect ~6 minutes of inference time. Use --max-duration to limit video length for faster testing.

Quick Start

# Run inference on a video (dependencies are auto-installed by uv)
uv run inference.py input_video.mp4 output.mcap

# Specify a different model or device
uv run inference.py input_video.mp4 output.mcap --model open-world-agents/Generalist-IDM-1B
uv run inference.py input_video.mp4 output.mcap --device cpu

# Limit video duration for faster testing
uv run inference.py input_video.mp4 output.mcap --max-duration 30

Options

Option Default Description
--model open-world-agents/Generalist-IDM-1B Model path or Hugging Face ID
--device cuda Device to run on (cuda or cpu)
--max-duration None Max video duration in seconds
--max-context-length 2048 Max context length for the model
--time-shift 0.1 Time shift for actions in seconds

Output Format

The output is an MCAP file containing predicted keyboard and mouse events with timestamps synchronized to the input video. You can visualize the output using the Dataset Visualizer.

Citation

If you find this work useful, please cite our paper:

@article{choi2025d2e,
  title={D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI},
  author={Choi, Suhwan and Jung, Jaeyoon and Seong, Haebin and Kim, Minchan and Kim, Minyeong and Cho, Yongjun and Kim, Yoonshik and Park, Yubeen and Yu, Youngjae and Lee, Yunsung},
  journal={arXiv preprint arXiv:2510.05684},
  year={2025}
}

About

[ICLR 2026] Official PyTorch Implementation of CANVAS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%