RoboPlayground

Our Team

Yi Ru Wang*^1,2

Carter Ung*¹

Evan Gubarev¹

Christopher Tan¹

Siddhartha Srinivasa¹†

Dieter Fox^1,2†

Yi Ru Wang*^1,2

Carter Ung*¹

Evan Gubarev¹

Christopher Tan¹

Siddhartha Srinivasa¹†

Dieter Fox^1,2†

¹University of Washington ²Allen Institute for AI * Equal contribution † Equal advising

Abstract

Evaluation of robotic manipulation systems has largely relied on fixed benchmarks authored by a small number of experts, where task instances, constraints, and success criteria are predefined and difficult to extend. This paradigm limits who can shape evaluation and obscures how policies respond to user-authored variations in task intent, constraints, and notions of success.

We present RoboPlayground, a framework that enables users to author executable manipulation tasks using natural language within a structured physical domain. Natural language instructions are compiled into reproducible task specifications with explicit asset definitions, initialization distributions, and success predicates. Each instruction defines a structured family of related tasks, enabling controlled semantic and behavioral variation while preserving executability and comparability. A user study shows that the language-driven interface is easier to use and imposes lower cognitive workload than programming-based and code-assist baselines. Evaluating learned policies on language-defined task families reveals generalization failures not apparent under fixed benchmark evaluations. Finally, we show that task diversity scales with contributor diversity rather than task count alone, enabling evaluation spaces to grow continuously through crowd-authored contributions.

System Overview

Language-Driven Task Authoring

Users express task intent, constraints, and success criteria in natural language. Each instruction is compiled into an executable task specification with explicit asset definitions, initialization distributions, and success predicates — enabling reproducible evaluation without writing a single line of code.

RainbowCubeLineArrangementTask

by Carter Ung · Feb 6, 2026

Arrange seven distinct colored cubes (red, orange, yellow, green, blue, indigo, violet) into a single straight line on the table in rainbow order. All cubes should be aligned and evenly spaced along the line.

Additional: Place a white cube on top of every cube whose color starts with a vowel (orange, indigo).

Table ColoredCube both compositional bimanual

Preview

Initial State

➔

Goal State

Generated Code

▼

Policy Evaluation on Generated Tasks

We evaluate learned policies on language-defined task families to reveal generalization failures not apparent under fixed benchmark evaluations. Toggle between training and evaluation tasks, select a model, and browse individual rollout episodes.

Episode 1 / 50

Episode Overview

Task — Success Rate Across Models

Results

We evaluate RoboPlayground along three axes: usability, diagnostic value for policy generalization, and scalability of task creation.

Try It Out!

RoboPlayground lets anyone author executable manipulation tasks using natural language. No simulation expertise required — just describe what you want, and the system compiles it into a reproducible, validated task specification. Everything is open-source.

@misc{wang2026roboplaygrounddemocratizingroboticevaluation,
      title={RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains},
      author={Yi Ru Wang and Carter Ung and Evan Gubarev and Christopher Tan and Siddhartha Srinivasa and Dieter Fox},
      year={2026},
      eprint={2604.05226},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2604.05226},
}

Please contact Carter Ung or Yi Ru Wang.

Our Team

Abstract

System Overview

Language-Driven Task Authoring

RainbowCubeLineArrangementTask

Preview

Generated Code

BluePinkOrangeBlockStackingTask

Preview

Generated Code

PurpleCubeCircleTealCenter

Preview

Generated Code

Policy Evaluation on Generated Tasks

Results

Try It Out!