Fixed benchmarks. Limited tasks.
What if anyone could create them?
Goal state snapshot — validated ✔
Goal state snapshot — validated ✔
class ColorBlockPileSorting(RoboEvalEnv):
"""Sort colored cubes into
distinct piles by color."""
_GROUP_MAX_DIST = 0.12
_MIN_SEPARATION = 0.18
def _initialize_env(self):
self.table = self._preset.get_props(Table)[0]
self.red_cubes = [ColoredCube(self._mojo) for _ in range(3)]
self.green_cubes = [ColoredCube(self._mojo) for _ in range(3)]
self.blue_cubes = [ColoredCube(self._mojo) for _ in range(3)]
self.cubes = self.red_cubes + self.green_cubes + self.blue_cubes
for c in self.red_cubes:
c.set_color("red")
for c in self.green_cubes:
c.set_color("green")
for c in self.blue_cubes:
c.set_color("blue")
def _on_reset(self):
rng = np.random.default_rng(self.seed)
center = np.array([0.55, 0.00])
for cube in self.cubes:
xy = center + rng.normal(scale=0.035, size=(2,))
cube.set_pose(position=np.array([*xy, 0.97]))
def _success(self):
red_spread = self._max_pairwise(self.red_cubes)
green_spread = self._max_pairwise(self.green_cubes)
blue_spread = self._max_pairwise(self.blue_cubes)
compact = all(s <= self._GROUP_MAX_DIST
for s in [red_spread, green_spread, blue_spread])
r_c = self._centroid(self.red_cubes)
g_c = self._centroid(self.green_cubes)
b_c = self._centroid(self.blue_cubes)
ordered = r_c[0] < g_c[0] < b_c[0]
return compact and ordered
1University of Washington 2Allen Institute for AI * Equal contribution † Equal advising
Evaluation of robotic manipulation systems has largely relied on fixed benchmarks authored by a small number of experts, where task instances, constraints, and success criteria are predefined and difficult to extend. This paradigm limits who can shape evaluation and obscures how policies respond to user-authored variations in task intent, constraints, and notions of success.
We present RoboPlayground, a framework that enables users to author executable manipulation tasks using natural language within a structured physical domain. Natural language instructions are compiled into reproducible task specifications with explicit asset definitions, initialization distributions, and success predicates. Each instruction defines a structured family of related tasks, enabling controlled semantic and behavioral variation while preserving executability and comparability. A user study shows that the language-driven interface is easier to use and imposes lower cognitive workload than programming-based and code-assist baselines. Evaluating learned policies on language-defined task families reveals generalization failures not apparent under fixed benchmark evaluations. Finally, we show that task diversity scales with contributor diversity rather than task count alone, enabling evaluation spaces to grow continuously through crowd-authored contributions.
Users express task intent, constraints, and success criteria in natural language. Each instruction is compiled into an executable task specification with explicit asset definitions, initialization distributions, and success predicates — enabling reproducible evaluation without writing a single line of code.
Arrange seven distinct colored cubes (red, orange, yellow, green, blue, indigo, violet) into a single straight line on the table in rainbow order. All cubes should be aligned and evenly spaced along the line.
Additional: Place a white cube on top of every cube whose color starts with a vowel (orange, indigo).
Initial State
Goal State
We evaluate learned policies on language-defined task families to reveal generalization failures not apparent under fixed benchmark evaluations. Toggle between training and evaluation tasks, select a model, and browse individual rollout episodes.
Episode Overview
Task — Success Rate Across Models
We evaluate RoboPlayground along three axes: usability, diagnostic value for policy generalization, and scalability of task creation.
@misc{wang2026roboplaygrounddemocratizingroboticevaluation,
title={RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains},
author={Yi Ru Wang and Carter Ung and Evan Gubarev and Christopher Tan and Siddhartha Srinivasa and Dieter Fox},
year={2026},
eprint={2604.05226},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2604.05226},
}
Please contact Carter Ung or Yi Ru Wang.