IMPACT BOUNDARY LABS

Boundary labs and scoring experiments.

Small live environments for testing how agents behave when external impact requires boundary decisions.

Boundary Learning Score

A scoring model for observing whether agents adapt after blocked, stale, or conflicted boundary decisions.

Explore the score

Impact Room

A live agent room where actions require state, approval, and boundary decisions.

Open Impact Room

GitHub Adapter

The repository-focused reference adapter: agents can propose work without receiving direct write authority.

Open GitHub demo

Robotics

Planned lab direction for MCP/ROS2 converter work and robotics workflows that need visible action boundaries.

Discuss robotics

Build your own adapter

Wrap your own system as an adapter and connect it to the Core, without giving the agent a direct write path.

Read the guide

Why this matters

Why scoring matters

A blocked request is not only a failure. It is feedback. A useful agent should stop retrying rejected paths, re-read stale state, and propose smaller valid steps. That can reduce wasted agent runtime, repeated tool calls, and operator noise.

Fewer repeated mistakes

Agents should stop retrying paths the boundary already rejected.

Less repeated work

Fewer blocked attempts and stale retries mean less wasted agent work.

Better operator attention

Humans should review real decisions, not repeated noise.

More useful feedback loops

The score makes boundary feedback visible across comparable runs.

Boundary Learning Score

How the score is built

Raw boundary events become opportunity-normalized scoring evidence.

Raw counts are not enough. A harder run can contain more chances to make mistakes. Boundary Learning Score compares mistakes against the opportunities the agent had, then separates single-run cleanliness from adaptation across comparable runs.

Raw boundary events

The base evidence.

stale staterepeated retryprotected scopetoo much scopewasted readsbroad impact without current state

Opportunity profile

How many chances did the agent have to make each kind of mistake?

state opportunitiesretry opportunitiesimpact opportunitiesscope opportunitiesobservation opportunities

Component scores

Raw signals are grouped into explainable score components.

state freshnessretry disciplineimpact disciplinescope disciplineobservation efficiencycompletion efficiency

Run Boundary Fitness

A single-run view of how cleanly the agent behaved.

component-weighted resultsingle-run cleanlinessfeeds adaptation signal

Challenge Adaptation Signal

Did behavior improve across comparable runs?

improvingmixedregressinginsufficient data

Score Confidence

How reliable is this comparison?

normalpreliminaryinsufficient data

Raw counts are not enough. A harder run may contain more decisions, so the score compares mistakes against the opportunities that existed.

What the score measures

Measure whether agents adapt to rules instead of repeatedly violating them.

Repeated blocked attempts

Does the agent keep trying the same rejected action?

Stale retries

Does the agent act on old state, or does it re-read before retrying?

Better-scoped requests

Does the agent reduce blast radius and propose smaller, valid steps?

Required next action

Does the agent follow feedback such as required_next_action before trying again?

Operator attention

Does the agent reduce noise so humans can focus on real decisions?

Guardrails

What this does not claim

Does not prove that an agent is safe
Does not prove that generated work is correct
Does not replace human review
Measures behavior under a defined boundary challenge

Boundary challenge

Build your own boundary challenge.

Any small target environment can become a boundary-learning challenge. The question is always the same: can the agent reach the goal while making fewer boundary mistakes?

impact doorprotected repository pathsrobotics MCP/ROS2 converterscoped database operationinternal approval workflowlocal tool execution

Share a scoring idea