Hosted Agent
The Chrome extension can host a gpt-realtime-2 voice/text agent directly in the browser.
This is the end-user path for actions.json: install the extension, authorize a tab, upload site memory, add your OpenAI key, and talk to an agent that can see and operate the current website through declared actions.
What The Hosted Agent Is
The hosted agent is an extension-owned Realtime session. It uses:
- your OpenAI API key, stored in Chrome extension storage;
- the authorized browser tab as its operating surface;
browser.screenshotto inspect visible page state;- uploaded
actions.json.storageto discover and run current-site actions throughactions.site; - direct primitives such as
locator.element_info,viewport.scroll, andpointer.clickwhen stored actions are missing or insufficient; - extension-local transcript, tool, memory, and diagnostic logs.
The session is owned by an extension offscreen document, not by the page overlay. That means closing or reinjecting the visible overlay should not intentionally stop the live voice session. Use Stop when you want the session to end.
What It Can Do
Current capabilities include:
- voice conversation through
gpt-realtime-2; - a text transcript with one mutable user and assistant bubble while speech deltas arrive, plus a composer for typed input;
- page screenshots after user authorization;
- storage-backed website context and actions through
actions.site; - direct browser primitives for visible page interaction;
- generated overlays and launchers;
- session memory rehydration;
runtime.session.logdiagnostics for transcript, tool calls, failures, screenshots, and lifecycle events.
Start A Session
- Install and authorize the Chrome extension. See Getting Started.
- Click the extension icon to open the popup. In the embedded Settings area, save your OpenAI API key (the OpenAI API key section is open by default).
- Optional but recommended: upload an
actions.json.storagecheckout from the Storage folder section. - Choose Open agent overlay to open the agent pane on the page.
- Press Start voice.
- Allow microphone permission if Chrome asks.
Expected result: the agent pane shows a live voice state and a transcript area.
Agent Pane
The agent pane — the page overlay titled actions.json agent — is the conversation surface.
It contains:
- the main voice control;
- Stop and mute controls;
- a bounded transcript;
- a text composer;
- current session status.
The transcript is for conversation. Successful tool calls do not add transcript lines; the voice launcher shows a transient Using tool state instead. A single line such as Tool <name> failed: <message>. appears only when a tool call fails, and at session start one line reports Bridge tools loaded: <count>. (a count, not a list). Detailed diagnostics belong in runtime.session.log, not as token-by-token chat spam.
Settings (Popup)
All operational configuration lives in the extension popup, in collapsible settings sections:
- OpenAI API key save/delete state;
- Realtime voice selection;
- turn detection (VAD) controls for speech interruption behavior;
- bridge URL for external-agent workflows;
- storage folder Upload and Download;
- memory and log controls.
In the popup all sections are collapsed except the API key. Opened as a top-level page, all sections are expanded. Folder pickers do not work inside iframes, so the storage Upload and Download buttons open the top-level settings page automatically.
Voice and VAD settings apply to the next Realtime session. OpenAI locks the voice once audio has started in a session.
Storage-Backed Context And actions.site
Uploaded storage makes the hosted agent site-aware.
The hosted agent receives one stable site-action tool: actions.site. It uses that tool when you ask it to:
- list actions available for the current website;
- call a named current-site action;
- read storage-backed context actions such as summaries, navigation playbooks, product guides, or link lists.
There is no separate actions.context tool in the current v1 surface. Site context should be exposed as callable actions.site actions.
See Hosted Agent Tools.
Screenshots And Tool Calls
The agent can call browser.screenshot after the tab is authorized. Screenshot data is passed to the Realtime model as image input; large image payloads are not stored in session logs.
Hosted screenshots default to a compact profile: JPEG at quality 60, scaled to at most 960x960 pixels, capped at 180 KB, with a 10 second capture timeout.
For interaction, the agent should prefer stored actions. When it needs lower level control, it can use primitives such as:
locator.element_info;viewport.scroll;pointer.click;dom.list_sections;browser.extract_elements.
Direct generic primitive calls from the hosted agent must include a policy_exception_report argument (kind, intended_tool, actions_json_path, reason). Calls without it are rejected with policy_exception_report_required before reaching the bridge. actions.site calls and internal steps of compound actions are exempt.
Human-observable actions such as clicking and scrolling are paced by the runtime so bursts of requests do not execute faster than a human-visible rhythm.
Session Memory And Logs
The extension keeps a diagnostic session log in Chrome extension storage.
Use runtime.session.log when you need to inspect:
- transcript turns;
- tool call arguments and results;
- routing decisions;
- tool failures;
- screenshot metadata;
- Realtime lifecycle and audio configuration events.
runtime.session.log returns { ok, visitorId, eventCount, events }. The returned events are capped (default 80, maximum 2000) and sanitized: secrets are redacted and image payloads are stripped.
The log is for debugging. It is not a public artifact and may include browsing context from authorized tabs.
Navigation Persistence
The live Realtime session is hosted by an extension offscreen document. The visible page overlay is a controller/view that can reconnect to that durable session.
This design is intended to keep conversation state alive when:
- the overlay is closed and reopened;
- the content script is reinjected;
- the page navigates within the same site;
- the active tab target changes and the extension can retarget tools.
Browser permissions, page lifecycle behavior, and cross-domain navigation can still require user recovery in some cases. If navigation interrupts the session, use Troubleshooting and inspect runtime.session.log.
Safety And Limitations
- The OpenAI key is stored in Chrome extension storage.
- Microphone permission is controlled by Chrome.
- The extension operates only on user-authorized tabs.
- Site actions depend on uploaded storage and current page compatibility.
- Debugger-backed actions are for authoring and repair, not normal product use.
- The project is pre-1.0; schema and runtime behavior may change.