Going to Production¶
Running a
HarnessAgenton your laptop is easy. Shipping it to production is another story — replicas must share sessions, users must stay isolated, untrusted code must be sandboxed, and pods must be able to resume mid-conversation after a restart. This page only covers what changes between single-node and distributed production: which components must be swapped, what to swap them with, and why the builder throwsIllegalStateExceptionwhen you miss something.
In the source, any component documented as “distributed-friendly”, “cross-replica”, or “shared store” — RemoteFilesystemSpec, SandboxDistributedOptions, RedisSandboxExecutionGuard, SandboxSnapshotSpec, RedisStore / JdbcStore, and so on — is purpose-built for the scenarios on this page.
At a glance: single-node defaults vs. distributed production¶
Dimension |
Single-node default (dev / demo) |
Distributed production swap |
|---|---|---|
|
|
|
Filesystem |
|
|
|
|
|
Skill source |
|
|
Sandbox state |
|
distributed Session-backed store |
Sandbox snapshots |
|
|
Sandbox exec serialization |
none needed in-process |
|
Observability |
no tracing by default |
|
Graceful shutdown |
|
same + |
The key validation chain:
filesystem(RemoteFilesystemSpec)without swappingsession(...)→build()throwsIllegalStateExceptiontelling you to swap inRedisSession.filesystem(SandboxFilesystemSpec)without swappingsession(...)→ same.filesystem(SandboxFilesystemSpec)withNoopSnapshotSpec→ throws, demands an explicit snapshot.Want to bypass for single-node tests:
.sandboxDistributed(SandboxDistributedOptions.builder().requireDistributed(false).build()).
The checks live in HarnessAgentBuilderSupport#validateDistributedSandboxConfig — deliberately fail-fast so “works in dev, loses state in prod” cannot happen.
1. Session backend: put AgentState somewhere durable first¶
AgentState (conversation context, compaction summary, permission rules, Plan Mode state, tool state) only survives across processes through Session.
Implementation |
Module |
When to use |
|---|---|---|
|
|
unit tests; everything dies on process exit |
|
|
single-machine dev; one directory per |
|
|
HarnessAgent default; writes to |
|
|
multi-replica production default; supports Jedis / Lettuce / Redisson (Standalone / Cluster / Sentinel) |
|
|
when sessions must live in a relational store (audit / reporting / joins) |
Redis with any of the three client adapters through RedisSession.builder():
import io.agentscope.core.session.redis.RedisSession;
import redis.clients.jedis.JedisPooled;
// Jedis Standalone
Session session = RedisSession.builder()
.jedisClient(new JedisPooled("redis://localhost:6379"))
.keyPrefix("myapp:session:")
.build();
// Lettuce Cluster (better for write-heavy)
// .lettuceClusterClient(RedisClusterClient.create(...))
// Redisson (if you already use Redisson elsewhere)
// .redissonClient(redisson)
SessionKey design. SimpleSessionKey.of(sessionId) only covers single-tenant. In production, implement SessionKey and encode tenant / user / agent into the identifier so multi-tenant calls can’t cross-read — RedisSession uses it as part of the Redis key, MysqlSession uses it as the primary key:
class TenantSessionKey implements SessionKey {
private final String tenantId, userId, agentId, sessionId;
@Override public String toIdentifier() {
return tenantId + ":" + userId + ":" + agentId + ":" + sessionId;
}
}
Full mechanics in Harness — Context.
3. Remote-mode BaseStore backends: KV choice — and why OSS is the wrong fit¶
RemoteFilesystemSpec sits on top of a BaseStore interface. Two built-in implementations:
Implementation |
Dependency |
Concurrency safety |
Use it when |
|---|---|---|---|
|
|
Lua-based CAS |
the default; multi-replica sharing |
|
|
single-statement CAS UPDATE |
existing relational infra / need joins |
|
— |
— |
tests |
// Redis
BaseStore store = new RedisStore(new JedisPooled("redis://prod-redis:6379"));
// MySQL (same agentscope_store table; schema bootstrapped automatically)
BaseStore store = JdbcStore.builder(dataSource)
.initializeSchema(true)
.build();
HarnessAgent agent = HarnessAgent.builder()
.name("multi-tenant-agent")
.model(model)
.workspace(workspace)
.session(RedisSession.builder().jedisClient(jedis).build())
.filesystem(new RemoteFilesystemSpec(store)
.isolationScope(IsolationScope.USER)
.workspaceIndex(WorkspaceIndex.open(workspace))) // speeds up ls/glob
.build();
What about OSS / NAS / S3?¶
Do not implement a BaseStore against OSS — MEMORY.md / memory/YYYY-MM-DD.md / agents/<id>/context/<sid>/ get written several times a second; OSS latency and per-request cost will blow up immediately. The correct division of labour:
Data shape |
Backend |
Owner |
|---|---|---|
High-frequency small KV (memory, session snapshots, task records) |
Redis / MySQL ( |
|
Large objects (whole sandbox workspace tar, tens of MB) |
OSS / S3 |
|
Cross-node shared volume (multiple sandbox instances mounting the same dir) |
NAS / EFS |
|
RemoteFilesystemSpec routing table¶
To prevent key collisions across subsystems, the spec slices the workspace into independent namespace segments:
Workspace path |
Namespace segment |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Extra: |
derived automatically |
Each segment is then bucketed by IsolationScope (USER → agents/<agentId>/users/<userId>/). A Redis key ends up looking like agentscope:store:item:agents\0X\0users\0alice\0memory\0memory/2026-06-02.md.
CompositeFilesystem: two-layer reads + write-through¶
RemoteFilesystemSpec.toFilesystem(...) actually produces a CompositeFilesystem: a base LocalFilesystem without shell (fallback for local templates) plus one OverlayFilesystem per route (upper = RemoteFilesystem, lower = read-only LocalFilesystem template).
Effect: writes always go to Remote; reads check Remote first, fall back to the local template. That is the “two-layer read architecture” described in Workspace instantiated for Remote mode — the local <workspace>/AGENTS.md is a seed (synced via team git), and Remote takes over as soon as it has been written to.
WorkspaceIndex: optional SQLite index¶
.filesystem(new RemoteFilesystemSpec(store).workspaceIndex(WorkspaceIndex.open(workspace)))
Speeds up ls / glob / exists / grep under Remote mode — without it every call scans the full KV. WorkspaceIndex is a best-effort SQLite file (under <workspace>/.index/), failures degrade silently without affecting correctness.
4. Skill marketplaces: which SkillRepository to pick¶
Skills compose from low to high priority (details in Skill):
Layer |
Source |
Configured by |
Use it for |
|---|---|---|---|
1 |
Project global |
|
personal dev box; |
2 |
Marketplace |
|
cross-project sharing |
3 |
Workspace shared |
|
project-specific; checked into git |
4 |
Per-user |
|
user-level override |
Marketplace backends¶
Repository |
Module |
Notes |
Best for |
|---|---|---|---|
|
|
team git repo; pulls only when HEAD changes; read-only distribution |
early stage / small teams; review skill changes via PR |
|
|
|
platform-side central governance; multi-team multi-agent |
|
|
online distribution + config-center change subscription; |
Aliyun ecosystem; “change once, take effect fleet-wide” |
|
|
shipped with the JAR; Spring Boot fat-JAR compatible |
hard-bound capabilities baked into the product |
HarnessAgent agent = HarnessAgent.builder()
// ...
.skillRepository(new GitSkillRepository("https://github.com/your-org/team-skills.git"))
.skillRepository(MysqlSkillRepository.builder(dataSource)
.databaseName("agentscope")
.skillsTableName("skills")
.createIfNotExist(true)
.writeable(false) // read-only distribution; recommended for production
.build())
.build();
skillRepository(...) is additive; later registrations win on name collisions.
Production checklist¶
Prefer
MysqlSkillRepository(writeable=false)orNacosSkillRepository— platform-side central governance, agents read-only; write-backs go through an admin console + review flow.Don’t want the agent to see
workspace/skills/?.disableDefaultWorkspaceSkills().When
enableSkillManageToollets the agent draft new skills, always pair it withenableSkillPromotionGate(...); neverautoPromote=truein production.NacosSkillRepositoryisAutoCloseable— close it from Spring@PreDestroyor atry-with-resources, otherwise subscriptions leak.
5. When you need shell: pick a Sandbox + mandatory Snapshot¶
When you must use a sandbox:
the model might run untrusted code (Python / shell /
npm install/ compilation)you need to recover the entire working directory across calls (
node_modules, generated files, post-pip installenvironment)you need hard user isolation (no peeking into another user’s processes)
Five sandbox backends¶
Spec |
Module |
Use it for |
|---|---|---|
|
|
single-machine / local cluster; container from an image; most familiar |
|
|
already running K8s; pods / Jobs |
|
|
Daytona (dev-env-as-a-service) |
|
|
E2B cloud sandboxes; fastest to ship, no self-managed infra |
|
|
Aliyun AgentRun; native NAS / OSS mounts; enterprise-grade |
.filesystem(new DockerFilesystemSpec()
.image("ubuntu:24.04")
.isolationScope(IsolationScope.SESSION))
Snapshots are the sandbox’s distributed lifeline¶
Sandboxes are ephemeral by default — the next call() may land on a different node in a fresh container, losing every pip install and generated file. SandboxSnapshotSpec archives the workspace as tar so the next call() hydrates it back into a new container.
Spec |
Backend |
When to use |
|---|---|---|
|
— |
not for production; the builder will refuse unless you |
|
local directory |
single-node debugging |
|
Alibaba Cloud OSS |
default for multi-replica production; large objects belong in object storage |
|
Redis |
small workspaces + short TTL (watch Redis memory cost) |
Custom |
S3 / GCS / MinIO / your own object store |
anything not in the built-in list |
SandboxSnapshotSpec ossSnap = new OssSnapshotSpec(
"oss-cn-hangzhou.aliyuncs.com",
System.getenv("OSS_AK"),
System.getenv("OSS_SK"),
"agentscope-sandbox-snapshots",
"prod/"); // key prefix for multi-env isolation
HarnessAgent agent = HarnessAgent.builder()
.name("coding-agent")
.model(model)
.workspace(workspace)
.session(RedisSession.builder().jedisClient(jedis).build())
.filesystem(new DockerFilesystemSpec()
.image("python:3.12-slim")
.isolationScope(IsolationScope.USER))
.sandboxDistributed(SandboxDistributedOptions.oss(redisSession, ossSnap))
.build();
SandboxDistributedOptions.oss(session, ossSpec) / .redis(session, redisSpec) are common shortcut factories. Snapshots carry a snapshotId (defaults to sessionId), so the same user across multiple devices fetches the same archive as long as the sessionId matches.
Sandbox exec serialization: RedisSandboxExecutionGuard¶
Under SESSION / USER scope, buckets are already partitioned by session/user and concurrent execs don’t collide. Under AGENT / GLOBAL scope with multiple replicas, N nodes can race to exec on the same sandbox slot. Add RedisSandboxExecutionGuard (a Redis-based distributed lock) in that case:
.filesystem(new DockerFilesystemSpec()
.image("ubuntu:24.04")
.isolationScope(IsolationScope.GLOBAL)
.executionGuard(new RedisSandboxExecutionGuard(jedis, "agentscope:guard:", Duration.ofSeconds(30))))
RedisSandboxExecutionGuard is the reference implementation of the SandboxExecutionGuard interface; you can plug in Zookeeper, etcd, or any other lock mechanism.
Workspace projection: pushing seed files into the sandbox¶
SandboxFilesystemSpec projects AGENTS.md, skills, subagents, knowledge, .skills-cache (five roots) into the sandbox at start time by hydrating a content-hashed tar archive (incremental rewrites). Tweak it:
.filesystem(new DockerFilesystemSpec()
.image("...")
.workspaceProjectionRoots(List.of("AGENTS.md", "skills", "knowledge")) // drop subagents/.skills-cache
// .workspaceProjectionEnabled(false) // fully disable
)
AgentRun-specific: NAS / OSS mounts¶
AgentRunFilesystemSpec is the only backend that natively supports multiple sandbox instances mounting the same directory (via NAS). When the business case is “one user sees the same workspace across different sessions”, AgentRun + NAS is more efficient than re-hydrating snapshots every time:
.filesystem(new AgentRunFilesystemSpec()
.apiKey(System.getenv("AGENTRUN_API_KEY"))
.accountId(System.getenv("ALI_ACCOUNT_ID"))
.region("cn-hangzhou")
.templateName("python-3.12")
.nasConfig(new AgentRunNasMountConfig().fileSystemId("...").mountTargetDomain("...").mountDir("/workspace"))
.addOssMount(new AgentRunOssMountConfig().bucketName("data").mountDir("/mnt/oss")))
Full fields in the AgentRunNasMountConfig / AgentRunOssMountConfig source.
6. Multi-replica deployment checklist (combined)¶
Pulling the single-component picks above into one table:
Concern |
Recommended combo |
|---|---|
Sessions / |
|
Workspace files |
|
Large objects / snapshots |
|
Cross-node sandbox sharing |
AgentRun + NAS mount, or self-managed K8s + |
Skill governance |
|
Subagent task records |
automatic via |
Graceful shutdown |
|
Observability |
|
Rate limiting |
custom |
7. A complete production builder template¶
Wiring everything together:
import io.agentscope.core.agent.RuntimeContext;
import io.agentscope.core.session.redis.RedisSession;
import io.agentscope.core.tracing.OtelTracingMiddleware;
import io.agentscope.harness.agent.HarnessAgent;
import io.agentscope.harness.agent.IsolationScope;
import io.agentscope.harness.agent.filesystem.spec.RemoteFilesystemSpec;
import io.agentscope.harness.agent.sandbox.RedisSandboxExecutionGuard;
import io.agentscope.harness.agent.sandbox.SandboxDistributedOptions;
import io.agentscope.harness.agent.sandbox.impl.docker.DockerFilesystemSpec;
import io.agentscope.harness.agent.sandbox.snapshot.OssSnapshotSpec;
import io.agentscope.harness.agent.store.RedisStore;
import io.agentscope.harness.agent.workspace.WorkspaceIndex;
import io.agentscope.core.memory.compaction.CompactionConfig;
import io.agentscope.core.memory.compaction.ToolResultEvictionConfig;
import java.nio.file.Paths;
import java.time.Duration;
import java.util.List;
import redis.clients.jedis.JedisPooled;
public class ProductionAgentFactory {
public HarnessAgent build() {
var workspace = Paths.get("/var/agentscope/workspace");
var jedis = new JedisPooled(System.getenv("REDIS_URI"));
// 1. Distributed Session
var session = RedisSession.builder().jedisClient(jedis).build();
// 2. Pick sandbox + OSS snapshot (we need shell)
var sandboxSpec = new DockerFilesystemSpec()
.image("python:3.12-slim")
.isolationScope(IsolationScope.USER)
.executionGuard(new RedisSandboxExecutionGuard(
jedis, "agentscope:guard:", Duration.ofSeconds(30)));
var oss = new OssSnapshotSpec(
"oss-cn-hangzhou.aliyuncs.com",
System.getenv("OSS_AK"), System.getenv("OSS_SK"),
"agentscope-sandbox-snapshots", "prod/");
return HarnessAgent.builder()
.name("coding-assistant")
.model("dashscope:qwen-plus")
.workspace(workspace)
// Session + filesystem must be swapped together
.session(session)
.filesystem(sandboxSpec)
.sandboxDistributed(SandboxDistributedOptions.oss(session, oss))
// Memory compaction + large-result offload (mandatory for long-running sessions)
.compaction(CompactionConfig.builder()
.triggerMessages(50)
.keepMessages(20)
.build())
.toolResultEviction(ToolResultEvictionConfig.defaults())
// Centrally-governed skill repository (read-only distribution)
.skillRepository(io.agentscope.core.skill.repository.mysql.MysqlSkillRepository
.builder(skillsDataSource())
.createIfNotExist(false)
.writeable(false)
.build())
// Tracing
.middlewares(List.of(new OtelTracingMiddleware()))
.build();
}
}
At call time, populate RuntimeContext so every “bucket-by-user/session” subsystem sees the key:
agent.call(msg, RuntimeContext.builder()
.userId(httpRequest.tenantUserId())
.sessionId(httpRequest.sessionId())
.build()).block();
8. Common pitfalls¶
java.nio.Filesfor workspace writes — under sandbox / Remote mode this lands in the wrong place. Always go throughagent.getWorkspaceManager(). Exception: builder-time seed files (initWorkspaceIfAbsent-style code) — no runtime context yet,java.nio.Filesis correct because you’re seeding the local template.tools.json’sallowfilters built-in tools too — when whitelisting, keepread_file/memory_search/agent_spawnand friends in the list, or every built-in gets stripped.IsolationScopechanges do not migrate existing data — pin it before launch. Changing it post-launch is equivalent to switching to a new namespace.WorkspaceSessionsingle-machine constraint: a K8s multi-replica build throwsIllegalStateExceptionon the very firstbuild(). This is intentional — you can’t park session state on one pod’s local disk.NacosSkillRepositorynot closed — subscriptions leak; at fleet scale Nacos complains. Use Spring@PreDestroyordestroyMethod="close".OSS / NAS without IAM —
OssSnapshotSpectakes platform AK/SK; RAM Role + STS temporary credentials is more robust.SandboxDistributedOptions.requireDistributed(false)is a debug switch — make sure it doesn’t ship to production config.