Online Training¶
agentscope-extensions-training plugs a Trinity-style training backend into AgentScope: it samples production traffic, collects traces, computes rewards, and periodically commits training jobs — closing the loop.
When to use¶
You run Trinity (or a compatible service) as the training backend.
You want to use live traffic for reinforcement learning or online fine-tuning.
You want the training pipeline to be transparent to the Agent’s callers.
Add the dependency¶
<dependency>
<groupId>io.agentscope</groupId>
<artifactId>agentscope-extensions-training</artifactId>
<version>${agentscope.version}</version>
</dependency>
Quickstart¶
import io.agentscope.core.training.runner.TrainingRunner;
import io.agentscope.core.training.strategy.SamplingRateStrategy;
TrainingRunner runner = TrainingRunner.builder()
.trinityEndpoint("http://localhost:8080")
.modelName("/path/to/model")
.selectionStrategy(SamplingRateStrategy.of(0.1)) // 10% sampling
.rewardCalculator(agent -> 0.0) // custom reward
.commitIntervalSeconds(300) // commit every 5 minutes
.build();
runner.start(); // intercept Agent calls and start sampling
// Business code keeps using the Agent unmodified
agent.call(msg).block();
runner.stop(); // stop the training pipeline
Selection strategies¶
SamplingRateStrategy.of(0.1): random sampling at the given rate.ExplicitMarkingStrategy.create(): only marked requests are sampled.Or implement
TrainingSelectionStrategyfor custom behavior.
Reward calculation¶
rewardCalculator is a Function<AgentBase, Double>, invoked once per sampled trajectory:
A lambda — heuristics like answer length, tool-call count, etc.
A custom class implementing
RewardCalculatorfor richer metrics.
TrainingRunner runner = TrainingRunner.builder()
.trinityEndpoint(endpoint)
.modelName(model)
.selectionStrategy(SamplingRateStrategy.of(0.1))
.rewardCalculator(new MyMetricRewardCalculator())
.build();
How it works¶
After
runner.start(), requests go throughTrainingRouter:sampled → routed to the Trinity backend, traces collected;
not sampled → original model is used, no side effects.
Sampled trajectories invoke the reward calculator and feedback through
TrinityClient.feedback(...).Every
commitIntervalSeconds,commit(...)triggers a training job.
runner.stop() shuts down timers and connection pools cleanly.
Key configuration¶
Field |
Notes |
|---|---|
|
Trinity service URL |
|
Target model path or alias |
|
Sampling strategy |
|
Reward function |
|
Commit interval, default 300 |
Pairs well with Studio¶
Attach StudioMessageHook simultaneously and you can see in Studio which sessions get sampled and how rewards were computed.