Foundations
Reward Shaping and Credit Assignment
Reward shaping injects domain knowledge into the reward signal to speed up learning, while credit assignment determines which past actions actually caused a delayed reward.
intermediate · 8 min read
A robot arm performs 300 joint movements before finally slotting a peg into a hole and receiving a reward of +1. Which of those 300 actions mattered? This question - distributing credit across a long sequence of decisions that precede a single sparse signal - is one of the oldest unsolved problems in machine intelligence. Minsky named it the credit assignment problem in 1961. Sixty years of progress have produced powerful partial answers, but no clean solution.
Reward shaping is a complementary but distinct problem: given that you already know something about the task, how do you encode that knowledge in the reward function itself so the agent reaches the goal faster, without accidentally teaching it to optimise the wrong thing?
The Credit Assignment Problem
In a standard MDP with discount factor \(\gamma\), the return from timestep \(t\) is:
\[G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}\]The discount serves double duty: it ensures convergence for infinite horizons, and it implicitly weights proximal rewards more heavily than distal ones. But in sparse-reward environments, this geometric decay means actions taken hundreds of steps before a terminal reward receive vanishingly small gradient signal. Training an agent on chess from scratch with only win/loss feedback is the canonical torture test.
Three families of solution exist:
Temporal-difference (TD) bootstrapping. Rather than waiting for a full episode return, TD methods use the current value estimate to supply an intermediate target. The TD error \(\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)\) acts as a surrogate reward propagated one step at a time. Over many updates this backpropagates credit, but slowly: a reward at step \(T\) takes \(T\) separate updates to reach step 0.
Eligibility traces. TD(\(\lambda\)) blends \(n\)-step returns via a decay parameter \(\lambda \in [0,1]\). Actions in the recent past remain "eligible" for credit, with eligibility decaying as \((\gamma\lambda)^k\) per step back. Setting \(\lambda = 1\) recovers Monte-Carlo returns; \(\lambda = 0\) recovers one-step TD. GAE (Generalised Advantage Estimation, Schulman et al. 2016) applies this idea to advantage estimation in policy gradient algorithms, controlling the bias-variance trade-off with a single knob.
Return redistribution. Some architectures try to decompose the final return into per-timestep contributions. RUDDER (Arjona-Medina et al. 2018) trains a sequence model to predict cumulative return from trajectory prefixes, then redistributes the unexplained return as immediate reward. The key insight: if you can accurately predict final outcome from the current state, the contribution of each new observation to that prediction is a natural credit signal.
Reward Shaping: Injecting Domain Knowledge Safely
Even with perfect credit assignment, learning is slow when rewards are sparse. Reward shaping augments the environment reward \(r(s, a, s')\) with a hand-designed bonus \(F(s, a, s')\) to guide exploration. The danger is reward hacking: an agent that earns the bonus without achieving the true objective.
Ng, Russell, and collaborators (ICML 1999) proved that potential-based shaping is the unique family of bonuses that leaves the set of optimal policies invariant. The shaped reward is:
\[\tilde{r}(s, a, s') = r(s, a, s') + \gamma \Phi(s') - \Phi(s)\]where \(\Phi: \mathcal{S} \to \mathbb{R}\) is an arbitrary potential function. Any bonus of this telescoping form cancels out over a complete trajectory, so the agent's optimal policy under \(\tilde{r}\) is identical to the optimal policy under \(r\). The extra signal simply makes the value landscape smoother.
A concrete example: in a gridworld maze, set \(\Phi(s) = -d(s, \text{goal})\), the negative Manhattan distance to the goal. The agent now receives a positive bonus whenever it moves closer. The bonus vanishes across the whole episode (goal distance decreases by at most \(T\) steps and you get \(+1\) for reaching it), so the true objective is preserved.
# Potential-based shaping in Python pseudocode
def shaped_reward(s, a, s_next, env_reward, phi, gamma=0.99):
return env_reward + gamma * phi(s_next) - phi(s)
# phi could be: negative L2 distance to goal,
# learned value estimate from a pretrained policy,
# heuristic progress metric.
| Shaping type | Policy-invariant? | Example |
|---|---|---|
| Potential-based \(\gamma\Phi(s') - \Phi(s)\) | Yes | Distance-to-goal bonus |
| Non-potential bonus (arbitrary \(F\)) | No | Step-count penalty that changes optimal path |
| Intrinsic curiosity (count-based, RND) | No (by design) | Exploration bonus; optimal policy shifts |
| Sub-goal reward | Only if aligned with true goal | Reaching an intermediate state |
Note that intrinsic motivation methods (RND, ICM, count-based bonuses) deliberately violate policy invariance: their point is to change exploration behaviour. The non-invariance is a feature, not a bug, as long as the true reward still dominates at convergence.
Advantage Estimation as a Credit Assignment Tool
The advantage function \(A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)\) measures how much better action \(a\) is than the average action in state \(s\). By subtracting the baseline \(V^\pi(s)\), it removes the component of the return that would have been received regardless of which action was taken - a direct answer to "did this action cause the reward?".
Subtracting a baseline does not bias the policy gradient, but it can dramatically reduce variance:
\[\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot A^\pi(s,a) \right]\]GAE constructs a multi-step advantage estimate as a weighted sum of \(k\)-step TD errors:
\[\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}\]PPO (Schulman et al. 2017) clips the probability ratio \(r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_\text{old}}(a_t|s_t)\) so that large advantage estimates do not cause destructive policy updates:
\[L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta)\hat{A}_t,\; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t \right) \right]\]The interplay here with credit assignment is direct: the quality of \(\hat{A}_t\) determines whether the policy gradient points in a useful direction. A poor value baseline (bad credit) causes high-variance updates; a biased baseline causes the policy to optimise the wrong objective.
Practical Design Choices
When to shape rewards. Shape when the environment gives reward only at terminal states and the episode is long (>100 steps). Use potential-based shaping so you cannot accidentally corrupt the task. Good potential candidates: any heuristic you would use to evaluate intermediate states (distance to sub-goal, proportion of task completed, simulator-computed progress).
When to use eligibility traces. Use \(\lambda > 0\) when the environment dynamics are low-noise and trajectories are not too long (eligibility traces accumulate error when \(\gamma\lambda\) is close to 1 over many steps). For PPO and A3C in practice, values of \(\lambda \in [0.95, 0.99]\) consistently work well.
Hierarchical credit assignment. In multi-step tasks with natural sub-goals (assembling furniture, code generation with unit tests), defining a reward function at the sub-task level and composing policies hierarchically sidesteps the long-horizon credit problem entirely. The sub-task critic handles within-option credit; the meta-controller handles between-option credit.
When It Falls Down
Potential mispecification. If \(\Phi\) encodes a bad heuristic (one that points away from the true goal along some trajectory), the shaped reward can make learning harder than no shaping at all. Potential-based shaping preserves optimal policies but says nothing about sample efficiency; a misleading potential still misleads.
Discount horizon mismatch. Eligibility traces with high \(\lambda\) over very long episodes can produce high-variance gradients. The theoretical guarantee of TD(\(\lambda\)) convergence assumes tabular settings; in deep networks with function approximation, the trace can amplify gradient noise until training destabilises.
Reward hacking through shaping. Non-potential bonuses, especially sub-goal rewards that are not perfectly aligned with the true objective, invite Goodhart's law. A robot trained to maximise a "hand near socket" reward learns to hover its hand near the socket without inserting the peg. Carefully constructed sub-goal hierarchies mitigate this, but the risk scales with how loosely the bonus correlates with the actual goal.
Long-horizon credit under stochastic dynamics. When the environment is highly stochastic, a long sequence of actions precedes a reward that is partly luck. Counterfactual methods (off-policy variance reduction, advantage baselines) help, but separating skill from luck in a causal sense remains an active research problem.
Return redistribution reliability. RUDDER-style approaches depend on a sequence model that accurately predicts final returns from intermediate states. In environments where the outcome is genuinely unpredictable from early states, the redistributed credits become noisy, and the method offers little advantage over standard TD.
Further Reading
- Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arxiv.org/abs/1506.02438
- Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. arxiv.org/abs/1707.06347
- Arjona-Medina, J. A. et al. (2018). RUDDER: Return Decomposition for Delayed Rewards. arxiv.org/abs/1806.07857
- OpenAI Spinning Up - Key Concepts in RL (covers returns, advantage functions, and baselines): spinningup.openai.com/en/latest/spinningup/rl_intro.html