Foundations
Entropy Regularisation
Entropy regularisation adds a bonus term to the RL objective that rewards stochastic policies, improving exploration and preventing premature convergence to deterministic optima.
intermediate · 7 min read
A policy that always picks the greedy action is brittle. Shake the environment slightly, and a fully deterministic policy that scored perfectly on training tasks can collapse. This fragility is not a fluke; it is a structural consequence of optimising reward alone. Entropy regularisation directly patches the problem by making randomness itself part of what the agent is rewarded for.
The Modified Objective
Standard RL seeks a policy that maximises expected cumulative reward:
J(π) = E_π [ Σ_t γ^t r_t ]
Maximum-entropy RL augments each step's reward with the Shannon entropy of the policy at that state:
J_MaxEnt(π) = E_π [ Σ_t γ^t ( r_t + α · H(π(·|s_t)) ) ]
where H(π(·|s)) = -Σ_a π(a|s) log π(a|s) and α > 0 is a temperature parameter controlling the relative weight of entropy versus reward.
The intuition is clean: the agent receives a free reward bonus whenever it spreads probability mass across actions. Acting randomly is now not purely wasteful; it is profitable up to the point where the reward signal clearly favours one action.
This single change has cascading consequences:
| Property | Reward-only policy | Entropy-regularised policy |
|---|---|---|
| Exploration | Often collapses to single action | Maintains spread across promising actions |
| Robustness | Brittle to reward noise | Adapts; distributional backup |
| Multiple optima | Gets stuck in first | Covers all near-optimal actions |
| Policy output | Can be deterministic | Stochastic by construction |
Why Shannon Entropy Belongs Here
Shannon entropy H(p) = -E[log p(X)] measures average surprise (or unpredictability) of a distribution. For a uniform distribution over n actions, H = log n. For a one-hot distribution, H = 0.
Maximising entropy pushes the policy toward uniformity subject to the constraint that high-reward actions still receive higher probability. The result is a soft policy: a Boltzmann (softmax) distribution where the logits are Q-values scaled by temperature.
For a tabular, single-step case the optimal entropy-regularised policy has the closed form:
π*(a|s) ∝ exp( Q*(s,a) / α )
This is exactly the softmax policy familiar from multi-armed bandits. The temperature α interpolates between fully random (α → ∞) and fully greedy (α → 0). What makes entropy regularisation powerful is that this structure propagates through the Bellman recursion; the soft Bellman backup replaces the hard max with a soft maximum (log-sum-exp):
V_soft(s) = α · log Σ_a exp( Q(s,a) / α )
The soft maximum is smooth, differentiable, and bounded above by the hard maximum, which simplifies gradient-based optimisation.
Soft Actor-Critic: Where This Matters in Practice
Soft Actor-Critic (SAC), introduced by Haarnoja et al. (2018), is the canonical modern instantiation of maximum-entropy RL. Its actor minimises the KL divergence between the current policy and the Boltzmann target implied by the soft Q-function; its critic trains soft Q-values via the soft Bellman backup. The result is an off-policy algorithm with notably low sample complexity and stable training.
The practical advantage is visible in comparisons on MuJoCo locomotion benchmarks: SAC reaches the performance level of on-policy methods like PPO in roughly one-tenth the environment interactions, largely because the entropy bonus keeps exploration alive throughout training rather than relying on explicit noise schedules.
One further improvement (Haarnoja et al., 2018, arXiv:1812.05905) automates the temperature α itself. Rather than treating it as a fixed hyperparameter, they formulate a constrained optimisation: find the policy with maximum entropy subject to the constraint that mean entropy stays above a minimum target H_target (typically -|A|, the negative action-space dimension). This dual gradient descent adjusts α during training automatically, removing one of the most frustrating knobs in entropy-regularised RL.
Connection to Policy Gradient and PPO
The entropy bonus also appears, in a lighter form, inside PPO (Schulman et al., 2017). PPO adds an entropy bonus to its surrogate objective as a regulariser:
L_PPO(θ) = L_CLIP(θ) - c₁ · L_VF(θ) + c₂ · H(π_θ)
Here c₂ is a small coefficient (often 0.01). This is not full maximum-entropy RL; the Bellman backups still use hard maxima. It is a heuristic that reduces premature policy collapse during the early phases of on-policy training, particularly on discrete action spaces like Atari where the policy can collapse to a single action after a few bad updates.
The difference matters: SAC bakes entropy into the value-function estimates, so the exploration incentive persists through every backup. PPO's entropy term only appears in the surrogate loss gradient, so its effect diminishes once the policy is fairly well trained.
When It Falls Down
Temperature sensitivity on reward-sparse tasks. When reward is very sparse, a high-entropy policy may spend nearly all its budget on random actions and never encounter any reward signal. The entropy bonus can dominate the objective, and the agent effectively learns to act randomly. Curriculum techniques or reward shaping are needed before entropy regularisation helps.
Continuous action spaces with unbounded entropy. For Gaussian policies, differential entropy can be negative, and the optimal policy may try to inflate its variance indefinitely in low-reward regions. SAC handles this by parameterising log-variance and applying a squashing nonlinearity (tanh), but careless implementations can produce NaN gradients or diverging variances.
Target entropy tuning is not free. The automatic temperature rule H_target = -|A| works well for locomotion but is a heuristic. Tasks with highly asymmetric action costs, or very large discrete action spaces, may require manual target selection. Setting H_target too high forces uniform exploration even when the task is nearly solved; setting it too low degenerates to reward-only training.
Off-policy stability assumption. SAC's sample efficiency assumes that replay-buffer data is close enough in distribution to the current policy for the soft Bellman backup to be valid. On tasks with very sharp reward landscapes (e.g., contact-rich manipulation), distributional shift in the replay buffer can cause Q-value overestimation that the entropy term does not cure.
Partial observability. Under partial observability the entropy of the action distribution does not capture uncertainty about the state. Entropy regularisation can give false confidence: the agent looks exploratory (high H) but is simply uncertain rather than deliberately exploring. Dedicated methods (e.g., curiosity-based or belief-based approaches) are needed instead.
Further Reading
- Haarnoja, T., Zhou, A., Abbeel, P., Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML 2018. https://arxiv.org/abs/1801.01290
- Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., Levine, S. (2018). Soft Actor-Critic Algorithms and Applications. https://arxiv.org/abs/1812.05905
- Levine, S. (2018). Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. https://arxiv.org/abs/1805.00909
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017). Proximal Policy Optimization Algorithms. https://arxiv.org/abs/1707.06347