Returns, Discounting, and Episodes

An agent trained on a grid-world with no discounting, left to run for a million steps, happily collects a reward one step before the deadline with exactly the same enthusiasm as collecting it on step two. That pathology is not an implementation bug; it is what the maths demands unless you tell the agent that later is worth less than sooner. The return, the discount factor, and the episode boundary are the three knobs that determine what "later" even means.

From reward to return

A reward $r_t$ is a scalar signal the environment emits after a transition. It answers the question "how good was that single step?" The return $G_t$ answers a different question: "how good is everything from here onwards?"

\[G_t = r_{t+1} + r_{t+2} + r_{t+3} + \cdots\]

That bare sum only converges when the interaction terminates at some finite time $T$, which is the episodic case. In continuing tasks (a stock-trading bot, a server routing daemon, a robotic manipulator that never powers off) the sum may diverge, making it useless as an objective.

Two fixes exist:

Setting	Return formula	Converges when...
Finite horizon (episodic)	$G_t = \sum_{k=0}^{T-t-1} r_{t+k+1}$	Always, because the sum is finite
Infinite horizon (discounted)	$G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$	$\gamma \in [0,1)$ and rewards are bounded

The discounted formulation also applies to episodic tasks; treating the terminal state as an absorbing state with zero rewards makes the two formalisms equivalent.

The discount factor and what it really says

$\gamma \in [0,1)$ is usually introduced with the financial analogy (a pound today beats a pound tomorrow), but the more precise reading is geometric probability: imagine there is a constant probability $1-\gamma$ that the episode ends at each step. Then the expected sum of undiscounted rewards under that model equals the discounted return. This reinterpretation is exact, not metaphorical, and it matters when you choose $\gamma$ in practice.

A few concrete numbers show the sensitivity:

gamma = 0.99   -> horizon ~ 100 steps   (e^{-1} mass beyond step 100)
gamma = 0.999  -> horizon ~ 1000 steps
gamma = 0.9    -> horizon ~ 10 steps
gamma = 1.0    -> infinite horizon (undefined unless episode terminates)

Setting $\gamma$ is a design choice with real consequences:

Too low (0.5-0.8): The agent is myopic. It will sacrifice long-run good outcomes for immediate scraps. In maze navigation this means it may collect a small reward in the wrong corridor rather than walk 20 steps to the exit.
Too high (0.999): Variance in gradient estimates explodes. The agent has to distinguish the contribution of an action taken 900 steps ago, which is numerically and statistically very hard.
$\gamma = 1$ with termination: Valid, but the value function must be estimated carefully. Monte Carlo is safe; bootstrapping methods can be unstable.

The Schulman et al. Generalised Advantage Estimation paper (arXiv 1506.02438) treats $\gamma$ as a bias-variance knob, not merely a convergence trick, which is a cleaner lens once you are working with policy gradients.

Episodic versus continuing tasks

An episode is a sequence of interactions with a defined start and a terminal state. Games are the canonical example: a chess game begins, pieces move, the game ends when a king is in checkmate or a draw is declared. The agent resets, and a fresh episode begins. Each episode is independent.

A continuing task has no terminal state. The agent simply keeps acting. The objective must use discounting (or average reward) to remain finite.

This distinction is not merely taxonomic. It affects:

Which return to compute. Monte Carlo methods require episodes to finish before computing $G_t$. They cannot be applied to continuing tasks without modification.
How to handle $\gamma = 1$. Safe in episodic tasks; requires extra care (or switching to average reward) in continuing tasks.
What "resetting" means for off-policy learning. An experience replay buffer drawn from many episodes mixes terminal and non-terminal transitions; implementations must flag which $s'$ values are terminal to zero out the bootstrap target correctly.

In practice, many environments that look continuing are made episodic by adding a time limit (a "timeout"). This is a deliberate approximation and introduces a subtle bias: the value function near the timeout boundary is lower than it would be under the true infinite-horizon objective. OpenAI's Spinning Up documentation notes this distinction explicitly. Modern libraries often expose a separate truncated flag (the timeout case) versus terminated (the true terminal state) precisely to let the value function handle both correctly.

Computing returns in code

In a collected trajectory of length $T$, the returns are computed by a backward pass:

def compute_returns(rewards, gamma, last_value=0.0):
    """
    rewards: list of floats, length T
    last_value: bootstrap value if episode was truncated (not terminated)
    returns: list of floats, same length
    """
    G = last_value
    returns = []
    for r in reversed(rewards):
        G = r + gamma * G
        returns.append(G)
    returns.reverse()
    return returns

The last_value parameter is the key detail. If the episode terminated (agent died, goal reached), last_value = 0. If the episode was cut short by a time limit, last_value should be the estimated value of the final state, otherwise you are treating truncation as death and biasing every return in that trajectory downward.

This backward pass is $O(T)$ and numerically stable for any $\gamma < 1$ because the most distant rewards contribute least.

When it falls down

Reward shaping and return hacking. The return is only as good as the reward signal. Dense shaping rewards that guide the agent through intermediate steps can inadvertently create local optima where the agent maximises shaped reward at the expense of the true objective. The agent is optimising the return exactly as designed; the design is wrong.

Discount mismatch across time scales. If your task has meaningful structure at both short horizons (sub-second motor control) and long horizons (multi-minute strategy), a single $\gamma$ cannot serve both. Hierarchical RL addresses this by using different discount factors at different levels of abstraction, but it adds considerable complexity.

Terminal state misidentification. Forgetting to zero the bootstrap for true terminal states is a silent bug. The value function learns to predict rewards beyond the end of an episode, inflating value estimates near terminal states. This is especially common when wrapping environments that return done=True for both timeout and true termination.

Sparse reward and the vanishing return problem. If the reward is only nonzero at the very end of a 1000-step episode and $\gamma = 0.99$, the return at step 0 is $0.99^{999} \approx 0.00004$ times the terminal reward. The signal is real, but vanishingly small. This is why reward shaping, curiosity-driven exploration, and return normalisation are active research areas rather than solved problems.

Infinite returns outside $[0,1)$. If rewards are unbounded and $\gamma \geq 1$, the sum diverges. This is obvious in theory and occasionally shows up in practice when a reward function is accidentally implemented without clipping or normalisation.

Setting	Return formula	Converges when...
Finite horizon (episodic)	\(G_t = \sum_{k=0}^{T-t-1} r_{t+k+1}\)	Always, because the sum is finite
Infinite horizon (discounted)	\(G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}\)	\(\gamma \in [0,1)\) and rewards are bounded

From reward to return

The discount factor and what it really says

Episodic versus continuing tasks

Computing returns in code

When it falls down

Further reading