1 FINITE-HORIZON OPTIMIZATION

Consider a *finite* set *S*, called the *state space*, a *finite* set *A*, called the *action space*, and, for each pair (*x, a*) ∈ *S* × *A*, a probability function *p*(*y; x, a*) on *S*: p (y;x,a)⩾0, ∑ y∈S p (y;x,a) =1. 6.1 The function *p*(*y; x, a*) denotes the probability that the state in the next period will be *y*, given that the present state is *x* and an action *a* has been taken.

A *policy* (or, a *feasible policy*) is a sequence of functions ( ƒ0 , ƒ1 ,…) defined as follows. ƒ0 is a function on *S* into *A*. If the state in period k=0 is x0 then an action ƒ0 ( x0 ) = a0 is taken. Given the state x0 and the action a0 = ƒ0 ( x0 ) , the state in period k=1 is x1 with probability p ( x1 ; x0 , ƒ0 ( x0 ) ) . Now ƒ1 is a function on *S* × *A* × *S* into *A*. Given the triple x0 , ƒ0 ( x0 ), x1 , an action ƒ1 ( x0 , ƒ0 ( x0 ), x1 )= a1 is taken. Given x0 , a0 and the state x1 and action a1 , the probability that the state in period k=2 is x2 is p ( x2 ; x1 , a1 ) . Similarly, ƒ2 is a function on *S* × *A* × *S* × *A* × *S* into *A*; given x0 , a0 , x1 , a1 , x2 an action a2 = ƒ2 ( x0 , a0 , x1 , a1 , x2 ) is taken. Given all the states and actions up to period k=2 (namely x0 , a0 = ƒ0 ( x0 ) , x1 , a1 = ƒ1 ( x0 , a0 , x1 ) , x2 , a2 = ƒ2 ( x0 , a0 , x1 , a1 , x2 ) ), the probability that a state x3 occurs during period k=3 is p ( x3 ; x2 , a2 ) , and so on. In general, a policy is a sequence of functions { ƒk :k=0,1,2,…} such that ƒk is a function on (S×A)k ×S into A (k=1,2,…) , with ƒ0 a function on *S* into *A*. The sequence may be finite { ƒk :k=0,1,… ,N} , called *the finite-horizon case*, or it may be infinite, *the infinite-horizon case*.

Consider first the case of a *finite horizon*.