1 FINITE-HORIZON OPTIMIZATION
Consider a finite set S, called the state space, a finite set A, called the action space, and, for each pair (x, a) ∈ S × A, a probability function p(y; x, a) on S: p (y;x,a)⩾0, ∑ y∈S p (y;x,a) =1. 6.1 The function p(y; x, a) denotes the probability that the state in the next period will be y, given that the present state is x and an action a has been taken.
A policy (or, a feasible policy) is a sequence of functions ( ƒ0 , ƒ1 ,…) defined as follows. ƒ0 is a function on S into A. If the state in period k=0 is x0 then an action ƒ0 ( x0 ) = a0 is taken. Given the state x0 and the action a0 = ƒ0 ( x0 ) , the state in period k=1 is x1 with probability p ( x1 ; x0 , ƒ0 ( x0 ) ) . Now ƒ1 is a function on S × A × S into A. Given the triple x0 , ƒ0 ( x0 ), x1 , an action ƒ1 ( x0 , ƒ0 ( x0 ), x1 )= a1 is taken. Given x0 , a0 and the state x1 and action a1 , the probability that the state in period k=2 is x2 is p ( x2 ; x1 , a1 ) . Similarly, ƒ2 is a function on S × A × S × A × S into A; given x0 , a0 , x1 , a1 , x2 an action a2 = ƒ2 ( x0 , a0 , x1 , a1 , x2 ) is taken. Given all the states and actions up to period k=2 (namely x0 , a0 = ƒ0 ( x0 ) , x1 , a1 = ƒ1 ( x0 , a0 , x1 ) , x2 , a2 = ƒ2 ( x0 , a0 , x1 , a1 , x2 ) ), the probability that a state x3 occurs during period k=3 is p ( x3 ; x2 , a2 ) , and so on. In general, a policy is a sequence of functions { ƒk :k=0,1,2,…} such that ƒk is a function on (S×A)k ×S into A (k=1,2,…) , with ƒ0 a function on S into A. The sequence may be finite { ƒk :k=0,1,… ,N} , called the finite-horizon case, or it may be infinite, the infinite-horizon case.
Consider first the case of a finite horizon.