Lecture Notes in Mathematics, 1299, Probability Theory and Mathematical Statistics, S.Watanabe and Yu.V.Prokhorov(Eds.), Springer-Verlag, Berlin, 1988. ON THE VALUE FOR OLA-OPTIMAL STOPPING PROBLEM BY POTENTIAL THEORETIC Masami Yasuda Summary. The stopping problem on Markov process with OLA(One-stage Look Ahead) policy is considered. Its associated optimal value could be expressed explicitlyby a potential for a charge of the positive part of the difference between the immediate reward and the oneperiod-after reward. As appl icat i on to the best cho ice problem, the optimal value of three problems: the classical secretary problem, the problem with a refusal probability and the one with random number of objects are calculated. 1. Introduction. The optimal stopping problem is a special case of Markov decision processes. The decision maker can either select to stop, in which case he receives reward and the process terminates, or to pay cost and continue observing the state. If the decision is made to continue, then he proceeds to the next state according to the given transition probability. The objective is to choose the policy which maximizes the expected value. A policy for the decision process means to take the adaptation of a stopping time of the process. Let x_n, n=0,1,2,.. be a Markov chain over a state space S in $R^1$. We assume that S is countable, but this is inessential for our discussion.. In the last section 3 the cases of the unit interval are considered. The optimal stopping problem is to find a stopping time l' which maximizes the expectation of payoff v(i;t) starting at i. Let us denote the optimal value by (1.1) v(i) = sup_{\tau} v(i;\tau) = sup_{\tau} E[r(x_{\tau}) - \sum c(X_n) \vert x_0=i], i \in S where r(i) means an immediate reward and c(i) a paying cost. The admissible class of the policies is the set of all finite stopping times. The detai led analyses are discussed by many authors such as Chow/Robbins/Siegrnund[l], Shiryaev[ll] and so on. Consider the set of states for which stopping immediately is at least as good as stopping after exactly one more period. Denote this set by (1.2) B = {ie: S ;' r ( i ) ~ Pr (i) -c ( i) } where Pr(i) = r. S P(i,j)r(j) and P(i,j), i,je:S, is a stationary transition probal;>ility. The policy, defined by the hitting time of this set B, is called a "One-stage Look Ahead"(abridged by OLA) policy by Ross[10], since it compares stopping immediately with stopping after one period. Under certain conditions, He shows that the OLA policy is optimal. The OLA policy is useful for many problems and also exte~d~d to the continuous parameter process by Prabhu(8). Our aim of this note is to obtain, by applying the potential operator, the explicit optimal value associated with the OLA policy for stopping problems. With this result, we will give, in the section 3, the explicit solution of the various versions for the best choice problem in the asymptotic form. The first is the classical problem and the second is the case wi th the refusal probabi I ~ ty. Also the solution for the case of random number of objects is calculated. The last case was reduced to a functional optimality equation by an ad hoc method in Yasuda[13]. The motivation for this approach arose in connection with the results of Dar 1ing [2] and that of Hordij k [5] ? The former gave the upper bound of the optimal value by a potential operator and the latter gave a sufficient condition to find an optimal stopping time. 2. Markov potential and the optimal value. For a transition probability p=p(i,j) ,i,je:S, a function f=f(i), ie:S, is called a charge if lim [{I+P+ ??? +pk}f] (i) for each ie:S, exists and is finite-valued. Function g=g(i),ie:S is a potential if there exists a charge f such that g(i) = lim [{I+P+ ??? +pk}f] (i), ie:s. We shall use the notation g ( i ) = N f ( i), i e:s ? The relation between Markov Potential Theory and Dynamic programming or Markov Decision Process is discussed by Hordijk [5] ? Since, as in the section 1, the optimal stopping problem is a special case of Markov Decision Process, the optimality equation of the stopping problem is reduced as follows. (2.1) v (i) = max {r (i), Pv (i) -c ( i) } , i \in S Fundamental in such an investigation of Dynamic programming is the uniqueness of the equation (2.1) and the determination of the optimal policy and the optimal value. In this note we consider the optimal stopping problem, for which the OLA policy is optimal. To give a sufficient condition, we prepare the next two ? ? The subset B of S in (1. 2) is closed if propos1t10ns. (2.2) P(i,j) = @ for i EB, j Ea where a denotes the complement of the set B. Thed process is stable if the sequence defined by v0{i)=r{i), v (i) = max{r (i), Pv lei) -c(i)} for n~l,       n n- converges uniformly and limn )00 vn(i) = veil for iES. " proposition 2.1 (Ross[10]). If the process is stable and the set B is closed, then the OLA policy is optimal.    Although this situation occurs in many applications and is useful to determine the stopping region, the proposition does not state the optimal value. The stability is somewhat less satisfactory to check in" the appl ication because the optimal value is unknown. So we impose assumption on a potential of the chain. Our aim is to calculate the optimal expected value under the closedness and the following equalization assumption in stead of the stability assumption, and express it explicitly by using a potential. We call the problem where the OLA policy is optimal as the OLA-optimal stopping problem. proposition 2.2 (Hordijk[5]). Suppose that ( 2 ? 3) v ( i ) = Pv ( i) -c ( i ) when it. r = { i ES ; V ( i) = r ( i) } ? If the value function is a potential, then the hitting time of set r becomes optimal. Proof. This is a special case in Theorem 4.1 of Hordijk[5]. Q.E.D.    Let" PA denote the restr iction to a subset A of the trans i tion probability P, i.e., (2.4) PA(i,j) = P(i,j)lA(j) if i,je:s, where lA denotes the indicator of set A. When the stability is dropped, as Ross shows, a stopping problem does not imply the optimality of the OLA policy. So we must impose a condition so as to preserve the OLA principle. The condition of equalizing for the reward function due to a potential notion is considered. . Assumpt10n ' 2.1. We assume that (2.5) limk )oo[ (Ps)kr ] (i) = 0 for iEa where B denotes the complement of the set B defined by (1. 2), and that the potential for PBr-c is finite-valued with respect to P ' that a ? 1S, \ '. ! " , u ,'~ - I ,,, '" ' , ii , , , '.;" "j I 1 , , , , ' , p, " , ,~ ;~ , " , , , , " " .: ,; : , , , " ! ' , (2.6) [N -(P r -c) ] (i) < co for IEB. The assumption (2.5) is equivalent to limk )co E[r(~k)l{T>k};xntB,n=l, ?? ,k-l] = 9 where T is the hitting time of B. The property limk ) co (Pa)kV = 0 for the optimal value is called equalizing in Optimal Gambling(Dubins /Savage [3] or Hordi j k [5] ) ? One might say that here the actually received in the time period up to Nand the promised earnings equalize as N tends,to infinity. If the optimal value satisfies this property, the assumption (2.5) holds since v ( i). > r ( i), i E S by (2 ? l) ? = Theorem 2.1. Under the assumptions (2.5) and (2.6)" the OLA policy sa TB is optimal, that is, the optimal policy is the hitting. time of B. The optimal value veil = v(i;T) is given by (2.7) v(i) = rei) + N(pr-r-c)+(i) = - r(i) on B Na(PBr-c) (i) on B where + is the positive part of the function, and Nand NB are potentials with respect to P and P respectively. a Proof. The proof for the optimality of the OLA policy is similar to Ross[19] and Hordijk[5]. In generally if veil, iE S is a solution of the optimality equation(2.1), then v(i) > r(i) and v(i) > pv(i) -c(i) = = for iES. So 0-1 veil ~ pnr(l) -L pkc(i) for each n. - k=9 It yields that v(i) ~ v(i; T) for any policy T < co. - Therefore it is sufficient to assert the followings to show the optimality. One is that, for the OLA policy TB, its value equals the right hand side of (2.7), that is, v(i;T B) = rei) on Band NB(PBr - c) (i) on B, and the second is that it satisfies the optimality equation (2.1). Because of the definition OLA policy, v(i) = r(i) on Band . v(i) = Pv(i)-c(i) = PBv(i) + -c ( I) on B. Hence we get the first assertion immediately. Nextly we show that it satisfies the optimality equation (2.1). From P f ( i ) = P B f ( i), iE B for any f=f(i), we have pv(i) -c(i) = PBv(i) -c(i) = PBr(i) -c(i) = Pr ( i) -c ( i) on B. Hence 575 max {r (i), Pv (i) -c (i) } = max{r(i), Pr(i) -c(i)} = rei) for iEB. On the other hand, by substitution of (2.7), Pv ( i) -c ( i) = Psv (i) + P B v ( i) -c ( i) = [PsNa(PBr-c?) (i) + PBr(i) -c(i) = [Na(PBr-c?)(i), iES. On id~, that       rei) < Pr(i) -c(i) = Par(i) + PBr(i) -c(i) implies rei) -Par(i) < PBr(i) -c(i). And, by the assumption (2.6), rei) ~ [Na(PBr -c?) (i) for iEa. Combining the above assertions, max{r(i), Pv(i) -c(i)} = max{r(i), [Na(PBr-c?)(i)} = [N(PBr-c?) (i) for iE a.            a Thus the value v(i)=v(i;~B) satisfies the optimality equation. It now remains to calculate the potential N(pr-r-c)+(i), iES. From the definition of the set B in (1.2), we have (pr-r-c)+(i)=9 on B. So the support of charge is the complement of B and hence N(Pr-r-c)+(i)=9 on B. On other hand, since (Pr-r-c) + (i) = (Pr-r-c) (i) for iEa, we have that P (Pr-r-c) + (i) = P (Pr-r-c) (i) a . = ?P-) 2r+p_p r-P-r-P-c) (i), lEB. B B B B B ? Repeatlng this procedure to take the expectation up to k times and adding these, ., {I+P+p2+ ??? +pk } (pr-r-c)+(i) k'. [{ k-l k-2 } ) . . = (P) r(l) + (P) +(P) + ?? +I (PBr-c) (1) -rei), lEB. a aaa Hence . ..            N(pr-r-c)+(i) = [Na(PBr-c?) (1) -r(l), lEB follows immediately from the assumptions (2.5) and (2.6). Q.E.D. ? We remark that the upper bound on the optimal value in Theorem 3.6 of Darling(2) equals exactly the optimal value in this case. That is, the bound holds with equality when the OLA policy is optimal and it is , equalizing. This explicit solution and the proposition 2.1, 2.2 determine the' optimal value and policy completely in the OLA-optimal stopping problem? .. 3. The Best Choice Problem. In this section we apply the previous method to the typical stopping problem known as the best choice problem. By taking a scale limit, the asymptotic form of the problem is considered in order to get an analytical explicit solution. In the asymptotic form the state space of the problem is not countable but the unit interval. However immediately the previous method can be applied to the case. Some of results are well known and the optimal values are already obtained by some method, for example, by using differential equation. Here we intend to illustrate it for the application which optimal value can be determined by the unified previous me,thod. Each of these are cases whose problem is the OLAoptimal stopping one. 3.1. The Classical Secretary problem. The secretary problem, variously called dowry problem or Googol, is an optimal stopping problem based on relative ranks for objects arriving in a random fashion; the objective is to find the stopping rule that maximizes the probability of stopping at the? best object of the sequence. The optimality equation for the optimal value veil, iES, the maximal probability of win, on the state space S={1,2, ?? ,n} is as follows. n v ( i) = rna x{i/ n, i L v ( j ) I ( (j-1) j) } , iE{1,2, ??? ,n-l}, v(n)=l j=i+l where there are no costs. This formulation of the problem as Markov Process is given by Dynkin and Yushkevitch[4]. Also refer to Chow/Robbins/Siegmund[l] or Shiryaev[ll]. To consider the problem in the asymptotic form, take the scale 1imit by i In --> x as i, n --> co. It becomes that (3.1) vex) = max{ x , x f 1 v(y)y-2 dyl, xES=[G,l]. x The solution of (3.1) is well known as -1 [n -l] e on '0, e . , (3.2) v (x) = -1 x on [e ,1]. Now we extend the problem by changing the reward function r(x)=x, which means probabili ty of choosing the best object in the classical secretary problem, to a general reward r(x). To ensure the OLA policy is still optimal, we assume, for a function hex) defined by (3.3) hex) =r(x)-x f1 r(y)y-2 dy on [13,1] , x that i) each term of the function h(x) is finite-valued on [0,1], and ii) it changes its sign from -to + only once as x increases 0 to 1 and so the equation h(x)=0 has a unique solution a. The optimality equation (3.1) with the reward function r(x) is (3.4) v (x) = rna X { r (x) , x f v(y)y-dy}, x?(0,l] ? x provided that the underlying Markov chain is unchanged. The solution of this equation (3.4) is given by r(a) on [0,a], (3.5) v (x) = r (x) on [a,l] where a is defined by (3.3ii).    In fact, one can show this (3.5) by applying the previous result. Straightforward calculation yields that the set B becomes [a ,1] and it is closed with respect to the transition probability: -2 xy dy for 0 co for X? [0, a]. 3.2. A Problem with Refusal Probability. A variant of the best choice problem is a case with a refusal probability discussed by smith[12]. We can also formulate the problem as optimal stopping on a Markov chain. The asymptotic form of the trans i tion probabi1i ty is obtained immediately and we have -1 p py (x/y) dy 0