James G. March, Learning to Be Risk Averse, Psychological Review, 1996
Abstract
Standard behavior theories of risk taking show that people are risk averse for gains and risk seeking for losses. But this may also result from accumulated learning, rather than calculated choice based on inexplicable human traits. The models show that standard learning models lead to greater risk aversion for gains than for losses. This effect is significant and persistent, especially for fast learners. Learning to choose from outcomes on the negative domain leads to more risky behavior in the short run but risk neutral in the long run.
This paper demonstrates that risk aversion, a component of rational choice may be produced from learning from experience.
1. Explaining Risk Taking Behavior
The "two armed bandit" problem in rational choice and the T-maze are two situations for studying individual behavior where the consequences of two alternatives are only seen after they are chosen.
Risk taking with gains and risk taking with losses
If a person prefers k with certainty than k/r (0<r<1) they are generally seen as risk averse (and vice versa). Generally less risky choices are made when k is positive than negative. Usually the reference point is based on some aspiration level or target that is presumed stable (but may actually vary endogenously). p. 4
Theories of choice and risk taking
In monetary terms, risk aversion is a decreasing marginal utility for money. But this methodology doesn't account for observed differences in risk taking between gains and losses. They compensate by assuming a decreasing marginal disutility for losses.
Another way is to use aspiration level. People are not willing to take high risks above to target to reduce the chance of going below. But they are willing to increase risk to have a chance of getting above the target when they are below.
Learning and Risk Taking
Another explanation is seeing it as a response learned from experience. Favorable responses are assumed to increase propensity to choose an anternative related historically to those experiences. Unfavorable responses lead to decreases in propensity. p. 6
Suppose individuals learn how to respond to situations involving risk the same way they learn other things, but experiencing the apparent consequences of their behavior and modifying their rules of behavior as a result of accumulated experience.
Models of Experiential learning
Each model is characterized by:
1. A set of alternatives (described by a probability distribution over returns, conditional on choice)
2. A learner (defined by a probability vector over alternatives)
3. A decision rule (chosen probabalistically from probability vector)
4. An outcome rule (drawn from pool of alternatives)
5. A learning rule (to modify probability vector based on the outcome)
p i, t is the probability of choosing alternative i on trial t.
Fractional Adjustment Model
If the outcome of the previous result is favorable, Pi,t increases. For a a reward after choosing the ith alternative,
Pi,t+1 = Pi,t + a(1-Pi,t)
if not rewarded,
Pi,t +1= (1-a)Pi,t where 0<a<1
The greater value of a , the faster the adaptation.
For variable rewards, the model is modified to:
Pi,t+1 = 1 - [(1-a)^k * (1-Pi,t)] for positive reward (k realized)
Pi,t+1 = pi,t(1-a)^k for negative reward (-k realized)
Initially pi,0 is often set to 1/n so each alternative is equally likely. The simplification in this model is that the learning parameter for reward is the same as for non-reward, and there are learning limits of 0 and 1.
Average return model
In this model Pi,t changes based on the history or returns of various alternatives.
Ai,t = Si,t/mi,t = Sum of all returns for i / number of returns for i
We only consider cases where all returns are either positive or negative domain , thus where all values of Ai,t have the same sign. Then the probability of choosing i in period t is proportional to past average results
For all positive A
Pi,t = Ai, t/ Sum(Aj, t) (j = 1 to n)
For all negative A
1-Pi,t = Ai,t / Sum(Aj,t)
Initially we normally assume that m0,t = 1 and Si,0 = pi,o = 1/n
This similar to past "matching models" This model assumes an infinitely long memory of results.
Weighted return model
In this model the learner updates the memory of returns after each choice of alternative i by computing a weighted average of the previous memory and the most recent experience. Ai,t is unchanged in any period where i is not chosen. If the ith alternative is chosen at t, Ai,t+1 is a mix between ai,t and the outcome realized in period t Oi,t, with the weight assigned to the two being (1-B) and B respectively. Thus,
Ai,t_1 = b(Oi,t + (1-B)Ai,t (o<B<1)
Thus more recent experiences are weighted more than more remote experiences. The relative rate of "learning" or "forgetting" is indexed by B. The value of Pi,t is found the same as in the average return model. It captures some elements of a limited memory version of the average return model. It is linked to theories of experiential smoothing.
III. Exercising the Models
All models converge on fixed r for a two alternative problem. Both the fractional adjustment and average return reach exactly r, with the weighted return slightly above r for r>0.5.
To simulate the classic case we can suppose that Ri is the chance of alternative i and Ki the magnitude of the reward. Then r1 = 1, r2 = 0.1, k1 = 1 or -1, and k2 = 100 or -100.
Two basic questions:
1. How does the distribution of pi,t over a series of learning trials depend on values of ri and ki?
2. How the distribution of pi,t over a series of learning triels depend on the learning rates a m and B?
Dependence on Ri and Ki
We can then simulate the classic experiment where alternative 1 has return k and alternative 2 has a return k/r with probabiliy r and otherwise a return of 0.
Greater Risk Aversion for gains than for losses
The results show that these learning models tend to choose the more risky alternative when k is negative than when k is positive. The tendency is the same for fractionaly adjustment and weighted return, and the average return only converges after 50,000 trials. However, in the long run only the fractional adjustment model maintains risk seeking for losses. Learning that conforms to the average return and weighted return models ultimately results in risk neutrality for losses (in combination for risk aversion for gains).
Risk Premiums
The least biases model (average return) requires that the expected value of the risky reward be at least 4 times the certain reward to get equality in choice between the two alternatives (19 for fractional adj and 13 for weighted return).
Effects of A, m, and B
High values of A and B and low values of m are associated with fast learning. Fast learning accelerates the tendency toward risk aversion in the positive domain. Slow learners are more likely than fast learners to make enough risky choices the realize an occaisional very favorable consequence.
Effects of Skewness
When r is greater than 0.5, the risky alternative inthe positive domain has a positively skewed outcome distribution (and a negatively skewed outcome for losses). The skewness of these distributions has an effect on the mix of outcomes experienced and is what produces the risk aversion or seeking behavior. But this speculation is not true in the simulation. In no value of r does the average risk aversion greater for losses than for gains.
IV. Interpreting the Results
Note that the same process produces a distribution of risk taking behavior (some are prone to take risks for gains and not for losses).
Learning as Sequential Sampling
Learning modifies the sampling rate of the alternatives (by changing behavior) which in turn changes the outcomes. By shifting sampling from inferior to more superior alternatives, the process improves performance. However, reducing a sampling rate reduces the ability to accurately measure the alternative, which could be disadvantageous for high variance returns.
Quick focus of choice thus can reduce sampling rates of good alternatives. Small sample learning with respect to risky alternatives is quite likely to be misguided. Alternatives that are usually fairly poor but occaisionally very good are likely to be interpretted as worse than they really are. Likewise, alternatives that are fairly good but occaisionally poor will be seen as better then they really are. Implicit overestimation is self-correcting, but underestimation is not.
In this simple two choice model the certain choice is likely to be more rewarding, which translates into a propensity to choose the less risk choice. The joint probability of choosing the risky choice and being rewarded on it becomes smaller and smaller. But in the domain of losses, sampling is more self-correcting. Each choice of the certain alternative results in a greater loss than most choices of the risky alternative, which reduces the probability of making that choice. Thus, there is more sampling of the risky alternative. The result is that learners tend to oscillate between the two alternatives, which brings the overall behavior closer to risk neutral.
Implications of a Learning Perspective on Risk Taking
Based on this model, risk "preference" can be interpreted as a learned response. The greater the risk and the faster the learning, the more profound the effect.
Implications for Interpreting Risk Aversion
The usual arguments are that knowledge increases ability and reliability, reducing involutary risk taking. Also, local experience gives an advantage to faimiliar, safe alternatives. As the learning process procedes, those clearer and closer returns come to dominate returns that are more distant.
The same processes that teach people to add sums correctly, also teaches risk aversion for gains, and those who learn one quickly also learn the other quickly too.
Much of experience learning has to be somehow accumulated across situations and individuals, often by heuristics or rules for dealing with risky situations. Alternative rules accumulate the lessons across individuals and non-repetitive situations in the same basic way that individuals do across repetitive situations. Rules of "risk-aversion" in the positive domain if shared with produce behavior.
"The behavior stems from rules, and the rules stem from a history of ordinary experiential learning that yields "risk" propensities as a side-effect.
Implications for Influencing Risk Taking
Insofar as risk aversion in the positive domain is a result of learning, encouraging risk seeking for gains involves slowing down the rate of learning. For any level of riskiness there is a rate slow enough to find the true values of the alternatives. Slow learning can result from indoctination outside of experience, from limited learning experience, or noise in the learning experience.
Implications for Adaptiveness
These results imply that maybe fast, precise learning isn't all that adaptive.
Cautions
It appears that the learning effects described here will be generated by any learning process in which information about the alternatives can only be gained from choosing them and in which choice depends on experience with alternatives.
The model is not generalized beyond two alternatives or to situations
where one has information about the other alternatives. This case also
makes the success/failure reference point stable.