|
Advice:
|
A Clarification
The term “leading practice” needs
some clarification. The leading practice in ranking IT investments
could mean the most popular methods, or methods with the most
scientifically measured evidence of actually improving decisions.
Unfortunately, there is almost no overlap between these two groups. The
most popular methods not only have no evidence of working, there is
significant evidence in published, scientifically controlled studies
that they may be worse than unaided intuition. On the other hand, the
methods that show measurable improvement in management decisions are
not widely used. Fortunately, although the most effective methods are
obscure, they are entirely practical.
There has been a significant
amount of research in the field of decision analysis regarding the
effectiveness of various decision analysis methods, including weighted
scoring methods. This Personal Advisory Report will build on the prior
research to discuss the issues with scoring methods, and identify
methods that have scientifically-measured evidence of improving
decisions. Finally, we will discuss some additional factors that you
may consider in your model, and how they might be defined.
Critical Issues with
Scoring Methods
The most popular type of decision
analysis method used in ranking IT projects is very likely the weighted
ordinal score. An “ordinal” score is not a quantitative measurement
like orders per week, months duration or average lifetime revenue per
customer, but simply a statement that one thing is more than another,
and it is a common mistake to treat ordinal scores as proper scales.
For example, a four-star restaurant is better than a two-star restaurant,
but not necessarily just twice as good. Some methods then apply some
number of points to each of these values or weight them in some other
way.
Although there appears to be a
strong belief that decisions are better with ordinal scoring methods, there
is no published evidence that they have caused decision makers to make
measurably more consistent, more reliable, or better performing
decisions in any area (IT portfolios, public health, etc.). The scores
and the calculations which they are used with are arbitrarily defined,
and are not based on any quantitative analysis that optimizes the track
record of decisions. These methods seem to be developed in complete
isolation from the well established and still growing field of decision
science / decision analysis. In fact, there is evidence that these
methods actually add significant error to the decision making
process, and any perception of an improved decision process is purely
imagined. Here is a summary of the research combined with this author’s
own documented measurements:
1. Studying the “scoring behavior” of
previous evaluations is critical but rarely done. No public opinion
survey would ever be considered credible without detailed consideration
for response behavior, yet most scoring methods make no attempt to
consider these issues. People choose their responses for all sorts of
reasons which the designers of the system never intended to affect the
outcomes. For example, arbitrary changes in wording or question order
seem to significantly change responses.
2. Another effect is inadvertent
weighting. It is a consequence of different scores being
distributed differently, and having an unintended effect on the
relative importance of different factors. If we look at how past
responses in a three-state score (high/medium/low) there will be some
factors that have most of their responses in a single value and some
will be evenly distributed among high, medium and low. The factors that
happen to have more distributed responses will end up having a larger
impact on the resulting rank than a factor where, say, 85%, of
responses are “medium” even if the latter is considered to be equally
or even more important.
3. A particularly important effect of
scales with only three possible responses is what one researcher calls
“range compression.” This is a tendency for many very different
projects to cluster within small range of total scores due to the fact
that, since there are only three choices, projects with very different
characteristics must be given the same scores for several factors. The
change of a single factor on one project from low to medium may make a
huge change in rank.
4. Another assumption in scoring methods
that adds significant error is the assumption that the scales or
weights roughly approximate some real-world measure. In your case, the
scoring method prescribes a value of 1, 5 and 10 for low, medium and
high, respectively, and this is applied the same to every factor. This
literally means that medium is considered five times as desirable as
low but only half as desirable as high. But if a historical analysis of
projects were completed, it is often found that the relative values of
high, medium and low could be dramatically different when optimized to
predict outcomes. In one case, I found that a three-point scale for
technology maturity that gave relative scores of 1, 3, 9 was actually
closer to a 1, 16.2 and 17.5 for technology maturity, and in another 1,
1.1 and 10.5 for sponsor level.
5.
Finally, it appears that even scales that are thought by
their designers to be very well defined, will be interpreted very
differently by various users. This creates what researchers call an
“illusion of communication” where two managers agree that some value is
“medium” but on further questioning reveal they believed very different
things about what they were evaluating.
The result of all of these added
(and completely avoidable) sources of error in the decision-making
process causes some researchers to conclude that such scoring methods
are literally “worse than useless.” Whatever decision errors that may
be removed by such methods, more errors are apparently added. That is,
when compared to unaided experts in controlled measurements of
decisions (based on portfolio returns, forecasting accuracy, etc.),
expert intuition alone often beats experts who use these weighted
scores.
However, there are methods that
have proven that decision makers — under controlled measurements — can
make better decisions. The problems introduced above can be avoided
with certain controls.
Recommended Methods
It is highly recommended that you
investigate the following methods. These are slightly more complicated
than the simple scoring method you have been using (and will take more
time to explain in detail than one PAR) but all of these methods have
been implemented in real-world environments many times. Furthermore,
the additional effort is easily justified for the IT budget a company
like yours is likely to have. Each of these topics has also been
described in more detail in this author’s book How to Measure
Anything: Finding the Value of Intangibles in Business:
1. Historical Regression Models: One of
the best methods appears to be based on regression models of historical
data. Although managers often assume that new projects and investments
are so unique that historical analysis would be useless, the fact is
that historical analysis routinely outperforms the intuitive judgments
of managers. The only reason this may not be the preferred approach is
that historical outcomes have not been consistently documented. If you
have been diligently tracking all past evaluations, and tracking them
against project results, then this is the best approach.
2. The Lens Method: Unlike historical
regression, this method requires no historical data of any kind, relies
entirely on subjective inputs, yet consistently outperforms managers’
intuition. The Lens method was developed in the 1950s and has been used
in a variety of decision analyses including military logistics, IT
portfolios, cancer patient prognosis, business risks, and hiring staff.
Everywhere it has been applied, it has shown a measurable improvement
on expert decision making, and in some situations the improvement was
dramatic. Instead of applying scores like 1, 5, and 10 to responses,
the Lens method computes relative weights based on a statistical
interpretation of a large number of subjective judgments. In effect,
the Lens method creates a statistical model of an expert’s judgment,
with the benefit that the model beats the unaided expert. This appears
to happen because, regardless of the level of experience of experts,
they are still unable to apply their judgment consistently. The Lens
method creates a 100% consistent model of the expert’s judgment.
3. Calibrated Probability Assessments:
Risk needs to be expressed in terms of probabilities and magnitudes of
specific losses. A common error is to assume that since there is
insufficient data for “exact” probabilities, then soft scales somehow
alleviate the problem of insufficient data. As previously mentioned,
the soft scales add their own error. Furthermore, even subjective
probabilities can be useful under certain circumstances. First,
extensive research shows that managers can be trained to provide
“calibrated” subjective probability estimates. That is, when they claim
to be 80% confident in a prediction, then they will be correct 80% of
the time. Without such training, managers seem to have great difficulty
assessing risks. Second, predictions and actual results need to be
tracked and documented so that subjective probabilities can improve
over time. This can be used in conjunction with the Lens method.
4. Several of the existing factors may be
“soft,” but this is not necessarily a hindrance to measurably improved
decisions. If several independent assessments are averaged (in secret
ballots) this can still be used as a useful measure. This can also be
used in conjunction with the development of model based on the Lens
method.
5.
Probabilistic Simulations: For some projects, you should
consider a proper risk/return analysis based on Monte Carlo
simulations. A company the size of yours would likely have at least one
project each year that would justified a detailed quantitative
analysis, and it would be entirely practical. I have applied such
methods to more than 60 major investments of various sizes — some were
as small as $1 million. (I would consider that the lower limit for this
level of analysis.) If this method is included in your repertoire, then
the basic ranking approach discussed so far becomes a method to
categorize projects as “accept”, “reject” or “needs detailed
risk/return analysis.” While it is likely that simple ranking methods
(even with the improvements mentioned) are all that is needed to rank
most projects, it is also very likely that some projects justify a much
more detailed analysis.
Comments on Decision
Criteria
Whether using historical
regressions or the Lens method, some new factors could be added to the
model. Many of the existing factors you currently use could be modified
using nothing but item (4) above.
Based on other historical
regression models I have performed, the first four items listed below
seem to have a consistent and significant impact on project success and
relative value. Some of these are new factors or specific recommendations
for factors already considered in your method:
- Sponsor level — In
objective historical models, this factor has turned out to have
one of the highest correlations to project success. Ideally, this
person has formally accepted business responsibility for the
project by signing a document to that effect. The levels may be
“C-level,” “SVP,” “VP,” “Manager” or whatever applies in the
organization.
- Planned duration
(months) — The initially planned duration of a project is also a
strong predictor of eventual success. The longer a project, the
more likely unforeseen issues will interfere with success, and the
harder it is to accurately predict costs. For these reasons, the
chance of even outright project cancellation is strongly
correlated with planned duration.
- Technology familiarity
— Many firms identify some key success factor related to how well
proven a new technology may be, or at least how familiar their own
staff is with it. Binary (yes/no) statements are recommended —
e.g., “We have used this technology for more than two years” or
“This technology has been used in the industry for more than three
years.”
- Vendor reliability —
Vendor-related concerns often boil down to proven reliability,
competence and long-term viability. Again, for simplification, you
may define a yes/no question such as “This vendor has been used
previously and was given a favorable rating for its deliverables.”
- New software
development — Large, new software development initiatives are
among the highest risk projects most firms will engage in.
Separating out how much of the proposed project is actually new
software development (as opposed to packaged software, hardware,
facilities, etc.) is often enlightening. This can be expressed as
a dollar amount or FTEs unique to the software development part of
the initiative.
- New business process —
This can be a binary (yes/no) question about whether the project
involves developing a new business process. If it were merely
automating an existing process, or simply updating hardware or packaged
software, it will probably not have risks as high as a project
developing a previously untested process. However, new processes
often have the potential for higher return.
- Organizational scope —
Enterprise-wide projects are invariably higher risk than those
with a narrower scope. Likewise, certain narrow areas of critical
operations have probably straightforward benefits as well as lower
risk. Like the new business processes, the organizational scope
may be both higher risk and higher potential return. One or more
binary questions like “Is this an enterprise-wide initiative?” or
“Is this project limited to supporting sales and marketing?” would
be useful.
- Cost elimination per
year/investment size — At this stage, ROI based on exact,
deterministic estimates are less useful than many managers think,
but only because the methods typically used to compute ROI turn
out to be highly unreliable and, therefore, don’t correlate
strongly with project success. (If best practices are used, a
proper risk/return analysis should be all that is required for any
investment decision.) But if there are known categories of
significant cost elimination, this should be shown as a ratio
compared to the planned investment size. An independent source
from business operations should confirm this savings.
- Customer service or
revenue — A customer survey specifically indicated a need or
problem specifically addressed by the proposed initiative. The
survey indicates that it is a significant factor in future
purchasing decisions. Claims that customer service or even revenue
will increase are among the most consistently abused in ranking
methods. If any such claim is made, it must be based on the
evidence from detailed measurements after previous projects or
some extensive customer survey combined with purchasing behavior
research. On the other hand, if a full risk return analysis is
used, this can be incorporated in a practical way.
- Strategic — The
strategic factor you currently identify could be made more
objective if it was based on a survey of upper management and/or
based on specific references in published strategic plans. Again,
an average of several independent persons could be used to assess
this value.
|
|
Sources and Referrals:
|
D. Kahneman, P. Slovic, and A. Tversky, Judgement under Uncertainty:
Heuristics and Biases, Cambridge: Cambridge University Press. 1982.
D. Kahneman and A. Tversky,
“Subjective Probability: A Judgement of Representativeness”. Cognitive
Psychology. 4: p. 430-454(1972).
D. Kahneman and A. Tversky, “On
the Psychology of Prediction”. Psychological Review. 80:
p. 237-251(1973).
E. Brunswik, “Representative
Design and Probabilistic Theory in a Functional Psychology”. Psychological
Review. 62: p. 193-217(1955).
N. Karelaia and R.M. Hogarth, “Determinants of Linear Judgment: A Meta-Analysis of
Lens Studies”. Psychological Bulletin. 134(3): p.
404-426(2008).
C.P. Bradley, “Can
We Avoid Bias?” British Medical Journal. 330: p.
784(2005)
D.V. Budescu, S. Broomell, and
H.-H. Por, “Improving Communication of Uncertainty in the Reports of
the Intergovernmental Panel on Climate Change”. Psychological
Science. 20(3): p. 299-308(2009)
L.A. Cox Jr., “What’s Wrong with
Risk Matrices?” Risk Analysis. 28(2): p. 497-512(2008)
L. A. Cox. Jr. et. al. “Some
Limitations of Aggregate Exposure Metrics” Risk Analysis, 27(2)
2007
D. Hubbard, How to Measure Anything: Finding the Value of
Intangibles in Business: John Wiley & Sons. 2007
S. Lichtenstein, B. Fischhoff,
and L.D. Phillips, “Calibration of Probabilities: The State of the Art
to 1980”, in Judgement under Uncertainty: Heuristics and Biases,
D. Kahneman, P. Slovic, and A. Tversky, Editors. 1982, Cambridge
University Press: Cambridge. p. 306-334
G.S. Simpson, F.E. Lamb, J.H.
Finch, and N.C. Dinnie, “The Application of Probabilistic and
Qualitative Methods to Asset Management Decision Making (Paper 59455)”,
in SPA Asia Pacific Conference on Integrated Modeling for Asset
Management. 2000: Yokohama, Japan.
W. Bailey, B. Couet, F. Lamb, G.
Simpson, and P. Rose, “Taking a Calculated Risk”. Oilfield Review. 12(3):
p. 20-35(2000).
|