The Advisory Council

Personal Advisory Report

To:

Jane Smith

RE:

Ranking Projects Based upon Criteria That Are as Objective as Possible

Ref Number:

3876

Date:

Mar 18 2009 4:44PM

Response by:

Douglas Hubbard

Question restatement:

What are the leading practices in the industry in the areas of ranking projects based on criteria that are as objective as possible?

Advice:

A Clarification

The term “leading practice” needs some clarification. The leading practice in ranking IT investments could mean the most popular methods, or methods with the most scientifically measured evidence of actually improving decisions. Unfortunately, there is almost no overlap between these two groups. The most popular methods not only have no evidence of working, there is significant evidence in published, scientifically controlled studies that they may be worse than unaided intuition. On the other hand, the methods that show measurable improvement in management decisions are not widely used. Fortunately, although the most effective methods are obscure, they are entirely practical.

There has been a significant amount of research in the field of decision analysis regarding the effectiveness of various decision analysis methods, including weighted scoring methods. This Personal Advisory Report will build on the prior research to discuss the issues with scoring methods, and identify methods that have scientifically-measured evidence of improving decisions. Finally, we will discuss some additional factors that you may consider in your model, and how they might be defined.

Critical Issues with Scoring Methods

The most popular type of decision analysis method used in ranking IT projects is very likely the weighted ordinal score. An “ordinal” score is not a quantitative measurement like orders per week, months duration or average lifetime revenue per customer, but simply a statement that one thing is more than another, and it is a common mistake to treat ordinal scores as proper scales. For example, a four-star restaurant is better than a two-star restaurant, but not necessarily just twice as good. Some methods then apply some number of points to each of these values or weight them in some other way.

Although there appears to be a strong belief that decisions are better with ordinal scoring methods, there is no published evidence that they have caused decision makers to make measurably more consistent, more reliable, or better performing decisions in any area (IT portfolios, public health, etc.). The scores and the calculations which they are used with are arbitrarily defined, and are not based on any quantitative analysis that optimizes the track record of decisions. These methods seem to be developed in complete isolation from the well established and still growing field of decision science / decision analysis. In fact, there is evidence that these methods actually add significant error to the decision making process, and any perception of an improved decision process is purely imagined. Here is a summary of the research combined with this author’s own documented measurements:

1.     Studying the “scoring behavior” of previous evaluations is critical but rarely done. No public opinion survey would ever be considered credible without detailed consideration for response behavior, yet most scoring methods make no attempt to consider these issues. People choose their responses for all sorts of reasons which the designers of the system never intended to affect the outcomes. For example, arbitrary changes in wording or question order seem to significantly change responses.

2.     Another effect is inadvertent weighting. It is a consequence of different scores being distributed differently, and having an unintended effect on the relative importance of different factors. If we look at how past responses in a three-state score (high/medium/low) there will be some factors that have most of their responses in a single value and some will be evenly distributed among high, medium and low. The factors that happen to have more distributed responses will end up having a larger impact on the resulting rank than a factor where, say, 85%, of responses are “medium” even if the latter is considered to be equally or even more important.

3.     A particularly important effect of scales with only three possible responses is what one researcher calls “range compression.” This is a tendency for many very different projects to cluster within small range of total scores due to the fact that, since there are only three choices, projects with very different characteristics must be given the same scores for several factors. The change of a single factor on one project from low to medium may make a huge change in rank.

4.     Another assumption in scoring methods that adds significant error is the assumption that the scales or weights roughly approximate some real-world measure. In your case, the scoring method prescribes a value of 1, 5 and 10 for low, medium and high, respectively, and this is applied the same to every factor. This literally means that medium is considered five times as desirable as low but only half as desirable as high. But if a historical analysis of projects were completed, it is often found that the relative values of high, medium and low could be dramatically different when optimized to predict outcomes. In one case, I found that a three-point scale for technology maturity that gave relative scores of 1, 3, 9 was actually closer to a 1, 16.2 and 17.5 for technology maturity, and in another 1, 1.1 and 10.5 for sponsor level.

5.     Finally, it appears that even scales that are thought by their designers to be very well defined, will be interpreted very differently by various users. This creates what researchers call an “illusion of communication” where two managers agree that some value is “medium” but on further questioning reveal they believed very different things about what they were evaluating.

The result of all of these added (and completely avoidable) sources of error in the decision-making process causes some researchers to conclude that such scoring methods are literally “worse than useless.” Whatever decision errors that may be removed by such methods, more errors are apparently added. That is, when compared to unaided experts in controlled measurements of decisions (based on portfolio returns, forecasting accuracy, etc.), expert intuition alone often beats experts who use these weighted scores.

However, there are methods that have proven that decision makers — under controlled measurements — can make better decisions. The problems introduced above can be avoided with certain controls.

Recommended Methods

It is highly recommended that you investigate the following methods. These are slightly more complicated than the simple scoring method you have been using (and will take more time to explain in detail than one PAR) but all of these methods have been implemented in real-world environments many times. Furthermore, the additional effort is easily justified for the IT budget a company like yours is likely to have. Each of these topics has also been described in more detail in this author’s book How to Measure Anything: Finding the Value of Intangibles in Business:

1.     Historical Regression Models: One of the best methods appears to be based on regression models of historical data. Although managers often assume that new projects and investments are so unique that historical analysis would be useless, the fact is that historical analysis routinely outperforms the intuitive judgments of managers. The only reason this may not be the preferred approach is that historical outcomes have not been consistently documented. If you have been diligently tracking all past evaluations, and tracking them against project results, then this is the best approach.

2.     The Lens Method: Unlike historical regression, this method requires no historical data of any kind, relies entirely on subjective inputs, yet consistently outperforms managers’ intuition. The Lens method was developed in the 1950s and has been used in a variety of decision analyses including military logistics, IT portfolios, cancer patient prognosis, business risks, and hiring staff. Everywhere it has been applied, it has shown a measurable improvement on expert decision making, and in some situations the improvement was dramatic. Instead of applying scores like 1, 5, and 10 to responses, the Lens method computes relative weights based on a statistical interpretation of a large number of subjective judgments. In effect, the Lens method creates a statistical model of an expert’s judgment, with the benefit that the model beats the unaided expert. This appears to happen because, regardless of the level of experience of experts, they are still unable to apply their judgment consistently. The Lens method creates a 100% consistent model of the expert’s judgment.

3.     Calibrated Probability Assessments: Risk needs to be expressed in terms of probabilities and magnitudes of specific losses. A common error is to assume that since there is insufficient data for “exact” probabilities, then soft scales somehow alleviate the problem of insufficient data. As previously mentioned, the soft scales add their own error. Furthermore, even subjective probabilities can be useful under certain circumstances. First, extensive research shows that managers can be trained to provide “calibrated” subjective probability estimates. That is, when they claim to be 80% confident in a prediction, then they will be correct 80% of the time. Without such training, managers seem to have great difficulty assessing risks. Second, predictions and actual results need to be tracked and documented so that subjective probabilities can improve over time. This can be used in conjunction with the Lens method.

4.     Several of the existing factors may be “soft,” but this is not necessarily a hindrance to measurably improved decisions. If several independent assessments are averaged (in secret ballots) this can still be used as a useful measure. This can also be used in conjunction with the development of model based on the Lens method.

5.     Probabilistic Simulations: For some projects, you should consider a proper risk/return analysis based on Monte Carlo simulations. A company the size of yours would likely have at least one project each year that would justified a detailed quantitative analysis, and it would be entirely practical. I have applied such methods to more than 60 major investments of various sizes — some were as small as $1 million. (I would consider that the lower limit for this level of analysis.) If this method is included in your repertoire, then the basic ranking approach discussed so far becomes a method to categorize projects as “accept”, “reject” or “needs detailed risk/return analysis.” While it is likely that simple ranking methods (even with the improvements mentioned) are all that is needed to rank most projects, it is also very likely that some projects justify a much more detailed analysis.

Comments on Decision Criteria

Whether using historical regressions or the Lens method, some new factors could be added to the model. Many of the existing factors you currently use could be modified using nothing but item (4) above.

Based on other historical regression models I have performed, the first four items listed below seem to have a consistent and significant impact on project success and relative value. Some of these are new factors or specific recommendations for factors already considered in your method:

  • Sponsor level — In objective historical models, this factor has turned out to have one of the highest correlations to project success. Ideally, this person has formally accepted business responsibility for the project by signing a document to that effect. The levels may be “C-level,” “SVP,” “VP,” “Manager” or whatever applies in the organization.
  • Planned duration (months) — The initially planned duration of a project is also a strong predictor of eventual success. The longer a project, the more likely unforeseen issues will interfere with success, and the harder it is to accurately predict costs. For these reasons, the chance of even outright project cancellation is strongly correlated with planned duration.
  • Technology familiarity — Many firms identify some key success factor related to how well proven a new technology may be, or at least how familiar their own staff is with it. Binary (yes/no) statements are recommended — e.g., “We have used this technology for more than two years” or “This technology has been used in the industry for more than three years.”
  • Vendor reliability — Vendor-related concerns often boil down to proven reliability, competence and long-term viability. Again, for simplification, you may define a yes/no question such as “This vendor has been used previously and was given a favorable rating for its deliverables.”
  • New software development — Large, new software development initiatives are among the highest risk projects most firms will engage in. Separating out how much of the proposed project is actually new software development (as opposed to packaged software, hardware, facilities, etc.) is often enlightening. This can be expressed as a dollar amount or FTEs unique to the software development part of the initiative.
  • New business process — This can be a binary (yes/no) question about whether the project involves developing a new business process. If it were merely automating an existing process, or simply updating hardware or packaged software, it will probably not have risks as high as a project developing a previously untested process. However, new processes often have the potential for higher return.
  • Organizational scope — Enterprise-wide projects are invariably higher risk than those with a narrower scope. Likewise, certain narrow areas of critical operations have probably straightforward benefits as well as lower risk. Like the new business processes, the organizational scope may be both higher risk and higher potential return. One or more binary questions like “Is this an enterprise-wide initiative?” or “Is this project limited to supporting sales and marketing?” would be useful.
  • Cost elimination per year/investment size — At this stage, ROI based on exact, deterministic estimates are less useful than many managers think, but only because the methods typically used to compute ROI turn out to be highly unreliable and, therefore, don’t correlate strongly with project success. (If best practices are used, a proper risk/return analysis should be all that is required for any investment decision.) But if there are known categories of significant cost elimination, this should be shown as a ratio compared to the planned investment size. An independent source from business operations should confirm this savings.
  • Customer service or revenue — A customer survey specifically indicated a need or problem specifically addressed by the proposed initiative. The survey indicates that it is a significant factor in future purchasing decisions. Claims that customer service or even revenue will increase are among the most consistently abused in ranking methods. If any such claim is made, it must be based on the evidence from detailed measurements after previous projects or some extensive customer survey combined with purchasing behavior research. On the other hand, if a full risk return analysis is used, this can be incorporated in a practical way.
  • Strategic — The strategic factor you currently identify could be made more objective if it was based on a survey of upper management and/or based on specific references in published strategic plans. Again, an average of several independent persons could be used to assess this value.

TAC SmartGrid

Alignment

“Alignment” for decision analysis methods should mean that decisions were measurably improved. That is failure rates were actually reduced, total return on the IT portfolio increased, and so on. If you at least use methods that have scientifically-proven track records, then alignment is achievable.

Cost Reduction



Cost reduction of the decision analysis process itself is not the key concern when the decisions affect very large portfolios. Spending just half of 1% of an annual IT budget on decision analysis and metrics may be easily justified, yet it is probably more than the cost currently committed to that task. On the other hand, no cost is justified if the decision analysis can’t show they measurably improve decisions.

Organization



Some new skills may need to be introduced into the IT organization to adopt methods that can show real improvements in decisions (or even just to measure those improvements). Depending on the size of the IT budget, it may justify hiring an individual trained in quantitative decision analysis. At a minimum, it should require training of selected existing staff in these methods.

Technology



The technology requirements for improved decision analysis do not need to be more than Microsoft Excel. More important than the technology, however, is the skill to use the right algorithms and methods on a give problem. As the organization becomes more advanced in these methods, then it may make sense to adopt some enterprise solutions for simulations and decision analysis.

 

Sources and Referrals:


D. Kahneman, P. Slovic, and A. Tversky, Judgement under Uncertainty: Heuristics and Biases, Cambridge: Cambridge University Press. 1982.

D. Kahneman and A. Tversky, “Subjective Probability: A Judgement of Representativeness”. Cognitive Psychology. 4: p. 430-454(1972).

D. Kahneman and A. Tversky, “On the Psychology of Prediction”. Psychological Review. 80: p. 237-251(1973).

E. Brunswik, “Representative Design and Probabilistic Theory in a Functional Psychology”. Psychological Review. 62: p. 193-217(1955).

N. Karelaia and R.M. Hogarth, “Determinants of Linear Judgment: A Meta-Analysis of Lens Studies”. Psychological Bulletin. 134(3): p. 404-426(2008).

C.P. Bradley, “Can We Avoid Bias?British Medical Journal. 330: p. 784(2005)

D.V. Budescu, S. Broomell, and H.-H. Por, “Improving Communication of Uncertainty in the Reports of the Intergovernmental Panel on Climate Change”. Psychological Science. 20(3): p. 299-308(2009)

L.A. Cox Jr., “What’s Wrong with Risk Matrices?” Risk Analysis. 28(2): p. 497-512(2008)

L. A. Cox. Jr. et. al. “Some Limitations of Aggregate Exposure Metrics” Risk Analysis, 27(2) 2007

D. Hubbard, How to Measure Anything: Finding the Value of Intangibles in Business: John Wiley & Sons. 2007

S. Lichtenstein, B. Fischhoff, and L.D. Phillips, “Calibration of Probabilities: The State of the Art to 1980”, in Judgement under Uncertainty: Heuristics and Biases, D. Kahneman, P. Slovic, and A. Tversky, Editors. 1982, Cambridge University Press: Cambridge. p. 306-334

G.S. Simpson, F.E. Lamb, J.H. Finch, and N.C. Dinnie, “The Application of Probabilistic and Qualitative Methods to Asset Management Decision Making (Paper 59455)”, in SPA Asia Pacific Conference on Integrated Modeling for Asset Management. 2000: Yokohama, Japan.

W. Bailey, B. Couet, F. Lamb, G. Simpson, and P. Rose, “Taking a Calculated Risk”. Oilfield Review. 12(3): p. 20-35(2000).

©2002-2009 The Advisory Council Inc. All rights reserved. Privacy Policy & Guidelines | Terms & Conditions