QUANTIFYING THE VALUE OF HUMAN INTERVENTION
IN EXTENDED WEATHER FORECASTS
by W. M. (Willi) Purcell
D. D. (Dave) Carlsen
Prairie and Arctic Storm Prediction Centre
Meteorological Service of Canada
Winnipeg, Manitoba, Canada
Everyone talks about the weather, but nobody does anything about it. The old adage certainly seemed appropriate with extended weather forecasts issued by the Prairie and Arctic Storm Prediction Centre (PASPC) starting in 2001, when the products were automated. The forecasts were the focus of many complaints, especially among meteorologists who believed they could add considerable value to the raw products issued by SCRIBE, the computer-based forecast tool developed by the Canadian Meteorological Centre (CMC). In the fall of 2004, a PASPC meteorologist undertook a 10-week experiment to evaluate human involvement in the extended forecasting process. The results indicted that although changes can produce improvements, the gains are small and could easily be automated.
1. THE HISTORY
Environment Canada has been issuing forecasts for the extended period for many years, but an increasing move toward automation gathered momentum just a few years ago. The development and implementation of forecast tools such as SCRIBE and updatable model output statistics (UMOS) produced steady improvements in forecast accuracy, assessed both by CMC and the storm prediction centres.
By 2001, there was little difference between the raw CMC forecasts and the products issued by the various weather centres, although those studies were generally limited to temperature forecasts. PASPC verification work indicated that meteorologists did reduce forecast error in the extended period, but the differences were generally small. Over a five-year period ending on January 1, 2001, meteorologists reduced the mean absolute error in day-three maximum temperatures by just 0.15 degree, to a value of 4.01 degrees. With overnight lows, the automated PASPC verification indicated that meteorologists actually degraded the CMC product.
Meanwhile, verification data for precipitation and cloud cover were spotty at best. As a result, conclusions regarding the accuracy of extended forecasts were largely based on anecdotal information, as well as potentially erroneous conclusions drawn from the temperature verification.
The PASPC fully automated the extended forecasts in 2001, primarily due to workload issues. That resulted in increased complaints about the quality of extended forecasts from the general public, and especially from meteorologists convinced that they could do a better job.
2. THE CHALLENGE
The Winnipeg office of the PASPC began detailed verification of its public forecasts in 2001, using an in-house utility designed for its Phoenix program (Ref. 2, 3). The system weighted forecast parameters and error ranges according to public perceptions contained in a 2002 survey by Decima Research (Ref. 1). Considerable improvement in the official forecasts followed, prompted by the revival of short-term forecasting techniques and information gleaned from the verification data.
The PASPC limited its detailed verification to the first two days, as there was no valid comparison possible in the extended period once the SCRIBE product became the official forecast. Nevertheless, a challenge was extended to all meteorologists that any unofficial forecast changes offered in real time would be assessed. Despite the frequent complaints about the SCRIBE forecast, there were no formal takers on the verification offer for more than three years.
That changed in the fall of 2004, when Mr. Carlsen, a PASPC meteorologist, agreed to participate in an extended experiment. Mr. Carlsen was a clear choice for this test, as he was acknowledged as having the best verification scores during 2004 and he was the runner-up during 2003. The test ran from Sept. 30 to Dec. 15, 2004 and involved evaluating and revising the extended forecasts for Edmonton, Calgary, Regina, Saskatoon and Winnipeg. The work took place a few hours after the extended forecast issue time, but the process mimicked the operational environment.
The bulk of the forecast complaints resulted from clear biases in the SCRIBE product, although unilateral assumptions based on this knowledge frequently led to unfortunate situations. On occasion, a PASPC meteorologist would alter the SCRIBE product. More frequently, a PASPC meteorologist would offer a contrary opinion during a media broadcast. Purely anecdotal assessments showed these alterations had a very high rate of failure. It was difficult to argue against full automation in the face of this mounting evidence.
As a result, the PASPC chose to put the matter to a more rigorous test. Mr. Carlsen’s experiment provided a clear test of a meteorologist’s ability to add value to the automated forecasts, although there was a key bias that affected the outcome. The test period covered the fall months, when SCRIBE and all numerical guidance is typically at its worst across the Prairies, due to the frequency of boundary layer cloud and precipitation. This led to an advantage for the meteorologist over his SCRIBE competition that would be less apparent for the remainder of the year.
A key part of the experiment was a comparison with existing automated forecasts, but any such test would not be complete without the inclusion of at least one other automated scheme that would attempt to correct clear biases in the SCRIBE forecasts. Ultimately, for the experiment to prove the value of reintroducing a human influence, the meteorologist would have to outperform all automated methods.
3. HUMAN INTERVENTION
A detailed examination of model guidance was the key to determining the most prospective changes to the existing SCRIBE extended forecasts. A detailed diagnosis of the recent performance of the numerical weather forecast models was the first task in the daily assessment. It was also the most difficult part of the experiment.
The experiment routinely used several models. The list included the Global Environmental Multiscale (GEM) global, the ensemble of the GEM global, the North American Mesoscale (NAM) model, the Global Forecast System (GFS), the United Kingdom Met Office forecast model (UKMET) and the European Centre for Medium-Range Weather Forecasting model (ECMWF). In all cases, the experiment utilized the 00 UTC model runs.
A first consideration was an evaluation of model initialization. In addition to cursory examinations, the meteorologist used the extended forecast guidance discussions prepared by the Canadian Meteorological Centre (CMC) and the National Weather Service (NWS). This evaluation proved of marginal use at best, as all the models seemed to initialize with comparable effectiveness.
The next step was an assessment of what each model was predicting. The meteorologist examined the veracity of each scenario, with pattern recognition playing a key role. After the assessment of each model was complete, a likely solution was developed. On occasion, recent model performance dictated the accepted solution, while statistical likelihood influenced other selections.
The meteorologist compared the SCRIBE extended forecast with forecasts from various outside sources, including private forecasting agencies and the model output. These data formed the members in what effectively was an ensemble approach. The forecasts of each forecast element were averaged for each location and date.
It was necessary to apply a probability of precipitation (POP) to every mention of precipitation for forecast consistency and verification, although not all the members supply an explicit value. Therefore, a subjective interpretation of the forecast’s wording or icon was required. Cloud cover also required some interpretation, but temperature and wind forecasts did not.
All forecast members were converted to a standard format. The temperature guidance was simply an integer average of all the members, including the SCRIBE forecast. The cloud cover forecast by each of the sources was fitted into one of five categories ranging from “sunny” to “cloudy”. The wind forecast was simply a “yes” or “no” forecast, based on a sustained wind threshold of 30 km/h. Precipitation forecasts were slotted into five categories. The lowest group contained POP forecasts of 20 per cent or less, while the top group contained forecasts of 80 per cent or more. The middle group contained POP forecasts of 50 per cent. The two remaining categories contained POP forecasts of 30 and 40 per cent, and 60 and 70 per cent respectively.
The forecasts for each location and date were compared to the SCRIBE forecast. If the difference between the SCRIBE forecast and the averaged ensemble forecast was within a certain threshold, the SCRIBE forecast would usually stand. If significant discrepancies existed, further investigation followed.
A three-degree threshold was used for temperature forecasts, and due to the binary nature of the wind forecasts, any variation was significant. Sky condition forecasts and probability of precipitation forecasts required a difference of two categories to be deemed significant.
The ensembles formed the basis of most decisions to change the forecasts. There were times, however, when the meteorologist rejected the ensemble. There were also times, although fairly rare during this experiment, when both the ensemble and the EC forecast were rejected in favour of meteorological judgement.
Pattern recognition, regional climatology and recent events were important factors in the forecasts. Temperature forecasts were occasionally changed because the values would drift toward the climatological temperature – a known trait of the SCRIBE forecasts. The difficulty was in figuring out which forecasts were most likely to deviate significantly from climatological values.
4. AUTOMATED METHODS
A very crude method of removing SCRIBE forecast bias was developed. The method was officially named Value Improvement Product Enhancing Routine (VIPER), but became colloquially known as the Very Idiotic but Potentially Embarrassing Routine.
The latter name clearly had merit, as there was no meteorology involved in the forecast revisions. Alterations to the existing routines were occasionally made, although retroactive adjustments were not permitted.
Sky conditions were bumped up by one or two categories as a matter of course, making it impossible to have anything less than “sunny with cloudy periods” as a forecast. An attempt was made to nudge POP forecasts by one category, given certain prevailing sky condition forecasts. This latter technique worked with some degree of effectiveness at the start, but forecast gains quickly dissipated and the approach was dropped at about the break-even point. No further changes were made to precipitation forecasts once that point was reached.
Several SCRIBE temperature forecasts had limited ranges between the highs and lows. If the range was less than a set limit, the VIPER routine boosted the high temperature to provide the set minimum spread. There were no changes to the minimum forecasts. The minimum variation was nine degrees throughout the period. Lowering the threshold by a degree or two as fall progressed was contemplated, but the existing value appeared to be sufficient. No changes were made to any wind forecasts.
The net effect of the VIPER changes was that the most optimistic forecast of sky condition was “a mix of sun and cloud,” as all “sunny” forecasts were degraded to that prediction. This would undoubtedly have changed during the other seasons.
5. THE RESULTS
Curiously, SCRIBE tended to under forecast precipitation significantly, and the meteorologist and VIPER tended to compensate for this, although by entirely different methods. Neither approach proved successful.
Meteorologist-induced changes to the SCRIBE precipitation forecasts resulted in a degradation of the forecast accuracy over the course of the experiment. There were 75 precipitation changes, and just 27 were improvements over SCRIBE. That resulted in an increased error rate of about 7 per cent over what was already a significant degree of error in the SCRIBE product. VIPER’s success rate was below 50 per cent and it added 1 per cent to SCRIBE’s error score.
Both the meteorologist and VIPER improved the sky condition forecasts. The human intervention produced 310 changes, and the 254 for the better yielded a success rate of about 82 per cent. Not surprisingly, that cut SCRIBE’s error rate by 57 per cent.
VIPER managed a comparable performance. The automated routine altered 583 sky condition forecasts and scored an improvement with 78 per cent of the changes. That resulted in a reduction of the SCRIBE error by 54 per cent. Those rates were nearly a match for what the detailed use of ensembles and meteorology delivered.
The meteorologist made improvements on temperatures, but only on daytime highs. There were 83 changes, and 77 per cent delivered improvements. Alterations to overnight lows fared poorly. Only 42 per cent of the 19 changes were for the better. Over all, temperature changes resulted in a reduction of 6 per cent in the SCRIBE error rate.
If there was a discrepancy between the ensemble forecast high and the SCRIBE value, the former would be taken and would usually win out. Still, on days when the ensemble turned out to be wrong, it would be very wrong. An ensemble high temperature forecast is often not the best forecast, but it is perhaps the least wrong on average.
There were a few successes with overnight lows at the start of the experiment, but a barrage of huge losses followed the early improvements. As it became apparent that changing lows was degrading the performance, the meteorologist scrapped the approach as a very bad idea.
Meanwhile, the VIPER routine was more eager to make temperature changes, but only to daytime highs. There were 156 alterations, and 71 per cent resulted in gains. As a result of the changes, VIPER reduced SCRIBE’s temperature error rates by 18 per cent.
Continuing with experience gained in the short-term forecasts, VIPER made no changes to wind forecasts. The meteorologist did alter eight wind predictions in one brief flurry. Only three worked out and the alterations added a modest degree of error to the SCRIBE product.
Over all, there were 495 changes made to the SCRIBE products, and 356 resulted in lower error scores. The resulting 72-per-cent rate of improvement was encouraging, but VIPER managed to surpass that performance. The automated routine produced 748 changes, and 569 were for the better, a success rate of 76 per cent.
In general, the SCRIBE error rate was considerable over all three days of the extended forecast. The error for day three was about 30 per cent higher than day two and the rate escalated to nearly double the day two error rate by the fifth day. The error rates for the meteorologist and VIPER followed a similar trend, although with a modest reduction in the absolute values.
Over all, the meteorologist achieved an error reduction of just 4 per cent over the raw SCRIBE product, while VIPER achieved a 9 per cent drop in the error rate.
Although the meteorologist did improve on the SCRIBE product, the gains were less than what were achieved by a simplistic automated routine. Further, the boundary layer weather predominating during the fall season contributed to most of the gains produced by both methods.
The meteorologist had success with changes to forecast highs and in the cloud amount, based on climatology pertinent to the upper air patterns, but Mr. Carlsen noted that his process would be easily automated. As a result, VIPER also delivered significant gains with changes to these parameters.
The poorest performance was with precipitation, which is deemed the most important element in a forecast by the general public, according to the 2002 survey by Decima Research. Placing a higher weight on precipitation quickly wipes out most of the improvements in sky conditions, which ranks as the least important of the parameters in the same survey.
There were also significant errors in forecasts that were unaltered. Agreement among the members clearly did not imply accuracy. Pattern recognition was an important ingredient in the precipitation forecasts, but the results indicated that basing a firm forecast of precipitation on the synoptic patterns indicated by the model guidance in the longer range was a very bad strategy. This tends to corroborate the anecdotal conclusions reached over the years.
In general, changes to the extended forecast that corrected known biases in the raw SCRIBE forecast had the best chance of success, although the VIPER routine proved that meteorology played no real role in lowering the error scores. Meteorology was the key component in forecasting precipitation events or significant warm or cold periods. These events are less common, and the experiment indicated that forecasts of uncommon events had an unacceptably high error rate, regardless of the method of preparation.
Firm forecasts of precipitation led to the greatest error in the extended forecasts issued by the meteorologist. The SCRIBE product tended to have fewer forecasts of precipitation, and that resulted in a lower error score. Further, hedging usually resulted in the lowest error scores.
The reason for that result is intuitive. It is often difficult to forecast an event with less than a 20-per-cent error in timing, and in the extended periods, such a rate can easily displace a precipitation event by a day. As well, minor spatial errors grow progressively in magnitude over several days, leading to another common source of forecast inaccuracy. As a result of these timing and spatial issues, firm forecasts of precipitation are all too frequently incorrect.
Unfortunately, this will remain the extended forecasting challenge for years to come. The PASPC does plan a second attempt to add value to the day two and day three forecasts in the spring of 2005, during a two-week experiment using a combination of short-term forecasting techniques and ensemble approaches.
Decima Research, 2002: National Survey on Meteorological Products and Services – 2002. Final Report. Prepared for Environment Canada.
Purcell, W. M. 2001a: Project Phoenix – Preparing Meteorologists for an Intensive Man-Machine Mix. Internal MSC Report.
Purcell, W. M. 2001b: Project Phoenix II – Preparing Meteorologists for an Intensive Man-Machine Mix. Internal MSC Report.