This post originated from an RSS feed registered with Agile Buzz
by James Robertson.
Original Post: This is interesting
Feed Title: Cincom Smalltalk Blog - Smalltalk with Rants
Feed URL: http://www.cincomsmalltalk.com/rssBlog/rssBlogView.xml
Feed Description: James Robertson comments on Cincom Smalltalk, the Smalltalk development community, and IT trends and issues in general.
Remember that problem in the western US with air traffic about a month ago? TechWorld has a story on it, and the problem seems to be technology and process related:
The failure was ultimately down to a combination of human error and a design glitch in the Windows servers brought in over the past three years to replace the radio system's original Unix servers, according to the FAA.
The servers are timed to shut down after 49.7 days of use in order to prevent a data overload, a union official told the LA Times. To avoid this automatic shutdown, technicians are required to restart the system manually every 30 days. An improperly trained employee failed to reset the system, leading it to shut down without warning, the official said. Backup systems failed because of a software failure, according to a report in The New York Times.
I'm with Strongly Typed on one thing here - the above is just screaming for more information - what the heck is a "data overload"? Even so, I think we can see the outlines of the problem - using MS Windows for a critical service.
In an office environment, reboots may be a pain, but they can be done relatively easily (if a file server is unavailable for a few minutes at 2 am, few workers are going to care) - and if someone forgets to reboot said server and it crashes, it's likely to be more of an irritation than a life threatening problem. Not so in an air traffic control situation. What was the fallout from that?
The radio system shutdown, which lasted more than three hours, left 800 planes in the air without contact to air traffic control, and led to at least five cases where planes came too close to one another, according to comments by the Federal Aviation Administration reported in the LA Times and The New York Times. Air traffic controllers were reduced to using personal mobile phones to pass on warnings to controllers at other facilities, and watched close calls without being able to alert pilots, according to the LA Times report
That's a pretty high level of risk to assign to a system that is - according to the story - known to fail catastrophically on a known interval. Now, it's not like the server running this blog is a "critical" system - but I will point out that typing "uptime" at the console prompt yields an answer of 313 1/2 days (the last time that there was a power outage before IT installed a generator). You think maybe the FAA should have insisted on a system that didn't need the addition of a "reboot on a regular schedule" process? Here's the money quote:
Soon after installation, however, the FAA discovered that the system design could lead to a radio system shutdown, and put the maintenance procedure into place as a workaround, the LA Times said. The FAA reportedly said it has been working on a permanent fix but has only eliminated the problem in Seattle. The FAA is now planning to institute a second workaround - an alert that will warn controllers well before the software shuts down.
The shutdown is intended to keep the system from becoming overloaded with data and potentially giving controllers wrong information about flights, according to a software analyst cited by the LA Times.
Microsoft told Techworld it was aware of the reports but was not immediately able to comment.
I think I'd say "no comment" if I were in their shoes as well...
Update: I got a link to this MS article in the comments, pointing out that Win 95/98 systems may hang after 49.7 days (which happens to be the time interval given in the air traffic story). So.... are they really running an air traffic control system on 95/98? Seems too coincidental to me.