Last updated July 15, 2008

 

Hello, HAL: Computer Systems That Fix Themselves

When I first heard that some computer vendors were working on "self-healing" systems, my first thought was: Where do I sign up? While working on this article, my Outlook crashed three times, my Explorer crashed twice, and, at least once, I had to reboot.

So the first company I called was Microsoft and got the bad news out of the way immediately: a spokesperson told me that Microsoft had no information for me about autonomic computing or self-healing systems. But she added "at this time," leaving open the possibility that Microsoft may visit this topic in the future.

Meanwhile, Hewlett-Packard, Sun Microsystems and IBM are busy paving the way for HAL 9000-like systems that promise to make our lives easier and our computer systems more stable.

And, analysts and vendors predict that Wall Street firms, usually among the first commercial adopters of any new technology, will be using autonomic technologies within 12 to 16 months, and will be able to manage entire systems within the next couple of years.

Many firms, including T. Rowe Price and Oppenheimer Funds, are already using some aspects of this technology, including keeping track of which systems are installed where, and monitoring performance and security issues. And, Putnam Lovell Securities is among many firms automating the process of deploying and upgrading software and operating systems.

There's still a while to go before the computers reach HAL 9000 levels, however. Gartner analyst Donna Scott said full-fledged autonomic computing is still only a vision, not even close to being available today except in certain environments like mainframes.

Most management tools today offer only monitoring, not real management, she said. And in order for system-wide management tools to be effective, technology vendors need to develop good monitoring and management tools for each individual part of the system. "But it's coming," she said. "Piece by piece, they're getting the plumbing down."

Once the groundwork is in place, which Scott expects will take a couple of years, true management tools can be rolled out-and the benefits will be huge.

Today, for example, when enterprises build a new service they make sure it won't run out of resources by installing two or three times the power and memory it will typically need. This translates into higher equipment costs as well as higher maintenance and labor. With better management tools, applications could share their excess capacity, reducing costs and improving quality of service, she said.

New kinds of distributed applications are also extremely complex to handle, Scott said. "Vendors can't sell their technology anymore because the users don't have the ability to deploy it. The management costs are too high. We need a different model."

Matthew Williamson, a research scientist at HP Labs Bristol, admits that "as computers have gotten bigger and more complicated, they've become more and more difficult to manage." As a result, management tools have to get more clever, he said, and more autonomous, picking up routine tasks that systems managers have to do manually.

This is similar to the way biological systems work, he added.

"The ways we presently think about managing systems don't tend to scale well. But around us, in nature, are many systems that apparently scale very well," he said.

To explore how biological systems can be applied to computers, HP has formed a Biologically Inspired Complex Adaptive Systems group at its Bristol Labs. For example, Williamson has been working with how humans contain damage caused by viruses, which can spread so quickly that it seems it would be physically impossible to react fast enough to hold back each new attack once it has started. But our bodies do this regularly, with reflexive actions. A similar approach can apply to computers, with feedback loops running on individual machines that cause them to slow down the rate of messaging if the volume gets too high.

"It's something we call virus throttling," he said. Like a reflex, the rate at which a virus is distributed can be slowed down enough so that a human manager can step in and address the problem before it overwhelms a network.

This combines the best of human and machine strengths, he said. "Common sense decisions you should leave to the humans," he said. "And, to the computer, leave the things they're good for, which is very quick, routine tasks."

This process is also known as "policy-based computing" and Sun Microsystems has been working on it for the past two years, said Adam Hawley, group manager for Sun's N1 product marketing.

The idea is that a human manager sets parameters, say, that certain applications or customers have priorities when it comes to resources. Then the system reconfigures memory and processor power on the fly, as needed, within those parameters. If anything unexpected comes up, the manager steps back in.

To help further this along, Sun bought Fremont, Calif.-based Terraspring last fall. Terraspring handles data center virtualization and provisioning, so if, for example, one server in a data center goes down, its work is reallocated to a backup machine.

On Feb. 10, Sun announced a product that will run on Sun blade servers, and is already being piloted at Cingular Wireless. Everyone else will be able to get what Sun calls the N1 Provisioning Server 3.0 Blades Edition by the second quarter of this year. The company said this new provisioning technology will reduce server deployment from weeks to hours while increasing utilization and availability.

Meanwhile, Sun continues to work on a version that supports heterogeneous systems, everything from Linux to Windows 2000, said Hawley. "That's not available yet, but we are working with our own professional services to implement some pilot customers right now, including some in the financial industry on the East Coast," he said, though he declined to identify any.

Provisioning servers is only the first step, however. The next step is service provisioning, which moves the concept up the software stack. "So for example, you can have patch management across a lot of different pieces of your network, so when a new patch shows up that only applies to a specific version of Oracle, it knows where the patch is needed and just automatically applies it," he said. "That will start to kick in the latter half of this year."

Then, in 2004 and 2005, Hawley said, Sun customers can expect to see more sophisticated policy-based automation. IT managers will be able to specify high-level policy or service-level objectives and let the computers handle all the details, even for services that span multiple applications on multiple machines and platforms.

IBM is also on the autonomic computing bandwagon. Last fall, it formed the Autonomic Computing Organization to lead a multipronged initiative to create self-configuring, self-healing, self-optimizing and self-protecting capabilities to everything from desktops to mainframes, software applications to middleware.

But management tools are not enough, said Steve Wojtowecz, director of strategy for IBM Tivoli. Companies also need to have the right processes in place, and their employees need the right skill sets.

He said IBM has more than 50 management tool products that can monitor and manage databases, ERP software, Web application servers, access and identity, and storage.

T. Rowe Price, for example, uses Tivoli Access Manager and WebSphere Application Server to quickly roll out Web applications, maintain security and allow customer self-management.

In general, security is one of the top areas in which management tools are deployed on Wall Street, said Wojtowecz, but, all in all, about 85 percent of IBM's Wall Street customers don't use automated management tools at all, or use them only for monitoring and analysis.

For those firms already doing automated monitoring, it's a short step to the next level, to where the computers make suggestions and recommendations about possible changes.

"The tools are there, but unless you've got the skills and processes that you need, you're not going to get there," he said.

Al Wasserberger, chairman and CEO at Chicago-based Spirian Technologies, said the tools to monitor systems, analyze behaviors, recommend changes and put those changes into practice have been around for seven or eight years, in various bits and pieces. "It's been cobbled together into an end-to-end solution over the last year or so," he said.

And some of these products work pretty well, he added. "The computer can analyze data more rapidly, and can implement solutions faster than human beings can," he said. "The primary thing keeping that technology from taking off is the limitations of people and their emotions. If you think about what scares people, it's the whole 2001 Space Odyssey Hal fear. We, as IT managers, aren't ready to accept that our ability to double-check after the fact is probably sufficient."

As an interim step, he said, Spirian and other management technology vendors are putting in extra permission points, where the computer, in effect, raises its hand and asks the administrator if it's okay to do something. Already, users are getting accustomed to giving permission for automatic virus updates, he said.

"As we grow emotionally we'll learn that self-management really is not threatening, really it's the most effective way to do this," he said. "Computers can recognize things that people can't, just because they have so many more cycles."

The next step, he said, is for self-management tools to be driven by learning algorithms, so that they will no longer be dependent on systems administrators to explicitly list all potential problems, their symptoms, and solutions. "Once we reach that point, then we will have really achieved true self-management," he said.

Not everyone agrees with this vision of the future, however.

Avi Rubin, professor and technical director of the Information Security Institute at Baltimore-based Johns Hopkins University, said the answer to increasing complexity isn't necessarily increased autonomy. Instead, he said, systems administrators should be aiming for increased simplicity, instead.

"You want to disable everything that comes standard on the computer except for the core application that you need," he said. "If you don't need it, don't run it. You want to be minimalist on the systems. You want them to do as little as possible. Complexity is the enemy of stability and security."

Automation won't solve the problem of increased complexity, he said, because computers don't have any common sense. "Some aspects of staying up can be automated but you can't get away from a good administrator," he said. "You have to babysit your system, you have to check it all the time. The things you can automate are the things you can predict in advance, but computer systems fail, and they fail in unanticipated ways."

As a result, management tools aren't keeping up with the rising threats from viruses, bugs, and just plain accidental mismanagement. Rubin, whose second edition of his book "Firewalls and Internet Security" is coming out this week, said what IT managers should do is reduce the number of applications, improve interfaces so that it's easier for users to make the right security decisions, and switch to less virus-plagued operating systems like Linux.

 

Maria Trombly can be reached at 011-86-21-6387-7243 or by email at maria@trombly.com