Summary
In this interview with Artima, Electric Cloud founder John Ousterhout discusses the use of metrics in software production, what managers should look for when judging a software project's overall health, and the attitude developers need to take to keep projects under control.
Advertisement
Electric Cloud, a company founded by John Ousterhout, of Tcl's fame, released a new version of its Electric Commander continuous integration tool. Electric commander aims to provide insight into a project's build and test cycle, something Ousterhout calls the heartbeats of the development process.
In this interview with Artima, Ousterhout explains the kinds of project metrics development managers should be interested in to gauge a project's overall health and status, why builds break, and the two main modes of operation in running a development project.
Frank Sommers: Continuous integration tools, such as your Electric Commander, can provide a lot of data about builds, tests, or commits. In your view, what metrics convey a project's overall health the best?
John Ousterhout: Historically, the back-end of software development—what I call software production—has not received very much attention. As a first step, we have to have tools and automation. Once you have a basic level of automation, you can start using those tools to extract information about whatever area you want to look at. With that information, you can give people a much better feel for what's going on.
The most telling metrics about a project's health measure the build and test cycle. You can think of the build and test cycle as the heartbeat of your development process. If that doesn't work reliably, then you don't have a software development process under control.
The build and test processes have traditionally been mysterious things, something that barely worked, and when they worked, it was difficult to know how well they worked. And if they didn't work, you didn't know what was wrong, or how to make it better. With some tools, such as our Electric Commander product, you can obtain useful data and reports, giving you a feel for what's going on in those processes, allowing you to visualize that, and then use that data to drive decisions in your software development activities.
Especially as a development manager, the most important thing you want to know is whether your software development process is under control. By under control, I mean: Can we make a plan for what we're going to do, execute that plan, and deliver the results in a predictable fashion? That's the number one job for every software development team.
What drives higher-level managers crazy is not that a project takes longer than it was expected, which is often the case. Instead, it's that you can't predict when the project is going to be done. That makes it hard to plan for the rest of your organization. If you told me the project would take two months longer than we would have hoped, but you are certain we can deliver on that date two months later, that project is, in fact, under control.
The only way you know if a project is under control is if you're actually producing things that work. When a developer comes and tells me they are ninety percent done, and that they'll be ready to check their work in in another week, I don't believe that. Software developers can't predict things that well. You don't really know how much you've got done until it's really, truly, completely done in the sense that it's been checked in, it passed all the tests, and it's part of your production build and test cycles.
If you don't know what's going on with your production build and test cycles, you don't really have a good feel for what's there and how reliable your full development process is. That makes it very difficult for you to make accurate predictions about when a project will complete.
So the most important thing, if you're a manager, but also for developers, is to have a good sense if things are working or not. And if they're not working, what are the kinds of things going wrong. The first thing is to keep track of that.
Frank Sommers: What are the most surprising things managers find out about build and test cycles when armed with accurate project metrics?
John Ousterhout: The most surprising thing for people is how unreliable their builds are. It may be that the data shows that most, or all, of your builds fail. At Electric Cloud, for instance, we went through a month when our products didn't have any successful production builds. If you hadn't been collecting and looking at trends, you'd think that the builds broke maybe the last day, but, surely, it couldn't have been a month without successful builds.
Another example is resource management. We have a lot of data on how resources are being utilized. You may see, for example, that at certain times of the day, there is a huge backlog, people waiting to do production builds. That's affecting your software development, but you can't realize that until you actually get some metrics on your build and test process.
Frank Sommers: What makes the build and test cycle so unreliable?
John Ousterhout: In my experience, it's never any one thing that's terribly interesting. It tends to be lots of little things.
What happens is that people allow what I call lint to accumulate. No single piece of lint is that bad, when you go and clean your drier, there is a lot of lint there, and it's very dirty. In software development, the tendency is to allow little things go by. These are not as much bugs in the code as small things, such as having a flaky test that's hard to fix because it's got timing dependencies so it sometimes passes, but sometimes it fails.
You let these build up to the point where, all of a sudden, in every test run you have a few tests that fail. And then you wonder, Are these really flaky tests, or is there something wrong with the product? By the time you get to that point, though, ninety percent of the tests are flaky tests, and in the remaining ten percent there is, in fact, a real product problem. But you've been ignoring all those occasional test failures, including the ones that are indicating a product problem.
This actually happened to us here about six months ago. We finally decided we were going to go on a campaign and eradicate every single little problem. There was some complaining by developers about why we were spending time with that, something that may just be a flaky test and not a product problem: Why should I spend time fixing that? Wouldn't it be better for me to work on the next new feature? We said, No, we can't live in a world where we don't have a reliable build and test process.
Once we got things fixed, we decided we were going to be absolutely religious about fixing every one of these little problems once they come up again. Then, suddenly, our builds became totally stable again, about ninety-five percent of our builds are perfect. It was not just one thing, or a major architectural flaw—just an accumulation of lots of little things.
In a software development project, there are basically two modes of operation. One mode is that you fix every little problem right away, and then constantly stay in the mode where things are pretty stable. Once you get things to that mode, it's easier to get them to stay there. You don't have that huge ongoing effort to keep things fixed.
The other way is that you decide that you don't have time to fix those little things, you've got to focus on the bigger things, new features, for example. Then what happens is that so many little things accumulate that eventually things just don't work and then you have to constantly be fixing the little things anyway because to go from hundred percent build failures to ninety percent build failures, you're fixing the same number of problems on average as in the first mode of operation. But you're living in a world that's continuously unstable, where you always fix just enough to get back to a modest amount of stability, and then within a week things break again.
I think most teams are in that last mode. To get from the second mode to the first one, you have to switch to zero tolerance. You have to have a campaign, and you will suffer for while. But you'll just work your way through it, it's going to be painful, and people are going to rumble and complain, and it may seem like you'll never going to get to the end of it all.It may seem like one problem after another. But you do actually get to the end. And once you get there, it's relatively easy to stay there.
I probably have a more radical view on this than many people. I really believe that the lint counts, and that you need to develop a mentality in your projects where you just don't allow it. You fix things right away.