Summary
"... when you have eliminated the impossible, whatever remains, however improbable, must be the truth..."
Advertisement
I've been hacking on the typesystem for this pretty cool language used in our applications. It's a little gnarly and you really want to keep your eyes on the road while you're moving bits around down there. Anyway I add my featurey thing, run some tests (passed) and checked in. Our poor test harness is having some growth issues these days and that leads to some pretty long latencies between a checkin and test failure alerts. So about 3 hours after I checked in I get the first of what would be many test-break mails. 248 broken tests. Uh-oh. Another mail. 1187 broken tests.
Actually this was promising. Test break records are rare these days and to get one you have to break something pretty fundamental and I was hopeful. In the end though I didn't even come close to a record, breaking a measly 11,000 tests. I had no idea why though and without the pride of a record to motivate me, figuring it out was going to be just another day of tree-traversal development.
Keep in mind it's been 3 hours since I checked in so I'm on a whole different thought train. Now I have to stop what I'm doing and start traversing up the tree to figure out what happened and how to fix it. This is even more frustrating because, as I mentioned, I did run some tests, both the new ones for the featurey thing and the ones that I concluded were relevant to the code I touched. And the test harness is telling me I broke things hell-and-gone away from where I was chopping up the code. It's kind of like putting a nail in the wall of your house to hang up a picture and having your car burst into flame in the driveway because of it.
(I should mention that the cause of the break wasn't actually in code we ship, it was because of some test harness code that, I argue, was not as well considered as it could have been. So it wasn't that tests were breaking so much as testing was breaking.)
These sorts of things that could broadly be classified as unexpected work are coming up with some regularity these days and I was complaining about it, as you do, while me and several folks from the crew I'm on were heading out to lunch at an entirely mediocre deli. The root of the complaint was, "Why?". "Why are we seeing an increasing rate of unexpected work?". Dark matter. Call it what you will, it sucks. It sucks your will to live. This tree search way of working gets tiring in a hurry. If you've been doing this for even a short while you know what I mean. How absolutely draining and demoralizing it is. How much it makes life, inside and outside of work, truly suck.
You know, you hit another problem you weren't expecting, you traverse back up the tree, sit back in your chair, maybe take a deep breath, and stare at the screen for a few minutes because, now, with so many traversals these days, it takes a real act of will to go deal with whatever it is that's suddently, annoyingily, in your way. This despite the fact that you're going to fall behind in your current task, which is already behind because of unexpected work during your previous task and you should get on this right now. And, if you love coding, really dig putting things together, you take that sighing resignation home with you when you stop typing for the day. How could you not?
So why? I've been doing this for a little while now and, though I've asked the question many times I don't think I've ever come up with the answer. Probably because there is no silver bullet. But I think the core cause is pretty apparent. Time. Time and pressure. We're required to make commitments to dates with so little information that the commitments are meaningless. But they're made and being professionals, more or less, we try hard to stick to them. At the expense of quality, which is really just an indirect way of also saying at the expense of the quality of life of developers. Because if you're like me you stress bad when you know you're throwing code around too quickly too often.
This is an old, old, old story and complaint. And spare me, please, the cry of XP! XP is fine when you have 2, 5, even 10 folks hacking on code for oneish customer and the reqs are reasonably solid and well understood. But if you have 50 developers writing code for customers literally around the world that is to be consumed by 100's of other developers XP just don't work. Indeed I think it's probably exactly the wrong sort of process to use. But that's another post.
Nevertheless, what to do? Ultimately I think the answer is that two possible things can happen. First, change how this sort of software get's sold. For whatever reason the software sales guys seem hard pressed to sell what's in the box. Rather they sell the next version or, perhaps more correctly, what might be in the next version. Or so it seems to me, I'm not a sales type. Second, change how this sort of software gets purchased. It seems that the purchasers are entirely inclined to suicide. It appears, as best as I can discern, that these folks would rather have software that does N things only marginally well on a particular date rather than N-X features that work reasonably well on the same date and then get X features at a later time. This despite the fact that they really aren't even in a position to consume N features because their developers are dealing with their load of unexpected work. It's truly insane. But changing people is either hard or impossible so I'm not holding out much hope.
You'll notice I don't talk about anything I should do. I don't think there's anything I can do. I have, with the very best of intentions, tried most everything. Top down, structured, UML, XP, C, C++, Java. Given those "in my professional opinion" type warnings about quality and TCO. And I do mean best of intentions. I've given it my very best shot for years and years and still I end up in circumstances where I'm not comfortable with the product of my labor because, again, I don't seem to have the time. In all fairness I'm not a great developer, something I think I can say because I have worked and do work with great developers. Those 10x more productive types. But, with no undo hubris, I'm not a suck developer either. So let's call me competent. And committed. If I'm both of these things and I've given it my best shot at crafting long lived software that doesn't require me to kill myself with quick fix rescue efforts to ship on time and failed, what am I to conclude? Seriously. What?
Part of the XP problem is that it doesn't have a phase that allows people to elaborate on risk for tackling a task.
The man with a plan always wins, even if it is not a good plan. The good-willed developer writing code to meet someone else's deadline will always lose to a manager with an impossible schedule. Good-will is not a plan, an unrealistic schedule is :-)
If your test harness is compromised, that is technical debt that must be paid and not spread across the cost of new development.
My take on it, assuming people would agree the tests are broken, is to raise it as a road-block to management, estimate the cost to fix it and press for a decision. Assuming this little snafu cost you a day or two, I would pad all sizings with a conditional 1-2 days until the problem is fixed. I would make clear that these 1-2 days are being tacked on to the estimates due to the potential of problems with the test harness so that there is no surprise.
Most of the time I found myself working funny hours on unexpected work was because (1) I was surprised by a situation like yours or (2) because I didn't size the work correctly.
> I've been hacking on the typesystem for this pretty cool > language used in our applications. It's a little gnarly > and you really want to keep your eyes on the road while > you're moving bits around down there. Anyway I add my > featurey thing, run some tests (passed) and checked in.
May I ask a question. Isn't this the kind of problem all the CS guys spent all the active days of their life to invent proof techniques no one outside of their little community seems to know and use? I ran into comparable stuff last year when sort of translating grammars into particular NFAs and wished to delegate tasks to a graduate student who had the time, patience and education to do the actual proofs and gradually improve the algo according to the books and eventually crank out a little paper in the end. After reaching a stable state I was done and needed a week of vacation.
It was one of those problems not really well located in business at all but living at the border of development and research, something not many people are doing these days and the SE discourse is mostly silent about. Maybe I misrepresent what you are doing entirely but the story sounded like something I feel very familiar with since I spent more time recently with developing/debugging nontrivial algos [1] than drawing and connecting little UML boxes and filling them with content that is easy to implement once the spec is understood.
[1] Which is a total headache and makes life miserable. Otherwise once you are living the life of a virtual grandson of Donald Knuth you can become addicted to it and gather for these kind of challenges after some timeout. Doing this within the bounds of a company might not be too bad though because one remains to be socialized.
ummm...why does it take 3 hours to get notified that you broke the code? I think you need to seriously reconsider how your continuous integration server is configured.
> It was one of those problems not really well located in > business at all but living at the border of development > and research, something not many people are doing these > days and the SE discourse is mostly silent about.
That's what folks in management tend to not understand, when they ask you to make an estimation for a task. There always seems to be the notion that developers know everything and if they don't, it's just a matter of minutes to understand the problem or technology. Don't get me wrong, it's not entirely the fault of managers, if anything it's the fault of the developers.
Most of the developers I cam across fall into one of the two following categories: a) they either completely overrate their capabilties or b) they know they can't do it in that time, but don't have the couage to stand up and speak their minds - you might look stupid if you do.
It's up to you which category is worse. I just think there should be more time to research, investigate or try to find a proper solution for a problem. If management wants you to be more productive, they need to give you more time to investigate and you should be the one asking for it!
> ummm...why does it take 3 hours to get notified that you > broke the code? I think you need to seriously reconsider > how your continuous integration server is configured.
I'll answer that one, since I work at the same company . . . we have about 42,000 tests running on each of about 30 branches; the tests are split across maybe 40 suites (maybe more). So when there's no contention, the turnaround time is maybe an hour between building all the products, deploying to the test servers, and waiting for the longest suite to run, with faster suites giving faster feedback. The problem is that we don't have 1200 test servers set up so that we can run all tests on all branches simultaneously (I think we've got around 120 instances total), so at peak check-in times there can be some nasty backlogs that take an hour or two to clear before the tests actually run. Our investment in and ability to manage test hardware hasn't really kept pace with our number of tests and the number of branches we're managing.
I should add that part of that is due to some architectural choices we made around structuring the tests; most of our tests now run basically against the full server stack using jetty and an h2 database, even if they're only testing part of the code. We'd originally gone the "mock everything route" only to have it blow up horrifically all the time as the dependencies of any given test were impossible to determine a priori and the number of mocks was out of control (and the mocks would diverge from the production code too often). So we made the choice to run all our tests with the same set of real, production dependencies, which solved most of those problems but made the tests slower.
And the answer - as to why the pain keeps happening - that you've found rings true, "Time. Time and pressure. We're required to make commitments to dates with so little information that the commitments are meaningless."
The truth is that your salesforce has to lie.
They have to promise feature X on June 1st.
If they don't promise this, then your competitors will promise it; then your company will collapse and you'll be out of a job. Your company is fighting savagely for survival in the tiniest of Darwinian niches, generating a pressure that necessarily pushes all programmers (in your and your competitors' companies) to prioritise time-to-market over quality (essentially prioritising the short-term over the long-term).
Of course, you can still ask: why? In which case you'll probably get the answer: because your customers can make more money with early, crappy software than late, good software.
And if you ask again: why? You're probably get the answer: because your customers' customers are tolerant of temporary and (possibly) frequent high-tech failure.
Again, why? Because software has reduced the severity granularity of modern failure.
Why? Because software enables the low-cost re-try.
Two hundred years ago, without all our modern, "Conveniences," there were simply fewer things that could go wrong, but what could go wrong were potentially more costly to solve than today. Your horse might start coughing as you head out to plough the fields; not a good sign: you can either rest him (missing valuable ploughing time, or ploughing manually yourself) or you could press him to work (and risk losing the poor thing to illness).
Of course, today planes tumble from the sky and bridges tear themselves apart; the modern age is not immune to catastrophe. But there's just so much other, unnecessary stuff in the world today, and a lot of it runs software. In general, when it doesn't work, you just try again. Phone call doesn't get through? Try again. Coffee machine dispenses something vague and unexpected? Try again. Web-page doesn't load? Try again. Bar-code doesn't scan? Try again.
In fact, there's probaby one more layer of, "Why?" that we could peel back. The reason we're accepting of the quick-try-again is the short amount of time it takes to perform the try-again compared to the amount of time it takes us to come to that decision point in whatever process we were exploring.
If we are booking a cinema seat over the web, and we've selected our film, time, and seat position, and we are about to come to the credit card information screen when the site throws us a, "Page not found, please reload," then we'll probably reload (or go back) to see whether all our details have been remembered, so that we can try to reach the credit card screen again. We won't immediately abandon the site and try somewhere elese (or try from the beginning again) because of the time it's taken us to shovel in all the data - it's simply more cost-effective to try to re-load the screen than start from the beginning.
I'm sure there's a cut-off point, some magically ratio of duration-of-process over duration-of-step-re-try; let's call it the perserverance ratio. If the time it takes to complete an overall process overcome compared to the time it takes overcome an obstacle in a single step is less than the perserverance ratio the, then we'll try that step again and persevere towards our process goal; above that ratio, then we'll abandon.
At each obstacle, furthermore, we probably make an intuitive extrapolation about the amount of obstacles we can expect before achieving the goal. If the goal might take 10 seconds, and we encounter a 1-second step re-try at the 3-second stage, we will probably re-try; but if we encounter another at the 4-second stage, and another at the 5-second stage, etc., then we'll probably abandon.
So, why is programming such a pain?
Because software enables a high perserverance ratio.
Rick! I don't know what the position in the company is but if you are a developer, most of that situation is NOT your fault! The old 80-20 rule (I think by Tom DeMarco) applies here saying that a developer has roughly 20% whereas management has about 80% influence on the outcome of a software project. That's why those guys get the bigger paychecks.
I think part of your 20% are the fact that you accepted that situation for so long. There's a critcal question you have to ask yourself: What happens if you just fix your test suite and forget aboud the dead line for a while? Will a star collapse? Will there be a war breaking out? Will aliens invade the planet?
Will be the situation be any different if you:
a. You miss the deadline by huddling and sloppiness will make you brake even more tests along the way?
b. You miss the deadline because you fixed the tests?
Good Luck!
Sebastian
P.S.:
To XP: A change in process would not help you in that situation. At least not in the short term. A sane project manager would. 30 Brachens is really something. Almost Microsoftism ;-).
I'd love to add to your guilt and misery, Rick -- just because that's the sort of person I am. But, no, not all bad software is your fault.
This particular piece of bad software is your fault, however.
I've seen some pathetic whining in my time, but yours takes the cake. You checked it in; it was automatically tested; it broke. That's sort of your fault, really.
Not sales-peoples'. Not bosses'. Yours.
As you say, XP (or any other silver bullet, not that XP qualifies as anything like a silver bullet unless you're a raving maniac trying to fight off imaginary code vampires) won't work here.
The only marginally interesting point in this entire post is your comment that you can't tell whether the code is broken, or the 11,000-odd tests are broken.
I'm with you here: no silver bullet. Ever.
Just make sure you get away from an environment where it takes three hours of running broken tests against minor changes before you get the red flag. That will never work.