Posts Tagged Web Development

The Healthcare Exchanges – A Failure of Leadership

As the launchpad explosion of the Obamacare Healthcare exchanges continues to play out, the risk builds of putting the blame on the programmers.  This would be a mistake, and I don’t say this out of some sense of fraternal protectiveness for the developers.

The blame here lies squarely on leadership.

Notice that I don’t say management.  I say leadership, in particular, the political leadership.

It’s true that the code isn’t very good.  In fact, it would be fair to say that the piece of the code we’ve been able to see – mostly the javascript – is lousy with bugs and poor coding practices.  And it’s also true that Barack Obama and Kathleen Sebelius didn’t write the code or manage its development.  But they are responsible for the overall environment in which that code was written, including the timeline and the expectations.

They were so committed to the October 1 launch date for the entire system that they didn’t leave enough time for proper development and testing.

It’s quite clear that the development team simply wasn’t given the time it needed to do the job.  By some reports, top-level decision-makers were so slow in approving requests that certain key functionality had about ten months in development.  That’s simply not enough time for a large, complex system that involves a lot of cross-communication with other systems.

What’s also not clear unless you’ve worked in software development for a while is that such large systems are developed with a lot of back-and-forth between the pieces.  Data that needed may not be available at a certain point in the process; you may find out later on that you want to keep users from entering certain kinds of data in combination, and so forth.  A lot of this only becomes clear once development is well underway, and UI developers – the guys who create and program the User Interface that you see – are usually at the tail end of the process.  This often gives them the least amount of time to do their work.  Since it’s clear from the sorts of error messages we’re seeing that even the underlying middle-tier and database code still isn’t ready, it’s clear why the user interface keeps breaking.

As mentioned above, the political leadership, evidently fearing political fallout from any delay whatsoever, has decided to act like Soviet leadership facing a poor harvest.  (Was there every any other kind?)  Instead of facing reality, and preparing people for that reality, they resorted to a number of strategies.

In such situations, it’s almost impossible to make up for that by throwing more people and resources at it.  Design and development can only bear so many chefs; the complexity isn’t in the small pieces of the code that need to be written, but in the overall picture of how the various systems fit together, and the business rules that need to be enforced at each step.  Adding more people to the design process isn’t going to get the work done any more quickly.  And just adding more developers to the coding process doesn’t solve that problem at all.  In fact, it can make it worse, as different coding styles begin to conflict with each other.

And yet, that’s just what the administration seems to have done, with an initial price tag of $100 million ballooning to over $630 million as of this writing.  When they asked how things were coming, they were told, “not well.”  And when they asked if more resources would help, and were told, “yes,” they wrote a bigger check.  They may not even have asked the second question, and the contractors may have said yes in order to get a bigger check.  And no doubt the contractors will be hauled in front of a Congressional committee to tell this story.  The point is, that it’s possible to imagine a scenario where Washington reacted as Washington always has – write a bigger check – and such a reaction was doomed to complicate the failure, not ameliorate it.

Their public reaction to the failure has be equally Soviet.  They first blame circumstances beyond their control – for the Russians, it was the weather; for this crew, it was server load.  In neither case was that factor beyond normal.  And yet they stuck with it for days after it became clear.

And now, we’re being told that “it’s getting better every day.”

Here’s the secret: it’s not.  What’s happening is that the developers and designers are now rushing to meet another hard deadline – December 31, when people will, by law, have to have signed up for health insurance, or risk fines from the IRS.  The scenario they’re desperate to avoid is one where someone can’t sign up, files his taxes, has his refund garnished by the IRS to enforce the penalty, and files a class-action lawsuit in the middle of election season to remind everyone of last year’s Hindenburg.

The development team still isn’t getting the time it needs to do this right, and in continuing to rush to rebuild a system that already exists, it’s only going to make things worse.  It may succeed in hiding some of the more public failures, but the back-system stuff is going to be held together with chewing gum and baling wire, and is going to be ripe for hackers and routine breakages.

As I said before, just be glad that nobody’s actual care is depending on this thing.  Yet.

, , ,

No Comments

Healthcare Exchanges – Why and How They Failed

This post originally appeared at PJ Media Lifestyle, (“No Good Excuses Exist for the Failure of Obamacare’s Expensive Website“).

By now, it’s hard to decide if the launch failure of the Obamacare exchange websites isn’t funny anymore, or just keeps getting funnier.

Sites went down — including the individual state sites for states that are running their own exchanges. When people weren’t getting “due to an extraordinarily high volume of calls” errors, they were getting 404 Not Found messages, and pages were finding new and creative ways of erroring out. Even Wednesday afternoon, I was getting server errors just trying to finish the account creation process on the California site.

Almost as quickly as the train wreck itself unfolded, so did the explanations for it evolve. First, both President Obama and then Press Secretary Jay Carney claimed with straight faces that the failures were a result of the massive interest in the exchanges. Then, others claimed that these were normal rollout errors that occur with all large, complex systems. Finally, as the engineers rolled the platform back to the hangar for retooling, there was no hiding the fact that this was indeed a software failure, not just a set of normal launch “glitches” (to use the press’s word du jour).

The exchanges’ bad day brought to mind a number of other high-profile website failures, including the Romney campaign’s spectacular white elephant of a killer whale, Orca.

I’ve been in web development for most of my professional career. I’ve participated in successful launches, and launches that needed to be rolled back and fixed. I’ve spent very long days dealing with one error after another, and equally long, uneventful days waiting for the deluge that mercifully never came.

It’s always easy to criticize someone else’s failures, and with my luck, tomorrow the QA guys will rain down trouble tickets on my head like nobody’s business. Nevertheless, it remains inescapably true that while there were reasons this happened, they weren’t good reasons, and could have been avoided. Given three years and hundreds of millions of dollars for development, they should have.

Here’s why, and how.

How Web Systems Work

First, a very simplified description of how large, commercial websites are put together nowadays. They basically have three layers of servers – 1) the web layer, which talks to you, the user; 2) the database layer, where the data is stored; and 3) middle-tier layers, which figure out what questions they need to ask the database, and what they need to tell the database, in order for the front-end that you see to work properly.

Each layer consists of many servers. You may be talking to Web Server 1 for a little bit, and then switch over to talk to Web Server 2. And Web Server 1 may send your first request to Middle Tier 1, and your next request to Middle Tier 5. This lets them answer many more questions at once, and talk to many users at once. It’s how Google is able to get results back to literally millions of simultaneous requests almost instantaneously.

These layers have traffic cops (called “routers”) to make sure that no one computer is trying to handle too many questions at once. Other traffic managers keep track of who you are and where you are on the site, so you don’t have to keep starting over.

There are even multiple databases. Data that change a lot (this is called “volatile”), like information about you, or your orders, or billing information, may only be stored once (and backed up regularly). But information that doesn’t change very often, like plan pricing and terms, may be stored in more than one database, to make it faster and easier to get to.

Web systems have used this basic architecture for over a decade now, and launching large, complex sites is now less art and more science.

What Can Go Wrong

Of course, no technology is foolproof, and large, complex websites do fail.

First, users are unpredictable. There’s a saying that you can make something foolproof, but you can’t make it damn-foolproof. People are ingenious in the ways they will misuse something that you put in front of them, and programmers are always complaining about users “doing it wrong.” Of course, it’s not the users who are “doing it wrong,” it’s the programmers who didn’t anticipate their doing it that way.

Second, servers will fail, network connections will fail, routers will fail. Sometimes this just happens, and there’s not much you can do about it, except hope that whatever’s left can handle the load, while you work to get the servers back up.

Sometimes, the load really is too large for the servers’ performance limits and number of servers. Web servers can only handle so many questions per second; the same is true for middle-tier and database servers. This is what happened to the Colorado Rockies in 2007, when seemingly all of Colorado tried to buy World Series tickets at once. The traffic jam brought the website to its knees, and people had to wait a day for the engineers to rework it so that wouldn’t happen again.

And sometimes, programmers just mess up. The database isn’t designed right, and it either loses information or takes too long to answer questions. The middle tier doesn’t ask the database the right questions, or fails to store what the customer needs stored. The web server can ask for information that isn’t there, not keep track of the where you are in the site, show you stuff you didn’t ask for, or let you choose things that don’t make sense in combination.

And the layers can send the wrong information to each other, or misread the information that gets sent to them by other layers.

How You Keep Things From Going Wrong

Test.

Test.

Test.

Of course, programmers are responsible for testing their own code as far as possible. But programmers are usually the worst people to test their own code. They know where all the bodies are buried, and only the most disciplined are likely to test things they know are likely to break. After all, they’ve fixed it before, and are heartily sick of making sure that the date field doesn’t bomb when someone enters 11//1994, instead of 1/1/1994.

There are QA testers, who make sure that things work as advertised. They’re given a list of expected behaviors, and run through the site, making sure that the it does the things the programmers say it will do. More importantly, they run through the site, deliberately making mistakes, to be sure that the site doesn’t break.

There’s beta testing, which basically is a larger group of people who aren’t given any specific instruction. They’re the ones most likely to imitate actual users, since ideally, they have no preconceptions of how the site is supposed to behave, and where it might break.

There’s load testing, which simulates a huge number of hits, all at once, to make sure that the servers don’t buckle and fold like a cheap suit when everyone tries to buy that cool toy all at the same time.

What Went Wrong

From the evidence, it’s clear that the Obamacare exchange servers saw errors of all different kinds. They weren’t prepared for the load, even though this was never very heavy. California reported about 600,000 unique visitors, and Colorado reported about 55,000 unique visitors.

There were screen captures of database errors, not because the data was bad, but because the structure that holds the data was misdesigned.

There were 404 errors, which are totally design errors, meaning that the web sever was trying to get to a page that didn’t exist. (This led to the best hashtag of the day, #404care.)

There were non-descript server errors like the one I got from the California server.

There were user-interface errors. At about 10:00 AM, Colorado suspended new accounts on its site (it’s one of the ones using its own site, not the main exchange site), and didn’t get around to allowing new accounts again until 3:00 PM. At that point, the “New Account” button sent you to the login page for existing accounts. If you chose to enter your childhood phone number for a secret question, it wouldn’t take it, no matter what format (certainly not the format it used when asking for your current phone number).

This is why I say it was clear that this wasn’t just one of those things. The volume of inquiries wasn’t high by large-system standards, and the rest of the errors were in the control of the programmers.

These were design and execution errors, pure and simple. They were all catchable, with proper beta and load testing.

What Could Have Been Done

Test. Test. Test.

If you’re going to have a big, splashy rollout of a controversial government service that half the country is rooting against anyway, you need to test it until it’s bulletproof.

Because failures are often ambiguous from the user side, it’s hard to tell exactly where a lot of these errors originated from. It’s certainly true that the data — involving as it does multiple insurance companies, with multiple plans, for different pricings based on location and number of people covered — is incredibly complicated, and that some states didn’t have final price and deductible information available.

As a programmer, I can tell you with certainty that simply logging into a system shouldn’t produce an error.

And with three years and tens of millions per site at the ready, this was inexcusable.

It didn’t have to be that way. Instead of announcing October 1 as the date that Obamacare would save the world, they could have had a series of smaller rollouts, opening up various portions of the registration process at, say, monthly intervals.

In effect, ask the public to act as your beta testers. They would have lost some of the sizzle in return for a robust system that wasn’t freighted with unrealistic expectations, but right now, I think that’s a trade they would happily have made.

It’s true that it’s hard to get a real feeling for how much of the problem was data-driven, since many times we couldn’t get far enough into the site to find out. But again, it could have been rolled out in pieces, letting people browse before the law said they could buy.

All of the code would still have needed merciless QA testing and beta testing, but each section would have been solid before the next one was rolled out, and where that wasn’t possible, the potential weaknesses would have been known beforehand, making it easier to locate the launch-day failures that remained.

In the cases cited earlier, the damage was either limited, or over. People’s irritation at not being able to score World Series tickets was tempered somewhat by the fact that they were seeing their team in the World Series at all. The Romney campaign had one day to make Orca work. Once it didn’t it was game over, and there was no payoff at all for getting it working Wednesday.

Obamacare exchanges are different. Not only are they supposed to be the tool by which tens of millions of Americans will — forever — select their health insurance, they’re a precursor to the systems that will store actual medical information for patients, insurers, hospitals, doctors, regulators.

In the end, the only good thing about these websites is that nobody’s actual health depended on their working.

This time.

, ,

No Comments