Yesterday I hosted a client event using WebEx Event Center. We had multiple presenters and I made sure that everyone was ready to login and call in 30 minutes ahead of the scheduled start time, per standard best practices. I clicked through on the link to get to the client's branded event site and received an error message. I tried entering the URL manually. No luck. The browser reported errors as if there was no such web page.
I called WebEx tech support and spent more than ten minutes on hold, being repeatedly told that my call was very important. The tech support guy at first said that he wasn't aware of any problems, but then went off and did another check. More time went by while I had one phone line with a panicked client on one ear and tech support hold music on the other ear.
My tech support guy came back and verified that there was indeed a server problem with both primary and backup servers. The only workaround he could suggest was to have him start an ad hoc meeting in a working demo account on another server and then to try to contact all our meeting participants and tell them how to access the new meeting. I didn't think that was practical, given the short time remaining before our scheduled session.
We hung around on our meeting call-in line and told the participants as they arrived that there were technical problems and we were waiting till the last minute before canceling the session. Amazingly, our server came back online at one minute before the hour and I was able to upload the presentation and start up the event properly. We didn't suffer any lost productivity.
I talked to Colin Smith, the Director of Corporate Communications at WebEx, to get their statement on what had happened. He sent me a note saying that "3 of the 40 event session clusters switched to backup sites. The switch happened instantly and in-process sessions were not impacted. In some cases, the switch could have delayed people joining scheduled sessions or looking at event reports for 10 to 20 minutes. Refreshing the browser would have solved the problem. The MediaTone Network's automatic routing systems worked according to design and limited the impact of the cluster failover."
That didn't satisfy me, so I asked for more info. My clients were out of commission for 30 minutes and refreshing the browser did nothing to alleviate the problem. The site was unavailable to any access. Colin got some more information from his engineering team and told me that when their network switches to a Global Site Backup and back to Primary, there are DNS changes involved. It takes time for all databases to resolve their network addressing properly, which affects people trying to browse a WebEx site during the switchover. Again, he pointed out that customers who were in meetings at the time were switched over without losing their connections or having to reenter the meeting.
Despite the tone of things so far, I am not writing a diatribe against WebEx (aside from their tech support hold recordings!) They acted quickly to automatically recover from a catastrophic failure somewhere in their network. The majority of their customers in active meetings were able to continue business without interruption. Our interrupted access to the site was for a total of 30 minutes, which is a short time in the world of network failure recovery. It is doubtful that most corporate IT sites would have been able to do backup failover and recovery any faster.
And yet. There is a psychological helplessness that sets in when your hosted service provider suddenly stops working. You can't get an insider's view of what has gone wrong, who and how many people are working to fix it, whether it is truly a priority to them, and what the current status is in terms of repair operations. You just sit around and cross your fingers, no matter how powerful or connected you might be within your own organization.
This is one of the things that makes some enterprise organizations leery about relying on Software as a Service (SaaS) for key business operations. Your IT team might not be able to fix things any faster than a third party provider, but at least you know who to call up and yell at. Or plead with. Or fire. There is a sense that you might be more comfortable with the problems you can identify yourself and are in charge of fixing than with the potential for resolution that happens outside your control.
There is an inescapable tradeoff in deciding on your preferred application infrastructure. Nobody is immune from technical failures. If there is a service provider out there that has never had a customer-perceptible failure, I want to meet them! And Mr. Murphy is just as ubiquitous in corporate environments with his list of things that can go wrong - and will. You either decide to take on the burden of managing everything yourself for the added control and information that gives you, or you cede that control to a provider for the reductions in your own overhead.
In WebEx's case, I'll say one thing for them... There is a serious incentive to get problems fixed quickly that comes from the top down. Colin told me that the executive team has bonuses directly tied to service uptime on their system. When something goes wrong, you can bet that it gets attention across the organization. It looks like a few people will be getting a slightly smaller bonus check this quarter! :)