Both Cisco and Citrix experienced catastrophic failures of their online hosted web meeting services on Friday, March 8. Cisco customers lost access to WebEx (I am trying to figure out whether it affected only a subset or all of the Event Center, Meeting Center, Training Center versions). Citrix customers lost access to GoToMeeting, GoToWebinar, and GoToTraining.
In reviewing Twitter messages, it looks like WebEx was down from approximately 12:30pm to 2:15pm US Eastern time. Citrix seems to have been down from approximately 11:15am to 1:15pm US Eastern time. Those are very rough estimates… So far I have found no source for official information, logs, or company communications from either company, nor any statements about the outages other than a Friday tweet from each stating that “Service should now be restored.”
Citrix Online maintains a blog for their GoTo services. The last entry is dated March 7. Cisco maintains several blogs, including ones labeled “Collaboration”, “Cisco Support Community”, “The Platform”, and “Inside Cisco IT.” As of Sunday night, none of them mentions anything about the outage.
In looking through tweets from affected customers trying to run web meetings, it is easy to see that the primary frustration is lack of communication. Neither company had a formal communication policy for updating affected users. Neither company put any updates on their websites during or after the outage. One user posted this picture captured from the WebEx website, definitely giving the wrong impression:
Others mentioned that they could not login to the WebEx support site to communicate, nor could they get through to the telephone support number (it announced that mailboxes were full).
Several questions beg answers at this point:
- Why did two of the largest web collaboration services both die on the same day at almost the same time? Was it a targeted attack? Failure of an underlying content delivery network or other shared resource that acts as a multi-vendor critical failure point?
- Why were failover or emergency backup procedures ineffective? Is two hours a reasonable recovery period on the web, when transactions are measured in milliseconds?
- How can the companies communicate better with affected customers in the event of a hosted service failure? Can’t they put an emergency plan in place that allows one employee to act as a public communications officer on a separate server or network? Since Twitter is independent of the companies’ internal networks, couldn’t they do a better job of putting up situation reports?
- How can the companies communicate more clearly and more quickly following a system-wide outage? Customers want reassurance that the vendor understands the impact on customer business operations, that they care, and that they acknowledge the crash and take responsibility for owning the problem. At the current time it just feels like each vendor would prefer to sweep the incident under the rug and pretend it never happened. That gives us no confidence in the security and reliability of future events.
I am sure both companies were scrambling like mad on Friday to identify and solve the problems while they were occurring. But since both services were restored in the early afternoon, it would have been comforting to see something later the same day acknowledging the disruption, with assurances that people were working to make sure it didn’t happen again. We didn’t see that. And we still haven’t seen it, more than two days later. In an international, 24x7 web-connected world, weekends don’t count. Two days is two days.
Maybe we will see more information from Cisco and Citrix on Monday. I hope so. When a “normal” company’s website goes down for a while, it harms that company and inconveniences their own business. That’s bad enough. But when a web conferencing vendor crashes, it harms the businesses of many clients. That is not just inconvenient… It demands swift and clear communications and resolution. Hosted service vendors who want customers to rely on them for real-time business operations automatically assume greater responsibility for real-time communications to those customers.