سری مقاله های آپتایم 100 درصد ممکن است ؟ (شماره1)

در این سری مقالات به بررسی امکان رسیدن به آپتایم 100 درصد می پردازیم که از سطح اینترنت جمع آوری شده است .

“Do you guys have the most uptime in the industry?”

Questions about uptime are some of the most commonly asked when folks evaluate WP Engine’s managed WordPress hosting platform, or any hosting solution for that matter. Nobody wants downtime, and customers who are paying for a hosting solution have a right to ask and be informed about a company’s history of downtime.

In particular, people want to know who has the least uptime, or if there is a company that has achieved 100% uptime. The reality is that 100% uptime, while the goal that every company sets its sights on, is a perfection unattainable.

In the past 10 days, four well-known WordPress hosting providers all had similar amounts of downtime, all for different reasons, all with different datacenters. A rash of downtime across four major WordPress hosting providers inside the same 2-week period is uncommon, but it indirectly helps answer the question, “How does your uptime stack up against [insert other hosting provider]?

Here’s what happened:

What can we learn from these situations?

Uptime is never 100%. A world of factors conspire against 100% uptime, and can potentially disrupt the flow of bits from the server to your browser. But despite the number of factors, most hosting companies are at or above 99.9% uptime.

There isn’t a single hosting provider with 100% uptime. Amazon AWS, one of the most robust operations, is (famously) not 100%. GMail isn’t. Facebook isn’t. Twitter definitely isn’t. Rackspace isn’t. ServerBeach isn’t. FireHost isn’t. We could keep naming folks, but every single hosting provider, including WP Engine, hasn’t achieved 100% uptime in a meaningful time-scale (like years).

Are all these companies “stupid?” Are each of these companies unable to hire top system engineering talent? TechCrunch had some choice things to say about this in their post after 15 minutes of downtime (referenced above). Since none of these (industry-leaders) are 100% uptime, does that mean that they don’t care?

Of course not.

As we mentioned, most of these companies maintain an over-99% uptime rate. They often reach 99.99% uptime, and sometimes a bit more.

So what’s the difference, in terms of cost and technical complexity, between 99% and 99.9% uptime? What about the difference between 99.9% and 99.99%?

First off, 99.9% uptime sounds like a lot, but it’s nearly nine hours of downtime per year.

Can you imagine how you’d react if you had nine straight hours of downtime on your site? Not well.

99.99% uptime is still a non-trivial 50 minutes of downtime per year.

Every “9″ you add to uptime (e.g. 99%, 99.9%, 99.99%) is not only an order of magnitude more uptime, it’s often a multiple more complex and expensive. At some point, trying to eliminate a few minutes of downtime now and then means doubling or tripling the cost of the service.

But lets break that cost down for a moment.

For example, to avoid hardware and software downtime for a single server, you can have several other servers in a cluster that the first can fall back on. Running multiple servers instead of a single one multiplies the cost by X number of servers. Nearly any hosting company will have some quantity of redundant servers, but some providers have a practice of adding more additional customers than there are secondary servers. When one server goes down, if the remaining servers don’t have enough capacity for 100% of the traffic, the cluster still goes down, despite the precaution.

But almost all the examples above were data center failure rather than single-server failure. To combat that, you need servers in entirely different data centers, once again with sufficient capacity to handle 100% of the traffic alone, which means another 2x the cost which was already 2-3x.

To avoid all the issues above is at least 6x more expensive in hardware alone, not to mention significantly more human and administrative effort. Plus, anytime you add more components as a redundant measure to prevent downtime, then ironically enough, each additional component increases the likelihood that one of your system’s components will have trouble at any given time.

In order to add redundant measures as a hosting provider, you have to add infrastructure. More infrastructure means more complexity, and adds more potential for trouble, which you then need to take steps to mitigate!

So what does all this mean? That we shouldn’t try? That we should just say “it’s hard, too bad” when things fail? That we shouldn’t continue investing in infrastructure and technology and techniques that our customers individually could never afford to pull off by themselves? Of course not, and in fact that’s exactly what WP Engine, and all the members of the hosting industry listed here and elsewhere, do. We’re always shooting for 100% uptime, and we always go into battle mode when there is the slightest blip of downtime.

It’s our responsibility to hold ourselves to ever-increasingly high standards.

That’s why only 1% of our customers had trouble the other day, not the other 99%. That’s pretty good! But, we immediately began working to bring everyone back online, and then make improvements for the next time. If we don’t continuously improve, that 99%+ could slip. And next time it would be awesome if only 0.1% of our customers had trouble. The bar must always be moved higher.

But perfection is unattainable, for WP Engine and everyone else hosting WordPress, or anything else on the Internet. We can ask better questions than, “Which hosting has perfect, 100% uptime?”

Instead, we can ask:

Those are the questions that matter the most when you evaluate a particular hosting platform and compare it to another. Everyone is going to have as little downtime as possible, and each of the previous questions get answers to the question, “What are you doing to make sure you can mitigate this issue with zero downtime next time?”

How the company chooses to answer that question goes a long way to let you know that your websites, and therefore your business, are indeed in good hands.

For another perspective on this, Uri Budnik wrote a detailed post on the RightScale Blog titled, “Lessons Learned from Recent Cloud Outages.