This is a pattern that I have noticed time and time again with many services. Why even have a status page if it is not going to be accurate in real time? It's also not uncommon that smaller issues never get acknowledged.
Is this a factor of how Atlassian Statuspage works?
Edit: Redditstatus finally acknowledged the issue as of 04:27 PST, a good 20+ minutes after Down Detector charts show the spike
I don't think it's a factor in how Statuspage works. Cloudflare, for example, uses them, and usually it's pretty fast to update their status page and release outage information.
For companies that need to monitor critical dependencies, my company ( https://isdown.app ) helps by aggregating status page information with crowdsourced reports. This way, companies can be alerted way sooner than when the status page is updated.
Back when I worked at a major cloud provider (which admittedly was >5 years ago), our alarms would go off after ~3-15 minutes of degraded functionality (depending on the sensitivity settings of that specific alarm). At that point the on call gets paged in to investigate and validates that the issue is real (and not trivially correctable). There was also automatic escalation if the on call doesn't acknowledge the issue after 15 minutes.
If so, a manager gets paged in to coordinate the response, and if the manager considers the outage to be serious (or to affect a key customer), a director or above gets paged in. The director/VP has the ultimate say about posting an outage, but they in parallel consult the PR/comms team to consult on the wording/severity of the notification, any partnership managers for key affected clients, and legal re any contractual requirements the outage may be breaching...
So in a best-case scenario you'd have 3 minutes (for a fast alarm to raise) plus ~5 minutes for the on call to engage, plus ~10 minutes of initial investigation, plus ~20 minutes of escalations and discussions... all before anyone with permission to edit the status page can go ahead and do so
Nobody cares about internal escalations, if manager is taking shit or not - that's not service status, that's internal dealing with the shit process - it can surface as extra timestamped comments next to service STATUS.
When you've guaranteed 4 or 5 nines worth of uptime to the customer, every acknowledged outage results in refunds (and potentially being sued over breach of contract)
I totally get that, but how hard would it be to actually make calls to your own API from the status page? If it fails, display a vague message saying there might be issues and that you are looking into it. Clearly these metrics and alerts exist internally too. I'm not asking for an instant RCA or confirmation of the scope of the outage. Just stop gaslighting me.
But early-stage startups typically have engineering own the status page, but as they grow, ownership usually transfers to customer support. These teams optimize for controlling the message rather than technical detail, which explains the shift toward vaguer/slower incident descriptions.
I used to work at a very big cloud service provider, and as the initial comment mentioned, we'd get a ton of escalations/alerts in a day, but the majority didn't necessarily warrant a status page update (only affecting X% of users, or not 'major' enough, or not having any visible public impact).
I don't really agree with that, but that was how it was. A manger would decide whether or not to update the status page, the wording was reviewed before being posted, etc. All that takes a lot of time.
And honestly, having been on a few customer escalations where they threatened legal action over outages, one kind of starts to see things the business way...
I heard this years ago from someone, but there's material impact to a company's bottom line if those pages get updated, so that's why someone fairly senior has to usually "approve" it. Obviously it's technically trivial, but if they acknowledge downtime (for example, like in the AWS case), investors will have questions, it might make quarterly reports, and it might impact stock price.
So it's not just a "status page," it's an indicator that could affect market sentiment, so there's a lot of pressure to leave everything "green" until there's no way to avoid it.
As a company, you don’t want to declare an outage readily and you definitely don’t want it to be declared frequently. Declaring an outage frequently means:
• Telling your exec team that your department is not running well • Negative signal to your investors • Bad reputation with your customers • Admitting culpability to your customers and partners (inviting lawsuits and refunds) • Telling your engineering leadership team that your specific team isn’t running well • Messing up your quarterly goals, bonuses etcetera for outages that aren’t real
So every social and incentive structure along the way basically signals that you don’t want to declare an outage when it isn’t real. You want to make sure you get it right. Therefore, you don’t just want to flip a status page because a few API calls had a timeout.
I would argue that every social and incentive structure along the way basically signals that you don't want to declare an outage, even when it is real. You should still do it though or it becomes meaningless.
Great example for Goodhart's law.
I've personally challenged some details in these policies, which I won't discuss publicly. What I generally agree with is that it's important to have a human in the loop, and to be very thoughtful about when to update a status page and what is put there.
Those status pages are often linked to contractual SLAs and updating the page tangibly means money lost.
So there’s an incentive to only up it when the issue is severe and not quickly remediated.
It’s not an engineers tool, it’s a liability tool.
I feel that the tech industry does not have sole ownership of this powerful tool
The funny thing is reddit's status page used to have real-time graphs of things like error rate, comment backlog, visits, etc. Not with any numbers on the Y-axis, so you could only see relative changes, really, but they were still helpful to see changes before humans got around to updating the status page.
We are now regularly detecting outages long before providers acknowledge them which is hugely beneficial to IT teams.
For this Reddit outage, we alerted 13 minutes before the official status page.
Last weeks Azure outage, it was 42 minutes prior (!?!).
I have considered building something to address this and even own honeststatuspage.com to eventually host it on. But it’s a complex problem without an obviously correct answer.
It's not even just intermediate networks, it's sometimes direct coinnections. For example, a flood of people reporting an outage on mobile phone network X when the problem they are experiencing is not being able to call a loved one who is on phone network Y, which is the one that is down. This happened a little while back in the UK, leading the other phone providers to have to deny there was some broad outage (which is not an easy thing to reassure when there are so many MVNOs sharing network Y)
https://news.ycombinator.com/item?id=44578490
You’d think that for such a company they’d notice if global traffic for one of their important services for a given minute had dropped below 50% compared with the last hour, but apparently not.
And that’s Cloudflare, who I would expect better of than most.
edit: I should have followed your link before commenting, because this sentiment is well covered there.
There's plenty of status page solutions that tie in uptime monitoring with status updates, essentially providing a "if we get an alert, anyone can follow along through the status page" for near real-time updates. But, it means showing _all_ users that something went wrong, when maybe only a handful noticed it in the first place.
It's a flawed tactic to try and hide/dismiss any downtime (people will notice), but it's in our human nature to try and hide the bad things?
[1] ie https://ohdear.app/features/status-pages
"A status page is used to communicate real-time information about a company's system health, performance, and any ongoing incidents to users. It helps reduce support tickets, improve transparency, and build trust by keeping users informed during outages or maintenance"
real-time. for multiple good reasons. reduces confusion for everyone.
You kinda answered your own question here. The intent of the status pages is to report any major issues and not every small glitch.
There is a ton of moving pieces in software these days and networks in general. There is no straightforward way to declare an outage via health checks, especially if declaring said outage can cost you $$ due to various SLAs. Manual verification+updates take time.
By definition, status pages should reflect reality, and be real-time, it makes sense to get worked up over because we rely on the company's status page to be the ultimate arbiter of the reality of their servers. It's not always easily obvious if it's on our end or theirs. Even if some other third party checker is showing problems, it doesn't mean they're 100% accurate either.
I didn’t break Reddit by trying to access the homepage.
This third partt status page reflects the issues much better [2].
[1]: https://metastatus.com/ads-manager [2]: https://statusgator.com/services/meta
I would add that I once had to file an FCC complaint when my internet went down. Comcast kept sending me to an AI bot; it appeared that they automated their status page and customer support phone line to the point where I didn't believe that an actual person was aware of the problem and fixing it.
Some time later, you might add an automated check thing that makes some synthetic requests to the service and validates what's returned. And maybe you wire that directly to the status page so issues can be shown as soon as possible.
Then, false alarms happen. Maybe someone forgot to rotate the credentials for the test account and it got locked out. Maybe the testing system has a bug. Maybe a change is pushed to the service that changes the output such that the test thinks the result is invalid. Maybe a localized infrastructure problem is preventing the testing system from reaching the service. There's a lot of ways for false alarms to appear, some intermittent and some persistent.
So then you spread out. You add more testing systems in diverse locations. You require some N of M tests to fail, and if that threshold is reached the status page gets updated automatically. That protects you from a few categories of false alarms, but not all of them.
You could go further to continue whacking away at the false alarm sources, but as you go you run into the same problem of service reliability, where each additional "9" costs much more than the one that came before. You reach a point where you realize the cost of making your automatic status page updates fully automatically accurate becomes prohibitive.
So you go back to having a human assess the alarm and authorize a status page update if it is legitimate.
Engineers are working the problem. They have a pretty good understanding of the impact of the outage. Then an external comms person asks for an engineer to proof read the external outage comms. Which triggers rounds of "no, this part is not technically correct" and "I know the internal system scope impact, but not how that maps to external product names you want to communicate".
Sure, it'd be nice if the message "we are investigating an issue with… uh… some products" would come up faster.
I'm not asking for an instant RCA or confirmation of the scope of the outage. Just stop gaslighting me.
Ah, so you're saying the status page should be hooked up to internal monitoring probers?
So how sure are you that it's the service that's broken, and not the probers? How sure are you that the granularity of the probers reflect the actual scope of the outage?
Also this opens up questioning of "well why don't you have probing on the EXACT workflow that happened to break this time?!". Because honestly, that's not helpful.
Say you have a complete end to end workflow for your web store. Should you publish "100% outage, the webstore is down!!" on your status page, automatically, because the very diligent prober failed to get into the shoe section of your store? That's probably not helpful to anybody.
> Clearly these metrics and alerts exist internally too.
Well, no. Probers can never cover every dimension across which a service can have an outage. You may think that the service is simple and has an obvious status, but you're using like 0.1% of the user surface, and have never even heard of the weird things that 99% of actual traffic does.
How do you even model your minority use case? Is it an outage? Or is your workflow maybe a tiny weird one, even though you think it's the straightforward one?
Especially since the nature of outages in complex systems tend to be complex to describe accurately. And a status page needs to boil it down to simple.
In many cases even engineers inspecting the system can not always be certain if real users are experiencing an outage, or if they're chasing an internal user, or if nothing is user visible because internal retries are taking care of everything, or what.
Complex systems are often complex because the world is complex. And if the problem is simple and unevolving then there would be no reason to have outages in the first place.
And often engineers helping phrase an outage statement need to compromise verbosity for clarity.
Another thing is what do you do if you start serving 500s to 90% of traffic? An outage, right? Surely auto-publish to a status page? Oh, but it turns out this was a DoS attack, and no non-DoS traffic was affected. Can your monitoring detect the difference? Unlikely.