Cloudflare CEO Details What Led to the Massive global Outage | Find a Way

Cloudflare CEO details what led to the massive global outage

The company had detailed what happened in a new blog post.
By  on 
Credit: Algi Febri Sugita / SOPA Images / LightRocket via Getty Images
Matthews Martins

Perhaps facing reality head on is the most honest way to try to escape it.

101 Comments

Stay informed!

  1. Quite a crazy turn of events, however, glad to see things are back to normal, the very thing that was designed to protect ended up costing Cloudflare the damage

    ReplyDelete
  2. So the file that managed threat traffic was larger than expected but it wasn't caused by any threats. 🤔

    ReplyDelete
  3. Reading between the lines..cyber attack. To be admitted in 3 years time

    ReplyDelete
  4. World is moving
    New Era
    By wise

    ReplyDelete
  5. I almost forgot Mashable existed. 😂. Mashable really fell that hard.

    ReplyDelete
  6. At "This is what happened" article is missing what is actually happened.
    Please fix.

    ReplyDelete
  7. Someone pushed an update in prod.
    Other explanations are excuses and futile...

    ReplyDelete
  8. https://media1.tenor.co/m/ikEP2PJ787AAAAAC/trump-everythings-fine.gif?

    ReplyDelete
  9. Paling rekomended selalu pengen balik 𝐃𝐄𝐋𝐈𝐂𝐔𝐀𝐍 lagi dan lagi 🤩

    ReplyDelete
  10. Super duper platform
    https://media1.tenor.co/m/giEWYOz5WgQAAAAC/best-sis-ever-sister-love.gif?

    ReplyDelete
  11. AI Servers, it's happening .. 😈
    Go check the comments : 👇
    https://youtu.be/hoxBaTSPwLM?si=lYg4XCJffH-8iTpd

    ReplyDelete
  12. Bullshit
    This is due to bugs in Rust language

    ReplyDelete
  13. It was a damn DDoS. Deny it all you want but your ass wad wide open. Secure better your servers you morons

    ReplyDelete
  14. Who, me?
    Whups

    -EFBIG

    ReplyDelete
    Replies
    1. Someone did mention a possible " who ? me ? " in the original article .. it would make one of them epic columns we all enjoy. Hopefully TheReg will be able to pull that one out their hat :) Kudos to the admin for really screwing up bad and please contact ElReg :D

      Delete
  15. Something missed in testing....

    ReplyDelete
  16. I guess they test in production then..

    ReplyDelete
    Replies
    1. Everyone has a testing environment, some lucky people have a separate production environment as well.

      Just got to hope that any future test env doesn't use a smaller dB than prod.

      Delete
    2. This comment has been removed by a blog administrator.

      Delete
    3. Or their test environment isn't sufficiently realistic. If it only produces a handful of entries in those database files each day then the query doubling that will hardly be noticeable.

      Delete
    4. *live scenes from Cloudflare testing*

      "Dear AI, can you test this for me to make sure it works?"

      "What a great idea! You are such a clever person for asking me this. Not a lot of people would think about testing something as extravagant as this so let me not let you down and test it for you. One moment fish...."

      "... fish?"

      "Ahh I see the problem! I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring..."

      "stop stop stop"

      "I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring, I am a herring..."

      *looks at the clock*

      "Fuck it my shift is over I'll pick it up in the morning"

      *next morning*

      "Yawn - I guess ######### did the testing, looks like it ran without a problem, git push origin master"

      Delete
    5. "Hmm, taking a while to push.. Oh well, time for coffee"

      * phones ring; alarms blare *

      OMG, it's broken!

      ChatGPT, fix it!

      ...ChatGPT?

      Oh no!

      Delete
    6. Or maybe it was 'vibe coding' done not quite right? After all Saint Linus of Torvalds approves, sort of:

      https://www.theregister.com/2025/11/18/linus_torvalds_vibe_coding/?td=rt-3a

      Oh, but then I read the article:

      "Linux and Git inventor Linus Torvalds discussed AI in software development in an interview earlier this month, describing himself as "fairly positive" about vibe coding, but as a way into computing, not for production coding where it would likely be horrible to maintain." (My emphasis.)

      Oh well, as you were.

      Delete
    7. "horrible to maintain."

      Well, exactly- I was saying this to someone at work recently.

      You might be able to get an AI to generate something from scratch, but for anything serious, you're going to have to maintain and update it.

      Which is a completely different kettle of fish, and something that- as far as I'm aware- most current gen AI systems aren't going to manage reliably.

      So either the "programmers" who wrote the original prompt are going to have to maintain code they didn't write and possibly don't have the skill to understand.

      Or the company is going to have to bring in real programmers which will cost them more, especially as most of them will say "fuck, no!" to having to maintain confusing, auto-generated code unless they're paid through the nose to make it worth their time.

      Which will either be an ongoing cost to maintain, or they rewrite it, in which case it will likely cost them as much or more than doing it by hand.

      But the type of company using cheap prompt engineers to create gen AI code won't- and possibly can't- pay that sort of money in the first place.

      So, yeah.

      Delete
    8. I think the missing thing in testing is the testing.

      Delete
  17. The picture in my mind …

    a teenage intern with copies of SQL for Dummiesand The Complete Idiot's to SQL open at the GRANT pages. ;)

    ReplyDelete
    Replies
    1. And ChatGPT open on their laptop...

      Delete
    2. I know you're joking, but its about lack of experience, which can happen at any age, especially if poorly trained and given access to important stuff!

      Delete
    3. No. I don't care WHAT system I'm on. If I'm asked to push untested code to prod, I'm sending a very clear email in protest. THEN I'm going to go figure out what "testing" means in the new environment while waiting on an email reply.

      I've earned these white hairs...

      Delete
  18. Data model assumption gone bad

    Sounds like the classic referential integrity mishap leading to more query results than expected

    ReplyDelete
    Replies
    1. Or since ClickHouse has rows/column-level security, the change allowed some badly written queries to return more columns and data than expected.

      Delete
  19. The irony.

    mfw DDoS prevention vendor DoS-es itself.

    ReplyDelete
  20. Who watches the global kill switches?

    As a consummate pessimist, eventually one might be concerned that all these additional "global kill switches" which future "important" code updates might require, alone become the next source of outages if discovered they are be easy to trigger, just like an actual fire alarm?

    Complacency breeds.... something?

    ReplyDelete
    Replies
    1. Recall what I pointed out recently re one aspect of the systemic risk being introduced by forced infestation of core tools by Rust code written by Rust Religious Zealots. (https://forums.theregister.com/forum/all/2025/11/12/asio_cyber_sabotage_warnings/#c_5178787)

      ----

      Yes, Cloudflare appears infested with the ideology: they recently proudly announced they'd switched Production to "more secure" Rust code. (https://forums.theregister.com/forum/all/2025/11/18/cloudflare_outage/#c_5182286)

      Delete
    2. So, it was Rust database permissions and a Rust database query then.

      Delete
    3. The underlying error - hit when the file discussed became much larger - was in rust code, actually:

      "The FL2 Rust code that makes the check and was the source of the unhandled error is shown below:"

      A quick Google suggests "FL2" is their rust-based replacement to "FL" which is written in .... PHP.

      This isn't intended to be a rust vs PHP comment, though When you move a large complex piece of software from X to Y you are going to introduce (or reintroduce) bugs unless you have a thorough testing regime.

      Delete
    4. It was an explicit arbitrary size limit, set fairly low, within the Rust context, so exceeding it was very much a Known State AND likely to occur in practice, so necessarily needed to be handled. Ideally, like a grownup. But instead the only downstream flag of breached limit was a null. A silent technical state, outofband from the code. So that's a coding deficiency right there, and then a coding error in (not) handling it.

      The fact that what happened to trip the limit _this_ time was a bug in database permissions and/or query is utterly irrelevant as to what the source of the Rust error was.

      The ultimate human source of that sort of error is the literally insane belief that Rust is a Magic Wand Of Virtue. It is not. It is a tool. With various advantages & disadvantages vs others in various circumstances.

      It is the seizing on of Rust by the type of people who desperately chase group domination by shrieking their superior virtue, which is the reason for Rust now being a yellow-to-red flag for systemic-level risks.

      E.g., CloudFlare.

      Delete
    5. One observation - it's only people outside of the rust ecosystem who seem to think people inside the rust ecosystem believe it is magic.

      Had this code been written in C the out of bounds error could have easily resulted in undefined behaviour. As it was the assignment of a value to the 201st element of a 200 element long vector producing an error, rather than blind adherence to pointer arithmetic.

      The error was unhandled, it was an obvious error and easy to catch.

      Delete
    6. But how, in 2025, do we have configurations of indeterminate and flexible length being stored in a static, fixed-size buffer?

      I mean, OK, thanks to Rust it's not going to result in a security problem but neither is it going to fail gracefully.

      Delete
    7. It is a source of minor anthropological fascination, seeing the modern commentards completely rewrite anything to suit a position.

      >>As it was the assignment of a value to the 201st element of a 200 element long vector

      I mean, this is just pure fiction. It was attempting to load the entire Config. In one bite. Not iterating elements through a loaded Config.

      >configurations of indeterminate and flexible length

      It was a known, exogenously determined, internally specified, fixed-max length. It was an explicit performance-focussed constraint: prevent CPU hogs. No "indeterminate and flexible": a hard upper limit specified in and by the code.

      Summary: no "reading past element N", no "flexible".

      By the bye, it appears over 20% of GitHub Rust code that has nothing to do with test, uses unwrap.

      Delete
  21. Some people are already excusing them based on their size/cost/uptime records, but none of those excuse a system that could allow this to happen, however much an employee screws up.

    After all, next time it could be a hacker or malicious state actor. There's a reason to make redundancy inherent in any design .

    Another example is that software for Windows a year ago that managed to nuke windows installs worldwide. You can bet they've tightened procedures, and are more careful with testing etc. but I bet, architecturally, their system hasn't changed.

    ReplyDelete
    Replies

    1. I think the report is detailed, self-critical and plausible. As have other reports when they've had problems in the past.

      From the description, the problem could stem from a race condition that probably wouldn't have happened in testing, or not unless you were specifically looking for it. That is always going to be possible, which is why you want the option of stop and rollback of everything, if you can't quickly figure out the problem. I think Google came to the same conclusion several years ago.

      Crowdstrike is what you're referring to and they were nothing like as quick or open with the fixes, while the downtime caused much greater problems. I assume board bonuses weren't affected as many customers had no choice but to live with it. It's much easier to switch CDNs.

      Delete

    2. The closest you can get to ensuring you have caught race conditions in software is thorough review by someone with the expertise to do it. Of course, this is by someone other than the person who wrote the code.

      In HW, you can guarantee things like how long an instruction takes to execute. Except that designers, being human, have been known to forget certain edge cases, with the usual result.

      Delete

    3. I seem to remember something about software now being so complex that it's impossible to find all bugs by review, which is one of the reasons behind fuzzing.

      Anyway, when this something like this happens, it's important to have the kind of" engineering" culture where fessing up to having been the cause of it does not come with fear of being sacked. And I like Prince's decision to report on the incident and not to appoint blame on whichever PFY was ultimately responsible. Of course, we don't know if that is the case at CloudFlare.

      Delete
    4. quote: You can bet they've tightened procedures, and are more careful with testing etc.

      At Microsoft? You jest.

      If software ever generates out of parameter data, a klaxon should sound, a detailed description should pop up on screen, and a prepared default workaround (zeroing a variable, clearing a field, whatever) should kick in whilst the staff look for the flaw. A DDoS attack, no matter how big, should produce known behaviours on such a system. So any unexpected behaviour shouldn't be considered to be a DDoS attack. Well written software should not go wrong, because there should be a reliable error detection response built in.

      Delete
    5. You do know that the "they" in the sentence you quoted was talking about CrowdStrike, not Microsoft, who weren't the they in any sentences in that comment. And that, while CloudFlare does handle lots of DDOS attacks, that's not related to what they were doing this time? Your comments are not making much sense in context, and devoid of that context appear to simplify to "Programs should just never have errors" which is a very nice option if you can make it happen.

      Delete
    6. “…You can bet they've tightened procedures, and are more careful with testing etc…”

      Hahahahaha…. I have to admire your optimism

      Delete
  22. Untested change

    It sounds like a very simple untested change was made.

    Which is incredible. No basic change control?

    ReplyDelete
    Replies
    1. What is this quality control you speak of? Sounds kinda like librul commie rules if you ask me!

      Delete
    2. Or it's plenty of control by the people who have no idea what are they controlling, resulting in corresponding control efficiency.

      Delete
    3. > It sounds like a very simple untested change was made.

      Which is incredible. No basic change control?

      Change control does not prevent errors from happening.

      It is intended to make them less likely to happen.

      Delete
  23. stabilized in the failing state

    If I was still managing IT systems, I would remember this phrase. Stable systems is the gold standard.

    ReplyDelete
    Replies
    1. If you can't be stable, you can at least be consistent...

      Therefore, surely better to be consistently failing, than intermittently working?

      Delete
    2. All humour aside, intermittent faults are the worst kind to deal with

      I identified a problem on one site that had been causing intermittents for 19 YEARS. The worst part was that it wasn't even technical or in our equipment, but a mislabelled power outlet which should have been on the essential power but was actually on non-essential and got switched off at night/weekends. Lead acid float battery systems don't like repeated deep discharge and (of course) whenever techs showed up during working hours there was no fault found

      The lesson is to get as much information as possible from the enduser and DO NOT let your ticket handlers "distill" it down. I have a pet hate for people who write something completely different to what the user told them and I'd quite happily roast their nuts over an open fire for doing so

      Delete
    3. Absolutely, intermittent is a bitch to resolve. But I was commenting more on the management speak. Our systems aren't completely screwed, we've stabilized them so none of our customers are getting what they paid for.

      Delete
    4. So so agree with the 'distillation' gripe.

      This is a common problem because Service Desk staff are under pressure to close so many calls per hour that they cannot afford the time to 'slowly' document the problem as the user gradually explains the problem. 'Mental Leaps' are made based on experience and what you THINK the user means BUT these 'Mental leaps' can be wrong ... we have all had calls were you start going down a 'rabbit hole', to solve the problem, when you suddenly realise you took a 'wrong' turn. Right at the start of the process you thought you understood an issue and made a 'mental leap' forward on that basis which turns out to be wrong.

      The temptation is to listen to the problem and 'interpret' the information into what you think the user is saying and to 'guide' the reader of the 'ticket' in the right direction you think the problem is pointing to.

      This is why you should record calls to ensure that what the caller says is not misunderstood.

      This is where your training and focus on detail should 'kick in', you should be clarifying the things that are ambiguous in the call and ensuring that there is not a degree of 'filling in the gaps' going on where you are assuming facts/details that appear to be obvious BUT were never actually stated by the caller.

      Working on Service Desks is very hard work and very stressful.

      The primary skill is to be able to listen ... very carefully !!!

      Often the caller/user gives more information than they think they are doing, if you listen carefully ... particularly useful when they are tempted to be 'economical with the truth' to cover a misuse/abuse of kit etc.

      :)

      Delete
    5. Service desk "interprets" customer's fault report is also a big problem in vehicle repair.

      Delete
  24. Was visiting about 10 different websites sites at the time this all started and saw varied CloudFlare errors indicating that my browser was ok, the CloudFlare Server had failed and the destination server was ok...

    The message suggested that the problem was NOT with CloudFlare's systems but most probably at my end...

    For the first ten minutes or so the https://www.cloudflarestatus.com/ site indicated no problems...

    After https://www.cloudflarestatus.com/ indicated a problem I noted that the messages when accessing affected sites started saying that there was actually a problem with CloudFlare's systems...

    Most hilariously was when I did a search for "is cloudflare down" and found that 19 out of 20 sites I visited failed due to the CloudFlare outage...

    ReplyDelete
    Replies
    1. ... only for DownDetector to fail with a Cloudfare error message.

      Delete
    2. A lot of websites returned a message like "Please unblock challenges.cloudflare.com". I find it funny that when you're unable to connect to cloudflare, they just assume it's your fault. Cloudflare never goes down, right?

      Delete
  25. I remember an SAP gig
    I worked on that was led by a highly paranoid guy that worried about stuff like this. I was dealing with storage and wasn't involved on the deployment end, but I'd shoot the shit with the guys who were. They had a number of SAP application servers but when they made a change in production they applied it to ONLY the odd numbered ones first, then they'd later apply it to the even numbered ones.

    The logic was that if they encountered some type of intermittent error related to the change they could track which application server the user's client was connected to and figure out if it was one of the ones with the change, and see if forcing their client to reconnect to an even numbered application server resolved their issues.

    Because they'd massively overprovisioned the number of application servers to deal with huge end of quarter / end of year loads if they ran into something like this they could force everyone to an application server without the change and reverse it on the odd ones. If there were no issues they could push it to the even servers. Obviously they were in a change freeze during EOQ/EOY so they always had that slack capacity to work with, so the project leader used it in a smart way. I assume he'd learned the hard way at some point that no matter how much testing you do, sometimes when you push a change into production it doesn't work the way you expect.

    They had similar strategies like this for the database too apparently, though I have no idea how that worked.

    ReplyDelete
    Replies
    1. we just made a big red "rollback" button

      at the slightest sign of new errors after go-live the button is pressed which backs out the code and undoes the migrations then you investigate at your leisure, because no one wants to investigate while the fires are still burning. no easy rollback strategy = no go-live without a big risk assessment.

      and that's AFTER a fucking massive set of CI tests

      Delete
    2. What he described sounds like a rolling deploy. For their particular case, it was deemed better to have a failing bird in hand to examine than not.

      Certainly, automatic rollbacks are the right solution for many situations, manual ones for others, but if the business is happy with a 50% rollout that allows for live troubleshooting, I'm not ready to pronounce the solution "wrong".

      Delete
    3. The way it worked they could block new connections to the updated servers, and force people already connected to reconnect to non-updated ones. They had a list of "power users" they'd go to for help in figuring out why something their testing didn't catch is hitting in the field.

      AFAIK this only happened once while I was there, or at least only once I heard of. If it had never been triggered it is unlikely I would have known about it since I wasn't involved on that end of the project at all. Just found out from "water cooler" discussions lol

      Delete
  26. What is missing from the list:

    - Remove SELECT * queries....

    ReplyDelete
  27. Something goes wrong and the reaction is "We're under attack" and not "Oops, was that me?" from someone who just changed something.

    ReplyDelete
    Replies
    1. I have a little sympathy because a) they are always under attack, that is kind of their whole deal, and b) they must have enough people that even if the left hand does know what the right hand is doing, neither of them know about the 200 other hands also poking different parts of the system. An operation at this size is not, and cannot be made, simple and in a way it's impressive that this kind of thing doesn't happen more often, although I guess if it did the impact would be much less because nobody would use them.

      Also a big central outage like this is a bit of a dog-in-the-playground day for those of us who work on the internet. I was particularly tickled that DownDetector wasn't accessible.

      Delete
    2. I wondered about this too. I wonder how many core config changes they are making per day, maybe there are so many this was lost in the noise.

      When I ran a comparatively tiny cloud service, we used to monitor pretty carefully after deploying any production changes, I'd hope Cloudflare would be even more diligent.

      From the description of the bug, it also sounds like it should be catchable in testing if they're using a realistic configuration, so it's a double fail.

      Delete
    3. There's a saying . . . "When you see hoofprints, think horses and not zebras"

      Cloudflare lives in a DDOS world. Not surprised they saw it that way. But it's always good to have a court jester around to ask the awkward questions.

      Delete
    4. Rare things happen rarely.

      Delete
  28. Bot attack

    So, the AI has control of the internet, and can take it away at will.

    ReplyDelete
  29. Centralization is the problem

    What most people aren't getting is that the existence of systems like Cloudflare IS the problem.

    It should NOT be possible for an error on one system to take down a huge chunk of the internet. But that's EXACTLY what happened. Cloudflare, and the entire concept of ANY single company that has that much power to cause disruption, are a menace to the internet.

    Organizations need to seriously rethink their use of such systems at all.

    ReplyDelete
    Replies
    1. We lost that particular argument 30 years ago when the telcos expanded into and took over the Tier1 businesses

      Delete
    2. I'm not a fan of moving everything to 'the cloud'. Whether it's CloudFlare or AWS or Azure. But they are useful tools that we will all appreciate next week when the holiday sales kick off. The lesson here is compartmentalization. A problem like this shouldn't be able to cross datacenters. From my developer days I realize that there is never enough time to test everything. There are often not the correct tools available or procedures in place. Management hates to spend the money to duplicate production systems so you have a reliable test environment. And while it complicated deploys, I appreciated when processes were in place that would keep my fuckup from putting a company wide, or worldwide, spotlight on me. And example, I had a policy of only updating one node in a cluster per day on critical applications. Not only did that mean updates took a lot longer, it also meant that version x had to be able to work with version x+1 and version x-1.

      And for those romanticizing the good old days of Big Telephone, you should go back and refresh your memory of the 1990 AT&T 4ESS long distance outage.

      Delete
    3. "Organizations need to seriously rethink their use of such systems at all."

      Or, possibly treat the "proxy service" as, you know, a service, and if said service fails, fall back to direct connect, ie take the failed proxy out of the circuit. After all, the sites which use Cloudflare mostly used to do this themselves. Of course, they probably don't have the people or expertise in-house any more because someone else's server "cloud" :-/

      Delete
  30. Bad Rust code was the problem...

    For those who sanctify Rust every day, a plain .unwrap() in production code :-O

    https://blog.cloudflare.com/18-november-2025-outage/

    Remember, Rust is neither the holy grail nor the silver bullet, it is just another tool that can be used by incompetent programmers. Having said that, stop this nowadays nonsense of trying to rewrite everything in Rust.

    ReplyDelete
    Replies
    1. Given a choice of Rust or PHP, I'll take Rust.

      Delete
    2. How have people who blame Rust for any bug written in it become so much more annoying than the Rust promoters they attempt to decry? Maybe it's because they use a similar tactic, but they took it one level farther. Rust fans have often pointed to any memory-related bug and said "look, that's why you should not use C", but at least the replacement they suggested would have actually done something about those. Whether it's this or one of a couple other articles, we're now beset by people who blame Rust for any bug written in it even if, as in this case, exactly the same bug could have been written with equal ease in any language of your choice.

      If you don't want more people to support Rust just because they're annoyed at you, you would do well to follow your own statements. "Remember, Rust is [...] just another tool", and if you insist on blaming it when the tool is not the reason for the problem, you're going to have trouble getting agreement except for those who already hated that tool. I don't like writing code in Javascript, but I don't blame it for every time someone does something I dislike with it. Unless JS made that happen, which it occasionally does because it does have some defects, the specific piece of code and its writer, not the language or its promoters or other things written in it or people rewriting something in it, is to blame.

      Delete
    3. I think the sentiment is that it was rewritten because they wanted it to be rust, and without that motivation they would be running the previous version... Overlooking the fact that presumably the previous version had some significant issues for them to conclude that the solution was to put in the resources to rewrite from scratch?

      I do wonder how much of it the "rust is amazing" stuff comes from people whose argument was actually "You have been refusing permission to rewrite from scratch, so how about we migrate to rust instead?"

      Delete
    4. It was originally PHP, then LuaJIT - CloudFlare were having to improve the LuaJIT compiler to keep it (thier business) running.

      There was no re-write - the existing solution did not scale any further and they had significant internal experience in Rust so went with what they knew for v2.

      v1 is still in the wild - it took suffered from this database issue, however it failed silently rather than taking out the process. This led to much higher traffic onto client services as bots were able to make it through.

      Somehow their internal processes missed this obvious code smell - given the things rust does protect from, imagine the issues they might have had with the same level of due diligence applied to C code...

      Delete
    5. > There was no re-write - the existing solution did not scale any further and they had significant internal experience in Rust so went with what they knew for v2.

      I wasn't saying that Cloudflare has rewritten its service in Rust. What I was trying to say is that Rust code has bugs too, especially because there is a lot of (bad) Rust PR that is brainwashing programmers, who end up thinking that, if it is (re) written in Rust, it will be better than a long time established piece of code written in C or C++.

      The PR is so negative that I saw posts from Rustacens praising the .unwrap() call, because (they said) "without it, an undefined behavior would have occurred with unknown implications". Yes, the UB didn't occur, but half the Internet was down!, only because the error was not properly managed (or because .unwrap_or_default()/.unwrap_or(...) were not used). No language, even Rust, can't protect from programmer stupidity/laziness.

      Delete
  31. Respect

    Lots of blame going around and legitimate frustration at the impact but, personally, I actually appreciate the statement that was put out. This was very detailed and clearly taking accountability, not just for the outage, but for the remediation. Shit happens and it's about how you deal with it and improve resilience. They've identified the root causes in impressive time and committed to actions to resolve those issues. In my book this is a perfect response.

    ReplyDelete
  32. Man, Little Bobby Tables all grown up and working for a big company like Cloudflare now

    ReplyDelete
    Replies
    1. Oh how the Tables have turned!

      Delete
    2. I was thinking Bobby Droptables, but close enough.

      Delete
  33. Hardened Data

    "Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input"

    Should have been done already. IMO, you should sanitise/check input data no matter what the source...maybe even more so for auto-generated stuff. Just because you got it from another computer in the same company doesn't make the data quality any better.

    Auto-generated data is just data that has been written by a 'Human Nth removed' programmer. If they are bad or don't understand the task, then of course things like that will happen.

    We hear of 'Move fast and break things'....looks like someone did!

    Icon is for their change control and risk management.

    ReplyDelete
    Replies
    1. Input validation can be quite slow. And Cloudflare's systems MUST be fast. I would need to know a lot more about their systems before I would agree with your prescription for this particular ailment.

      Delete
  34. Microsoft:

    Guys we've took 10s of millions of users of far to many times this year, we must be the most unreliable out there!

    Cloudflare:

    Hold my beer!

    ReplyDelete
  35. Login Hell

    So I'm getting bad login, I decide to change my password, 2-factor code not coming through,

    I JUST WANT TO ADJUST MY DAMN THERMOSTAT.

    The joke about your toilet needing your email address is brilliant and not funny.

    ReplyDelete
  36. They never learn

    It was in 1978 that I learned that you *always* validate your input data, particularly size. Almost 50 years later and the same mistakes still happen.

    ReplyDelete
  37. The elephant in the room

    There's been a great deal of discussion here, a few jokes, and a fair amount of war stories. Strange, though, that everyone seems to ignore the bigger issue, which is that for a very large chunk of the Internet, Cloudflare has become a single point of failure.

    And if that's not bad enough, said SPOF doesn't seem to have adequate change management procedures in place, to the point where they didn't even realize it was their own change that broke mission-critical services that large numbers of the world's businesses and private citizens rely on.

    Ehm... That's bad, right?

    ReplyDelete
  38. I’ve seen the future brother, it is murder

    I find it darkly feckin’ hilarious that a globally distributed network designed to be fault tolerant has gotten to a situation 40yrs later where large amounts of it become unusable when just one of a half-dozen large companies has a bad day.

    How long’s it been since the AWS us-east-1 debacle? Like a month or something?

    ReplyDelete
  39. What we need is

    What we need is an Internet designed to not rely on a single point of failure!

    Oh, hang on…

    ReplyDelete
  40. Eggs

    Eggs in one basket again? Or should that be one bad apple ...

    ReplyDelete
  41. Reminds when I keep confusing F2 and F4 when coding in BASIC, one key loads the other saves, guess who?

    ReplyDelete
  42. So we're positive it wasn't DNS this time?

    Because historians need to know.

    ReplyDelete
  43. To big to fail is not good

    Having everyone depend on the same infrastructure providers is a problem. When they go bad they take down too much of the internet. Countries should pass laws that encourage lots of these providers instead of supporting the very big ones.

    ReplyDelete
Post a Comment
Previous Post Next Post