thoughts on the latest cloudflare outage
against my best instincts, i will put forward my opinion on the recent cloudflare outage. you should probably read their postmortem for context.
intro and background
tl;dr is that a quarter of the internet went down because a database change caused oversized config files to be propagated to a critical component of their core proxy. this component used a fixed amount of preallocated memory as an optimization, and nothing validated the size of the config it was parsing. when the oversized config was loaded, the application crashed, which translated into an http 500 server error, which blocked all traffic attempting to traverse that component. in other words:
cloudflare: i would like to parse dynamic config generated by an external service
me: fair enough
cloudflare: i would like to restrict the size of this config, for performance
me: fair enough
me: what will you do if the config is too big?
cloudflare: die instantly
me: ah.
spooky scary sloppy rust code
the snippet of code provided in the postmortem is as follows:
pub fn fetch_features(
&mut self,
input: &dyn BotsInput,
features: &mut Features,
) -> Result<(), (ErrorFlags, i32)> {
features.checksum &= 0xFFFF_FFFF_0000_0000;
features.checksum |= u64::from(self.config.checksum);
let (feature_values, _) = features
.append_with_names(&self.config.feature_names)
.unwrap();
the problematic part is that last line: .unwrap(). in rust, errors are handled using sum types, namely, Result<T, E>. this Result object can then be dealt with in various ways: either explicitly with match, if let, or let ... else; automatically with ? or .unwrap_or_else(); dangerously with .unwrap(); or unsafely with .unwrap_unchecked(). here's how each would have looked; don't mind the owl references, we'll get to it:
//we don't know the actual return type of .append_with_names(),
//so this is a guesstimate:
let res: Result<(Vec<FeatureValues>, Unknown), AppendError>
= features.append_with_names(&self.config.feature_names);
//handle the error directly
match res {
Ok((feature_values, _)) => /*do whatever you needed to do here*/,
Err(error) => {
log_error!(error, "failed to append features when loading config");
//draw the rest of the owl
},
}
//find a safe default
let (feature_values, _) = res.unwrap_or_else(|error| {
log_error!(error, "failed to append features when loading config");
(vec![], Default::default()) //draw the rest of the owl
});
//bubble it up - let the caller draw the rest of the owl
let Ok((feature_values, _)) = res else {
log_error!("failed to append features when loading config");
return Err((ErrorFlags::AppendError, 120))
//who knows what this error return should be
};
//another way to bubble it up, more conveniently but without a local log
//(assuming the error type is convertible)
let (feature_values, _) = res?;
//the bare minimum - cloudflare's actual solution.
//crash the program if you can't append
let (feature_values, _) = res.unwrap();
//more or less equivalent to the above, with slightly better debuggability
let (feature_values, _) = res.expect(
"there should be a max of 200 feature values"
);
//abdicate all responsibility. perform undefined behaviour if you can't append.
//here there be memory corruption. summon the nasal demons.
//thankfully cloudflare had the sense not to do this
let (feature_values, _) = unsafe {
res.unwrap_unchecked()
};
what's wrong with unwrapping
unwraps are to be used in one of two1 scenarios:
- you do not care about your code very much. if it crashes, who cares
- the error is believed to be completely impossible. there is an invariant being upheld that rust's type system is insufficiently smart to prove. if the value is an error, something has gone catastrophically wrong and no assumptions can be made about the state of the program. the only safe response is to panic, unwind the stack and kill the entire thread. we can only pray that is enough.
this unwrap falls in neither of these. cloudflare's core proxy is critical infrastructure and thus must make every effort not to crash, eliminating the first point. the second is where cloudflare-positive discourse seems to have placed them - the engineer and code reviewer must have believed it impossible to fail here. i disagree - in a situation where you are parsing input provided from an external system, generated by a database, across a network boundary, no assumptions can be made. operate as if it is, if not actively malicious, at least highly suspicious.
cloudflare's core proxy code should probably have zero potential panics in it. this makes it more difficult to write rust, i understand that. if they must write panicking code, at the very least it should have safety-style comments above each potential panic indicating why the engineer believes a panic is impossible.
additionally, this component of the core proxy was a filter designed to keep llm scrapers and other (semi-)malicious bots from accessing a client site. in the event of a catastrophic failure like this, the component should obviously fail open - bot management is not a reliable security boundary, and increasing load / visibility is far superior to taking the website off the internet altogether. present one of those horrible little javascript challenges on each page load if you must (i'm thinking of ticket-sellers looking to avoid being bought out by bots), but don't break the web. different components have different threat models; i'm sure e.g. their zero trust product would prefer failing closed.
crux
the core of the issue is an unstated, unvalidated assumption - the config will never exceed 200 features. rust did its best to warn the coder about this issue; the coder responded by explicitly ignoring the concern, and this outage is the result. a sufficiently conscientious programmer and reviewer would have looked at this Result, thought "how might i handle this properly", and pursued a proper solution.
now we finally return to "draw the rest of the owl"; it is not obvious what this solution should have been. i have several ideas: there could have been a validator in the config generator or distributor, plus a //SAFETY comment immediately before this unwrap; that would have prevented the outage. i'm hesitant about this, though - if the validator fails, we're back where we started. the solution could have also been a configuration manager that stores past versions, so the proxy could revert to a previous known-good config in the event of a misconfiguration like this.
how i would have fixed this
configuration is about defining a point in multidimensional space. this point is fed to your program to define its initial state and rules about how it handles input and output. if configuration fails, a critically important program should continue to work at all costs, in this case by statically choosing an acceptable default point in the state-space, inhabiting it, and logging a maximum-severity error message. ideally, such a message would trigger an alert and page an on-call engineer to fix the problem immediately, but until then, nothing is fundamentally broken. for this program, the sensible static default is "allow every request through".
postscriptum - response to "this is rust's fault"
if this had been c, cloudflare would have failed to check the error code and proceeded to perform undefined behaviour. whee, memory safety cves! if this had been java, javascript, python, go, haskell, c#, kotlin, lisp, etc., this program would have been ripped out and rewritten in rust; if cloudflare is so memory-constrained as to only allow 200 entries in an array, a garbage-collected language would have been right out. if this had been c++, it would have thrown an unchecked exception, same as the panic, and the outage would have been exactly the same. maybe zig solves this; i haven't written any zig.
rust is not a perfect language. the naming of .unwrap() is suboptimal; error handling in tutorials and educational work is often brushed under the rug with it. maybe in rust 2.0, which will never happen, we can rename unwrap. until then, it is the professional responsibility of engineers, at least those who create programs which must have extremely high reliability, to understand the language in which they work and not litter their code with potential crashes.
-
there is a third2, false option: if you handle panics with
catch_unwindor a panic hook, and you will convert the panic into an error 500 or something. cloudflare probably had this, it's common in web frameworks. this is not something you want to rely on. panic catching is dangerous; if you panic within a panic (either in aDropimplementation or the panic handler) it hard-aborts your program immediately. also, responding with a 500 is not actually the correct response in this case! as i said above, this filter should have failed open, not responded to every request with a 500. listen to the rust documentation: "it is not recommended to use this [...] for a general try/catch mechanism" ↩ -
another, even less defensible idea: you are allowed to panic if there is no sensible way of responding to such an error. i disagree with this - in a webserver, you want to respond to the client no matter what; in a user-driven app with a gui, tui, or cli, splatting a panic stacktrace is not a friendly user experience; in a library you should really never panic but always return some indication of failure, etc. ↩