Exceptions happen everyday: the bigger (and the more distributed) the system, the
higher the chances for things to go south.
Most of us already learned the lesson when we idealized architectures and they
bit us back in the form of a catastrophic downtime that could have been avoided,
maybe by just adding a required timeout
or keeping a few best practices for distributed systems
in mind: we are now better architects, who understand that failures are an option
and we have to build resilient systems that embrace them and work towards mitigating
There is one thing, though, that most of us (including me) still suck at: throwing
We have great infrastructures in place to log information and monitor our systems
where, in theory, everything is taken care of; then the day comes when disaster, in the form
of a nasty bug, strikes and we’re left trying to understand what’s
going on with our software.
How many times, after fixing a bug, you find yourself saying
“let’s add some more logs though”? If you’ve been as frustrated as I’ve been,
I’d recommend you to read on.
Read on →