There is this spooky term that floats around in almost all engineering teams:
Breaking Changes
But what are they, really? Why would we dread them so much? And most importantly, how can you deploy them safely?
What are Breaking Changes?
Simply speaking, breaking changes are a dependency problem between one software system and one or more dependent software systems. While there might be a lot of changes to a software system over its lifetime, not all of them run the risk of being breaking. Usually, only those changes that alter the exposed surface of a software system (its interface) can be breaking.
Of course, if a software system exposes all of its internals as its interface, every change to it may be breaking (which is the reason for encapsulation being such an important concept!).
But that is just a kind of upper bound for breaking changes. Not every modification of a HTTP API's response body schema, not every change to a method's return type, and not every additional header value in a protocol frame is a breaking change.
In my opinion, the definition of breaking changes cannot just be made once and then never change. Instead, I sympathize with an approach similar to the definition of the term Software Architecture by the late Stefan Tilkov in his GOTO 2019 Talk "Good Enough" Architecture:
> "Architecture is a property of a system, not a description of its intended design"
I view breaking changes in a similar way: as properties of a system in the context of those depending on it. Every change in a software system (upstream) that causes unwanted changes in at least one dependent software system (downstream) is a breaking one.
Unwanted changes may be compilation errors, runtime errors, or changes in runtime behavior that are not necessarily errors. It all depends on their downstream systems and their tolerance for change.
Now, with a proper definition of breaking changes (the term, not the changes themselves, obviously) at hand, let's see how to deploy them safely.
Changing without Breaking
Please take the following advice with a grain of salt, since I am speaking from the perspective of someone who currently works in an environment of 10s of interdependent services, not 100s or 1000s.
When deploying breaking changes, there are several ways to go about it:
Plan A: YOLO
Just ship it. See what breaks. Fix it.
That's bad. It's bad for the downstream systems and requires long shifts to find the broken dependants, either by you or the maintainers of the downstream systems. If the latter is the case, you just made a lot of enemies.
Do not do it like that.
Plan B: Informed YOLO
Make a plan and prepare the changes to the breaking downstream systems. Then, deploy the upstream, then the downstream.
In some cases, this is a valid option, but do be aware of the risks. From the moment on that you deploy the breaking change upstream, the clock is ticking for the downstream changes to be deployed.
If interactions between downstream and upstream are infrequent, you might get away with it (e.g., nightly batch tasks, low-priority calls, etc.).
This tactic can work in small teams, limited scopes but quickly hits limits when deployments take long or the constraints mentioned above are not met.
Do this only if you need to.
Plan C: YOLO, so be careful
You plan the required changes and spread them over multiple deployments so that there is never a state of inconsistency.
How exactly can you achieve this? Here is the blueprint:
-
Expand Downstream
- Figure out a superset interface that includes both the old and the new interface
- Expand each downstream system to accept the superset interface
- Deploy the changes to the downstream systems
-
Expand Upstream
- Now implement the same superset interface in the upstream system
- Deploy the changes to the upstream systems
- Verify that each downstream system correctly handles the new interface as part of the superset interface
-
Reduce Downstream
- In each downstream system, remove the handling of the old interface part of the superset interface, accepting only the new interface
- Deploy the changes to the downstream systems
-
Reduce Upstream
- Change the upstream system to only provide the new interface
- Deploy the changes to the upstream system
You just deployed a breaking change without making enemies of your coworkers!
Pros and Cons
Pro
- Deploy breaking changes with zero downtime
-
Applicable in many contexts:
- Database migrations: write to the old and the new table, read only from the new table, remove the old table
- Code Interfaces: create a new interface function, use the old one as fallback, remove the old function
- Network Protocols: ... kidding, just pray (I am looking at you, IPv6)
Con
- Many deployments
- Often times ugly, intermediary code after steps 1 and 2
- Only effective when all downstream systems can be controlled
As you can imagine, the entire concept falls apart when you cannot control the downstream dependents of a software system. But that is a challenge outside the scope of this post.
Closing Thoughts
Breaking changes are a reality for every developer shipping production systems. Shipping them successfully oftentimes requires a lot of hard work on top of the changes themselves and you should consider them part of the effort, not just an afterthought.
There are, of course, many other ways to accomplish this task, and I am truly sorry for everybody who provides a public API or ships mobile apps. While the latter "just" increases the delay between deployments, the former entirely kills the idea of ever shipping breaking changes and instead introduces deprecation, another nasty term.
So, please be kind to your downstream and ship breaking changes without breaking stuff.
Bye!
Chris