Part 1 - Setting the Stage

Introduction

Configuring cloud environments can be cumbersome, and like many others I often wonder whether some have designed their GUIs intentionally in a certain or not. Previous in my carreer, I was often flown in as a consultant to help teams and companies get a grip. The tools at hand mainly consist of a way-of-working, a bit of culture and a bit of management balance between delivering fast and delivering well. Without proper infrastructure-as-code (IaC), this job can become 10 times more tedious, even if you are able to navigate the GUI or the APIs.

As such, I’ve seen firsthand the value of IaC, and within large cloud environments I would never advise against it. Not even when you bring AI in to the mix. At the moment of writing, my teams at NN are using Terraform to manage 100K+ resources across dozens of different environments. IaC is essential.

Things start to get more hairy once you start crossing the boundaries of the cloud, or clouds if you’re multi-cloud. In many cases Terraform providers do exist, but their quality might differ and in some cases the underlying technology hasn’t really been adapted to work very well with the API that’s exposing it. With state involved, having correct APIs that actually can inform the state about the current configuration is essential.

You cross this boundary already when you start to provision databases, or SaaS-services like Databricks Workspaces. And some organizations might even have built their own boundaries within the Cloud to limit the possibilities for other teams. This might sound like a bad thing, but it is not by definition. It depends.

So IaC and API availability go hand-in-hand. And more often than not, API-design greatly influences IaC-design. If the API fits well with the responsibilities of a certain team, that team generally will be pushed to implement IaC at some point. If it does not fit at all, or if there’s no API available, this is a very different story. No API, no IaC.

Finally, there is this grey area in the middle. A certain service is configurable through APIs, but not fully. For some parts of the service, privilege escalalations or third-party approval, or perhaps some technicality, limit teams from further automating their work.

This series of blogposts is about such a story.

APIs that do not fit with the organization

In the scenario at hand, some parts of the service that is to be configured, are locked behind admin permissions. Since my teams are more responsible for provisioning and enablement, rather than the more abstract central parts related to administation, there are simply some things that they are deemed not allowed to do. The reasoning is always security, and in many cases it also does make sense.

Nonetheless, configuration of admin-level objects present a stringent limitation, especially because my teams are moving significantly faster than the central administrative team. If you think about it, that would probably be often the case. The more central your IT Service becomes, the more stakeholders you will have and the harder it will be scope your service well.

This does pose a significant challenge. Since managing those admin-level objects are important in several new features of the underlying service as a whole, there is a direct need for speedy development cycles. The last thing you want is to have a separate team in the loop to scrutinize and approve your moves. Next to the risk of an overall slowing down of the process, there’s also an impact on teams and people being able to move autonomous. This slight lack of control for the enablement team also pushes the team away of taking up responsibility of the service as a whole to its customers.

So what do companies do? They look for improvements to the process. One example is formal ticketing between the two teams, which if done wrong mostly leads to the central team being able to easier hide behind formalities. The other methods more either rely on shared codebases, shared CI/CD pipelines or APIs to automate the interfacing.

In our case, a short-term solution involving a shared pipeline behind a third-party API was chosen. Next to that, the work was defined and planned to bring this short-term solution into a standalone API not based on third-party compute and pipelines. I pushed very strong for this, since CICD pipelines are not really meant to host APIs and interfaces between separate teams.

Long Story Short

We have a large terraform codebase that manages 100K resources, and are now confronted with a custom internal API whose underlying objects are closely integrated with other objects in the exising codebase.