Saturday, 19 October 2019

Operating Manual for the Ship of Theseus

Over the last 5 or more years I've had kind of an abstinence from conferences and software architecture books. Industry focus was on Cloud, Serverless and ML, leaving system design stalling, with the occasional, rare exception (KNative, Learned Structures, ISA and Simon's ongoing quest for explainability come to mind).

Conference speakers still explain Agile, DevOps, ADR's, EDA and Resilience, while people still pile up tech debt and big balls of mud, just now using Serverless or Kubernetes. The 20 year anniversary edition of The Pragmatic Programmer, which I hold in my hands, says it is “just as relevant today as it was back then”. Given that I saw the limits of Agile, I became more interested in the product operations, SRE side of systems and how observability, explainability, human collaboration and supportability spins system evolution to converge towards simplicity (or not) and builds a community-centric narrative that hopefully enables, in the long term, better socio-technological structures.

Over the last year or so, though, I found myself surprised by an influx of interesting material, in particular in regards to bias, culture, empathy, failure, discovery, resilience (SRE) and risk awareness in orchestration complexity of socio-technical systems.

A welcome change from a long overdue self correction of VC-fueled get-rich-quick startups culture. Concrete examples are my favourite book Accelerate, which inspired a great Böckeler / Fowler talk at Craft Conf and a very entertaining self reflection by Tilkov, and other nice perspectives like George Fairbanks Continuous Design Talk and Nygard thinking about state, Videla's, Wichary's, Steenson's and Ullman's thoughts about weird languages, Design It! (and the 2nd edition of Release It!), but also some really great distributed systems research by Howard and Kleppmann challenging our perspective on concurrency, and new, humble ways to create frameworks.

Empathy for Entropy

It's not simply that Microservices have made microblogging-driven startups suddenly realize the value of BDUF. It rather is the agile move away from Enforcement towards Observation with better tools, empathetic user-centric techniques and ways of thinking about consistency and concurrency. It reminds me of my W-Jax talk 7 years ago, when I first read about Spanner and its formalization of time. When Kleppmann argues for OLEP and Local First, that the queue is the database, it reminds me how Spanner argues that the database is the queue. Both arrive at the same insight: That information and time are entangled, and that entropy or consistency are derived from that. As Howard beautifully analyzed, for our systems it’s easy to replace the concept of time with state transitions (Lamport clocks). This allows us to step away from seeing the system as uncontrollable, and us reactive, to focus on the real, human cause for entropy.

Thursday, 29 November 2018

An annotated Philosophy of Software Design

We build our computer (systems) the way we build our cities:
over time, without a plan, on top of ruins
Ellen Ullman

After much hype I've read John Ousterhout's A Philosophy of Software Design which he uses to teach his course at Stanford and presents as a personal experience "opinion piece". Its basic goal seems to be to develop an awareness and intuition about how and when to manage complexity emerging out of proper problem decomposition in Software Design.

Undergraduate students thus seem to be the main audience, which explains why, despite the title, architecture or non-abstract large system design, distributed systems, developer workflow and design thinking are not covered much, and why a prosaic, almost aphorismic writing style were chosen. My assumption is this is  also why barely any references or annotations are given. It's supposed to be the beginning of an iterative journey, similar to language learning. I figured it might help to share mine, though, to go the next step in that journey (Github would feel too official, so blog):

Preface, Iterative process to this book. Exercises in Programming Style 2014Beautiful Code, 2008Software Craftsmanship 2001 (Craftsperson) and before that The Pragmatic Programmer 1999 and even Programming Pearls 1986 developed iterative, small-wisdom, later developed into Kata's, based learning of software design.

Saturday, 28 April 2018


After a recent conference I had some good questions and discussions about the current state of services meshes (Istio in this case), thus decided to note down what I find interesting about them.

I'm a fan of infrastructure-level (polyglot) service meshes. The premise excites me as much as Android ten years ago. Working over those years with SOA, EDA, REST Hypermedia API's, API Management, Microservices and language-bound Services Meshes (Netflix / Pivotal, Lightbend's Reactive), I tend to be careful about the resolution of their promises though.

This post is about 3 lesser known effects of service meshes. My last post already covered complexity, emergence and observability in a more general way so I'll limit those topics.


Horizontal scalability in the multi-core age is one of the main arguments behind basically all modern stateless software architectures. Consistency in distributed systems, immutability and state-handling are often mentioned common properties, typically used to justify functional programming paradigms. To me it seems though, their most important common property is a complex, emergent network graph structure. Layers and tiers cannot represent contemporary systems anymore. Those systems have a complex adaptive graph structure in space (infrastructure, users, component interactions) and time (versioning / one codebaserainbow releases, experiments, DevOps, event order, routing).

We could observe a lot of frameworks in the last years quietly move towards declarative graph alterations. The first I remember were Android Intents and Puppet. But more recently it was ReactTensorflowBeamKubernetesEve (RIP) not to forget the re-emergence of SQL combined with flexible consistency models and stream processing.

All of those come with a robust, well-defined domain vocabulary and set of patterns that allows to precisely define desired behaviour. A graph encourages modularization and reuse, it allows for division of labor: Better specialization while making the overall concept better understandable. This, in turn, allows a wider, more diverse group of people to reason and converse about the behaviour of the system. The shared language and culture may hopefully enable them to learn alongside the system, what Nora Bateson calls "symmathesy". It requires all actors in the system to define goals and dependencies, versioned together in one codebase across layers and components, documentation, test (spec), customer support feedback and architectural decisions. That's why all good (micro) service-architecture principles contain continuous delivery and lean.

The biggest difference between those graph-based declarative approaches and Model-Driven concepts (MDD/MDA) is that they are bottom-up, and designed to support evolution*. Instead of requiring a canonical model, tribal ("bounded") domain language or strict interface contracts externally, it is very easy to implement domain-event messaging on the infrastructure level, because the infrastructure itself has meaning, allowing for independently composed distributed systems - in other words choreographed rather than orchestrated.

The declarative, domain-event-driven approach shares some advantage of MDA though: The vocabulary, patterns, and visible graph of dependencies. It makes it a lot easier to follow, though, and a lot harder to ignore. Once the implications of changes to the graphs are commonly understood, it's a lot easier to reason about the graph (see, for instance, the original Flume paper). On the low level that service meshes target, the infrastructure level, this quality makes it a lot easier to reason and iteratively learn the entire system (including the mesh itself and the DevOps process around), and to version, document, track and test the system.

Sunday, 31 December 2017

Mandala or The end of control

A good friend of mine asked me why I don’t blog anymore, so I took my new-years flight as an opportunity to write some random thoughts down. Happy new year!

We used to build Information Systems or Control Systems. Sometimes, they were clumsily merging - but finally become something entirely new: Intelligent Intent Systems. I don’t like catchy slang, though, let’s just say we finally have universal “Systems”.

A sufficiently lean online shop is essentially an easy interface for sending a signal into an extremely complex, often entirely automated, logistics chain that translates information into physical control commands - the Information System part manages the feedback loop to humans. The more feedback, an idea from Control Systems, became incorporated into Information Systems, the smarter, faster, and more intuitive we were able to interact. Like frames in a movie, we’re now at the point where it becomes seamless and continuous. The IoT, spatial computing, but actually just technology becoming synonymous with information close the loop back into our world (the "real" world). We used to interpret our systems as essentially closed and deterministic, in both imperative and functional programming styles. But the new types of systems have rapidly become Probabilistic Systems.

With the planet-scale cloud of distributed services risk-driven models, from complexity and uncertainty theory, have taken center stage. With Machine Learning rapidly surpassing human experience in programming and research, we will ourselves have to model the real, human, world in a descriptiveprobabilistic way as part of those systems, observing and inferring, rather than imperatively defining the message flows between agents, and its consistency properties, the data flows between processors or structural limitations (think column- vs row-oriented data).

We don’t observe outside of the system, but as a part of it. Much like quantum physics, psychology or sociology (especially of power), told us. We humans are only agents receiving signals in this system. We inhabit a second-order Cybernetics Technosphere we call reality, built on platforms that define what they want and value economically in entirely new, sometimes alien ways, like real-time exchanges including spatial and social dimensions - with the gig- and experience economy only being the tip of the iceberg. It’s not a matrix or a mastermind though, it’s just a real, techno-human ecosystem with its own, uncontrollable evolutionary goals.

Despite me not liking the gig economy, I like the idea of evolutionary systems. Most writing focuses on the iterative process to avoid unpredictability, though, assuming some external, given, linearity, to incorporate feedback. That’s important for organizations, but it’s not mentioned where the feedback comes from: The goal, the adversary, the predator, or the local maximum.

I mainly work on the integration between real world and software, the hard end of mobile / ubiquitous / spatial / pervasive computing. Building independent, distributed service meshes, in a DevOps and Design Thinking (or DesignOps, whatever the cool kids call it these days) way. In such systems, you’re always going after the local maximum, the goal is unclear, more often than not multiple conflicting ones. You’re naturally dealing with (domain) verticals rather than horizontals. Every day has a new trade off between best experience possible and the technical realities. Those systems are not linear evolutions, they are mandalas of expanding and contracting system boundaries. They have to be observable, though, as with observation comes empathy, and with empathy learning.

In the future, we may refactor the parameters of our systems based on deeper insights about their non-deterministic behaviour. That's what I like about SRE-style work. It's the non-deterministic, the probabilistic part of software engineering. It focuses on observability and serviceability, and ML RCA’s involve explainability - correctness becomes an optimization goal, not an axiom. When I spoke about Spanner first time publicly in 2012, compared it to the twisted experience of time in movies like Spaceballs and The Hitchhiker's Guide to the Galaxy. The powerful takeaway is: Nothing is fixed, if we can reason about it, we can change it.

To understand the magnitude of change to our profession, we have to understand the societal context. Of all the possible futures, a dystopian scenario is interesting here - I shorten my version after watching Charly Stross’ talk from 34c3 which tells it better: In this scenario, the we is not a harmonic, transhuman, unity. The new we is us and algorithms from us, for us. A dark (in the sense of dark matter) singularity not of eternal life but thoughtlessmutual uncertainty, where biased algorithms and biased, dumbed down or even corrupt, people push each other further into the edges, not becoming market segments but mobs which reinforce themselves. The algorithm is as helpless as their users, because the society and economy around it require the entropy as fuel making regulation impossible.

Quick, personalized, adjustment, unlearning, or one-shot learning, can maybe avoid this scenario - it seems AI is already forgetting easier than us, controlling and optimizing itself faster than us, collaborating and sharing surprising insights nicer than us. In a "thinking fast, thinking slow" model, maybe the human is ought to become the "slow" part - as Nate Silver once put it "complementary roles that computer processing speed and human ingenuity can play in prediction". Instead of turning away, the human needs to be enabled to act.

Right now, we train machine learning systems by saying “yes” - soon we will reach the point where transfer is so good that we’ll start saying “no”. That “no” has to be slow enough, it has to be thoughtful, and it has to weigh more than assumed silent consent. We have to introduce reason, empathy and ethics, not only into individual machine learning models, but into the whole system that is driven by technology, into all of the information and the human organizational complex around it.

What excites me about working with systems from an observability, serviceability, and explainability perspective is that we can bring all of that rich knowledge from physics, psychology or sociology, hermeneutics, but also art in, and start reasoning about the overall behaviour, rather than deterministic, imperative requirements. We only have to keep talking to each other - and try to understand.

Sunday, 19 March 2017

On the other side of Certainty

"We have to create the preconditions under which
Exaptation can happen naturally...
which actually means introducing inefficiency into systems"

"The question is no longer how systems behave ...
but how to ask for the probability distribution
of the properties that change the system"

Moving from project- into product world last year, I wanted to experience the real long-term view, because only strategy can tackle complexity. And our software grows more complex, though arguably less complicated, every year. But I also wanted to understand uncertainty, the dark matter of complexity, better. After some time in, I understand it's probably more important for an architect* to have strong operations knowledge than very strong algorithmic skills.

Resilience is sometimes mistaken as being adaptive, or agile. It just means expecting uncertainty and disruption, but also working properly under normal conditions. Just calling everything disruptive, and reacting in an agile way to every random demand, is not resilient. For instance, "The datacenter as a computer"** came to us, rather unexpected, from a relatively "boring" infrastructure level, rather than from new frameworks and startups, but also not from consortium standards and language ecosystems. Similarly, in true grassroots manner, polyglot programming and JavaScript on top of Unix principles catapulted us into the 21st century and will eventually enable domain logic to exist, as Adrian Cockcroft calls it, in functions, unaware of most technological constraints. In order to understand resilience, you need to care about your product, and want to improve it, have a real goal, a story you want to tell. Ironically, you have to be a little bit un-adaptive, inefficient and un-agile in order not to overfit, but to really improve.

We are still waiting for the 4th paradigm of programming. My guess is it's going to be more than just goal-oriented, it will be probabilistic, in the sense of an abstract goal corridor. While engineers will live inside the goal corridor, making sure its workings are predictable, specializing in certainty, architects live on the outside, in the long tail of the probability distribution, the Multiverses where the Dragon Kings live, outside predictive models, specializing in uncertainty. That’s what makes a system resilient. It implies engineers have a fairly static/discrete/fitted view of a system, the perfect snapshot, the position, whereas architects have a time-smeared continuum perspective, the story, the momentum. It's the whole story, not only the goal, that differentiates between emergence and evolution. But it's also important to understand that both of those roles are equally creative and forward-looking, none is a "higher level", they are two sides of the same coin.

Helga Nowotny's distinction between risk and uncertainty fits nicely here - risk can be computed, uncertainty not. Risk a relation between snapshots, uncertainty is the continuum in between. A weather forecast has risk, the climate is uncertainty. In that sense, a risk-driven architecture involves everyone, but uncertainty needs a different perspective. The future leaves to the architect (the systems-of-systems-carer, archeological gardener, forensic librarian, ontological cartographer or whatever you like to call her) the role of the curator, and maybe narrator, of the uncomputable. Paraphrasing what Akka Architects say: Architects don't design system interactions, they curate the context for a discourse about system interactions.

"It is not a question of establishing limits with walls,
but by other means"

Why is all of this important, why should we get used to speak in terms of probabilities, but, most importantly, tackle certainty and uncertainty differently, but with the same importance?

In my last job I learned the importance of maintainability and traceability. In every single one of my projects the first thing I introduced was proper monitoring, alerting and analytics - Robustness was core, as was accountability, to be able to become lean and agile. With new infrastructure, whether in the cloud or not, this has become the default. The battle for certainty, traceability, and robustness is won.

While we were busy fighting this battle, uncertainty has come back, as Ms. Nowotny would put it, "systemic risk", as risk deeply embedded in the complex relationships of our services: "Uncertainty switches gestalt". A cloud service going down is a risk that can be predicted - the dashboard not showing it, because it is hosted on the same cloud service is systemic risk, the type of Dragon King uncertainty which we don't expect. Soon it will be mainstream that coders will pair with AI, and operations and product teams will regularly train models rather than manually define metrics. The relationships between components will become so complex that, more often than not, it will take a long time to even recognize errors, or feature usage, let alone find the root cause or customer need.

Despite all the data we accumulate, Observability does still not mean Explainability (sometimes as beautifully visualized as our old architecture diagrams) or Introspection, the ability to rationalize the inscrutability of AI.

Most of our architecture diagrams are nothing more than Thomassons, useless depictions of a fictional state which we only keep because it took us so long to create them. But similar to code, it's actually more productive to delete them. What we need is a visualization of the statistical complexity of actual state. In my book, a good friend of mine wrote a story how we made network traffic audible in order to hear inconsistencies - that's what we need to understand the architecture of a system. A real-time explanation of what's going on, mapped to the architecturally relevant components, such as interfaces, deployment units (like functions) and, last but not least, rules inside machine learning systems.

I am looking forward to new, real-time programming and data toolkits, think Eve, Jupyter and Glitch, especially because they enable a different kind of coder to build software. And with serverless and deep learning we will be able to scale those apps and the data required quicker than ever before. But it will require architects to understand them, if something is wrong, if a use case needs to be developed or a feature is behaving unexpectedly - i.e. in operations and product development. These architects won't be able to look at diagrams anymore, they won't be certain about documentation (not that they ever were), and they won't even be able to observe the system in its entirety. Architects will indeed become archeological gardeners or forensic librarians. Most importantly, they will become like anthropologists or biologists in the early days, trying to understand evolution, with the help of experimentation and collection. We'll finally see architects developing models instead of diagrams. And with that, we will see very different people in this role too. Which is very exciting.

*) separating this from a Tech Lead here, sometimes it can make sense to combine the two roles, though
**) e.g. serverless computing platforms such as Lambda and globally distributed databases such as Spanner, with them a different approach to time, where evergreen becomes an axiom, but also a comeback of Spreadsheets through functional programming and k/v stores