Sunday, 31 May 2020

Observability, Debt or the Bret-Victor-Ization of distributed systems

I've been thinking how the different way of conceptualizing cost (in the broader sense of investment) in Cloud changes the tech debt metaphor. It never was a good metaphor to start with and allowed too many excuses, but I like the idea of expressing a suboptimal, incomplete or leaky level of abstraction somehow, as a dialectical, critical tool. I like declarative systems because they allow comparison of state over time, but they do limit expressiveness of our mental models, and omissions, when writing down code. Debt, with its pseudo-quantifiable touch is such a mental model limit. No one wants to keep ADR's for each of these*. How to solve this?


It's exciting to see how the next step in polyglot programming is taking stacks apart and designating a layer to developer experience for humans based on “progressive disclosure of complexity”, and how we argument for this is feedback time or, in other words, Software Delivery Performance. What a16z recently called The Decade of Design (but combined with craft and lean), the best example probably being Stripe which shows that beautiful API's and documentation and a beautiful website (and beautiful books) might after all correlate, and maybe because they take empathy to their heart (no surprises here for anyone who has seen or used PayPal). 

When I first saw this at Google working with Dremel / F1 and all the Data Mesh tools around it, an ontological, rhizomatic approach to data - without oversight, yet with structure, as side effect of, essentially DDD (or set / category theory, referring to ontology below). And the same when seeing Borg and the Service Mesh around it. Both were built as products with a builder focus, with "a builder" meaning universally anyone who wants to build something, meaning to contribute to a shared idea. When we say everything-as-code we need to go beyond engineering components, we, and that means everyone and the maximum of diverse perspectives, need to look at the product and all of its users. Similar to medical doctors who moved away from seeing "man as machine", or architects seeing the city or the house as a machine, we are slowly moving away from seeing a software system as a machine, perfectly controllable. Technology is not neutral, and a constant process and struggle that goes far beyond engineering.

Saturday, 19 October 2019

Operating Manual for the Ship of Theseus

Over the last 5 or more years I've had kind of an abstinence from conferences and software architecture books. Industry focus was on Cloud, Serverless and ML, leaving system design stalling, with the occasional, rare exception (KNative, Learned Structures, ISA and Simon's ongoing quest for explainability come to mind).

Conference speakers still explain Agile, DevOps, ADR's, EDA and Resilience, while people still pile up tech debt and big balls of mud, just now using Serverless or Kubernetes. The 20 year anniversary edition of The Pragmatic Programmer, which I hold in my hands, says it is “just as relevant today as it was back then”. Given that I saw the limits of Agile, I became more interested in the product operations, SRE side of systems and how observability, explainability, human collaboration and supportability spins system evolution to converge towards simplicity (or not) and builds a community-centric narrative that hopefully enables, in the long term, better socio-technological structures.

Over the last year or so, though, I found myself surprised by an influx of interesting material, in particular in regards to bias, culture, empathy, failure, discovery, resilience (SRE) and risk awareness in orchestration complexity of socio-technical systems.

A welcome change from a long overdue self correction of VC-fueled get-rich-quick startups culture. Concrete examples are my favourite book Accelerate, which inspired a great Böckeler / Fowler talk at Craft Conf and a very entertaining self reflection by Tilkov, and other nice perspectives like George Fairbanks Continuous Design Talk and Nygard thinking about state, Videla's, Wichary's, Steenson's and Ullman's thoughts about weird languages, Design It! (and the 2nd edition of Release It!), but also some really great distributed systems research by Howard and Kleppmann challenging our perspective on concurrency, and new, humble ways to create frameworks.

Empathy for Entropy

It's not simply that Microservices have made microblogging-driven startups suddenly realize the value of BDUF. It rather is the agile move away from Enforcement towards Observation with better tools, empathetic user-centric techniques and ways of thinking about consistency and concurrency. It reminds me of my W-Jax talk 7 years ago, when I first read about Spanner and its formalization of time. When Kleppmann argues for OLEP and Local First, that the queue is the database, it reminds me how Spanner argues that the database is the queue. Both arrive at the same insight: That information and time are entangled, and that entropy or consistency are derived from that. As Howard beautifully analyzed, for our systems it’s easy to replace the concept of time with state transitions (Lamport clocks). This allows us to step away from seeing the system as uncontrollable, and us reactive, to focus on the real, human cause for entropy.

Thursday, 29 November 2018

An annotated Philosophy of Software Design

We build our computer (systems) the way we build our cities:
over time, without a plan, on top of ruins
Ellen Ullman

After much hype I've read John Ousterhout's A Philosophy of Software Design which he uses to teach his course at Stanford and presents as a personal experience "opinion piece". Its basic goal seems to be to develop an awareness and intuition about how and when to manage complexity emerging out of proper problem decomposition in Software Design.

Undergraduate students thus seem to be the main audience, which explains why, despite the title, architecture or non-abstract large system design, distributed systems, developer workflow and design thinking are not covered much, and why a prosaic, almost aphorismic writing style were chosen. My assumption is this is  also why barely any references or annotations are given. It's supposed to be the beginning of an iterative journey, similar to language learning. I figured it might help to share mine, though, to go the next step in that journey (Github would feel too official, so blog):

Preface, Iterative process to this book. Exercises in Programming Style 2014Beautiful Code, 2008Software Craftsmanship 2001 (Craftsperson) and before that The Pragmatic Programmer 1999 and even Programming Pearls 1986 developed iterative, small-wisdom, later developed into Kata's, based learning of software design.

Saturday, 28 April 2018


After a recent conference I had some good questions and discussions about the current state of services meshes (Istio in this case), thus decided to note down what I find interesting about them.

I'm a fan of infrastructure-level (polyglot) service meshes. The premise excites me as much as Android ten years ago. Working over those years with SOA, EDA, REST Hypermedia API's, API Management, Microservices and language-bound Services Meshes (Netflix / Pivotal, Lightbend's Reactive), I tend to be careful about the resolution of their promises though.

This post is about 3 lesser known effects of service meshes. My last post already covered complexity, emergence and observability in a more general way so I'll limit those topics.


Horizontal scalability in the multi-core age is one of the main arguments behind basically all modern stateless software architectures. Consistency in distributed systems, immutability and state-handling are often mentioned common properties, typically used to justify functional programming paradigms. To me it seems though, their most important common property is a complex, emergent network graph structure. Layers and tiers cannot represent contemporary systems anymore. Those systems have a complex adaptive graph structure in space (infrastructure, users, component interactions) and time (versioning / one codebaserainbow releases, experiments, DevOps, event order, routing).

We could observe a lot of frameworks in the last years quietly move towards declarative graph alterations. The first I remember were Android Intents and Puppet. But more recently it was ReactTensorflowBeamKubernetesEve (RIP) not to forget the re-emergence of SQL combined with flexible consistency models and stream processing.

All of those come with a robust, well-defined domain vocabulary and set of patterns that allows to precisely define desired behaviour. A graph encourages modularization and reuse, it allows for division of labor: Better specialization while making the overall concept better understandable. This, in turn, allows a wider, more diverse group of people to reason and converse about the behaviour of the system. The shared language and culture may hopefully enable them to learn alongside the system, what Nora Bateson calls "symmathesy". It requires all actors in the system to define goals and dependencies, versioned together in one codebase across layers and components, documentation, test (spec), customer support feedback and architectural decisions. That's why all good (micro) service-architecture principles contain continuous delivery and lean.

The biggest difference between those graph-based declarative approaches and Model-Driven concepts (MDD/MDA) is that they are bottom-up, and designed to support evolution*. Instead of requiring a canonical model, tribal ("bounded") domain language or strict interface contracts externally, it is very easy to implement domain-event messaging on the infrastructure level, because the infrastructure itself has meaning, allowing for independently composed distributed systems - in other words choreographed rather than orchestrated.

The declarative, domain-event-driven approach shares some advantage of MDA though: The vocabulary, patterns, and visible graph of dependencies. It makes it a lot easier to follow, though, and a lot harder to ignore. Once the implications of changes to the graphs are commonly understood, it's a lot easier to reason about the graph (see, for instance, the original Flume paper). On the low level that service meshes target, the infrastructure level, this quality makes it a lot easier to reason and iteratively learn the entire system (including the mesh itself and the DevOps process around), and to version, document, track and test the system.

Sunday, 31 December 2017

Mandala or The end of control

A good friend of mine asked me why I don’t blog anymore, so I took my new-years flight as an opportunity to write some random thoughts down. Happy new year!

We used to build Information Systems or Control Systems. Sometimes, they were clumsily merging - but finally become something entirely new: Intelligent Intent Systems. I don’t like catchy slang, though, let’s just say we finally have universal “Systems”.

A sufficiently lean online shop is essentially an easy interface for sending a signal into an extremely complex, often entirely automated, logistics chain that translates information into physical control commands - the Information System part manages the feedback loop to humans. The more feedback, an idea from Control Systems, became incorporated into Information Systems, the smarter, faster, and more intuitive we were able to interact. Like frames in a movie, we’re now at the point where it becomes seamless and continuous. The IoT, spatial computing, but actually just technology becoming synonymous with information close the loop back into our world (the "real" world). We used to interpret our systems as essentially closed and deterministic, in both imperative and functional programming styles. But the new types of systems have rapidly become Probabilistic Systems.

With the planet-scale cloud of distributed services risk-driven models, from complexity and uncertainty theory, have taken center stage. With Machine Learning rapidly surpassing human experience in programming and research, we will ourselves have to model the real, human, world in a descriptiveprobabilistic way as part of those systems, observing and inferring, rather than imperatively defining the message flows between agents, and its consistency properties, the data flows between processors or structural limitations (think column- vs row-oriented data).

We don’t observe outside of the system, but as a part of it. Much like quantum physics, psychology or sociology (especially of power), told us. We humans are only agents receiving signals in this system. We inhabit a second-order Cybernetics Technosphere we call reality, built on platforms that define what they want and value economically in entirely new, sometimes alien ways, like real-time exchanges including spatial and social dimensions - with the gig- and experience economy only being the tip of the iceberg. It’s not a matrix or a mastermind though, it’s just a real, techno-human ecosystem with its own, uncontrollable evolutionary goals.

Despite me not liking the gig economy, I like the idea of evolutionary systems. Most writing focuses on the iterative process to avoid unpredictability, though, assuming some external, given, linearity, to incorporate feedback. That’s important for organizations, but it’s not mentioned where the feedback comes from: The goal, the adversary, the predator, or the local maximum.

I mainly work on the integration between real world and software, the hard end of mobile / ubiquitous / spatial / pervasive computing. Building independent, distributed service meshes, in a DevOps and Design Thinking (or DesignOps, whatever the cool kids call it these days) way. In such systems, you’re always going after the local maximum, the goal is unclear, more often than not multiple conflicting ones. You’re naturally dealing with (domain) verticals rather than horizontals. Every day has a new trade off between best experience possible and the technical realities. Those systems are not linear evolutions, they are mandalas of expanding and contracting system boundaries. They have to be observable, though, as with observation comes empathy, and with empathy learning.

In the future, we may refactor the parameters of our systems based on deeper insights about their non-deterministic behaviour. That's what I like about SRE-style work. It's the non-deterministic, the probabilistic part of software engineering. It focuses on observability and serviceability, and ML RCA’s involve explainability - correctness becomes an optimization goal, not an axiom. When I spoke about Spanner first time publicly in 2012, compared it to the twisted experience of time in movies like Spaceballs and The Hitchhiker's Guide to the Galaxy. The powerful takeaway is: Nothing is fixed, if we can reason about it, we can change it.

To understand the magnitude of change to our profession, we have to understand the societal context. Of all the possible futures, a dystopian scenario is interesting here - I shorten my version after watching Charly Stross’ talk from 34c3 which tells it better: In this scenario, the we is not a harmonic, transhuman, unity. The new we is us and algorithms from us, for us. A dark (in the sense of dark matter) singularity not of eternal life but thoughtlessmutual uncertainty, where biased algorithms and biased, dumbed down or even corrupt, people push each other further into the edges, not becoming market segments but mobs which reinforce themselves. The algorithm is as helpless as their users, because the society and economy around it require the entropy as fuel making regulation impossible.

Quick, personalized, adjustment, unlearning, or one-shot learning, can maybe avoid this scenario - it seems AI is already forgetting easier than us, controlling and optimizing itself faster than us, collaborating and sharing surprising insights nicer than us. In a "thinking fast, thinking slow" model, maybe the human is ought to become the "slow" part - as Nate Silver once put it "complementary roles that computer processing speed and human ingenuity can play in prediction". Instead of turning away, the human needs to be enabled to act.

Right now, we train machine learning systems by saying “yes” - soon we will reach the point where transfer is so good that we’ll start saying “no”. That “no” has to be slow enough, it has to be thoughtful, and it has to weigh more than assumed silent consent. We have to introduce reason, empathy and ethics, not only into individual machine learning models, but into the whole system that is driven by technology, into all of the information and the human organizational complex around it.

What excites me about working with systems from an observability, serviceability, and explainability perspective is that we can bring all of that rich knowledge from physics, psychology or sociology, hermeneutics, but also art in, and start reasoning about the overall behaviour, rather than deterministic, imperative requirements. We only have to keep talking to each other - and try to understand.