Sunday 31 May 2020

Observability, Debt or the Bret-Victor-Ization of distributed systems

I've been thinking how the different way of conceptualizing cost (in the broader sense of investment) in Cloud changes the tech debt metaphor. It never was a good metaphor to start with and allowed too many excuses, but I like the idea of expressing a suboptimal, incomplete or leaky level of abstraction somehow, as a dialectical, critical tool. I like declarative systems because they allow comparison of state over time, but they do limit expressiveness of our mental models, and omissions, when writing down code. Debt, with its pseudo-quantifiable touch is such a mental model limit. No one wants to keep ADR's for each of these*. How to solve this?

Product-as-code

It's exciting to see how the next step in polyglot programming is taking stacks apart and designating a layer to developer experience for humans based on “progressive disclosure of complexity”, and how we argument for this is feedback time or, in other words, Software Delivery Performance. What a16z recently called The Decade of Design (but combined with craft and lean), the best example probably being Stripe which shows that beautiful API's and documentation and a beautiful website (and beautiful books) might after all correlate, and maybe because they take empathy to their heart (no surprises here for anyone who has seen or used PayPal). 

When I first saw this at Google working with Dremel / F1 and all the Data Mesh tools around it, an ontological, rhizomatic approach to data - without oversight, yet with structure, as side effect of, essentially DDD (or set / category theory, referring to ontology below). And the same when seeing Borg and the Service Mesh around it. Both were built as products with a builder focus, with "a builder" meaning universally anyone who wants to build something, meaning to contribute to a shared idea. When we say everything-as-code we need to go beyond engineering components, we, and that means everyone and the maximum of diverse perspectives, need to look at the product and all of its users. Similar to medical doctors who moved away from seeing "man as machine", or architects seeing the city or the house as a machine, we are slowly moving away from seeing a software system as a machine, perfectly controllable. Technology is not neutral, and a constant process and struggle that goes far beyond engineering.


Understanding the product

With observability our systems become not only monitorable or measurable but tangible to every part of the ever-emergent (just as ontological) socio-technological system, and they give us better feeling of time, to avoid bias like Reification and Goodhart's law. In other words, it establishes a way to talk about the ontological difference between what we things our systems are and what they really are by leveraging category / set theory (maybe?) or OOO.

We can check those biases by seeing the effect of our ideas in the real world (the production system) and iterating on the real effect of systems: The Bret-Victor-Ization of distributed systems.

Very much in the spirit of SRE's focus on user happiness or user pain - in other words empathy. Systems are for everybody and need to involve everybody, all of its unexpected users (something real world architecture finally starts to do). It's worded a little technical but "SLOs are the API for Your Engineering Team", where API essentially means OKR and SLO means promise to all users, is a catch phrase I like.

But what I really like about observability is, that is also shows us there is not only one system - the one in the flowchart, in the architecture PowerPoint, but there is many, constantly emerging ones that are in the heads of all users. "Deep" Systems are never fully up or down and us humans merely hint intents at the complex, speculative, declarative system, which turns our relationship with the system from a fixed state SDLC into, essentially, promise theory, read Mark Burgess 2019:
It's shown that the concepts of statefulness or statelessness are artifacts of observational scale and causal bias towards functional evaluation. If we include feedback loops, recursion, and process convergence, which appear acausal to external observers, the arguments about (im)mutable state need to be modified in a scale-dependent way. In most cases the intended focus of such remarks is not terms like `statelessness' but process predictability. 
The good news is that we've learned a lot of concepts of deep systems from recent advances in Machine Learning (ML). In ML, Goodhart's law is called overfitting. ML's dropout regularization we call canary or A/B testing, embeddings can be understood as modularization or bounded context (let's not discuss whether that means sets / categories), and so forth. I'm not proposing to use ML to 'learn' the architecture of a product, but understand the research in those techniques to come up with our own tools to explain the real structure of systems as a discursive, critical tool, not as ideology. Coming back to my experience with Dremel, I was stunned how normal and easy it was to embed ML into it - for instance to test a hypothesis against a large body of logs by training an explainable random forest.

In that sense, a product is just promise (an old marketing adage, like a product is a story that people personally want to be fulfilled). Like quality is defined in the ISO sense as the sum of expected properties, and the extent to which they are fulfilled in reality (and maybe how close marketing comes to that, or goes beyond). When we decide for a product, we instinctively trust it (with some bias), and we accept the promise of its quality and value. Internal understanding of a product is therefore like quality management. Unsurprisingly, SRE very much acts like the steward of quality expectations and promises of its value, the SLO. Which, I guess, is why ITIL recently introduced a Service Value System, too (I'm skipping the "Value Chain" here, seems they couldn't completely detach from Taylorism / Fordism internally). Given that a product is a promise, and so are the relationships in our system, we can conclude that an ontological analysis of the whole system is possible - as long as everyone is included, the socio-technical system is open for participation, and we aim to reduce bias a much as possible.


Debt as promise

Recently I stumbled over a fascinating analogy when reading 3 books in parallel which surprisingly touched on different subjects in a similar way, the topic of debt - or the topic of promises:



The first book is Carlota Perez' "Technological Revolutions and Financial Capital", which I had read about and was recommended a few times, but was afraid it would be too schumpeter-hype-cycle-idealistic-technological-deterministic. I picked it up again after a post on debt in technology. Yes, it is clearly pre-financial-crisis, pre-precarious-employment and western, but with the open source world being predominantly western, it provides an interesting insight how governments and societal changes drive consolidation, in particular when she talks about capital, debt and technology dualism. The second book, "Resistance and the politics of truth" by Iain MacKenzie, a random Google find when searching for how Deleuze's and Focault's concepts of violence and power relate and how that influences bias in socio-technological systems, in particular how to transform systems from within using resistance instead of revolution or reform. The concept of bias and how it affects risk started with Barthes and Focault, popularised in our software craft by the Agile movement and Fairbanks, MacKenzie talks about how embracing uncertainty helps to counter systems based on hegemonic truth of the "algorithmic control society". And finally the third book, Giorgio Agamben's "Creation and Anarchy" which I found when stumbling over Benjamin again in a Critical Theory book and wondered how he would relate to Max Weber**. Agamben explains how debt withholds anarchy and limits resistance, what Benjamin originally calls a "verschuldeten Kultus", meaning not only owing to but indebted to a cult of dependencies - the debt is the cult. Agamben writes that "capitalism has no beginning or foundation" because it is "the anarchy of power". If debt, or non-debt, becomes a cult, it tranquillizes, while freedom lies in "chance and uncertainty".

When reading Graeber's "Debt" long ago I remember myself nodding and agreeing that indeed it was debt, not money, that was created first, because debt is an obligation to reciprocity. In other words, debt is also just a promise - an asymmetric, slow one, with risk (cost) attached. So in an ontological sense we need to look at categories, not just sets, to model relations between events (like in OOO). The relation here is trust and risk, or happiness and pain in SRE teams. Debt is why tit-for-tat is a surprisingly efficient strategy in the Prisoner's dilemma. Especially when compared to Maus' Gift as a form of community violence or power, which Graeber of course extended to the idea that the state is essentially enforcement of debt.

Coming back to Perez' model, that's when "unfulfilled promises had been piling up" after the hype cycle, which leads to speculation, a debt crisis, "power seeking", and finally regulation stepping in. Sounds to me like limiting emergence in software systems by creating an "Architecture" institution enforcing rules, which then cause less resilience, and worse user experience. In our case, that means our socio-technical systems have to optimize for user trust and reduce risk, ideally without enforcement from institutions, to avoid the "anarchy of power" of Agamben or the "power seeking" of Perez or MacKenzie's / Badiou's / Focault's "politics of truth", which are the first steps towards a complete loss of faith. That's exactly what SRE's do and good open source projects do, help making the system more resilient from within. SRE introduces the chance for resistance using error budgets. SRE embrace uncertainty by not getting trapped by queuing. There is not one truth in an SRE system, there is only the art of influence, of user empathy. Systems decay, but engineers are not blamed for debt. Instead of a debt/credit metaphor, toil is used to express a relationship between people. Following SRE is architecting without architects.

What does that mean for tech debt in general? Is tech debt a good thing, a bad thing? Does it always have to be resolved, or only inventorized? Who gets to define what counts as debt and what as quality, and why does that person or institution have power over this decision?

Given my learnings above, I am careful to enforce resolution of tech debt. Especially when there is no clear user happiness or pain impact. User trust debt needs to be resolved first, and that includes resilience debt, supportability debt or brand debt. Actually I don't like the term debt, because it implies it 'belongs' to someone, it reinforces the hegemonic power. A decision, a tradeoff, toil can be documented without implying a positive or negative value. The point has to be to resist this power position, and establish a culture of caring for the user and for the system. This requires a degree of uncertainty to allow creativity. A system that lacks trust, by its users, including operations, for instance in failing or malfunctioning releases or features, enforces wrong incentives. It will lead to power grab and survival of the fittest, to focus on meaningless metrics to hold onto positions of power, it loses track of the original story and credibility that your users wanted to be fulfilled. Therefore everyone has to be able to observe the system from all of its angles, and that observability has to be transparent and democratic for all users, not just operators or developers. Only this way, a consensus in what elements constitute risking user trust can be formed, at which level and where uncertainty is too much.

That's why I am careful to advise Design Thinking as a general remedy. This instance of Lean favours very fast feedback cycle and is therefore well suited for the fast attention economy. User empathy, and empathy for each other, and understanding are not fast, they are not as quantifiable as debt, but more importantly they are not engineered. They require non-engineering users to have a say. There are, at best, proxy metrics for such feedback. An assumption Design Thinking, with its legacy from Cybernetics and Control Theory barely hides. Tech debt fits Design Thinking because it's an output, it gives the impression a prototype is fine as long as revenue is higher than debt - but that's not always true. Trust, sustainability and empathy cannot be offset by debt. I like feedback loops, but prefer to investigate larger systemic feedback and second over thinking over direct control loops - in other words observing and resisting rather than exerting power. Resilience comes from resisting the temptation to move fast and break things, see what flow the socio-technological system can sustain, and do what's best for the user, what fulfils the product promise.


*) I've recently been building a hacky prototype how pairing conversation could be stored alongside code. I got the idea from speech-to-text combined with an Eliza like interface to drill down based on questions rather than free association, where the questions are based on NLP Keyphrase extraction.


**) Since reading Max Weber, I have been fascinated by the similarities of Puritan to Folk Chinese / Taoist idealist promises of a luck and bias-free meritocracy and the equivalence of prosperity, fortune and status which continuously biases the system towards the more powerful. In particular the similarity in manicured meritocracy between the Anglo-Saxon University System and the Imperial Exams which still cast a mythological shadow on the NCEE.

No comments: