From RPC to transactions and durable executions

I spent some time reading about “Durable Execution Engines” (eg: Temporal) and explored possible connections to earlier concepts like database transactions, distributed transactions, and building RPC/Microservice based systems in a fault tolerant manner. In this post I’ll try to summarize some of my learnings. How useful it is will depend on how much of this you already know! Among other things, I relied on these great overviews: The Modern Transactional Stack (by some a16z folks) and What is Durable Execution?. What’s fascinating about this topic is that both the problem domain and the solution space are quite large and thus offer a range of options.

A simple place to start is this talk called Six Little Lines of Fail by Jimmy Bogard, from many years back. He gives a compelling example of how writing even a single small function that updates a database, then sends an email and a separate message to an event bus, is full of pitfalls. Any of those parts could fail, you might have to retry (how many times? how far apart?), or roll back previous actions and so on. As a spoiler, many of these pain points get addressed in a great introductory blog post about Temporal, called Temporal: the iPhone of System Design.

The Jimmy Bogard talk references an old, great article called Your coffee shop doesn’t use two phase commit. Its author also wrote the famous book Enterprise Integration Patterns (EIP). In the world of EIP, the solutions to handling function calls across services touches upon event sourcing, the “Saga Pattern”, compensating transactions, and so on (more on this later).

Distributed Transactions
But chronologically, the path to “durable execution” seems to have meandered through “distributed transactions”. Though not used much these days, the idea was an extension of database transactions to handle, say, two/more different nodes of an RDBMS, or two/more different RDBMSes entirely, or an RDBMS and a message queue… and so on. A great resource for the evolution of transactions to their distributed variants (and beyond) is the chapter from the book Designing Data Intensive Applications (DDIA). The main algorithm (as far as I know) is the Two Phase Commit (2PC) algorithm/protocol. In the Java world – and in probably many other places – you are likely to encounter distributed transactions via the XA Standard (spec pdf) for 2PC, which, curiously, originated from vendors of UNIX! Quoting from that spec, “The January 1987 edition of the X/Open Portability Guide committed X/Open to standardise facilities by which commercial applications could achieve distributed transaction processing (DTP) on UNIX systems.” That spec is a collection of C header files and C function definitions that compliant systems are expected to implement. Its Java counterpart is JTA (Java Transaction API) and can be enabled transparently in Spring.

But what was the catch with “distributed transactions”? Well, it’s essentially algorithmic. To ensure that multiple independent systems all together succeed or fail in every possible scenario, they need to not only save temporary information to disk (the PREPARE phase), but once they’ve PREPARE-d, they have to wait arbitrarily long until a proper COMMIT/ROLLBACK command is received. This “arbitrarily long” might include crashes of just about everything: any of the systems themselves or the transaction co-ordinator. It’s like they’re stuck in a game of “statue”! Not just that, if you care about data integrity, you’d have probably allowed each system to exclusively lock some tables/rows/other resources during the course of this transaction. So a crash can bring down your availability drastically. In a great section titled Distributed Transactions in Practice, the DDIA book concludes, “Distributed transactions thus have a tendency of amplifying failures, which runs counter to our goal of building fault-tolerant systems”! There’s an easy-to-follow lecture titled Distributed Transactions from MIT, where the professor demonstrates well the drawbacks of 2PC (towards the end).

An aside: On tranactions
I got curious about database transactions in general, and skimmed one of the original papers from 1981 by Jim Gray on the topic: The Transaction concept: Virtues and Limitations. This paper describes the basics of the transaction concept and a couple of ways to implement it. But interestingly, also brings up the idea of distributed transactions, “long lived transactions” (basically what we would call workflows today) and nested transactions. Long lived transactions are explored with the example of having to go on a trip, and making sure your travel agent books a flight, a hotel and a car (and not just some of these!). He suggests settling for a lower level of consistency, and the concept of a “compensating transaction” – say cancelling the hotel if a flight can’t be found. The fact that “enterprise integration” patterns twenty years later used these same examples and patterns, and so do today’s durable execution engines forty years thence, point to how fundamentally hard these problems are. There’s a small but good para at the end called “Integration with Programming languages”, where he suggests that transactions can happen transparently behind the scenes if a programming language would support the methods BEGIN, ABORT etc on an object via durable logs behind the scenes. Hold on to this thought for now!

The JTA and Java Activity Service strands
Given that all these patterns were around, and distributed and long-lived transactions have been spoken about for decades, I checked if the Java world has ever taken this concept any further within its JSR standardization process. Indeed, in a terrific article called A History of Extended Transactions, Mark Little goes over the advancements attempted in the Java world. It was mainly JSR-95: Activity Service for Extended Transactions, of which he was a co-author. It defined interfaces that could be used to implement long-running and nested transactions. It includes Activities, Signals and SignalSets. One of the examples cited in that spec? A travel agent booking a trip that includes a hotel, car etc! It seems like IBM, who proposed the standard, were the only ones to do a reference implementation (within Websphere). You can find implementations of UserActivity, ActivityManager etc interfaces within their package. And also, Mark Little co-authored an entire book on Java Transactions.

However, he admits in that article that, “Unfortunately, despite it also being adopted with J2EE, there were few opportunities for using extended transactions in these environments”. He expresses hope (this was 2006) that upcoming Web Service related standards might have better luck in getting these ideas adopted!

The (failed?) Web Service standards
If the API’s for “extended transactions” weren’t taking off within the Java world, I don’t know why people assumed they’d do better with Web Services – which are more loosely coupled and run by different companies on diverse tech stacks. They interact over HTTP, using XML, SOAP etc. But the first decade of 2000’s was generally full of optimism/hype about the effectiveness of XML+SOAP based “web services”.

Anyway, roughly the same set of people were involved as in JSR-95, and this gave rise to WS-AtomicTransaction and further, WS-BusinessActivity standards. I’m not aware how successful (or not) these were. But for sure there were critics galore. Tim Bray writes in one of his (many, many critical) blog posts from that era about these endless (bloated?) standards regarding web services. An MSDN (archive) link from that time shows the Transaction related specs. It is possible that these were intended to eventually work with workflow engines that implemented WS-Business Process Execution Language (like BizTalk, jBPM, Camunda)… but I can’t be sure.

A quick word about .NET
Since .NET was also heavily used for enterprise applications and for web services, the same ideas appeared to have surfaced there (I’m barely familiar with .NET). The XA implementation was Microsoft’s Distributed Transaction Coordinator (DTC). One rant about DTC points mainly to the inherent issues with the 2PC protocol/algorithm, which no API or standard can address. I should highlight here what looks a slick piece of work: NServiceBus, which is an implementation/packaging of common enterprise integration patterns. It works with multiple queueing software, and bundles patterns like Outbox, Saga etc. They have (or had?) a nice Distributed Systems Design course as well.

Mix and match approaches
Given that heavyweight standards didn’t find adoption, it was left for developers to adopt the ideas piecemeal. The microservices boom in the last decade led to libraries like Resilience4J, Failsafe, Netflix Hysterix (deprecated) and Spring’s Circuit Breaker. Probably all these co-existed with company-specific implementations of EIP’s ideas in some shape or form, and formed the state-of-the-art for backend development. Frameworks like Spring Integration and Apache Camel also exist.

Even if you eschew RPC and rely heavily on events/messages and corresponding asynchronous handlers, you may still have design-time coupling between your services, as discussed in depth in this conversation. That conversation also touches upon the terms “orchestration” and “choreography”. The former tends to be synchronous (though not a must) but is definitely characterized by one service commanding another to take a specific action; the latter hews closer to event sourcing, where services respond to past events that they are interested in, rather than outright commands/requests.

A para from Golem’s blog post is so apt: “The hidden pain of modern cloud app development is that a significant part of the cost of developing, maintaining, and operating applications is compensating for unreliability, caused by our shift from centralized state to distributed state. Many patterns we reach for by default these days, including event-sourcing, state machines, and CQRS, exist as compensatory mechanisms for the unreliability of application logic.”

Today’s Durable Execution engines
Software like Temporal (or Restate, Flawless, DBOS, Golem) are standalone pieces which try to hide much of the complexity related to fault tolerance and asynchronicity. But you can’t forget the cloud providers themselves! I’m sure they each have an offering in this space. One of the Temporal founders’ own journey traversed the world of AWS SWF (its modern variant is Step Functions), Azure Durable Functions, and Uber’s Cadence (open source).

If you want to quickly come up to speed on much of the history presented in my article and the current baseline, Max’s 2023 keynote talk at Temporal’s conference is a good one. He says it’s a paradigm shift to start thinking in terms of “durable execution” instead of events (although of course it’s backed by event sourcing behind the scenes), and leads to less design-time coupling across services.

The baseline is that all these software are roughly agreed on the need to make it easy for a developer to interact with many (micro)services in a fault tolerant manner. They all give up on 2PC/Distributed Transactions, and eschew extreme (or any?) standardization. But the solution space is still large enough and leads to different design choices. I’ll just call out a few here. This is not a thorough comparison by any means!

Jim Gray’s old question – should transactions be integrated into the programming language? – comes back with a vengeance! At one extreme is Autokitteh, which modifies your Python program’s AST and generates Temporal Activities behind the scenes! Otherwise in Temporal, you have to explicitly bifurcate all your logic as being enclosed in either Activities or Workflows – each of which have to adhere to different semantics. There’s a central Temporal server and individual workers which all work off a common event history and ensure durable execution. You could also obtain durable execution through a library that you link to your program (say like Hibernate). You annotate the relevant functions appropriately, and the library creates an event log behind the scenes. This is aimed at by DBOS, which does not launch any external process other than those which comprise your program itself, and then uses Postgres as the event store (the company is founded by one of the original creators of Postgres).

Restate uses an event log rather than a database to store the details of the RPC calls, and that probably helps it achieve higher speeds (source: a HN comment by one of their developers). In a way, I agree that a raw event log might be a better starting point for event sourcing compared to a database, but it’s also a matter of engineering and performance optimization. Restate shares some developer history and design with Apache Flink’s Stateful Functions. Restate also has a “push” model, in that it invokes your “handlers” or functions with the parameters from a specific event, unlike Temporal which has long running workers that poll the event history (related queues) for tasks to execute.

Which brings us to the Serverless/Workers design choice. With Temporal, your handler can’t be “serverless”, e.g: an AWS Lambda. That seems to be a possibility for future. A CNCF-backed specification seems to be emerging in that space. Also, the concept of a long running workflow is fundamentally antithetical to short lived Lambdas. In one talk at a Temporal conference, some ways to make Temporal work with Azure Durable Functions are presented.

Thus, as the requirements for “durable execution” across services keep expanding, and if you add enterprise requirements like security, authorization, and needing different teams to talk to each other easily, the demands on the IPC (inter-process communication) mechanism begin to grow as well. Hence Temporal now has Nexus, which is built on top of Nexus RPC: “Nexus RPC is a protocol designed with durable execution in mind. It supports arbitrary-duration Operations that extend beyond a traditional RPC — a key underpinning to connect durable executions within and across Namespaces, clusters, regions, and cloud boundaries”

Leave a Reply Cancel reply