View on GitHub

adr

Journal of Architectural Decision Records made for the project

Use NATS as the foundation for lattice

Status

Accepted

Context and Problem Statement

Lattice is the self-healing, self-forming network extension to the WebAssembly actor runtime host. Lattice needs to be able to create flattened network topologies regardless of the number of intervening hops, routers, gateways, clouds, and physical infrastructures. As the core of the networking infrastructure, whatever lattice is built with must be boring ¹, flexible, self-maintaining, secure, reliable, fast, and easy to use.

Decision Drivers

Impact on developers (for good or bad)
Impact on operations and support
Cost of running, cost of associated infrastructure
Reliability, Stability, Performance, and other quantifiable ratings
Extensibility and support for future “unknown” use cases
Developer, support, and open-source community size and engagement level

Considered Options

NATS
Apache Kafka
RabbitMQ
Redis
Write our Own
Cloud-Coupled/Proprietary (SQS, Google, Azure, etc)

Decision Outcome

Chosen option: NATS, because it was by far the simplest yet most powerful, flexible, easy-to-run, and easy-to-support distributed messaging system evaluated. The power and flexibility brought to bear by “leaf nodes”² alone is a compelling enough argument to use NATS.

Positive Consequences

A truly powerful foundational building block on which we can build all kinds of enhancements and features
Leaf nodes
Access to NGS with the same set of APIs, libraries, and primitives as used internally
- Leaf nodes + NGS == <3
Drop-dead simple developer experience (just start the nats binary)
Amazing decentralized, account-based, multi-tenant security system that originally inspired our use of JWTs in wascap for signing actor modules.
Streaming is available as an easy add-on if we want it, but won’t get in the way if we don’t

Negative Consequences

It is possible that, by choosing NATS, we might miss out on some powerful persistence technology available in a heavy-duty (possible reading: bloated) broker, but at this point we don’t think our needs warrant the additional baggage.

NATS is also not quite as well-known as Apache Kafka and RabbitMQ, though we think it should be. We might have to defend our decision to use NATS more often than if we had picked a different broker, but it’s worth it.

Pros and Cons of the Options

NATS

NATS is a small, lightweight message broker that has been designed from the very beginning to be low-maintenance, easy to configure, flexible, and fast. It remains one of the most “cloud native” messaging systems we’ve encountered.

Good, because “it just works”
- In use in production without a single moment of downtime for some extremely critical use cases.
Good, because as of 2.0 it offers a decentralized security model that is ideal for federation, multi-tenancy, and giving tenants power and flexibility.
- This same decentralized model means an entire NATS cluster can be compromised without the loss of a single private key
Good, because it uses a simple protocol that doesn’t force us to use specific schemas or serialization patterns.
Good, because of leaf nodes and interoperability with NGS.
Good, because of broad community acceptance, use, and support.
- NATS is a part of the CNCF and is popular in that community.
Good, because streaming and the complexity that comes with it is opt-in. We don’t bear that burden until/unless we need it.
Good, because it’s incredibly easy to run during development and manage in production
Bad, because we might decide we need the ultra-heavy streaming support of something like Kafka? (honestly this is a stretch and we doubt we’ll need this in the foreseeable future)

RabbitMQ

RabbitMQ is written in Erlang and is used to support all kinds of incredibly large-scale production workloads. “Nobody would get fired” for picking Rabbit.

Good, because you can programmatically control queues and partitions
Good, because it is relatively light-weight and easy to run in development
Bad, because it still requires post-installation operations to manage via API or console
Bad, because it has a relatively limited/narrow security story

Apache Kafka

Kafka is the 300 pound elephant of the brokers we evaluated. It has a nearly infinite list of extensions and tie-ins to massive numbers and sizes of ecosystems. It has broad support and is running in production at just about every scale and configuration imaginable.

Good, because it is one of the de-facto choices for message brokers/streaming systems
Good, because it has robust streaming and persistence capabilities
Good, because it has well-known scaling characteristics
Bad, because someone has to manually perform stream repartitioning to adapt to new scaling patterns (this goes against the “hands off” and “boring” maintenance needs)
Bad, because it’s also a huge pain to install and manage locally, as well as manage properly and well in production. Without the right staff, we’d need a hosted solution to take on that burden.
Bad, because the “embarrassment of riches” of options, configurations, plug-ins, and everything else makes it hard for developers to work with.
Bad, because though everything is pluggable, its security model isn’t as powerful as what we want (multi-tenancy isn’t a native concept without add-ons)

Redis

In recent years Redis has become a kitchen sink of services, providing far more than just a distributed key-value store with optional persistence.

Good, because you get a bunch of stuff “for free”, like the aforementioned key-value store and access to a community of extensions and plugins (e.g. graph databases, geo support, etc)
Good, because Redis is easy for developers to use. The server “just works” locally and is usually relatively low maintenance in production
Bad, because the message broker part (channels) is a relatively simple “add-on” to Redis
Bad, because the security model isn’t decentralized or as robust as we’d like
Bad, because it’s too easy (and tempting) to put state and messaging in the same place, which can cause problems given the aforementioned security concerns.

Write our Own

Included here as an option for the sake of showing all possibilities. This would involve us creating our own networking code, likely starting from TCP/UDP low-level stuff and building up from there.

Good, because we could potentially have the most control over networking behavior compared to all other message brokers
Bad, because our hedgehog is not making message brokers, it’s making a distributed actor system that runs WebAssembly actors.
Bad, because it would be foolish to think that we could create something on our own that rivals any of the other options in this list
Bad, because even if we had the skills and resources to create our own “lattice from scratch”, it would take so much time that we would have to drop everything from our backlog to accommodate the effort.

Cloud-Proprietary

This is only included for the sake of completeness. We only briefly considered this before deciding against it.

Links

NATS
Synadia, the company that hosts NGS, a “global messaging dial-tone”
Whitepaper A study on modern messaging systems-Kafka, RabbitMQ, and NATS Streaming

From the article, “Technology for its own sake is snake oil” ↩
https://docs.nats.io/nats-server/configuration/leafnodes ↩