InfoQ Presentation —Scaling Slack — The Good, the Unexpected, and the Road Ahead
A long series of learning notes on infrastructure
Materials taken from the presentation from here. All tributes go to InfoQ and the Presenter.
November 6, 2018
- Download full workspace model with all channels, users, emoji, etc.
- Establish real time websocket
- Push updates via websocket
- API calls for channel history, message edits, create channels, etc.
- Assign a workspace to a DB and MS shard at creation
- Metadata table lookup for each API request to route
“Herd of Pets”
- DBs run in active/active pairs with application failover
- Service hosts are addressed in config and manually replaced
Challenges with Scalling up
- Large organizations: Boot metadata download is slow and expensive
- Thundering Herd: Load to connect >> Load in steady state
- Hot spots: Overwhelm database hosts (mains and shards) and other systems
- Herd of Pets: Manual operation to replace specific servers
- Cross Workspace Channels: Need to change assumptions about partitioning
In computer science, the thundering herd problem occurs when a large number of processes or threads waiting for an event are awoken when that event occurs, but only one process is able to handle the event. When the processes wake up, they will each try to handle the event, but only one will win. All processes will compete for resources, possibly freezing the computer, until the herd is calmed down again. — wiki
Thin Client Model
- Use Flannel Service — a globally distributed edge cache to cache metadata e.g. workspace model
- When clients connect, a routing tier in each edge pub/sub that routes a client with workspace affinity to a Flannel Cache that is likely to have the model for that workspace
- Fine grained sharding and topology management solution that sits on top of MySQL.
- In Vitess, the application connects to a routing tier called VtGate. VtGate knows the set of tables that are in the configuration. It knows which set of backend servers are hosting those tables, and it knows which column a given table is sharded by.
The topology database consists of a local topology database, unique to a node, and a network topology database, whose entries are replicated across all network nodes in the same topology subnetwork. The topology database stores and maintains the nodes and the links (transmission groups or TGs) in the networks and their characteristics.
A component called the topology database manager (TDM) creates and maintains the topology database.
To enable a network node to provide routing functions to and from itself and the end nodes it serves, every network node maintains a network topology database that has complete and current topology information about the network. This topology information consists of the characteristics of all network nodes in the network and of all transmission groups (TGs) between network nodes. The end nodes in the network and the TGs connected to them are not considered network topology information.
- In Vitess, we have a single writable master and we rely on orchestrator, an open-source project out of GitHub to manage the fail, which promotes a replica on failover.
Migrating to a channel-sharded / user-sharded data model helps mitigate hot spots for large teams and thundering herds:
- Retains MySQL at the core for developer / operations continuity
- More mature topology management and cluster expansion systems
- Data migrations that change the sharding model take a long time
Everything is a pub/sub “channel”, including message channels as well as workspace / user metadata channels.
- Clients / Flannel subscribes to updates for all relevant objects
- Each Message Service has dedicated clear roles and responsibilities
- Self-healing cluster orchestration to maintain availability
- Each user session now depends on many more servers being available
Herd of Pets to Service Mesh
Topology Management For each of these projects (and more), architecture evolved from hand-configured server hostnames to a discovery mesh.
- Enables self-registration and automatic cluster repair
- Adds reliance on service discovery infrastructure (consul)
- Led to changes in service ownership and on-call rotation
As hard as it is to add new services into production under load, it’s proven as hard if not harder to remove old ones.
- With few exceptions, all 2016 services still in production
- Need to support legacy clients and integrations
- Data migrations need application changes takes time