InfoQ Presentation —Scaling Slack — The Good, the Unexpected, and the Road Ahead

E.Y.

4 min readSep 19, 2021

A long series of learning notes on infrastructure

Materials taken from the presentation from here. All tributes go to InfoQ and the Presenter.

November 6, 2018

Previously

Application

Initial login:

Download full workspace model with all channels, users, emoji, etc.
Establish real time websocket

While connected:

Push updates via websocket
API calls for channel history, message edits, create channels, etc.

Database

Workspace Sharding

Assign a workspace to a DB and MS shard at creation
Metadata table lookup for each API request to route

“Herd of Pets”

DBs run in active/active pairs with application failover
Service hosts are addressed in config and manually replaced

Challenges with Scalling up

Large organizations: Boot metadata download is slow and expensive
Thundering Herd: Load to connect >> Load in steady state
Hot spots: Overwhelm database hosts (mains and shards) and other systems
Herd of Pets: Manual operation to replace specific servers
Cross Workspace Channels: Need to change assumptions about partitioning

In computer science, the thundering herd problem occurs when a large number of processes or threads waiting for an event are awoken when that event occurs, but only one process is able to handle the event. When the processes wake up, they will each try to handle the event, but only one will win. All processes will compete for resources, possibly freezing the computer, until the herd is calmed down again. — wiki

Solutions

Thin Client Model

Use Flannel Service — a globally distributed edge cache to cache metadata e.g. workspace model
When clients connect, a routing tier in each edge pub/sub that routes a client with workspace affinity to a Flannel Cache that is likely to have the model for that workspace

Vitess

Fine grained sharding and topology management solution that sits on top of MySQL.
In Vitess, the application connects to a routing tier called VtGate. VtGate knows the set of tables that are in the configuration. It knows which set of backend servers are hosting those tables, and it knows which column a given table is sharded by.

Topology database
The topology database consists of a local topology database, unique to a node, and a network topology database, whose entries are replicated across all network nodes in the same topology subnetwork. The topology database stores and maintains the nodes and the links (transmission groups or TGs) in the networks and their characteristics.
A component called the topology database manager (TDM) creates and maintains the topology database.
To enable a network node to provide routing functions to and from itself and the end nodes it serves, every network node maintains a network topology database that has complete and current topology information about the network. This topology information consists of the characteristics of all network nodes in the network and of all transmission groups (TGs) between network nodes. The end nodes in the network and the TGs connected to them are not considered network topology information.
IBM

In Vitess, we have a single writable master and we rely on orchestrator, an open-source project out of GitHub to manage the fail, which promotes a replica on failover.

Migrating to a channel-sharded / user-sharded data model helps mitigate hot spots for large teams and thundering herds:

Retains MySQL at the core for developer / operations continuity
More mature topology management and cluster expansion systems
Data migrations that change the sharding model take a long time

Service Decomposition

Everything is a pub/sub “channel”, including message channels as well as workspace / user metadata channels.

Clients / Flannel subscribes to updates for all relevant objects
Each Message Service has dedicated clear roles and responsibilities
Self-healing cluster orchestration to maintain availability
Each user session now depends on many more servers being available

Themes

Herd of Pets to Service Mesh

Topology Management For each of these projects (and more), architecture evolved from hand-configured server hostnames to a discovery mesh.

Enables self-registration and automatic cluster repair
Adds reliance on service discovery infrastructure (consul)
Led to changes in service ownership and on-call rotation

Deprecation Challenges

As hard as it is to add new services into production under load, it’s proven as hard if not harder to remove old ones.

With few exceptions, all 2016 services still in production
Need to support legacy clients and integrations
Data migrations need application changes takes time

That’s it!

Happy Reading!