InfoQ Presentation— Scaling Instagram Infra

A long series of learning notes on infrastructure

Photo by Daniel Klein on Unsplash

Materials taken from the presentation from here. All tributes go to InfoQ and the Presenter.

March 7th, 2017

  • Scale out
  • Scale up
  • Scale the dev team

Scale out

Two types of resources:

  • Storage: needs to be consistent across data centers
  • Computing: driven by user traffic, as needed basis


  • PostgreSQL: relational database: user, media, friendship etc. One master for write and slave replicas for read.
  • Cassandra: user feeds, activities etc. Masterless replicas that have different read/write eventual consistency configuration.


Due to attributes below, memcache is not suitable for cross region update.

  • High performance key-value store in memory
  • Millions of reads/writes per second
  • Sensitive to network condition
  • Cross region operation is prohibitive

But if they don’t update, then the user will read stale cache and causes confusion.

So the solution is for local Postgres to invalidate the local cache when performing database operation.

But this adds burden to Postgres, for queries like select count(*) from user_likes_media where media_id=12345 this will join multiple tables together. Another step is to have denormalised table for queries like this.

But this is still not enough. As when all workers try to access the DB to get one data with cache misses, this will starve the workers. This is also called thundering herds problem aka when so many workers want to read/write from the same key.

The solution is to have Memcache lease. This allows for zero communication among clients (to decide which client should go to the backend, retrieve/recompute the value, and refill cache).

So instead of getting the worker going into db and get the value, the worker gets a lease-get( a 64-bit token), meaning the worker is told to retry in a short time and by then can hopefully get a value. Since memcache returns a token only once every 10 s/key, if another request comes in within this time frame, when it retries, it will probably see the data been set by the earlier worker with the first lease key.

This is also used to solve problems with stale sets aka workers set stale value in the cache, as when worker sets the key, the lease will be verified by memcache.

Scale up

Don’t count the servers, make the servers count

Use as few CPU instructions as possible

Monitor Optimize Analyz

  • collect performance data
  • monitor regression
  • analyse code path

Use as few servers as possible

  • fewer codes (optimisation, dead code )
  • shared memory: move configuration into shared memory, disable garbage collection
  • async call

Scale the Dev

No branches:

  • Continous integration
  • Collaborate easily
  • Fast bisect and revert
  • Continuous performance monitoring

That’s it! Happy Reading!

Hi :)