Stop trying to do it all alone, add Kit to your team. Learn more.
How to Build a Production Grade Workflow with SQL Modelling

How to Build a Production Grade Workflow with SQL Modelling

By Michelle Ark and Chris Wu

In January of 2014, Shopify built a data pipeline platform for the data science team called Starscream. Back then, we were a smaller team and needed a tool that could deal with everything from ad hoc explorations to machine learning models. We chose to build with PySpark to get the power of a generalized distributed computer platform, the backing of the industry standard, and the ability to tap into the Python talent market. 

Fast forward six years and our data needs have changed. Starscream now runs 76,000 jobs and writes 300 terabytes a day! As we grew, some types of work went away, but others (like simple reports) became so commonplace we do them every day. While our Python tool based on PySpark was computationally powerful, it wasn’t optimized for these commonplace tasks. If a product manager needed a simple rollup for a new feature by country, pulling it, and modeling it wasn’t a fast task.

We’ll show you how we moved to a SQL modelling workflow by leveraging dbt (data build tool) and created tooling for testing and documentation on top of it. All together, these features provide Shopify’s data scientists with a robust, production-ready workflow to quickly build straightforward pipelines.

The Problem

When we interviewed our users to understand their workflow on Starscream, there were two issues we discovered: development time and thinking.

Development time encompasses the time data scientists use to prototype the data model they’d like to build, run it, see the outcome,and iterate. The PySpark platform isn’t ideal for running straightforward reporting tasks, often forcing data scientists to write boilerplate and it yields long runtimes. This led to long iteration cycles when trying to build models on unfamiliar data.

The second issue, thinking, is more subtle and deals with the way the programming language forces you to look at the data. Many of our data scientists prefer SQL to python because its structure forces consistency in business metrics. When interviewing users, we found a majority would write out a query in SQL then translate it to Python when prototyping. Unfortunately, query translation is time consuming and doesn’t add value to the pipeline.

To understand how widespread these problems were, we audited the jobs run and surveyed our data science team for the use cases. We found that 70% or so of the PySpark jobs on Starscream were full batch queries that didn’t require generalized computing. We viewed this as an opportunity to make a kickass optimization for a painful workflow. 

Enter Seamster

Our goal was to create a SQL pipeline for reporting that enables data scientists to create simple reporting data faster, while still being production ready. After exploring a few alternatives, we felt that the dbt library came closest to our needs. Their tagline “deploy analytics code faster with software engineering practices” was exactly what we were looking for in a workflow. We opted to pair it with Google BigQuery as our data store and dubbed the system and its tools, Seamster.

We knew that any off-the-shelf system wouldn’t be one size fits all. In moving to dbt, we had to implement our own:

  • source and model structure to modularize data model development
  • unit testing to increase the types of testable errors
  • continuous integration (CI) pipelines to provide safety and consistency guarantees.

Source Independence and Safety

With dozens of data scientists making data models in a shared repository, a great user experience would

  • maximize focus on work 
  • minimize the impact of model changes by other data scientists.

By default, dbt declares raw sources in a central sources.yml. This quickly became a very large file as it included the schema for each source, in addition to the source name. It creates a huge bottleneck for teams editing the same file across multiple PRs. 

To mitigate the bottleneck, we leveraged the flexibility of dbt and created a top-level ‘sources’ directory to represent each raw source with its own source-formatted yaml file. This way, data scientists can parse only the source documentation that’s relevant for them and contribute to the sources.yml file without stepping on each other’s toes.

Base models are one-to-one interfaces to raw sources.

We also created a Base layer of models using the staging’ concept from dbt to implement their best practice of limiting references to raw data. Our Base models serve as a one-to-one interface to raw sources. They don’t change the grain of the raw source, but do apply renaming, recasting, or any other cleaning operation that relates to the source data collection system. 

The Base layer serves to protect users from breaking changes in raw sources. Raw external sources are by definition out of the control of Seamster and can introduce breaking changes for any number of reasons at any point in time. If and when this happens, you only need to apply the fix to the Base model representing the raw source, as opposed to every individual downstream model that depends on the raw source. 

Model Ownership for Teams

We knew that the tooling improvements of Seamster would be only one part of a greater data platform at Shopify. We wanted to make sure we’re providing mechanisms to support good dimensional modelling practices and support data discovery.

In dbt, a model is simply a .sql file. We’ve extended this definition in Seamster to define a model as a directory consisting of four files: 

  • model_name.sql
  • schema.yml

You can further organize models into directories that indicate a data science team at Shopify like ‘finance’ or ‘marketing’. 

To support a clean data warehouse we’ve also organized data models into these rough layers that differentiate between:

  • base: data models that are one-to-one with raw data, but cleaned, recast and renamed
  • application-ready: data that isn’t dimensionally modelled but still transformed and clean for consumption by another tool (for example,  training data for a machine learning algorithm)
  • presentation: shareable and reliable data models that follow dimensional modelling best practices and can be used by data scientists across different domains.

With these two changes, a data consumer can quickly understand the data quality they can expect from a model and find the owner in case there is an issue. We also pass this metadata upstream to other tools to help with the data discovery workflow.

More Tests

dbt has native support for ‘schema tests’, which are encoded in a model’s schema.yml file. These tests run against production data to validate data invariants, such as the presence of null values or the uniqueness of a particular key. This feature in dbt serves its purpose well, but we also want to enable data scientists to write unit tests for models that run against fixed input data (as opposed to production data).

Testing on fixed inputs allows the user to test edge cases that may not be in production yet. In larger organizations, there can and will be frequent updates and many collaborators for a single model. Unit tests give users confidence that the changes they’re making won’t break existing behaviour or introduce regressions. 

Seamster provides a Python-based unit testing framework. Data scientists write their unit tests in the file in the model directory. The framework enables constructing ‘mock’ input models from fixed data. The central object in this framework is a ‘mock’ data model, which has an underlying representation of a Pandas dataframe. You can pass fixed data to the mock constructor as either a csv-style string, Pandas dataframe, or a list of dictionaries to specify input data. 

Input and expected MockModels are built from static data. The actual MockModel is built from input MockModels by BigQuery. Actual and expected MockModels can assert equality or any Great Expectations expectation
Input and expected MockModels are built from static data. The actual MockModel is built from input MockModels by BigQuery. Actual and expected MockModels can assert equality or any Great Expectations expectation.

A constructor creates a test query where a common table expression (CTE) represents each input mock data model, and any references to production models (identified using dbt’s ‘ref’ macro) are replaced by references to the corresponding CTE. Once you execute a query, you can compare the output to an expected result. In addition to an equality assertion, we extended our framework to support all expectations from the open-source Great Expectations library to provide more granular assertions and error messaging. 

The main downside to this framework is that it requires a roundtrip to the query engine to construct the test data model given a set of inputs. Even though the query itself is lightweight and processes only a handful of rows, these roundtrips to the engine add up. It becomes costly to run an entire test suite on each local or CI run. To solve this, we introduced tooling both in development and CI to run the minimal set of tests that could potentially break given the change. This was straightforward to implement with accuracy because of dbt’s lineage tracking support; we simply had to find all downstream models (direct and indirect) for each changed model and run their tests. 

Schema and Directed Acyclic Graph Validation on the Cheap

Our objective in Seamster’s CI is to give data scientists peace of mind that their changes won’t introduce production errors the next time the warehouse is built. They shouldn’t have to wonder whether removing a column will cause downstream dependencies to break, or whether they made a small typo in their SQL model definition.

To achieve this accurately, we would need to build and tear down the entire warehouse on every commit. This isn’t feasible from both a time and cost perspective. Instead, on every commit we materialize every model as a view in a temporary BigQuery dataset which is created at the start of the validation process and removed as soon as the validation finishes. If we can’t build a view because its upstream model doesn’t provide a certain column, or if the SQL is invalid for any reason, BigQuery fails to build the view and produces relevant error messaging. 

Currently, We have a warehouse consisting of over 100 models, and this validation step takes about two minutes. We reduce validation time further by only building the portion of the directed acyclic graph (DAG) affected by the changed models, as done in the unit testing approach. 

dbt’s schema.yml serves purely as metadata and can contain columns with invalid names or types (data_type). We employ the same view-based strategy to validate the contents of a model’s schema.yml file ensuring the schema.yml is an accurate depiction of the actual SQL model.

Data Warehouse Rules

Like many large organizations, we maintain a data warehouse for reporting where accuracy is key. To power our independent data science teams, Seamster helps by enforcing conformance rules on the layers mentioned earlier (base, application-ready, and presentation layers). Examples include naming rules or inheritance rules which help the user reason over the data when building their own dependent models.

Seamster CI runs a collection of such rules that ensure consistency of documentation and modelling practices across different data science teams. For example, one warehouse rule enforces that all columns in a schema conform to a prescribed nomenclature. Another warehouse rule enforces that only base models can reference raw sources (via the ‘source’ macro) directly. 

Some warehouse rules apply only to certain layers. In the presentation layer, we enforce that any column name needs a globally unique description to avoid divergence of definitions. Since everything in dbt is YAML, most of this rule enforcement is just simple parsing.

So, How Did It Go?

To ensure we got it right and worked out the kinks, we ran a multiweek beta of Seamster with some of our data scientists who tested the system out on real models. Since you’re reading about it, you can guess by now that it went well!

While productivity measures are always hard, the vast majority of users reported they were shipping models in a couple of days instead of a couple of weeks. In addition, documentation of models increased because this is a feature built into the model spec.

Were there any negative results? Of course. dbt’s current incremental support doesn’t provide safe and consistent methods to handle late arriving data, key resolution, and rebuilds. For this reason, a handful of models (Type  2 dimensions or models in the 1.5B+ event territory) that required incremental semantics weren’t doable—for now. We’ve got big plans though!

Where to Next?

We’re focusing on updating the tool to ensure it’s tailored to Shopify’s data scientists. The biggest hurdle for a new product (internal and external) is adoption. We know we still have work to do to ensure that our tool is top of mind when users have simple (but not easy) reporting work. We’re spending time with each team to identify upcoming work that we can speed up by using Seamster. Their questions and comments will be part of our tutorials and documentations for new data scientists.

On the engineering front, an exciting next step is looking beyond batch data processing. Apache Beam and Beam SQL provide an opportunity to consider a single SQL-centric data modelling tool for both batch and streaming use cases.

We’re also big believers in open source at Shopify. Depending on the dbt’s community needs we’d also like to explore contributing our validation strategy and a unit testing framework to the project. 

If you’re interested in building solutions from the ground up and would like to come work with us, please check out Shopify’s career page.

Continue reading

Adopting Sorbet at Scale

Adopting Sorbet at Scale

Join us November 25, 2020 at 1 pm EST for ShipIt! Presents: The State of Ruby Static Typing at Shopify as we talk about static typing at Shopify. We’ll share why we chose Sorbet for the monolith and the lessons we learned along the way the way. Please Register.

Shopify changes a lot. We merge around 400 commits to the main branch daily and deploy a new version or our monolith 40 times a day. Shopify is also big: 37,000 Ruby files, 622,000 methods, more than 2,000,000 calls. At this scale, with a dynamic language, even with the most rigorous review process and over 150 000 tests, it’s a challenge to ensure that everything runs smoothly. Developers benefit from a short feedback loop to ensure the stability of our monolith for our merchants.

In my first post, I talked about how we brought static typing to our core monolith. We adopted Sorbet in 2019, and the Ruby Infrastructure team continues to work on ways to make the development process safer, faster, and more enjoyable for Ruby developers. Currently, Sorbet is only enforced on our main monolith, but we have 60 internal projects using Sorbet as well. On our main monolith, we require all files to be at least typed: false and Sorbet is run on our continuous integration (CI) platform for every pull request and fails builds if type checking errors are found. As of today, 80% of our files (including tests) are typed: true or higher. Almost half of our calls are typed and half of our methods have signatures.

In this second post, I’ll present how we got from no Sorbet in our monolith to almost full coverage in the span of a few months. I’ll explain the challenges we faced, the tools we built to solve them, and the preliminary results of our experiment to reduce production errors with static typing.

Our Open-Source Tooling for Sorbet Adoption

Currently, Sorbet can’t understand all the constructs available in Ruby. Furthermore, Shopify relies on a lot of gems and frameworks, including Rails, that bring their own set of idioms. Increasing type coverage in our monolith meant finding ways to make Sorbet understand all of this. These are the tools we created to make it possible. They are open sourced in our effort to share our work with the community and make typing adoption easier for everyone.

Making Code Sorbet-compatible with RuboCop Sorbet

Even with gradual typing, moving our monolith to Sorbet required a lot of changes to remove or replace Ruby constructs that Sorbet couldn’t understand, such as non-constant superclasses or accessing constants through meta-programming with const_get. For this, we created RuboCop Sorbet, a suite of RuboCop rules allowing us to:

  • forbid some of the constructs not recognized by Sorbet yet 
  • automatically correct those constructs to something Sorbet can understand.

We also use these cops to require a minimal typed level on all files of our monolith (at least typed: false for now, but soon typed: true) and enforce some styling conventions around the way we write signatures.

Creating RBI Files for Gems with Tapioca

One considerable piece missing when we started using Sorbet was Ruby Interface file (RBI) generation for gems. For Sorbet to understand code from required gems, we had two options: 

  1. pass the full code of the gems to Sorbet which would make it slower and require making all gems compatible with Sorbet too
  2. pass a light representation of the gem content through an Ruby Interface file called a RBI file.

Being before the birth of Sorbet’s srb tool, we created our own: Tapioca. Tapioca provides an automated way to generate the appropriate RBI file for a given gem with high accuracy. It generates the definitions for all statically defined types and most of the runtime defined ones exported from a Ruby gem. It loads all the gems declared in the dependency list from the Gemfile into memory, then performs runtime introspection on the loaded types to understand their structure, and finally generates a complete RBI file for each gem with a versioned filename.

Tapioca is the de facto RBI generation tool at Shopify and used by a few renowned projects including Homebrew.

Creating RBI Files for Domain Specific Languages

Understanding the content of the gems wasn’t enough to allow type checking our monolith. At Shopify we use a lot of internal Domain Specific Languages (DSLs), most of them coming directly from Rails and often based on meta-programming. For example, the Active Record association belongs_to ends up defining tens of methods at runtime, none of which are statically visible to Sorbet. To enhance Sorbet coverage on our codebase we needed it to “see” those methods.

To solve this problem, we added RBI generation for Rails DSLs directly into Tapioca. Again, using runtime introspection, Tapioca analyzes the code of our application to generate RBI files containing a static definition for all the runtime-generated methods from Rails and other libraries.

Today Tapioca provides RBI generation for a lot of DSLs we use at Shopify:

  • Active Record associations
  • Active Record columns
  • Active Record enums
  • Active Record scopes
  • Active Record typed store
  • Action Mailer
  • Active Resource
  • Action Controller helpers
  • Active Support current attributes
  • Rails URL helpers
  • FrozenRecord
  • IdentityCache
  • Google Protobuf definitions
  • SmartProperties
  • StateMachines
  • …and the list is growing everyday

Building Tooling on Top of Sorbet with Spoom

As we began using Sorbet, the need for tooling built on top of it was more and more apparent. For example, Tapioca itself depends on Sorbet to list the symbols for which we need to generate RBI definitions.

Sorbet is a really fast Ruby parser that can build an Abstract Syntax Tree (AST) of our monolith in a matter of seconds versus a few minutes for the Whitequark parser. We believe that in the future a lot of tools such as linters, cops, or static analyzers can benefit from this speed.

Sorbet also provides a Language Server Protocol (LSP) with the option --lsp. Using this option, Sorbet can act as a server that is interrogated by other tools programmatically. LSP scales much better than using the file output by Sorbet with the --print option (see for example parse-tree-json or symbol-table-json) that spits out GBs of JSON for our monolith. Using LSP, we get answers in a few milliseconds instead of parsing those gigantic JSON files. This is generally how the language plugins for IDEs are implemented.

To facilitate the development of external tools to Sorbet we created Spoom, our toolbox to use Sorbet programmatically. It provides a set of useful features to interact with Sorbet, parse the configuration files, list the type checked files, collect metrics, or automatically bump files to higher strictnesses and comes with a Ruby API to connect with Sorbet’s LSP mode.

Today, Spoom is at the heart of our typing coverage reporting and provides the beautiful visualizations used in our SorbetMetrics dashboard.

Sharing Lessons Learned

After more than a year using Sorbet on our codebases, we learned a lot. I’ll share some insights about what typing did for us, which benefits it brings, and some of the limitations it implies.

Build, Measure, Learn

There’s a very scientific way to approach building products, encapsulated in the Build-Measure-Learn loop pioneered by Eric Ries. Our team believes in intuition, but we still prefer empirical proofs when we have access to them. So when we started with static typing in Ruby, we all believed it would be useful for our developers, but wanted to measure its effects and have hard data. This allows us to decide what we should concentrate on next based on the outcome of our measurements.

I talked about observing metrics, surveying developer happiness, or getting feedback through interviews in part 1, but my team wanted to go further and correlate the impact of typing on production errors. So, we conducted a series of controlled experiments to validate our assumptions.

Since our monolith evolves very fast, it becomes hard to observe the direct impact of typing on production. New features are added every day which gives rise to new errors while we work to decrease errors in other areas. Moreover, our monolith has about 500 gem dependencies (including transitive dependencies), any of which could introduce new errors in a version bump.

For this reason, we decreased our scope and targeted a smaller codebase for our experiment. Our internal developer tool, aptly named dev, was an ideal candidate. It’s a mature codebase that changes slowly and by a few people. It’s a very opinionated codebase with no external dependencies (the few dependencies it has are vendored), so it could satisfy the performance requirements of a command-line tool. Additionally, dev almost uses no meta-programming, especially not the kind normally coming from external libraries. Finally, it’s a tool with heavy usage since it’s the main development tool used by all developers at Shopify for their day-to-day work. It runs thousands of times a day on hundreds of different computers, there’s no edge case—at this scale, if something can break, it will.

We started monitoring all errors raised by dev in production, categorized the errors, analyzed their root cause, and tried to understand how typing could avoid them.

typed: ignore means typed: debt

Our first realisation was to keep away from typed: ignore. Ignoring a file can cause errors to appear in other files because those other files may reference something defined in the ignored file.

For example, if we opt to ignore this file:

Sorbet will raise errors in this file:

Since Sorbet doesn't even parse the file a.rb, it won’t know where constant A was defined. The more files you ignore, the more this case arises, especially when ignoring library files. This makes it harder and harder for other developers to type their own code.

As a rule of thumb at Shopify, we aim to have all our application files at least at typed: true and our test files at least at typed: false. We reserve typed: ignore for some test files that are particularly hard to type (because of mocking, stubbing, and fixtures), or some very specific files such as Protobuf definition files (which we handle through DSLs RBI generation with Tapioca).

Benefits Realized, Even at typed: false

Even at typed: false, Sorbet provides safety in our codebase by checking that all the constants resolve. Thanks to this, we now avoid mistakes triggering NameErrors either in CI or production.

Enabling Sorbet on our monolith allowed us to find and fix a few mistakes such as:

  • StandardException instead of StandardError
  • NotImplemented instead of NotImplementedError

We found dead code referencing constants deleted months ago. Interestingly, while most of the main execution paths were covered by tests, code paths for error handling were the places where we found the most NameErrors.

A bar graph showing the decreasing amount of NameErrors in dev over time
NameErrors raised in production for the dev project

During our experiment, we started by moving all files from dev to typed: false without adding any signatures. As soon as Sorbet was enabled in October 2019 on this project, no more NameErrors were raised in production.

Stacktrace showing NameError raised in production after Sorbet was enabled because of meta-programming like const_get
NameError raised in production after Sorbet was enabled because of meta-programming

The same observation was made on multiple projects: enabling Sorbet on a codebase eradicates all NameErrors due to developers’ mistakes. Note that this doesn’t avoid NameErrors triggered through metaprogramming, for example, when using const_get.

While Sorbet is a bit more restrictive when it comes to resolve constants, this strictness can be beneficial for developers:

Example of constant resolution error raised by Sorbet

typed: true Brings More Benefits

A circular tree map showing the relationship between strictness level and helpers in dev
Files strictnesses in dev (the colored dots are the helpers)

With our next experiment on gradual typing, we wanted to observe the effects of moving parts of the dev application to typed: true. We moved a few of the typed: false files to typed: true by focusing on the most reused part of the application, called helpers (the blue dots).

A bar graph showing the decrease in NoMethodErrors for files typed: true over time
NoMethodErrors in production for dev (in red the errors raised from the helpers)

By typing only this part of the application (~20% of the files) and still without signatures, we observed a decrease in NoMethodErrors for files typed: true.

Those preliminary results gave us confidence that a stricter typing can impact other classes of errors. We’re now in the process of adding signatures to the typed: true files in dev so we can observe their effect on TypeErrors and ArgumentErrors in production.

The Road Ahead of Us

The team working on Sorbet adoption is part of the broader Ruby Infrastructure team which is responsible for delivering a fast, scalable, and safe Ruby language for Shopify. As part of that mandate, we believe there are more things we can do to increase Ruby performance when the types of variables are known ahead of time. This is an interesting area to explore for the team as our adoption of typing increases and we’re certainly thinking about investing in this in the near future.

Support for Ruby 2.7

We keep our monolith as close as possible to Ruby trunk. This means we moved to Ruby 2.7 months ago. Doing so required a lot of changes in Sorbet itself to support syntax changes such as beginless ranges and numbered parameters, as well as new behaviors like forbidding circular argument references. Some work is still in progress to support the new forwarding arguments syntax and pattern matching. Stripe is currently working on making the keyword arguments compatible with Ruby 2.7 behavior.

100% Files at typed: true

The next objective for our monolith is to move 100% of our files to at least typed: true and make it mandatory for all new files going forward. Doing so implies making both Sorbet and our tooling smarter to handle the last Ruby idioms and constructs we can’t type check yet.

We’re currently focusing on changing Sorbet to support Rails ActiveSupport::Concerns (Sorbet pull requests #3424, #3468, #3486) and providing a solution to inclusion requirements. As well as, improvement to Tapioca for better RBI generation for generics and GraphQL support.

Investing in the Future of Static Typing for Ruby

Sorbet isn’t the end of the road concerning Ruby type checking. Matz announced that Ruby 3 will ship with RBS, another approach to type check Ruby programs. While the solution isn’t yet mature enough for our needs (mainly because of speed and some other limitations explained by Stripe) we’re already collaborating with Ruby core developers, academics, and Stripe to make its specification better for everyone.

Notably, we have open-sourced RBS parser, a C++ parser for RBS capable of translating a subset of RBS to RBI, and are now working on making RBS partially compatible with Sorbet.

We believe that typing, whatever the solution used is greatly beneficial for Ruby, Shopify, and our merchants so we'll continue to invest heavily in it. We want to increase the typing for the whole community.

We’ll continue to work with collaborators to push typing in Ruby even further. As we lessen the effort needed to adopt Sorbet, specifically on Rails projects, we’ll start making gradual typing mandatory on more internal projects. We will help teams start adopting at typed: false and move to stricter typing gradually.

As our long term goal, we hope to bring Ruby on par with compiled and statically typed languages regarding safety, speed and tooling.

Do you want to be part of this effort? Feel free to contribute to Sorbet (there are a lot of good first issues to begin with), check our many open-source projects or take a look at how you can join our team.

Happy typing!

—The Ruby Infrastructure Team

Shipit! Presents: The State of Ruby Static Typing at Shopify

Shopify changes a lot. We merge around 400 commits to the main branch daily and deploy a new version of our core monolith 40 times a day. The Monolith is also big: 37,000 Ruby files, 622,000 methods, more than 2,000,000 calls. At this scale with a dynamic language, even with the most rigorous review process and over 150,000 automated tests, it’s a challenge to ensure everything works properly. Developers benefit from a short feedback loop to ensure the stability of our monolith for our merchants.

Since 2018, our Ruby Infrastructure team has looked at ways to make the development process safer, faster, and more enjoyable for Ruby developers. While Ruby is different from other languages and brings amazing features allowing Shopify to be what it is today, we felt there was a feature from other languages missing: static typing.

​​​​​​​During this event you will learn

  • Why we use Sorbet for static typing
  • What it means to treat typing as a product
  • What supporting tools we built
  • How we measure the effectiveness of gradual typing
  • What lessons we learned

Date: November 25, 2020 at 1 pm EST


Continue reading

Static Typing for Ruby

Static Typing for Ruby

Join us November 25, 2020 at 1 pm EST for ShipIt! Presents: The State of Ruby Static Typing at Shopify as we talk about static typing at Shopify. We’ll share why we chose Sorbet for the monolith and the lessons we learned along the way the way. Please Register.

Shopify changes a lot. We merge around 400 commits to the main branch daily and deploy a new version of our core monolith 40 times a day. The Monolith is also big: 37,000 Ruby files, 622,000 methods, more than 2,000,000 calls. At this scale with a dynamic language, even with the most rigorous review process and over 150,000 automated tests, it’s a challenge to ensure everything runs smoothly. Developers benefit from a short feedback loop to ensure the stability of our monolith for our merchants.

Since 2018, our Ruby Infrastructure team has looked at ways to make the development process safer, faster, and more enjoyable for Ruby developers. While Ruby is different from other languages and brings amazing features allowing Shopify to be what it is today, we felt there was a feature from other languages missing: static typing.

The Three Key Requirements for a Typing Solution in Ruby

Even in 2018, typing for Ruby wasn't a novelty. A few attempts were made to integrate type annotations directly into the language or through external tools (RDL, Steep), or as libraries (dry-types). 

Which solution would best fit Shopify considering its codebase and culture? For the Ruby Infrastructure team, the best match for a typing solution needs:

  • Gradual typing: Typing a monolith isn't a simple task and can’t be done in a day. Our code evolves fast, and we can’t ask developers to stop coding while we add types to the existing codebase. We need flexibility to add types without blocking the development process or limiting our ability to satisfy merchants needs.
  • Speed: Considering the size of Shopify’s codebase, speed is a concern. If our goal is to provide quick feedback on errors and remove pressure from continuous integration (CI), we need a solution that’s fast.
  • Full Ruby support: We use all of Ruby at Shopify. Our culture embraces the language and benefits from all features, even hard to type ones like metaprogramming, overloading, and class reopening. Support for Rails is also a must. From code elegance to developer happiness, the chosen solution needs to be compatible as much as possible with all Ruby features.

With such a list of requirements, none of the contenders at the time could satisfy our needs, especially the speed requirement. We started thinking about developing our own solution, but a perfectly timed meeting with Stripe, who were working on a solution to the problem, introduced us to Sorbet.

Sorbet was closed-source at the time and under heavy development but was already promising. It’s built for gradual typing with phenomenal performance (able to analyze 100,000 lines per second per core) making it significantly faster than running automated tests. It can handle hard to type things like metaprogramming, thanks to Ruby Interface files (RBI). This is how, at the start of 2019, Shopify began its journey toward static type checking for Ruby.

Treat Static Typing as a Product

With only a three-person team and a lot on our plate, fully typing our monolith with Sorbet was going to be an approach based on Shopify’s Get Shit Done (GSD) framework.

  1. We tested the viability of Sorbet on our core monolith by only typing a few files, to check if we could observe benefits from it while not impairing other developers’ work. Sorbets’ gradual approach proved to be working.
  2. We manually created RBI files to represent what Sorbet could not understand yet. We checked we supported Ruby’s most advanced features as well as Rails constructs.
  3. We added more and more files while keeping an eye on performance ensuring Sorbet would scale with our monolith.

This gave us confidence Sorbet was the right choice to solve our problem. Once we officially decided to go with Sorbet we reflected on how we can reach 100% adoption in the monolith. To determine our roadmap we looked at:

  • how many files needed to be typed
  • the content of the files
  • the Ruby features they used.

Track Static Typing Adoption

Type checking in Sorbet comes in different levels of strictness. The strictness is defined on a per file basis by adding a magic comment in the file, called a sigil, written # typed: LEVEL, where LEVEL can be one of the following: 

  • ignore: At this level, the file is not even read by Sorbet, and no errors are reported for this file at all.
  • false: Only errors related to syntax, constant resolution and correctness of sigs are reported. At this level sorbet doesn’t check the calls in the files even if the methods called don't exist anywhere in the codebase.
  • true: This is the level where Sorbet actually starts to type check your code. All methods called need to exist in the code base. For each call, Sorbet will check that the arguments count matches the method definition. If the method has a signature, Sorbet will also check their types.
  • strict: At this level all methods must have a signature, and all constants and instance variables must have explicitly annotated types.
  • strong: Sorbet no longer allows untyped variables. In practice, this level is actually unusable for most files because Sorbet can’t type everything yet and even Stripe advises against using it.

Once we were able to run Sorbet on our codebase, we needed a way to track our progress and identify which parts of our monolith were typed with which strictness or which parts needed more work. To do so we created SorbetMetrics, a tool able to collect and display metrics about typing coverage for all our internal projects. We started tracking three key metrics to measure Sorbet adoption :

  • Sigils: how many files are typed ignore, false, true, strict or strong
  • Calls: how many calls are sent to a method with a signature
  • Signatures: how many methods have a signature

Bar graph showing increased Sorbet usage in projects over time. Below the bar graph is a table showing the percentage of sigils, calls, signatures in each project.
SorbetMetrics dashboard homepage

Each day SorbetMetrics pulls the latest version of our monolith and other Shopify projects using Sorbet, computes those metrics and displays them in a dashboard internally available to all our developers.

A selection of charts from the SorbetMetrics Dashboard. 3 pie charts showing the percentage of sigils, calls, and signatures in the monolith. 3 line charts showing Sigils, calls, and signature percentage over time. A circular tree map showing the relationship between strictness level and components. 2 line charts showing Sorbet versions and typechecking time over time
SorbetMetrics dashboard for our monolith

Sorbet Support at Scale

If we treat typing as a product, we also need to focus on supporting and enabling our “customers” who are developers at Shopify. One of our goals was to have a strong support system in place to help with any problems that arise and slow developers down.

Initially, we supported developers with a dedicated Slack channel where Shopifolk could ask questions to the team. We’d answer these questions real-time and help Shopifolk with typing efforts where our input was important.

This white glove support model obviously didn't scale, but it was an excellent learning opportunity for our team—we now understood the biggest challenges and recurring problems. We ended up solving some problems over and over again, but it solidified the effort to understand the patterns and decide which features to work on next.

Using Slack meant our answers weren't discoverable forever. We moved most of the support and conversation to our internal Discourse platform, increasing discoverability and broader sharing of knowledge. This also allows us to record solutions in a single place and let developers self-serve as much as possible. As we onboard more and more projects with Sorbet, this solution scales better.

Understand Developer Happiness

Going further from unblocking our users, we also need to ensure their happiness. Sorbet and more generally static typing in Ruby wouldn’t be a good fit for us if it made our developers miserable. We’re aware that it introduces a bit more work, so the benefits need to balance with the inconvenience.

Our first tool to measure developers’ opinions of Sorbet is surveys. Twice a year, we send a “Typing @ Shopify” survey to all developers and collect their sentiments regarding Sorbet’s benefits and limitations, as well as what we should focus on in the future.

A bar graph showing the increasing strongly agree answer over time to the question I want Sorbet to be applied to other Shopify projects. Below that graph is a bar graph showing the increasing strongly agree answer over time to the question I want more code to be typed.
Some responses from our “Sorbet @ Shopify” surveys

We use simple questions (“yes” or “no”, or a “Strongly Disagree” (1) to “Strongly Agree” (5) scale) and then look at how the answers evolve over time. The survey results gave us interesting insights:

  • Sorbet catches more errors on developer’s pull requests (PR) as adoption increased
  • Signatures help with code understanding and give developers confidence to ship
  • Added confidence directly impacted the increasing positive opinion about static typing in Ruby
  • Over time developers wanted more code and more projects to be typed
  • Developers get used to Sorbet syntax over time
  • IDE integration with Sorbet is a feature developers are rooting for

Our main observation is that developers enjoy Sorbet more as the typing coverage increases. This is one reason that's increasing our motivation to reach 100% of files at typed: true and maximize the amount of methods with a signature.

The second tool is interviews with individual developers. We select a team working with Sorbet and meet each member to talk about their experience using Sorbet either in the monolith or on another project. We get a better understanding of what their likes and dislikes are, what we should be improving, but also how we can better support them when introducing Sorbet, so the team keeps Sorbet in their project.

The Current State of Sorbet at Shopify

Currently, Sorbet is only enforced on our main monolith and we have about 60 other internal projects that opted to use Sorbet as well. On our main monolith, we require all files to be at least typed: false and Sorbet is run on our continuous integration platform (CI) for every PR and fails builds if type checking errors are found. We’re currently evaluating the idea of enforcing valid type checking on CI even before running the automated tests.

Three pie charts showing percentage of sigils, calls, and signatures in the monolith used to measure Sorbet adoption
Typing coverage metrics for Shopify’s monolith

As of today, 80% of our files (including tests) are typed: true or higher. Almost half of our calls are typed and half of our methods have signatures. All of this can be type checked under 15 seconds on our developers machines.

A circular tree map showing the relationship between strictness level and components
Files strictness map in Shopify’s monolith

The circle map shows which parts of our monolith are at which strictness level. Each dot represents a Ruby file (excluding tests). Each circle represents a component (a set of Ruby files serving the same application concern). Yes, it looks like a Petri dish and our goal is to eradicate the bad orange untyped cells.

A bar graph showing increased number of Shopify projects using Sorbet over time
Shopify projects using Sorbet

Outside of the core monolith, we’ve also observed a natural increase of Shopify projects, both internal and open-source, using Sorbet. As I write these lines, more than 60 projects now use Sorbet. Shopifolks like Sorbet and use it on their own without being forced to do so.

A bar graph showing manual dev tc runs from developers machine on our monolith in 2019
Manual dev tc runs from developers machine on our monolith in 2019

Finally, we track how many times our developers ran the command dev tc to typecheck a project with Sorbet on their development machine. This shows us that developers don’t wait for CI to use Sorbet—everyone enjoys a faster feedback loop.

The Benefits of Types for Ruby Developers 

Now that Sorbet is fully adopted in our core monolith, as well as in many internal projects, we’re starting to see the benefits of it on our codebases as well as on our developers. Our team that is working on Sorbet adoption is part of the broader Ruby Infrastructure team which is responsible for delivering a fast, scalable and safe Ruby language for Shopify. As part of that mandate, we believe that static typing has a lot to offer for Ruby developers, especially when working on big, complex codebases.

In this post I focused on the process we followed to ensure Sorbet was the right solution for our needs, treating static typing as a product and showed the benefits of this product on our customers: Shopifolk working on our monolith and outside. Are you curious to know how we got there? Then you’ll be interested in the second part: Adopting Sorbet at Scale where I present the tools we built to make adoption easier and faster, the projects we open-sourced to share with the community and the preliminary results of our experiment with static typing to reduce production errors.

Happy typing!

—The Ruby Infrastructure Team

Shipit! Presents: The State of Ruby Static Typing at Shopify

Shopify changes a lot. We merge around 400 commits to the main branch daily and deploy a new version of our core monolith 40 times a day. The Monolith is also big: 37,000 Ruby files, 622,000 methods, more than 2,000,000 calls. At this scale with a dynamic language, even with the most rigorous review process and over 150,000 automated tests, it’s a challenge to ensure everything works properly. Developers benefit from a short feedback loop to ensure the stability of our monolith for our merchants.

Since 2018, our Ruby Infrastructure team has looked at ways to make the development process safer, faster, and more enjoyable for Ruby developers. While Ruby is different from other languages and brings amazing features allowing Shopify to be what it is today, we felt there was a feature from other languages missing: static typing.

​​​​​​​During this event you will learn

  • Why we use Sorbet for static typing
  • What it means to treat typing as a product
  • What supporting tools we built
  • How we measure the effectiveness of gradual typing
  • What lessons we learned

Date: November 25, 2020 at 1 pm EST


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

How to Introduce Composite Primary Keys in Rails

How to Introduce Composite Primary Keys in Rails

Databases are a key scalability bottleneck for many web applications. But what if you could make a small change to your database design that would unlock massively more efficient data access? At Shopify, we dusted off some old database principles and did exactly that with the primary Rails application that powers online stores for over a million merchants. In this post, we’ll walk you through how we did it, and how you can use the same trick to optimize your own applications.


A basic principle of database design is that data that is accessed together should be stored together. In a relational database, we see this principle at work in the design of individual records (rows), which are composed of bits of information that are often accessed and stored at the same time. When a query needs to access or update multiple records, this query will be faster if those rows are “near” to each other. In MySQL, the sequential ordering of rows on disk is dictated by the table’s primary key.

Active Record is the portion of the Rails application framework that abstracts and simplifies database access. This layer introduces database practices and conventions that greatly simplify application development. One such convention is that all tables have a simple automatically incrementing integer primary key, often called `id`. This means that, for a typical Rails application, most data is stored on disk strictly in the order the rows were created. For most tables in most Rails applications, this works just fine and is easy for application developers to understand.

Sometimes the pattern of row access in a table is quite different from the insertion pattern. In the case of Shopify’s core API server, it is usually quite different, due to Shopify’s multi-tenant architecture. Each database instance contains records from many shops. With a simple auto-incrementing primary key, table insertions interleave the insertion of records across many shops. On the other hand, most queries are only interested in the records for a single shop at a time.

Let’s take a look at how this plays out at the database storage level. We will use details from MySQL using the InnoDB storage engine, but the basic idea will hold true across many relational databases. Records are stored on disk in a data structure called a B+ tree. Here is an illustration of a table storing orders, with the integer order id shown, color-coded by shop:

Individual records are grouped into pages. When a record is queried, the entire page is loaded from disk into an in-memory structure called a buffer pool. Subsequent reads from the same page are much faster while it remains in the buffer pool. As we can see in the example above, if we want to retrieve all orders from the “yellow” shop, every page will need loading from disk. This is the worst-case scenario, but it turned out to be a prevalent scenario in Shopify’s main operational database. For some of our most important queries, we observed an average of 0.9 pages read per row in the final query result. This means we were loading an entire page into memory for nearly every row of data that we needed!

The fix for this problem is conceptually very simple. Instead of a simple primary key, we create a composite primary key [shop_id, order_id]. With this key structure, our disk layout looks quite different:

Records are now grouped into pages by shop. When retrieving orders for the “yellow” shop, we read from a much smaller set of pages (in this example it’s only one page less, but imagine extrapolating this to a table storing records for 10,000 shops and the result is more profound).

So far, so good. We have an obvious problem with the efficiency of data access and a simple solution. For the remainder of this article, we’ll go through some of the implementation details and challenges we came across with rolling out composite primary keys in our main operational database, along with the impact for our Ruby on Rails application and other systems directly coupled to our database. We will continue using the example of an “orders” table, both because it is conceptually simple to understand. It also turned out to be one of the critical table names that we applied this change to.

Introducing Composite Primary Keys

The first challenge we faced with introducing composite primary keys was at the application layer. Our framework and application code contained various assumptions about the table’s primary key. Active Record, in particular, assumes an integer primary key, and although there is a community gem to monkey-patch this, we didn’t have confidence that this approach would be sustainable and maintainable in the future. On deeper analysis, it turned out that nearly all such assumptions in application layer code continued to hold if we changed the `id` column to be an auto-incrementing secondary key. We can leave the application layer blissfully unaware of the underlying database schema by forcing Active Record to treat the `id` column as a primary key:

class Order < ApplicationRecord
  self.primary_key = :id
  .. remainder of order model ...

Here is the corresponding SQL table definition:

CREATE TABLE `orders` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `shop_id` bigint(20) NOT NULL,
  … other columns ...
  PRIMARY KEY (`shop_id`,`id`),
  KEY `id` (`id`)
  … other secondary keys ...

Note that we chose to leave the secondary index as a non-unique key here. There is some risk to this approach because it is possible to construct application code that results in duplicate models with the same id (but with different shop_id in our case). You can opt for safety here and make a unique secondary key on id. We took this approach because the method we use for live schema migrations is prone to deadlock on tables containing multiple unique constraints. Specifically, we use Large Hadron Migrator (LHM), which uses MySQL triggers to copy records into a shadow table during migrations. Unique constraints are enforced in InnoDB through an exclusive table-level write lock. Since there are two tables accepting writes, each containing two exclusive locks, all of the necessary deadlock conditions are present during migration. You may be able to keep a unique constraint on `id` if any of the following are true:

  • You don’t perform live migrations on your application.
  • Your migrations don’t use SQL triggers (such as the default Rails migrations).
  • The write throughput on the table is low enough that a low volume of deadlocks is acceptable for your application.
  • The code path for writing to this table is resilient to database transaction failures.

The remaining area of concern is any data infrastructure that directly accesses the MySQL database outside of the Rails application layer. In our case, we had three key technologies that fell into this category: 

  • Our database schema migration infrastructure, already discussed above.
  • Our live data migration system, called Ghostferry. Ghostferry moves data across different MySQL instances while the application is still running, enabling load-balancing of sharded data across multiple databases. We implemented support for composite primary keys in ghostferry as part of this work, by introducing the ability to specify an alternate column for pagination during migration.
  • Our data warehousing system does both bulk and incremental extraction of MySQL tables into long term storage. Since this system is proprietary to Shopify we won’t cover this area further, but if you have a similar data extraction system, you’ll need to ensure it can accommodate tables with composite primary keys.


Before we dig into specific results, a disclaimer: every table and corresponding application code is different, so the results you see in one table do not necessarily translate into another. You need to carefully consider your data’s access patterns to ensure that the primary key structure produces the optimal clustering for those access patterns. In our case of a sharded application, clustering the data by shop was often the right answer. However, if you have multiple closely connected data models, you may find another structure works better. To use a common example, if an application has “Blog” and “BlogPost” models, a suitable primary key for the blog_posts table may be (blog_id, blog_post_id). This is because typical data access patterns will tend to query posts for a single blog at once. In some cases, we found no overwhelming advantage to a composite primary key because there was no such singular data access pattern to optimize for. In one more subtle example, we found that associated records tended to be written within the same transaction, and so were already sequentially ordered, eliminating the advantage of a composite key. To extend the previous blog example, imagine if all posts for a single blog were always created in a single transaction, so that blog post records were never interleaved with insertion of posts from other blogs.

Returning to our leading example of an “orders” table, we measured a significant improvement in database efficiency:

  • The most common queries that consumed most database capacity had a 5-6x improvement in elapsed query time.
  • Performance gains corresponded linearly with a reduction in MySQL buffer pool page reads per query. Adding a composite key on our single most queried table reduced the median buffer pool reads per query from 1.8 to 1.2.
  • There was a dramatic improvement in tail latency for our slowest queries. We maintain a log of slow queries, which showed a roughly 80% reduction in distinct queries relating to the orders table.
  • Performance gains varied greatly across different kinds of queries. The most dramatic improvement was 500x on a particularly egregious query. Most queries involving joins saw much lower improvement due to the lack of similar data clustering in other tables (we expect this to improve as more tables adopt composite keys).
  • A useful measure of aggregate improvement is to measure the total elapsed database time per day, across all queries involving the changed table. This helps to add up the net benefit on database capacity across the system. We observed a reduction of roughly one hour per day, per shard, in elapsed query time from this change.

There is one notable downside on performance that is worth clearly calling out. A simple auto-incrementing primary key has optimal performance on insert statements because data is always clustered in insertion order. Changing to a composite primary key results in more expensive inserts, as more distinct database pages need to be both read and flushed to disk. We observed a roughly 10x performance degradation on inserts by changing to a composite primary key. Most data are queried and updated far more often than inserted, so this tradeoff is correct in most cases. However, it is worth keeping this in mind if insert performance is a critical bottleneck in your system for the table in question.

Wrapping Up

The benefits of data clustering and the use of composite primary keys are well-established techniques in the database world. Within the Rails ecosystem, established conventions around primary keys mean that many Rails applications lose out on these benefits. If you are operating a large Rails application, and either database performance or capacity are major concerns, it is worth exploring a move to composite primary keys in some of your tables. In our case, we faced a large upfront cost to introduce composite primary keys due to the complexity of our data infrastructure, but with that cost paid, we can now introduce additional composite keys with a small incremental effort. This has resulted in significant improvements to query performance and total database capacity in one of the world’s largest and oldest Rails applications.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

Building Mental Models of Ideas That Don’t Change

Building Mental Models of Ideas That Don’t Change

I hope these mental models are as valuable for you as they are for me. I presented these ideas at ShipIt! Presents: Building Mental Models of Ideas That Don’t Change on October 28, 2020 and the video is available. I went over the process of prioritizing new ideas and coming up with a system of models for yourself to organize these ideas. If you find this useful, stay updated by following me on Twitter or my blog.

There’s always new stuff: new frameworks, new languages, and new platforms. All of this adds up. Sometimes it feels like you’re just treading water, and not actually getting better at what you do. I’ve tried spending more time learning this stuff, but that doesn’t work—there’s always more. I have found a better approach is learning things at a deeper level and using those lessons as a checklist. This checklist of core principles are called mental models. 

I learned this approach by studying how bright people think. You might have heard Richard Feynman describe the handful of algorithms that he applies to everything. Maybe you’ve  seen Elon Musk describe his approach as thinking by fundamental principles. Charlie Munger also credits most of his financial success to mental models. All of these people are amazing and you won’t get to their level with mental models alone, but mental models give you a nudge in the right direction.

So, how does one integrate mental models into their life and work? The first thing that you need is a method for prioritizing new concepts that you should learn. After that, you’ll need a good system for keeping track of what you have identified as important. With this process, you’ll identify mental models and use them to make more informed decisions. Below I start by describing some engineering and management mental models that I have found useful over the years.

Table of Contents

Engineering Mental Models

 Management Mental Models

 Engineering Mental Models

Avoid Silent Failures

When something breaks you should hear about it. This is important because small issues can help you find larger structural issues. Silent failures typically happen when exceptions are silenced—this may be in a networking library, or the code that handles exceptions. Failures can also be silent when one of your servers is down. You can prevent this by using a third party system that pings each of the critical components.

As your project gets more mature, set up a dashboard to track key metrics and create automated alerts. Generally, computers should tell you when something is wrong. Systems become more difficult to monitor as they grow. You want to measure and log everything at the beginning and not wait until something goes wrong. You can encourage other developers to do this by creating helper classes with a really simple APIs since things that are easy and obvious are more likely to be used. Once you are logging everything, create automated alerts. Post these alerts in shared communication channels, and automatically page the oncall developer for emergencies.

Do Minimal Upfront Work and Queue the Rest

A system is scalable when it handles unexpectedly large bursts of incoming requests. The faster your system handles a request, the faster it gets to the next one. Turns out, that in most cases, you don’t have to give a response to the request right away—just a response indicating you've started working on the task. In practice, you queue a background job after you receive a request. Once your job is in a queue, you have the added benefit of making your system fault tolerant since failed jobs can be tried again.

Scaling Reads with Caching and Denormalizing

Read-heavy systems mean some data is being read multiple times. This can be problematic because your database might not have enough capacity to deal with all of that work. The general approach of solving this is by pre-computing this data (called denormalizing) and storing it somewhere fast. In practice, instead of letting each request hit multiple tables in a database, you pre-compute the expected response and store it in a single place. Ideally, you store this information somewhere that’s really fast to read from (think RAM). In practice this means storing data in data stores like Memcached.

Scaling Writes with Sharding, NoSQL Datastore, or Design Choices

Write-heavy systems tend to be difficult to deal with. Traditional relational databases can handle reads pretty well, but have trouble with writes. They take more time processing writes because relational databases spend more effort on durability and that can lock up writes and create timeout errors.

Consider the scenario where a relational database is at it’s write-capacity and you can’t scale up anymore. One solution is to write data to multiple databases. Sharding is the process where you split your database into multiple parts (known as shards). This process allows you to group related data into one database. Another method of dealing with a write heavy system is by writing to Non-relational (NoSQL) databases. These databases are optimized to handle writes, but there’s a tradeoff. Depending on the type of NoSQL database and its configuration, it gives up:

  • atomic transactions (they don’t wait for other transactions to fully finish), 
  • consistency across multiple clusters (they don’t wait for other clusters to have the same data),
  • durability (they don’t spend time writing to disk). 

It may seem like you are giving up a lot, but you mitigate some of these losses with design choices. 

Design choices help you cover some of the weaknesses of SQL databases. For example, consider that updating rows is much more expensive than creating new rows. Design your system so you avoid updating the same row in multiple flows—insert new rows to avoid lock contention. With all of that said, I recommend starting out with a SQL database, and evolving your setup depending on your needs.

Horizontal Scaling Is the Only Real Long Term Solution

Horizontal scaling refers to running your software on multiple small machines, while vertical scaling refers to running your software on one large machine. Horizontal scaling is more fault tolerant since failure of a machine doesn’t mean an outage. Instead, the work for the failed machine is routed to the other machines. In practice, horizontally scaling a system is the only long term approach to scaling. All systems that appear ‘infinitely-scalable’ are horizontally scaled under the hood: Cloud object stores like S3 and GCS; NoSQL databases like Bigtable and Dynamo DB; and stream processing systems like Kafka are all horizontally scaled. The cost for horizontally scaling systems is application and operational complexity. It takes significant time and potential complexity to horizontally scale your system, but you want to be in a situation where you can linearly scale your system by adding more computers.

Things That are Harder to Test Are More Likely to Break

Among competing approaches to a problem, you should pick the most testable solution (this is my variant of Occam’s Razor). If something is difficult to test, people tend to avoid testing it. This means that future programmers (or you) will be less likely to fully test this system, and each change will make the system more brittle. This model is important to remember when you first tackle a problem because good testability needs to be baked into the architecture. You’ll know when something is hard to test because your intuition will tell you.

Antifragility and Root Cause Analysis

Nassim Taleb uses the analogy of a hydra in Antifragile; they grow back a stronger head every time they are struck. The software industry championed this idea too. Instead of treating failures as shameful incidents that should be avoided at all costs, they’re now treated as opportunities to improve the system. Netflix’s engineering team is known for Chaos Monkey, a resiliency system that turns off random components. Once you anticipate random events, you can build a more resilient system. When failures do happen, they’re treated as an opportunity to learn.

Root cause analysis is a process where the people involved in a failure try to extract the root cause in a blameless way by starting off by what went right, and then diving into the failure without blaming anyone.

Big-O and Exponential Growth

The Big-O notation describes the growth in complexity of an algorithm. There’s a lot to this, but you’ll get very far if you just understand the difference between constant, linear, and exponential growth. In layman’s terms, algorithms that perform one task are better than algorithms that perform many tasks, and algorithms that perform many tasks are better than ones where the tasks are ever increasing with each iteration. I have found this issue visible at an architectural level as well.

Margin of Safety

Accounting for a margin of safety means you need to leave some room for errors or exceptional events. For example, you might be tempted to run each server at 90% of its capacity. While this saves money, it leaves your server vulnerable to spikes in traffic. You’ll have more confidence in your setup, if you have auto-scaling setup. There’s a problem with this too, your overworked server can cause cascading failures in the whole system. By the time auto-scaling kicks in, the new server may have a disk, connection pool or an assortment of other random fun issues. Expect the unexpected and give yourself some room to breathe. Margin of safety also applies to planning releases of new software. You should add a buffer of time because unexpected things will come up.

Protect the Public API

Be very careful when making changes to the public API. Once something is in the public API, it’s difficult to change or remove. In practice, this means having a very good reason for your changes, and being extremely careful with anything that affects external developers; mistakes in this type of work affect numerous people and are very difficult to revert.


Any system with many moving parts should be built to expect failures of individual parts. This means having backup providers for systems like Memcached or Redis. For permanent data-stores like SQL, fail-overs and backups are critical. Keep in mind that you shouldn’t consider something a backup unless you do regular drills to make sure that you can actually recover that data.

Loose Coupling and Isolation

Tight coupling means that different components of a system are closely interconnected. This has two major drawbacks. The first drawback is that these tightly coupled systems are more complex. Complex systems, in turn, are more difficult to maintain and more error prone. The second major drawback is that failure in one component propagates faster. When systems are loosely coupled, failures can be self contained and can be replaced by potential backups (see Redundancy). At a code level, reducing tight coupling means following the single responsibility principle which states that every class has a single responsibility and communicates with other classes with a minimal public API. At an architecture level, you improve tightly coupled systems by following the service oriented architecture. This architecture system suggests dividing components by their business services and only allows communication between these services with a strict API.

Be Serious About Configuration

Most failures in well-tested systems occur due to bad configuration; this can be changes like environmental variables updates or DNS settings. Configuration changes are particularly error prone because of the lack of tests and the difference between the development and production environment. In practice, add tests to cover different configurations, and make the dev and prod environment as similar as possible. If something works in development, but not production, spend some time thinking about why that’s the case.

Explicit Is Better than Implicit

The explicit is better than implicit model is one of the core tenants from the Zen of Python and it’s critical to improving code readability. It’s difficult to understand code that expects the reader to have all of the context of the original author. An engineer should be able to look at class and understand where all of the different components come from. I have found that simply having everything in one place is better than convoluted design patterns. Write code for people, not computers.

Code Review

Code review is one of the highest leverage activities a developer can perform. It improves code quality and transfers knowledge between developers. Great code reviewers change the culture and performance of an entire engineering organization. Have at least two other developers review your code before shipping it. Reviewers should give thorough feedback all at once, as it’s really inefficient to have multiple rounds of reviews. You’ll find that your code review quality will slip depending on your energy level. Here’s an approach to getting some consistency in reviews: 

  1. Why is this change being made? 
  2. How can this approach or code be wrong? 
  3. Do the tests cover this or do I need to run it locally? 

Perceived Performance

Based on UX research, 0.1 second (100 ms) is the gold standard of loading time. Slower applications risk losing the user’s attention. Accomplishing this load time for non-trivial apps is actually pretty difficult, so this is where you can take advantage of perceived performance. Perceived performance refers to how fast your product feels. The idea is that you show users placeholder content at load time and then add the actual content on the screen once it finishes loading. This is related to the Do Minimal Upfront Work and Queue the Rest model.

Never Trust User Input Without Validating it First

The internet works because we managed to create predictable and secure abstractions on top of unpredictable and insecure networks of computers. These abstractions are mostly invisible to users but there’s a lot happening in the background to make it work. As an engineer, you should be mindful of this and never trust input without validating it first. There are a few fundamental issues when receiving input from the user.

  1. You need to validate that the user is who they say they are (authentication).
  2. You need to ensure that the communication channel is secure, and no one else is snooping (confidentiality).
  3. You need to validate that the incoming data was not manipulated in the network (data integrity).
  4. You need to prevent replay attacks and ensure that the same data isn’t being sent multiple times.
  5. You could also have the case where a trusted entity is sending malicious data.

This is simplified, and there are more things that can go wrong, so you should always validate user input before trusting it.

Safety Valves

Building a system means accounting for all possibilities. In addition to worst case scenarios, you have to be prepared to deal with things that you cannot anticipate. The general approach for handling these scenarios is stopping the system to prevent any possible damage. In practice, this means having controls that let you reject additional requests while you diagnose a solution. One way to do this is adding an environment variable that can be toggled without deploying a new version of your code.

Automatic Cache Expiration

Your caching setup can be greatly simplified with automatic cache expiration. To illustrate why, consider the example where the server is rendering a product on a page. You want to expire the cache whenever this product changes. The manual method is by expiring the cache expiration code after the product is changed. This requires two separate steps, 1) Changing the product, and then 2) Expiring the cache. If you build your system with key-based caching, you avoid the second step all together. It’s typically done by using a combination of the product’s ID and it’s last_updated_at_timestamp as the key for the product’s cache. This means that when a product changes it’ll have a different last_updated_at_timestamp field. Since you’ll have a different key, you won’t find anything in the cache matching that key and fetch the product in it’s newest state. The downside of this approach is that your cache datastore (e.g., Memcached or Redis) will fill up with old caches. You can mitigate it by adding an expiry time to all caches so old caches automatically disappear. You can also configure Memcached so it evicts the oldest caches to make room for new ones.

Introducing New Tech Should Make an Impossible Task Possible or Something 10x Easier

Most companies eventually have to evaluate new technologies. In the tech industry, you have to do this to stay relevant. However, introducing a new technology has two negative consequences. First, it becomes more difficult for developers to move across teams. This is a problem because it creates knowledge silos within the company, and slows down career growth. The second consequence is that fewer libraries or insights can be shared across the company because of the tech fragmentation. Moving over to new tech might come up because of people’s tendency to want to start over and write things from scratch—it’s almost always a bad idea. On the other hand, there are a few cases where introducing a new technology makes sense like when it enables your company to take on previously impossible tasks. It makes sense when the technical limitation of your current stack is preventing you from reaching your product goals.

Failure Modes

Designers and product folks focus on the expected use cases. As an engineer you also have to think about the worst case scenarios because that’s where the majority of your time will go. At scale, all bad things that can happen do happen. Asking “What could go wrong” or “How can I be wrong” really helps; these questions also cancel out our bias towards confirmation of our existing ideas. Think about what happens when no data, or a lot of data is flowing through the system (Think “Min-Max”). You should expect computers to occasionally die and handle those cases gracefully, and expect network requests to be slow or stall all together.

Management Mental Models

The key insight here is that "Management" might be the wrong name for this discipline all together. What you are really doing is growing people. You’ll rarely have to manage others if you align your interests with the people that report to you. Compared to engineering, management is more fuzzy and subjective. This is why engineers struggle with it. It's really about calibration; you are calibrating approaches for yourself and your reports. What works for you, might not work for me because the world has different expectations from us. Likewise, just reading books on this stuff doesn't help because advice from the book is calibrated for the author.

With that said, I believe the following mental models are falsifiable. I also believe that doing the opposite of these will always be harmful. I find these particularly valuable while planning my week. Enjoy!

Create Motivation by Aligning Incentives

Incentives drive our behavior above all else. You probably feel this yourself when you procrastinate on tasks that you don't really want to do.  Work with your reports to identify the intersection of:

  1. What do they want to work on?
  2. What does the product need?
  3. What does that company need?  

Venn diagram of the intersection of the three incentives
The intersection of the three incentives

Magic happens when these questions produce themes that overlap with each other. The person will have intrinsic motivation for tasks that build their skills, improve their product, and their company. Working on the two intersecting themes to these questions can be fruitful too. You can replace 'product' with 'direct team' if appropriate.

 Occasionally you'll find someone focusing on a task that's only:

  • done because that's what the person wants (neglecting the product and the company)
  • what the product needs (neglecting the person's needs or the company)
  • what the company wants (neglecting the person's needs or their product). 

This is fine in the short term, but not a good long term strategy. You should nudge your reports towards these overlapping themes.

Create Clarity by Understanding the "Why" and Having a Vision for the Product

You should have a vision for where your product needs to go. This ends up being super helpful when deciding between competing tactical options and also helps clear up general confusion. You must communicate the vision with your team.  While being a visionary isn't included in your job description, aligning on a "why" often counteracts the negative effects of broken-telephone effect in communication and entropy in organizations.

Focus on High Leverage Activities

This is the central idea in High output management. The core idea is similar to the "Pareto principle" where you focus your energy on the 20% of the tasks that have 80% of the impact. If you don't do this, your team will spend a lot of time, but not accomplish much. So, take some time to plan your approach and focus on the activities that give you the most leverage. I found Donella Medow’s research to be a super user for understanding leverage. A few examples of this include:

Promote Growth Mindset in Your Team

If you had to start with zero, what you'd want is the ability to acquire new skills. You want to instil a mindset of growth in yourself and the rest of the team. Create  an environment where reflection and failures are talked about. Lessons that you truly learn are the ones that you have learnt by making mistakes. Create an environment where people obsess about the craft and consider failures a learning opportunity.

Align Your Team on the Common Vision

Aligning your team towards a common direction is one of the most important things you can do. It'll mean that people will go in the same general direction.

Build Self-organizing Teams

Creating self-sufficient teams is the only way to scale yourself. You can enable a team to do this by promoting a sense of ownership. You can give your input without taking authority away from others and offer suggestions without steamrolling leaders.

Communication and Structural Organization

You should focus on communication and organization tools that keep the whole team organized. Communication fragmentation leads to massive waste.

Get the Architecture Right

This is where your engineering chops will come in handy. From a technical perspective, getting the core pieces of the architecture right ends up being critical and defines the flow of information in the system.

Don’t Try to be Efficient with Relationships

As an engineer your brain is optimized to seek efficiency. Efficiency isn’t a good approach when it comes to relationships with people as you often have the opposite effect as to what you intended. I have found that 30 minute meetings are too fast for one-on-ones with your reports. You want to give some time for banter and a free flow of information. This eases people up, you have better conversations and they often end up sharing more critical information than they would otherwise. Of course, you don't want to spend a lot of time in meetings, so I prefer to have longer infrequent meetings instead of frequent short meetings.

This model also applies to pushing for changes or influencing others in any way. This is a long game, and you should be prepared for that. Permanent positive behavioral changes take time.

Hire Smart People Who Get Stuff Done and You Want to Be Around

Pick business partners with high intelligence, energy, and, above all, integrity
Pick business partners with high intelligence, energy, and, above all, integrity. -@Naval

Hiring, when done right, is one of the highest-leverage activities that you can work on. You are looking for three key signals when hiring:

  • smart
  • gets stuff done
  • good to be around. 

When looking for the "smart" signal, be aware of the "halo effect" and be weary of charmers. "Get stuff done" is critical because you don't want to be around smart people who aren’t adding value to your company. Just like investing, your aim should be to identify people on a great trajectory. "Good to be around" is tricky because it's filled with personal bias. A good rule is to never hire assholes. Even if they are smart and get stuff done, they’ll do that at the expense of others and wreak havoc on the team's culture. Avoid! Focus on hiring people that you would want long term relationships with.

Be Useful

You could behave in a number of different ways at any given time or interaction. What you want is to be useful and add value. There is a useful way of giving feedback to someone that reports you and a useful way to review code. Apply this to yourself and plan your day to be more useful to others. Our default instinct is to seek confirmation bias and think about how we are right. We don’t give others the same courtesy. The right approach is to reverse that default instinct: Think “how can I make this work” for other people, and “how can this be wrong” for your own ideas.

Don’t compete with your reports either. As a manager, this is particularly important because you want to grow your reports. Be aware of situations or cases where you might be competing, and default to being useful instead of pushing your own agenda.

Get the Requirements Right Early and Come Up with a Game Plan

Planning gets a bad rep in agile organizations, but it ends up being critical in the long term. Doing some planning almost always ends up being much better than no planning at all. What you want to do is plan until you have a general direction defined and start iterating towards that. There are a few questions can help getting these requirements right:

  • What are the things that we want in the near future? You want to pick the path that gives you the most options for the expected future.
  • How can this be wrong? Counteract your confirmation bias with this question by explicitly thinking about failure modes.
  • Where do you not want to go? Inversion ends up being really useful. It’s easier to avoid stupidity than seeking brilliance
  • What happens once you get there? Seek second order effects. What will one path unlock or limit?
  • What other paths could you take? Has your team settled on a local maxima instead and not the global maxima?

Once you have a decent idea of how to proceed with this, you are responsible for communicating this plan with the rest of the team too. Not getting the requirements right early on means that your team can potentially end up going in the wrong direction which ends up being a net negative.

Establish Rapport Before Getting to Work

You will be much more effective at work, if you connect with the other people before getting to work. This could mean banter, or just listening—there’s a reason why people small-talk.  Get in the circle before attempting to change the circle. This leads to numerous positive improvements in your workflow. Slack conversations will sound like conversations instead of arguments and you'll assume positive intent.  You’ll also find that getting alignment in meetings and nudging reports towards positive changes ends up being much more useful this way. Icebreakers in meetings and room for silliness helps here.

There Is No One-size-fits-all Approach to People. Personality Tests Are Good Defaults

Management is about calibration. You are calibrating  your general approach to others, while calibrating a specific approach to each person. This is really important because an approach that might work for one person won’t work on others. You might find that personality tests like the Enneagram serve as great defaults to approaching a person. Type-5, the investigators, work best when you give them autonomy and new ideas. Type-6, the loyalists, typically want frequent support and the feeling of being entrusted. The Last Dance miniseries on Netflix is a master class on this topic.

Get People to Lead with Their Strengths and Address Their Growth Areas as a Secondary Priority

There are multiple ways to approach a person's personal growth. I’ve found that what works best is first identifying their strengths and then areas of improvements. Find people’s strengths and obsessions then point them to that. You have to get people to lead with their strengths. It’s the right approach because it gives people confidence and momentum. Turns out, that’s also how they add the most value. Ideally, one should develop to be more well-rounded, so it’s also important to come up with a game plan for addressing any areas of improvements. 

Focus on the Positives and Don't Over Index on the Negatives

For whatever reasons, we tend to focus on the negatives more than we should. It might be related to "deprival super reaction syndrome" where we hate losing more than we like winning. In management, we might have the proclivity to focus on what people are doing poorly instead of what they’re doing well. People may not feel appreciated if you only focus on the negatives. I believe this also means that we end up focusing on improving low-performers more than amplifying high-performers. Amplifying high-performers may have an order of magnitude higher impact.

People Will Act Like Owners if You Give Them Control and Transparency

Be transparent and don’t do everything yourself. Talk to people and make them feel included. When people feel left out of the loop, they generally grow more anxious as they feel that they’re losing control. Your ideal case is that your reports act like owners. You can do this by being transparent about how decisions are made. You also have to give others control and autonomy. Expect some mistakes as they calibrate their judgement and nudge in the right direction instead of steamrolling them.

There are other times where you'll have to act like an owner and lead by example. One hint of this case will be when you have a nagging feeling about something that you don't want to do. Ultimately, the right thing here is to take full ownership and not ask others to do what you wouldn't want.

Have High Standards for Yourself and the People Around You

Tech markets and products are generally winner-take-all. This means that second place isn’t a viable option—winning, in tech, leads to disproportionately greater rewards. Aim to be the best in the world in your space and iterate towards that. There’s no point in doing things in a mediocre way.  Aiming high, and having high standards is what pushes everything forward.

To make this happen, you need to partner with people who expect more from you. You should also have high standards for your reports. One interesting outcome of this is that you get the positive effects of the Pygmalion effect: people will rise to your positive expectations.

Hold People Accountable

When things don't get delivered or when you see bad behavior, you have to have high standards and hold people accountable. If you don't, that's an implicit message that these bad outcomes are ok. There are many cases where not holding others accountable can have a spiraling effect.

Your approach to this has to be calibrated for your work style and the situation. Ultimately, you should enforce all deal-breaker rules. Set clear expectations early on. When something goes wrong, work with the other person or team to understand why. Was it a technical issue, does the tooling need to be improved, or was it an issue with leadership?

Bring Other People Up with You

We like working with great and ambitious people because they raise the bar for everyone else. We’re allergic to self-obsessed people who only care about their own growth. Your job as a manager is to bring other people up. Don’t take credit for work that you didn’t do and give recognition to those that did the work. What's interesting is that most people only really care about their own growth. So, being the person who actually spends time thinking about the growth of others differentiates you, making more people want to work with you.

Maintain Your Mental Health with Mindfulness, Rest, and Distance

A team's culture is top down and bottom up. This means people mimic the behavior of others in the position of authority—for better or for worse. Keeping this in mind, you have to be aware of your own actions. Generally, most people become less aware of their actions as fatigue builds up. Be mindful of your energy levels when entering meetings.  Energy and positivity is phenomenal, because it's something that you can give to others, and it doesn't cost you anything.

Stress management is another important skill to develop. Most people can manage problems with clear yes/no solutions. Trickier problems with nuances and unclear paths, or split decisions tend to bubble up. Ambiguity and conflicting signals are a source of stress for many people and treating this like a skill is really important. Dealing with stressful situations by adding more emotion generally doesn't help. Keep your cool in stressful situations.

Aim to Operate Two to Three Months Into the Future

As an engineer you typically operate in scope between days and weeks. As you expand your influence, you also have to start thinking in a greater time horizon. As a manager, your horizon will be longer than a typical engineer, but smaller than someone who focuses on high level strategy. This means that you need to project your team and reports a few months into the future and anticipate their challenges. Ideally, you can help resolve these issues before they even happen. This exercise also helps you be more proactive instead of reacting to daily events.

Give Feedback to People That You Want Long Term Relationships with

Would you give feedback to a random stranger doing something that's bad for them? Probably not. Now, imagine that you knew this person. You would try to reason with this person, and hopefully nudge them in the right direction. You give feedback to people that you care about and want a long term relationship with. I believe this is also true at work. Even if someone isn’t your report, it’s worth sharing your feedback if you can deliver it usefully.

Giving feedback is tricky since people often get defensive. There are different schools of thoughts on this, but I try to build a rapport with someone before giving them direct feedback. Once you convince someone that you are on their side, people are much more receptive to it. Get in the circle. While code review feedback is best when it's all at once, that isn't necessarily true for one-on-one feedback. Many people default to quick-feedback and I think that doesn't work for people you don't have good rapport with and that it only really works if you are in a position of authority. The shortest path is not always the path of least resistance, and so you should build rapport before getting to work.

ShipIt! Presents: Building Mental Models of Ideas That Don’t Change


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

How to Do an In-depth Liquid Render Analysis with Theme Inspector

How to Do an In-depth Liquid Render Analysis with Theme Inspector

Shopify’s Online Store provides greater flexibility to build themes representing the brand of a merchant’s online store. However, like with any programming language, one often writes code without being aware of all the performance impact it creates. Whether it be performance impact on Shopify’s servers or observable performance impact on the browser, ultimately, it’s the customers that experience the slowness. The speed of server-side rendering is one of the most important performance timings to optimize for. That’s because customers wait on a blank screen until server-side rendering completes. Even though we’re working hard to make server-side rendering as fast as possible, bottlenecks may originate from the Liquid source itself.

In this article, I’ll look at:  

  • how to interpret flame graphs generated by the Shopify Theme Inspector
  • what kind of flame graphs generate from unoptimized Liquid code patterns
  • tips for spotting and avoiding these performance issues.

Install the Shopify Theme Inspector

With a Google Chrome browser, install the Shopify Theme Inspector extension. Follow this article on Debug Liquid Render Performance with Shopify Theme Inspector for Chrome for how to start with the extension and get to a point where you can produce a flame graph on your store.

A flame graph example
A flame graph example

The flame graph produced by this tool is a data representation of the code path and the time it took to execute. With this tool, as a developer, you can find out how long a piece of code took to render.

Start with Clean Code

We often forget what a clean implementation looks like, and this is often how we, Shopify, envision this piece of liquid code will be used—it’s often not the reality as developers will find their own ways to achieve their goals. As time passes, code becomes complicated. We need to go back to the clean implementation to understand what makes it take the time to render.

The simple code above creates a flame graph that looks like this image below:

Flame graph for a 10 item paginated collection
Flame graph for a 10 item paginated collection

The template section took 13 ms to complete rendering. Let’s have a better understanding of what we are seeing here.

Highlighted flame graph for a 10 item paginated collection
Highlighted flame graph for a 10 item paginated collection

The area where the server took the time to render is where the code for the pagination loop is executed. In this case, we rendered 10 product titles. Then there’s a block of time that seems to disappear. It‘s actually the time spent on Shopify’s side collecting all the information that belongs to the products in the paginate collection.

Look at Inefficient Code

To know what’s an inefficient code, one must know what it looks like, why it is slow, and how to recognize it in the flame graph. This section walks through a side-by-side comparison of code and it’s flame graphs, and how a simple change results in bad performance.

Heavy Loop

Let’s take that clean code example and make it heavy.

What I’ve done here is accessed attributes in a product while iterating through a collection. Here’s the corresponding flame graph:

Flame graph for a 10 item paginated collection with accessing to its attributes
Flame graph for a 10 item paginated collection with accessing to its attributes

The total render time of this loop is now at 162 ms compared to 13 ms from the clean example. The product attributes access changes a less than 1 ms render time per tile to a 16 ms render time per tile. This produces exactly the same markup as the clean example but at the cost of 16 times more rendering time. If we increase the number of products to paginate from 10 to 50, it takes 800 ms to render.


  • Instead of focusing on how many 1 ms bars there are, focus on the total rendering time of each loop iteration
  • Clean up any attributes aren’t being used
  • Reduce the number of products in a paginated page (Potentially AJAX the next page of products)
  • Simplify the functionality of the rendered product

Nested Loops

Let’s take that clean code example and make it render with nested loops.

This code snippet is a typical example of iterating through the options and variations of a product. Here’s the corresponding flame graph:

Flame graph for two nested loop example
Flame graph for two nested loop example

This code snippet is a two-level nested loop rendering at 55 ms.

Nested loops are hard to notice when just looking at code because it’s separated by files. With the flame graph, we see the flame graph start to grow deeper.

Flame graph of a single loop on a product
Flame graph of a single loop on a product

As highlighted in the above screenshot, the two inner for-loops stacks side by side. This is okay if there are only one or two loops. However, each iteration rendering time will vary based on how many inner iterations it has.

Let’s look at what a three nested loop looks like.

Flame graph for three nested loop example
Flame graph for three nested loop example

This three level nested loop rendered at 72 ms. This can get out of hand really quickly if we aren’t careful. A small addition to the code inside the loop could blow your budget on server rendering time.


  • Look for a sawtooth shaped flame graph to target potential performance problem
  • Evaluate each flame graph layer and see if the nested loops are required

Mix Usage of Multiple Global Liquid Scope

Let’s now take that clean code example and add another global scoped liquid variable.

And here’s the corresponding flame graph:

Flame graph of when there’s one item in the cart with rendering time at 45 ms
Flame graph of when there’s one item in the cart with rendering time at 45 ms

Flame graph of when there’s 10 items in the cart with rendering time at 124 ms
Flame graph of when there’s 10 items in the cart with rendering time at 124 ms

This flame graph is an example of a badly nested loop where each variation is accessing the cart items. As more items are added to the cart, the page takes longer to render.


  • Look for hair comb or sawtooth shaped flame graph to target potential performance problem
  • Compare flame graphs between one item and multiple items in cart
  • Don’t mix global liquid variable usage. If you have to, use  AJAX to fetch for cart items instead

What is Fast Enough?

Try to aim for 200 ms but no more than 500 ms total page rendering time reported by the extension. We didn’t just pick a number out of the hat. It’s made with careful consideration of what other allocation of available time during a page render that we need to include to hit a performance goal. Google Web Vitals stated that a good score for Largest Content Paint (LCP) is less than 2.5 seconds. However, the largest content paint is dependent on many other metrics like time to first byte (TTFB) and first content paint (FCP). So, let’s make some time allocation! Also, let’s understand what each metric represents:

Flow diagram: Shopify Server to Browser to FCP to LCP

  • From Shopify’s server to a browser is the network overhead time required. It varies based on the network the browser is on. For example, navigating your store on 3G or Wi-Fi.
  • From a browser blank page (TTFB) to showing anything (FCP) is the time the browser needs to read and display the page.
  • From the FCP to the LCF is the time the browser needs to get all other resources (images, css, fonts, scripts, video, … etc.) to complete the page.

The goal is an LCP < 2.5 seconds to receive a good score

Server → Browser

300 ms for network overhead

Browser → FCP

200 ms for browser to do its work


1.5 sec for above the fold image and assets to download


Which leaves us 500 ms for total page render time.

Does this mean that as long as we keep server rendering below 500 ms, we can get a good LCP score? No, there’s other considerations like critical rendering path that aren’t addressed here, but we’re at least half way there.


  • Optimizing for critical rendering path on the theme level can bring the 200 ms requirement between the browser to FCP timing down to a lower number.

So, we have 500 ms for total page render time, but this doesn’t mean you have all 500 ms to spare. There’s some mandatory server render times that are dedicated to Shopify and others that the theme dedicates to rendering global sections like the header and footer. Depending how you want to allocate the rendering resources, the available rendering time you leave yourself for the page content varies. For example:


500 ms

Shopify (content for header)

50 ms

Header (with menu)

100 ms


25 ms


25 ms

Page Content

300 ms

I mentioned trying to aim for 200 ms total page rendering time—this is a stretch goal. By keeping ourselves mindful of a goal, it’s much easier to start recognizing when performance starts to degrade.

An Invitation to the Shopify Theme Developer Community

We couldn’t possibly know every possible combination of how the world is using Shopify. So, I invite you to share your experience with Shopify’s Theme Inspector and let us know how we can improve at or tweet us at @shopifydevs.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

How to Build an Experiment Pipeline from Scratch

How to Build an Experiment Pipeline from Scratch

One of the most compelling ways to prove the value of any decision or intervention—to technical and non-technical audiences alike—is to run an A/B test. But what if that wasn’t an option on your current stack? That’s the challenge we faced at Shopify. Our amazing team of engineers built robust capabilities for experimentation on our web properties and Shopify admin experiences, but testing external channels like email was unexplored. When it came time to ship a new recommendation algorithm that generates personalized blog post suggestions, we had no way to measure its incremental benefit against the generic blog emails.

To address the problem I built an email experimentation pipeline from the ground up. This quick build helps the Marketing Data Science team solve challenges around experimentation for external channels, and it’s in use by various data teams across Shopify. Below is a guide that teaches you how to implement a similar pipeline with a relatively simple setup from scratch. 

The Problem

Experimentation is one of the most valuable tools for data teams, providing a systematic proof of concept mechanism for interface tweaks, product variations, and changes to the user experience. With our existing experimentation framework Verdict, we can randomize shops, as well as sessions for web properties that exist before the login gate. However, this didn’t extend to email experiments since the randomization isn’t triggered by a site visit and the intervention is in the user’s inbox, not our platform.

As a result, data scientists randomized emails themselves, shipped the experiment, and stored the results in a local text file. This was problematic for a number of reasons: 

  1. Local storage isn’t discoverable and can be lost or deleted. 
  2. The ad hoc randomization didn’t account for users that unsubscribed from our mailing list and didn’t resolve the many-to-many relationship of emails to shops, creating the risk for cross-contamination between the variations. Some shops have multiple staff each with an email address, and some people create multiple shops under the same email address.
  3. Two marketers can simultaneously test the same audience with no exclusion criteria, violating the assumption that all other variables are controlled. 

Toward the end of 2019, when email experimentation became even more popular among marketers as the marketing department grew at Shopify, it became clear that a solution was overdue and necessary.

Before You Start

There are few things I find more mortifying than shipping code just to ship more code to fix your old code, and I’m no stranger to this. My pull requests (PRs) were rigorously reviewed, but myself and the reviewers were in uncharted territory. Exhibit A: a selection of my failed attempts at building a pipeline through trial and error: 

Github PR montage, showing a series of bug fixes.
Github PR montage, showing a series of bug fixes

All that to say that requirements gathering isn’t fun, but it’s necessary. Here are some steps I’d recommend before you start.

1. Understanding the Problem

The basic goal is to create a pipeline that can pull a group constrained by eligibility criteria, randomly assign each person to one of many variations, and disseminate the randomized groups to the necessary endpoints to execute the experiment. The ideal output is repeatable and usable across many data teams. 

We define the problem as: given a list of visitors, we want to randomize so that each person is limited to one experiment at a time, and the experiment subjects can be fairly split among data scientists who want to test on a portion of the visitor pool. At this point, we won’t outline the how, we’re just trying to understand the what.

2. Draw a System Diagram

Get the lay of the land with a high-level map of how your solution will interact with its environment. It’s important to be abstract to prevent prescribing a solution; the goal is to understand the inputs and outputs of the system. This is what mine looked like:

Example of a system diagram for email experiment pipeline
Example of a system diagram for email experiment pipeline

In our case, the data come from two sources: our data warehouse and our email platform.

In a much simpler setup—say, with no ETL at all—you can replace the inputs in this diagram with locally-stored CSVs and the experiment pipeline can be a Jupyter notebook. Whatever your stack may be, this diagram is a great starting point.

3. Plan the Ideal Output

I anticipated the implementation portion to be complicated, so I started by whiteboarding my ideal production table and reverse-engineered the necessary steps. Some of the immediate decisions that arose as part of this exercise were:

  1. Choosing the grain of the table: subjects will get randomized at the shop grain, but the experience of the experiment variation is surfaced with the primary email associated with that shop.
  2. Considering necessary resolvers: each experiment is measured on its own success metric, meaning the experiment output table needs to be joined to other tables in our database.
  3. Compatibility with existing analysis framework: I didn’t want to reinvent the wheel; we already have an amazing experiment dashboard system, which can be leveraged if my output is designed with that in mind.

I built a table with one row per email, per shop, per experiment, and with some additional attributes detailing the timing and theme of the experiment. Once I had a rough idea of this ideal output, I created a mock version of an experiment with some dummy data in a CSV file that I uploaded as a temporary table in our data warehouse. With this, I brainstormed some common use cases and experiment KPIs and attempted to query my fake table. This allowed me to identify pitfalls of my first iteration; for example, I realized that in my first draft that my keys wouldn’t be compatible with the email engagement data we get from the email platform API, which is the platform that sends our emails.

I sat with some stakeholders that included my teammates, members of the experimentation team, and non-technical members of the marketing organization. I did a guided exercise where I asked them to query my temporary table and question whether the structure can support the analysis required for their last few email experiments. In these conversations, we nailed down several requirements: 

  • Exclude subjects from other experiments: all subjects in a current experiment should be excluded from other experiments for a minimum of 30 days, but the tool should support an override for longer exclusion periods for testing high risk variations, such as product pricing.
  • Identify missing experiment category tags: the version of the output table I had was missing the experiment category tags (ex. research, promotional, etc) which is helpful for experiment discoverability.
  • Exclude linked shops: if an email was linked to multiple shops that qualified for the same experiment, all shops linked to that email should be excluded altogether.
  • Enable on-going randomization of experiments: the new pipeline should allow experiments to randomize on an ongoing basis, assigning new users as they qualify over time (as opposed to a one-time batch randomization).
  • Backfill past experiments into the pipeline: all past email experiments needed to be backfilled into the pipeline, and if a data scientist inadvertently bypassed this new tool, the pipeline needs to support a way to backfill these experiments as well. 

After a few iterations and with stakeholders’ blessing, I was ready to move to technical planning.

4. Technical Planning

At Shopify, all major projects are drafted in a technical document that’s peer-reviewed by a panel of data scientists and relevant stakeholders. My document included the ideal output and system requirements I’d gathered in the planning phase, as well as expected use cases and sample queries. I also had to draw a blueprint for how I planned to structure the implementation in our ETL platform. After chatting with my lead and discussions on Slack, I decided to build the pipeline in three stages, demonstrated by the diagram below.

Proposed high-level ETL structure for the randomization pipeline
Proposed high-level ETL structure for the randomization pipeline

Data scientists may need to ship experiments simultaneously; therefore for the first phase, I needed to create an experiment definition file that defines the criteria for candidates in the form of a SQL query. For example, you may want to limit a promotional offer to shops that have been on the platform for at least a year, and only in a particular region. This also allows you to tag your experiment with the necessary categories and specify a maximum sample size, if applicable. All experiment definition files are validated on an output contract as they need to be in agreement to be unioned in the next phase.

Phase two contains a many-to-one transform stage that consolidates all incoming experiments into a single output. If an experiment produces randomizations over time, it continues to append new rows incrementally. 

In phase three, the table is filtered down in many ways. First, users that have been chosen for multiple experiments are only included in the first experiment to avoid cross-contamination of controlled variables. Additionally, users with multiple shops within the same experiment are excluded altogether. This is done by deduping a list at the email grain with a lookup at the shop grain. Finally, the job adds features such as date of randomization and indicators for whether the experiment included a promotional offer.

With this blueprint in hand, I scheduled a technical design review session and pitched my game plan to a panel of my peers. They challenged potential pitfalls, provided engineering feedback, and ultimately approved the decision to move into build.

5. Building the Pipeline

Given the detailed planning, the build phase follows as the incremental implementation of the steps described above. I built the jobs in PySpark and shipped in increments, small enough increments to be consumable by code reviewers since all of the PRs totalled several thousand lines of code.

6. Ship, Ship, Ship! 

Once all PRs were shipped into production, the tool was ready to use. I documented its use in our internal data handbook and shared it with the experimentation team. Over the next few weeks, we successfully shipped several email experiments using the pipeline, which allowed me to work out small kinks in the implementation as well. 

The biggest mistake I made in the shipping process is that I didn’t  share the tool enough across the data organization. I found that many data scientists didn’t know the tool existed, and continued to use local files as their workaround solution. Well, better late than never, I did a more thorough job of sending team-wide emails, posting in relevant Slack channels, and setting up GitHub alerts to notify me when other contributors edit experiment files.

As a result, the tool has not only been used by the Marketing Data Science, but across the data organization by teams that focus on shipping, retail and international growth, to ship email experiments for the past year. The table produced by this pipeline integrated seamlessly with our existing analysis framework, so no additional work was required to see statistical results once an experiment is defined.

Key Takeaways

To quickly summarize, the most important takeaways are:

  1. Don’t skip out on requirement gathering! Understand the problem you’re trying to solve, create a high-level map of how your solution will interact with its environment, and plan your ideal output before you start.
  2. Draft your project blueprint in a technical document and get it peer-reviewed before you build.
  3. When building the pipeline, keep PRs smaller where possible, so that reviewers can focus on detailed design recommendations and so production failures are easier to debug.
  4. Once shipped, make sure you share effectively across your organization.

Overall, this project was a great lesson that a simple solution, built with care for engineering design, can quickly solve for the long-term. In the absence of a pre-existing A/B testing framework, this type of project is a quick and resourceful way to unlock experimentation for any data science team with very few requirements from the data stack.

Are you passionate about experiments and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

Continue reading

Images as Code: Representing Localized and Evolving Products on Marketing Pages

Images as Code: Representing Localized and Evolving Products on Marketing Pages

Last year, our marketing team kicked off a large effort based on user research to revamp to better serve the needs of our site visitors. We recognized that visitors wanted to see screenshots and visuals of the product itself, however we found that most of the screenshots across our website were outdated.

The image was used to showcase our Shopify POS software on, however it misrepresented our product when our POS software was updated and rebranded.
Old Shopify POS software on

The above image was used to showcase our Shopify POS software on, however it misrepresented our product when our POS software was updated and rebranded. 

While we first experimented with a Scalable Vector Graphics (SVG) based solution to visuals, we found that it wouldn’t scale and forced us to restrict usage to only “high-value” pages. Still, other teams expressed interest in this approach, so we recreated these in HTML and JavaScript (JS) and compared the lift between them. The biggest question was around getting these to resize in a given container—with SVG all content, including text size, grows and shrinks proportionally with a width of 100%, appearing as an image to users. With CSS there’s no way to get font sizes to scale proportionally to a container, only the window. We created a solution that resizes all the contents of the element at the same rate in response to container size, and reused it to create a better

The Design Challenge

We wanted to create new visuals of our product that needed to be available and translated across more than 35 different localized domains. Many domains support different currencies, features, and languages. Re-capturing screenshots on each domain to keep in sync with all our product changes is extremely inefficient.

Screenshots of our product were simplified in order to highlight features relevant to the page or section.
Screenshots of our product were simplified in order to highlight features relevant to the page or section.

After a number of iterations and as part of a collaborative effort outlined in more detail by Robyn Larsen on our UX blog, our design team came up with simplified representations of our user interface, UI Illustrations as we called them, for the parts of the product that we wanted to showcase. This was a clever solution to drive user focus to the parts of the product that we’re highlighting in each situation, however it required that someone maintain translations and versions of the product as separate image assets. We had an automated process for updating translations in our code but not in the design editor. 

What Didn’t Work: The SVG Approach

As an experimental solution, we attempted to export these visuals as SVG code and added those SVGs inline in our HTML. Then we’d replace the text and numbers with translated and localized text.

SVGs don’t support word wrapping so visuals with long translations would look broken.
SVGs don’t support word wrapping so visuals with long translations would look broken.

Exported SVGs were cool, they actually worked to accomplish what we had set out to do, but they had a bunch of drawbacks. Certain effects like gaussian blur caused performance issues in Firefox, and SVG text doesn’t wrap when reaching a max-width like HTML can. This resulted in some very broken looking visuals (see above). Languages with longer word lengths, like German, had overflowing text. In addition, SVG export settings in our design tool needed to be consistent for every developer to avoid massive changes to the whole SVG structure every time someone else exported the same visual. Even with a consistent export process, the developer would have to go through the whole process of swapping out text with our own hooks for translated content again. It was a huge mess. We were writing a lot of documentation just to create consistency in the process, and new challenges kept popping up when new settings in Sketch were used. It felt like we had just replaced one arduous process with another.

Our strategy of using SVGs for these visuals was quickly becoming unmanageable, and that was just with a few simple visuals. A month in, we still saw a lot of value in creating visuals as code, but needed to find a better approach.

Our Solution: The HTML/JavaScript Approach

After toying around with using JS to resize the content, we ended up with a utility we call ScaleContentAsImage. It calculates how much size is available for the visual and then resizes it to fit in that space. Let’s break down the steps required to do this.

Starting with A Simple Class

We start by creating a simple class that accepts a reference to the element in the DOM that we want to scale, and initialize it by storing the computed width of the element in memory. This assumes that we assigned the element a fixed pixel width somewhere in our code already (this fixed pixel width matches the width of the visual in the design file). Then we override the width of the element to 100% so that it can fill the space available to it.I’ve purposely separated the initialization sequence into a separate method from the constructor. While not demonstrated in this post, that separation allows us to add lazy loading or conditional loading to save on performance.

Creating an Element That We Can Transform

Next we’ll need to create an element that will scale as needed using a CSS transform. We assign that wrapper the fixed width that its parent used to have. Note that we haven’t actually added this element anywhere in the DOM yet.

Moving the Content Over

We transfer all the contents of the visual out from where it is now and into the wrapper we just created, and put that back into the parent element. This method preserves any event bindings (such as lazy load listeners) that previously were bound to these elements. At this point the content might overflow the container, but we’ll apply the transform to resolve that.

Applying the Transformation

Now, we determine how much the wrapper should scale the contents by and apply that property. For example, if the visual was designed to be 200px wide but it’s rendered in an element that’s 100px wide, the wrapper would be assigned transform: scale(0.5);.

Preserving Space in the Page

A screenshot of a webpage in a desktop view with text content aligned to the left and an image on the right. The image is of the Shopify admin, with the Shopify logo, a search bar, and a user avatar next to “Helen B.” on the top of the screen. Below in a grid are summaries of emails delivered, purchase totals, and total growth represented in graphs.
A screenshot of a webpage in a desktop view with text content aligned to the left and an image on the right.

So now our visual is resizing correctly, however our page layout is now looking all wonky. Our text content and the visual are meant to display as equal width side-by-side, like the above.

So why does the page look like this? The colored highlight shows what’s happening with our CSS transform.

A screenshot of a webpage in a desktop viewport where text is pushed to the very left of the screen taking up one sixth of the width. The image of the Shopify admin to the right is only one third of the screen wide. The entire right half of the page is empty. The image is highlighted in blue, however a larger green box is also highlighted in the same position taking up more of the empty space but matching the same dimensions of the image. A diagonal line from the bottom right corner of the larger green box to the bottom right corner of the highlighted image hints at a relationship between both boxes.
A screenshot of a webpage in a desktop viewport after CSS transform.

CSS transforms don’t change the content flow, so even though our visual size is reduced correctly, the element still takes up its fixed width. We add some additional logic here to fix this problem by adding an empty element that takes up the correct amount of vertical space. Unless the visual contains images that are lazy loaded, or animates vertically in some way, we only need to make this calculation once since a simple CSS trick to maintain an aspect ratio will work just fine.

Removing the Transformed Element from the Document Flow

We also need to set the transformed element to be absolutely positioned, so that it doesn’t affect the document flow.

A screenshot of a webpage in a desktop view with text content aligned to the left and an image on the right. The image is of the Shopify admin, with the Shopify logo, a search bar, and a user avatar next to “Helen B.” on the top of the screen. Below in a grid are summaries of emails delivered, purchase totals, and total growth represented in graphs.
A screenshot of a webpage in a desktop view with text content aligned to the left and an image on the right.

Binding to Resize

Success! Looks good! Now we just add a bit of logic to update our calculations if the window is resized.

Finishing theCode

Our class is now complete.

Looking at the Detailed Changes in the DOM

1. Before JS is initialized, in this example the container width is 378px and the assigned width of the element is 757px. The available space is about 50% of the original size of the visual.

A screenshot of a page open in the browser with developer tools open. In the page, a UI Illustration is shown side-by-side with some text, and the highlighted element in the inspector matches that as described above. The override for the container width can be seen in the style inspector. In addition, the described element is assigned a property of “aria-hidden: true”, and its container is a `div` with `role: “img”` and an aria-label describing the visual as “View of the Shopify admin showing emails delivered, purchase totals, and total growth represented in graphs”
A screenshot of a page open in the browser with developer tools open

2. As seen in our HTML post-initialization, in JS we have overridden the size of the container to be 100%

3. We’ve also moved all the content of the visual inside of a new element that we created, to which we apply a scale of 0.5 (based on the 50% calculated in step 1).

4. We absolutely position the element that we scaled so that it doesn’t disturb the document flow.

5. We added a placeholder element to preserve the correct amount of space in the document flow.

A Solution for a React Project

For a project using React, the same thing is accomplished without any of the logic we wrote to create, move, or update the DOM elements. The result is a much simpler snippet that only needs to worry about determining how much space is available within its container. A project using CSS-in-JS benefits in that the fixed width is directly passed into the element.

Problems with Localization and Currency

Shopify Order Summary Page
Shopify admin order summary page

An interesting problem we ran into was displaying prices in local currencies for fictional products. For instance, we started off with a visual of a product, a checkout button, and an order summary. Shown in the order summary were two chairs, each priced at ~$200, which were made-up prices and products for demonstrative purposes only.

It didn’t occur to us that 200 Japanese Yen is the equivalent of under $1.89 USD (today), so when we just swapped the currency symbol the visual of the chair did not realistically match the price. We ended up creating a table of currency conversion rates pulled on that day. We don’t update those conversion values on a regular basis, since we don’t need accurate rates for our invented prices. We’re ok with fluctuations, even large ones, as long as the numbers look reasonable in context. We obviously don’t take this approach with real products and prices.

Comparing Approaches: SVG vs HTML/JavaScript

The HTML/JS approach took some time to build upfront, but its advantages clearly outweighed the developer lift required even from the start. The UI Illustrations were fairly quick to build out given how simply and consistently they were designed. We started finding that other projects were reusing and creating their own visuals using the same approach. We created a comparison chart between approaches, evaluating major considerations and the support for these between the two.

Text Support

While SVG resizes text automatically and in proportion to the visual resizing, it didn’t support word wrap which is available in HTML

Implementation and Maintenance

HTML/JS had a lot going for compared to the SVG approach when it came to implementation and maintenance. Using HTML and JS would mean that developers don’t need to have technical knowledge of SVGs, and they code these visuals with the help of our existing components. Code is easy to parse and tested using our existing testing framework. From an implementation standpoint, the only thing that SVG really had going for it was that it usually resulted in fewer lines of code, since styles are inline and elements are absolutely positioned relative to each other. That in itself isn’t reason to choose a less maintainable and human-readable solution.


While both would support animations—something we may want to add in the future—an HTML/JS approach allows us to easily use our existing play/pause buttons to control these animations.


The SVG approach works with JS disabled, however it’s less performant and caused a lot of jankiness on the page when certain properties like shadows were applied to it


Design is where HTML/JS really stood out against SVG. With our original SVG approach, designers needed to follow a specific process and use a specific design tool that worked with that process. For example, we started requiring that shadows applied to elements were consistent in order to prevent multiple versions of Gaussian Blur from being added to the page and creating jankiness. It also required our designers to design in a way that text would never break onto a new line because of the lack of support for word wrapping. Without introducing SVG, none of these concerns applied and designers had more flexibility to use any tools they wanted to build freely.

Documentation and Ramp-up

HTML/JS was a clear winner , as we did away with all of the documentation describing the SVG export process, design guidelines, and quirks we discovered. With HTML, all we’d need to document that wouldn’t apply to SVGs is how to apply the resize functionality to the content.

Scaling Our Solution

We started off with a set of named visuals, and designed our system around a single component that accepted a name (for example “Shopify admin dashboard” or “POS software”) and rendered the desired visual. We thought that having a single entry point would help us better track each visual and restrict us to a small, maintainable set of UI Illustrations. That single component was tested and documented and for each new visual we added respective tests and documentation.

We worried about overuse given that each UI Illustration needed to be maintained by a developer. But with this system, a good portion of that development effort ended up being the education of the structure, maintenance of documentation, and tests for basic HTML markup that’s only used in one place. We’ve since provided a more generic container that can be used to wrap any block of HTML for initialization with our ScaleContentLikeImage module and provides a consistent implementation of descriptive text for screen readers.

The Future of UI Illustrations

ScaleContentLikeImage and its application for our UI Illustrations is a powerful tool for our team to highlight our product in a very intentional and relevant way for our users. Jen Taylor dives deeper into our UX considerations and user-focused approach to UI Illustrations on the Shopify UX Blog. There are still performance and structural wins to be had, specifically around how we recalculate sizing for lazy loaded images, and how we document existing visuals for reuse. However, until there’s a CSS-only solution to handle this use case our HTML/JS approach seems to be the cleanest. Looking to the future, this could be an excellent application to explore with CSS Houdini once the layout API is made available (it’s not yet supported in any major browser).

Based on Anton Dosov’s CSS Houdini with Layout API demo, I can imagine a scenario where we can create a custom layout renderer and then apply this logic with a few lines of CSS.

We’ve all learned a lot in this process and like any system, its long term success relies on our team’s collaborative relationship in order to keep evolving and growing in a maintainable, scalable way. At Shopify one of our core values is to thrive on change, and this project certainly has done so.

If sounds this sounds like the kind of projects you want to be a part of please check out our open positions.

Continue reading

How to Use Quasi-experiments and Counterfactuals to Build Great Products

How to Use Quasi-experiments and Counterfactuals to Build Great Products

Descriptive statistics and correlations are every data scientists’ bread and butter, but they often come with the caveat that correlation isn’t causation. At Shopify, we believe that understanding causality is the key to unlocking maximum business value. We aim to identify insights that actually indicate why we see things in the data, since causal insights can validate (or invalidate) entire business strategies. Below I’ll discuss different causal inference methods and how to use them to build great products.

The Causal Inference “Levels of Evidence Ladder”

A data scientist can use various different methods to estimate the causal effects of a factor. The “levels of evidence ladder” is a great mental model that introduces the ideas of causal inference.

Levels of evidence ladder. First level (clearest evidence): A/B tests (a.k.a statistical experiments). Second level (reasonable level of evidence): Quasi-experiments (including Difference-in-differences, matching, controlled regression). Third level (weakest level of evidence): Full estimation of counterfactuals. Bottom of the chart: descriptive statistics—provides no direct evidence for causal relationship.
Levels of evidence ladder. First level (clearest evidence): A/B tests (a.k.a statistical experiments). Second level (reasonable level of evidence): Quasi-experiments (including difference-in-differences, matching, controlled regression). Third level (weakest level of evidence): Full estimation of counterfactuals. Bottom of the chart: descriptive statistics—provides no direct evidence for causal relationship.

The ladder isn’t a ranking of methods, instead it’s a loose indication of the level of proof each method will give you. The higher the method is on the ladder, the easier it is to compute estimates that constitute evidence of a strong causal relationship. Methods at the top of the ladder typically (but not always) require more focus on the experimentation setup. On the other end, methods at the bottom of the ladder use more observational data, but require more focus on robustness checks (more on this later).

The ladder does a good job of explaining that there is no free lunch in causal inference. To get a powerful causal analysis you either need a good experimental setup, or a good statistician and a lot of work. It’s also simple to follow. I’ve recently started sharing this model with my non-data stakeholders. Using it to illustrate your work process is a great way to get buy-in from your collaborators and stakeholders.

Causal Inference Methods

A/B tests

A/B tests, or randomized controlled trials, are the gold standard method for causal inference—they’re on rung one of the levels of evidence ladder! For A/B tests, group A and group B are randomly assigned. The environment both groups are placed in is identical except for one parameter: the treatment. Randomness ensures that both groups are clones “on average”. This enables you to deduce causal estimates from A/B tests, because the only way they differ is the treatment. Of course in practice, lots of caveats apply! For example, one of the frequent gotchas of A/B testing is when the units in your treatment and control groups self-select to participate in your experiment.

Setting up an A/B test for products is a lot of work. If you’re starting from scratch, you’ll need

  • A way to randomly assign units to the right group as they use your product.
  • A tracking mechanism to collect the data for all relevant metrics.
  • To analyze these metrics and their associated statistics to compute effect sizes and validate the causal effects you suspect.

And that only covers the basics! Sometimes you’ll need much more to be able to detect the right signals. At Shopify, we have the luxury of an experiments platform that does all the heavy work and allows data scientists to start experiments with just a few clicks.


Sometimes it’s just not possible to set up an experiment. Here are a few reasons why A/B tests won’t work in every situation:

  • Lack of tooling. For example, if your code can’t be modified in certain parts of the product.
  • Lack of time to implement the experiment.
  • Ethical concerns  for example, at Shopify, randomly leaving some merchants out of a new feature that could help them with their business is sometimes not an option).
  • Just plain oversight (for example, a request to study the data from a launch that happened in the past).

Fortunately, if you find yourself in one of the above situations, there are methods that exist  which enable you to obtain causal estimates.

A quasi-experiment (rung two) is an experiment where your treatment and control group are divided by a natural process that isn’t truly random, but are considered close enough to compute estimates. Quasi-experiments frequently occur in product companies, for example, when a feature rollout happens at different dates in different countries, or if eligibility for a new feature is dependent on the behaviour of other features (like in the case of a deprecation). In order to compute causal estimates when the control group is divided using a non-random criterion, you’ll use different methods that correspond to different assumptions on how “close” you are to the random situation.

I’d like to highlight two of the methods we use at Shopify. The first is linear regression with fixed effects. In this method, the assumption is that we’ve collected data on all factors that divide individuals between treatment and control group. If that is true, then a simple linear regression on the metric of interest, controlling for these factors, gives a good estimate of the causal effect of being in the treatment group.

The parallel trends assumption for differences-in-differences. In the absence of treatment, the difference between the ‘treatment’ and ‘control’ group is a constant. Plotting both lines in a temporal graph like this can help check the validity of the assumption. Credits to Youcef Msaid.

The parallel trends assumption for differences-in-differences. In the absence of treatment, the difference between the ‘treatment’ and ‘control’ group is a constant. Plotting both lines in a temporal graph like this can help check the validity of the assumption. Credits to Youcef Msaid.

The second is also a very popular method in causal inference: difference in difference. For this method to be applicable, you have to find a control group that shows a trend that’s parallel to your treatment group for the metric of interest, prior to any treatment being applied. Then, after treatment happens, you assume the break in the parallel trend is only due to the treatment itself. This is summed up in the above diagram.


Finally, there will be cases when you’ll want to try to detect causal factors from data that only consists of observations of the treatment. A classic example in tech is estimating the effect of a new feature that was released to all the user base at once: no A/B test was done and there’s absolutely no one that could be the control group. In this case, you can try making a counterfactual estimation (rung three).

The idea behind counterfactual estimation is to create a model that  allows you to compute a counterfactual control group. In other words, you estimate what would happen had this feature not existed. It isn’t always simple to compute an estimate. However, if you have a model of your users that you’re confident about, then you have enough material to start doing counterfactual causal analyses!

Example of time series counterfactual vs. observed data
Example of time series counterfactual vs. observed data

A good way to explain counterfactuals is with an example. A few months ago, my team faced a situation where we needed to assess the impact of a security update. The security update was important and it was rolled out to everyone, however it introduced friction for users. We wanted to see if this added friction caused a decrease in usage. Of course, we had no way of finding a control group among our users.

With no control group, we created a time-series model to get a robust counterfactual estimation of usage of the updated feature. We trained the model on data such as usage of other features not impacted by the security update and global trends describing the overall level of activity on Shopify. All of these variables were independent from the security update we were studying. When we compared our model’s prediction to actuals, we found that there was no lift. This was a great null result which showed that the new security feature did not negatively affect usage.

When using counterfactual methods, the quality of your prediction is key. If a confounding factor that’s independent from your newest rollout varies, you don’t want to attribute this change to your feature. For example, if you have a model that predicts daily usage of a certain feature, and a competitor launches a similar feature right after yours, your model won’t be able to account for this new factor. Domain expertise and rigorous testing are the best tools to do counterfactual causal inference. Let’s dive into that a bit more.

The Importance of Robustness

While quasi-experiments and counterfactuals are great methods when you can’t perform a full randomization, these methods come at a cost! The tradeoff is that it’s much harder to compute sensible confidence intervals, and you’ll generally have to deal with a lot more uncertainty—false positives are frequent. The key to avoiding falling into traps is robustness checks.

Robustness really isn't that complicated. It just means clearly stating assumptions your methods and data rely on, and gradually relaxing each of them to see if your results still hold. It acts as an efficient coherence check if you realize your findings can dramatically change due to a single variable, especially if that variable is subject to noise, error measurement, etc.

Direct Acyclic Graphs (DAGs) are a great tool for checking robustness. They help you clearly spell out assumptions and hypotheses in the context of causal inference. Popularized by the famous computer scientist, Judea Pearl, DAGs have gained a lot of traction recently in tech and academic circles.

At Shopify, we’re really fond of DAGs. We often use Dagitty, a handy browser-based tool. In a nutshell, when you draw an assumed chain of causal events in Dagitty, it provides you with robustness checks on your data, like certain conditional correlations that should vanish. I recommend you explore the tool

The Three Most Important Points About Causal Inference

Let’s quickly recap the most important points regarding causal inference:

  • A/B tests are awesome and should be a go to tool in every data science team’s toolbox.
  • However, it’s not always possible to set up an A/B test. Instead, look for natural experiments to replace true experiments. 
  • If no natural experiment can be found, counterfactual methods can be useful. However, you shouldn’t expect to be able to detect very weak signals using these methods. 

I love causal inference applications for business and I think there is a huge untapped potential in the industry. Just like generalizing A/B tests lead to building a very successful “Experimentation Culture” since the end of the 1990s, I hope the 2020s and beyond will be an era of the “Causal Culture” as a whole! I hope sharing how we do it at Shopify will help. If any of this sounds interesting to you, we’re looking for talented data scientists to join our team.

Continue reading

Enforcing Modularity in Rails Apps with Packwerk

Enforcing Modularity in Rails Apps with Packwerk

On September 30, 2020 we held ShipIt! presents: Packwerk by Shopify. A video for the event is now available for you to learn more about our latest open source tool for creating packages with enforced boundaries in Rails apps. Click here to watch the video.

The Shopify core codebase is large, complex, and growing by the day. To better understand these complex systems, we use software architecture to create structural boundaries. Ruby doesn't come with a lot of boundary enforcements out of the box. Ruby on Rails only provides a very basic layering structure, so it's hard to scale the application without any solid pattern for boundary enforcement. In comparison, other languages and frameworks have built-in mechanisms for vertical boundaries, like Elixir’s umbrella projects.

As Shopify grows, it’s crucial we establish a new architecture pattern so large scale domains within the monolith can interact with each other through well-defined boundaries, and in turn, increase developer productivity and happiness. 

So, we created an open source tool to build a package system that can be used to guide and enforce boundaries in large scale Rails applications. Packwerk is a static analysis tool used to enforce boundaries between groups of Ruby files we call packages.

High Cohesion and Low Coupling In Code

Ideally, we want to work on a codebase that feels small. One way to make a large codebase feel small is for it to have high cohesion and low coupling.

Cohesion refers to the measure of how much elements in a module or class belong together. For example, functional cohesion is when code is grouped together in a module because they all contribute to one single task. Code that is related changes together and therefore should be placed together.

On the other hand, coupling refers to the level of dependency between modules or classes. Elements that are independent of each other should also be independent in location of implementation. When a certain domain of code has a long list of dependencies of unrelated domains, there’s no separation of boundaries. 

Boundaries are barriers between code. An example of a code boundary is to have a separate repository and service. For the code to work together in this case, network calls have to be made. In our case, a code boundary refers to different domains of concern within the same codebase.

With that, there are two types of boundaries we’d like to enforce within our applications—dependency and privacy. A class can have a list of dependencies of constants from other classes. We want an intentional and ideally small list of dependencies for a group of relevant code. Classes shouldn’t rely on other classes that aren’t considered their dependencies. Privacy boundaries are violated when there’s external use of private constants in your module. Instead, external references should be made to public constants, where a public API is established.

A Common Problem with Large Rails Applications

If there are no code boundaries in the monolith, developers find it harder to make changes in their respective areas. You may remember making a straightforward change that shockingly resulted in the breaking of unrelated tests in a different part of the codebase, or digging around a codebase to find a class or module with more than 2,000 lines of code. 

Without any established code boundaries, we end up with anti-patterns such as spaghetti code and large classes that know too much. As a codebase with low cohesion and high coupling grows, it becomes harder to develop, maintain, and understand. Eventually, it’s hard to implement new features, scale and grow. This is frustrating to developers working on the codebase. Developer happiness and productivity when working on our codebase is important to Shopify.

Rails Is Like an Open-concept Living Space

Let’s think of a large Rails application as a living space within a house without any walls. An open-concept living space is like a codebase without architectural boundaries. In an effort to separate concerns of different types of living spaces, you can arrange the furniture in a strategic manner to indicate boundaries. This is exactly what we did with the componentization efforts in 2017. We moved code that made sense together into folders we call components. Each of the component folders at Shopify represent domains of commerce, such as orders and checkout.

In our open-concept analogy, imagine having a bathroom without walls—it’s clear where the bathroom is supposed to be, but we would like it to be separate from other living spaces with a wall. The componentization effort was a great first step towards modularity for the great Shopify monolith, but we are still far from a modular codebase—we need walls. Cross-component calls are still being made, and Active Record models are shared across domains. There’s no wall imposing those boundaries, just an agreed upon social contract that can be easily broken.

Boundary Enforcing Solutions We Researched

The goal is to find a solution for boundary enforcement. The Ruby we all know and love doesn't come with boundary enforcements out of the box. It allows specifying visibility on the class level only and loads all dependencies into the global namespace. There’s no differences between direct and indirect dependencies.

There are some existing ways of potentially enforcing boundaries in Ruby. We explored a combination of solutions: using the private_constant keyword to set private constants, creating gems to set boundaries, using tests to prevent cross-boundary associations, and testing out external gems such as Modulation.

Setting Private Constants

The private_constant keyword is a built-in Ruby method to make a constant private so it cannot be accessed outside of its namespace. A constant’s namespace is the modules or classes where it’s nested and defined. In other words, using private_constant provides visibility semantics for constants on a namespace level, which is desirable. We want to establish public and private constants for a class or a group of classes.

However, there are drawbacks of using the private_constant method of privacy enforcement. If a constant is privatized after it has been defined, the first reference to it will not be checked. It is therefore not a reliable method to use.

There’s no trivial way to tell if there’s a boundary violation using private_constants. When declaring a constant private to your class, it is hard to determine if the use of the constant is getting bypassed or used appropriately. Plus, this is just a solution for privacy issues and not dependency.

Overall, only using private_constant is insufficient to enforce boundaries across large domains. We want a tool that is flexible and can integrate into our current workflow. 

Establishing Boundaries Through Gems

The other method of creating a modular Rails application is through gems. Ruby gems are used to distribute and share Ruby libraries between Rails applications. People may place relevant code into an internal gem, separating concerns from the main application. The gem may also eventually be extracted from the application with little to no complications.

Gems provide a list of dependencies through the gemspec which is something we wanted, but we also wanted the list of dependencies to be enforced in some way. Our primary concern was that gems don't have visibility semantics. Gems make transitive dependencies available in the same way as direct dependencies in the application. The main application can use any dependency within the internal gem as it would its own dependency. Again, this doesn't help us with boundary enforcement.

We want a solution where we’re able to still group code that’s relevant together, but only expose certain parts of that group of code as public API. In other words, we want to control and enforce the privacy and dependency boundaries for a group of code—something we can’t do with Ruby gems.

Using Tests to Prevent Cross-component Associations

We added a test case that rejects any PRs that introduce Active Record associations across components, which is a pattern we’re trying to avoid. However, this solution is insufficient for several reasons. The test doesn’t account for the direction of the dependency. It also isn’t a complete test. It doesn’t cover use cases of Active Record objects that aren’t associations and generally doesn’t cover anything that isn’t Active Record.

The test was good enforcement, but lacked several key features. We wanted a solution that determined the direction of dependencies and accounted for different types of Active Record associations. Nonetheless, the test case still exists in our codebase as we still found it helpful in triggering developer thought and discussions to whether or not an association between components is truly needed.

Using the Modulation Ruby Gem

Modulation is a Ruby gem for file-level dependency management within the Ruby application that was experimental at the time of our exploration. Modulation works by overriding the default Ruby code loading, which is concerning, as we’d have to replace the whole autoloading system in our Rails application. The level of complexity added to the code and runtime application behaviour is because dependency introspection performed at runtime.

There are obvious risks that come with modifying how our monolith works for an experiment. If we went with Modulation as a solution and had to change our minds, we’d likely have to revert changes to hundreds of files, which is impractical in a production codebase. Plus, the gem works at file-level granularity which is too fine for the scale we were trying to solve.

Creating Microservices?

The idea of extracting components from the core monolith into microservices in order to create code boundaries is often brought up at Shopify. In our monolith’s case, creating more services in an attempt to decouple code is solving code design problems the wrong way.

Distributing code over multiple machines is a topology change, not an architectural change. If we try to extract components from our core codebase into separate services, we introduce the added concern of networked communication and create a distributed system. A poorly designed API within the monolith will still be a poorly designed API within a service, but now with additional complexities. These complexities can come in forms such as stateless network boundary and serialisation between the systems, and reliability issues with networked communications. Microservices are a great solution when the service is isolated and unique enough to reason the tradeoff of the network boundary and complexities that come with it.

The Shopify core codebase still stands as a majestic modular monolith, with all the code broken up into components and living in a singular codebase. Now, our goal is to advance our application’s modularity to the next step—by having clear and enforced boundaries.

Packwerk: Creating Our Own Solution

Taking our learnings from the exploration phase for the project, we created Packwerk. There are two violations that Packwerk enforces: dependency and privacy. Dependency violations occur when a package references a private constant from a package that hasn’t been declared as a dependency. Privacy violations occur when an external constant references a package’s private constants. However, constants within the public folder, app/public, can be accessed and won't be a violation.

How Packwerk Works 

Packwerk parses and resolves constants in the application statically with the help of an open-sourced Shopify Ruby gem called ConstantResolver. ConstantResolver uses the same assumptions as Zeitwerk, the Rails code loader, to infer the constant's file location. For example, Some::Nested::Model will be resolved to the constant defined in the file path, models/some/nested/model.rb. Packwerk then uses the file path to determine which package defines the constant.

Next, Packwerk will use the resolved constants to check against the configurations of the packages involved. If all the checks are enforced (i.e. dependency and privacy), references from Package A to Package B are valid if:

  1. Package A declares a dependency on Package B, and;
  2. The referenced constant is a public constant in Package B

Ensuring Application Validity

Before diving into further details, we have to make sure that the application is in a valid state for Packwerk to work correctly. To be considered valid, an application has to have a valid autoload path cache, package definition files and application folder structure. Packwerk comes with a command, packwerk validate, that runs on a continuous integration (CI) pipeline to ensure the application is always valid.

Packwerk also checks for any acyclic dependencies within the application. According to the Acyclic Dependency Principle, no cycles should be allowed in the component dependency graph. If packages depend on each other in a cycle, making a change to one package will create a domino effect and force a change on all packages in the cycle. This dependency cycle will be difficult to manage.

In practical terms, imagine working on a domain of the codebase concurrently with 100 other developers. If your codebase has cyclic dependencies, your change will impact the components that depend on your component. When you are done with your work, you want to merge it into the main branch, along with the changes of other developers. This code will create an integration nightmare because all the dependencies have to be modified in each iteration of the application.

An application with an acyclic dependency graph can be tested and released independently without having the entire application change at the same time.

Creating a Package 

A package is defined by a package.yml file at the root of the package folder. Within that file, specific configurations are set. Packwerk allows a package to declare the type of boundary enforcement that the package would like to adhere to. 

Additionally, other useful package-specific metadata can be specified, like team and contact information for the package. We’ve found that having granular, package-specific ownership makes it easier for cross-team collaboration compared to ownership of an entire domain.

Enforcing Boundaries Between Packages

Running packwerk check
Running packwerk check

Packwerk enforces boundaries between packages through a check that can be run both locally and on the CI pipeline. To perform a check, simply run the line packwerk check. We also included this in Shopify’s CI pipeline to prevent any new violations from being merged into the main branch of the codebase.

Enforcing Boundaries in Existing Codebases

Because of the lack of code structure in Rails apps, legacy large scale Rails apps tend to have existing dependency and privacy violations between packages. If this is the case, we want to stop the bleeding and prevent new violations from being added to the codebase.

Users can still enforce boundaries within the application despite existing violations, ensuring the list of violations doesn't continue to increase. This is done by generating a deprecated references list for the package.

We want to allow developers to continue with their workflow, but prevent any further violations. The list of deprecated references can be used to help a codebase transition to a cleaner architecture. It iteratively establishes boundaries in existing code as developers work to reduce the list.

List of deprecated references for components/online_store
List of deprecated references for components/online_store

The list of deprecated references contains some useful information about the violation within the package. In the example above, we can tell that there was a privacy violation in the following files that are referring to the ::RetailStore constant that was defined in the components/online_store package.

By surfacing the exact references where the package’s boundaries are being breached, we essentially have a to-do list that can be worked off.

Conventionally, the deprecated references list was meant for developers to start enforcing the boundaries of an application immediately despite existing violations, and use it to remove the technical debt. However, the Shipping team at Shopify found success using this list to extract a domain out of their main application into its own service. Also, the list can be used if the package were extracted into a gem. Ultimately, we make sure to let developers know that the list of deprecated references should be used to refactor the code and reduce the amount of violations in the list.

The purpose of Packwerk would be defeated if we merely added to the list of violations (though, we’ve made some exceptions to this rule). When a team is unable to add a dependency in the correct direction because the pattern doesn’t exist, we recommend adding the violation to the list of deprecated references. Doing so will ensure that when such a pattern exists, we eventually refactor the code and remove the violation from the list. This results in a better alternative than creating a dependency in the wrong direction.

Preventing New Violations 

After creating packages within your application and enforcing boundaries for those packages, Packwerk should be ready to go. Packwerk will display violations when packwerk check is run either locally or on the CI pipeline.

The error message as seen above displays the type of violation, location of violation, and provides actionable next steps for developers. The goal is to make developers aware of the changes they make and to be mindful of any boundary breaking changes they add to the code.

The Caveats 

Statically analyzing Ruby is complex. If a constant is not autoloaded, Packwerk ignores it. This ensures that the results produced by Packwerk won’t have any false positives, but it can create false negatives. If we get most of the references right, it’ll be enough to shift the code quality in a positive direction. The Packwerk team made this design decision as our strategy to handle the inaccuracy that comes with Ruby static analysis. 

How Shopify Is Using Packwerk

There was no formal push for the adoption of Packwerk within Shopify. Several teams were interested in the tool and volunteered to beta test before it was released. Since its release, many teams and developers are adopting Packwerk to enforce boundaries within their components.

Currently Packwerk runs in six Rails applications at Shopify, including the core monolith. Within the core codebase, we have 48 packages with 30 boundary enforcements within those packages. Packwerk integrates in the CI pipeline for all these applications and has commands that can run locally for packaging-related checks.

Since Packwerk was released for use within the company, new conversations related to software architecture have been sparked. As developers worked on removing technical debt and refactoring the code using Packwerk, we noticed there’s no established pattern for decoupling of code and creating single-direction dependencies. We’re currently researching and discussing inversion of control and establishing patterns for dependency inversion within Rails applications.

Start Using Packwerk. It’s Open Source!

Packwerk is now out in the wild and ready for you to try it out!

To get Packwerk installed in your Rails application, add it as a gem and simply run the command packwerk init. The command will generate the configuration files needed for you to use Packwerk.

The Packwerk team will be maintaining the gem and we’re stoked to see how you will be using the tool. You are also welcome to report bugs and open pull requests in accordance with our contribution guidelines.


Packwerk is inspired by Stripe’s internal Ruby packages solution with its idea adapted to the more complex world of Rails applications.

ShipIt! Presents: Packwerk by Shopify

Without code boundaries in a monolith, it’s difficult for developers to make changes in their respective areas. Like when you make a straightforward change that shockingly results in breaking unrelated tests in a different part of the codebase, or dig around a codebase to find a class or module with more than 2,000 lines of code!

You end up with anti-patterns like spaghetti code and large classes that know too much. The codebase is harder to develop, maintain and understand, leading to difficulty adding new features. It’s frustrating for developers working on the codebase. Developer happiness and productivity is important to us.

So, we created an open source tool to establish code boundaries in Rails applications. We call it Packwerk.

During this event you will

  • Learn more about the problems Packwerk solves.
  • See how we built Packwerk.
  • Understand how we use Packwerk at Shopify.
  • See a demo of Packwerk.
  • Learn how you can get started with Packwerk.

Additional Information 

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default

Continue reading

Under Deconstruction: The State of Shopify’s Monolith

Under Deconstruction: The State of Shopify’s Monolith

Ruby on Rails is a great framework for rapidly building beautiful web applications that users and developers love. But if an application is successful, there’s usually continued investment, resulting in additional features and increased overall system complexity.

Shopify’s core monolith has over 2.8 million lines of Ruby code and 500,000 commits. Rails doesn’t provide patterns or tooling for managing the inherent complexity and adding features in a structured, well-bounded way.

That’s why, over three years ago, Shopify founded a team to investigate how to make our Rails monoliths more modular. The goal was to help us scale towards ever increasing system capabilities and complexity by creating smaller, independent units of code we called components. The vision went like this:

  • We can more easily onboard new developers to just the parts immediately relevant to them, instead of the whole monolith.
  • Instead of running the test suite on the whole application, we can run it on the smaller subset of components affected by a change, making the test suite faster and more stable.
  • Instead of worrying about the impact on parts of the system we know less well, we can change a component freely as long as we’re keeping its existing contracts intact, cutting down on feature implementation time.

In summary, developers should feel like they are working on a much smaller app than they actually are.

It’s been 18 months since we last shared our efforts to make our Rails monoliths more modular. I’ve been working on this modularity effort for the last two and a half years, currently on a team called Architecture Patterns. I’ll lay out the current state of my team’s work, and some things we’d do differently if we started fresh right now.

The Status Quo

We generally stand by the original ideas as described in Deconstructing the Monolith, but almost all of the details have changed.  We make consistent progress, but it's important to note that making changes at this scale requires a significant shift in thinking for a critical mass of contributors, and that takes time.

While we’re far from finished, we already reap the benefits of our work. The added constraints on how we write our code trigger deep software design discussions throughout the organization. We see a mindset shift across our developers with a stronger focus on modular design. When making a change, developers are now more aware of the consequences on the design and quality of the monolith as a whole. That means instead of degrading the design of existing code, new feature implementations now more often improve it. Parts of the codebase that received heavy refactoring in recent years are now easier to understand because their relationship with the rest of the system is clearer.

We automatically triage exceptions to components, enabling teams to act on them without having to dig through the sometimes noisy exception stream for the whole monolith. And with each component explicitly owned by a team, whole-codebase chores like Rails upgrades are easily distributed and collaboratively solved. Shopify is running its main monolith on the newest, unreleased revisions of Rails. The clearly defined ownership for areas of the codebase is one of the factors enabling us to do that.

What We Learned so Far

Our main monolith is one of the oldest, largest Rails codebases on the planet, under continuous development since at least 2006, with hundreds of developers currently adding features.

A refactor on this scale needs to be approached completely differently from smaller efforts. We learned that all large scale changes start

  • with understanding and influencing developer behavior
  • at the grassroots
  • with a holistic perspective on architecture 
  • with careful application of tooling
  • with being aware of the tradeoffs involved

Understand Developer Behaviour

A single centralized team can’t make change happen by working against the momentum of hundreds of developers adding features.

Also, it can’t anticipate all the edge cases and have context on all domains of the application. A single team can make simple change happen on a large scale, or complex change on a small scale. To modularize a large monolith though, we need to make complex change happen on a large scale. Even if a centralized team could make it happen, the design would degrade once the team switches its focus to something else. 

That’s why making a fundamental architecture change to a system that’s being actively worked on is in large part a people problem. We need to change the behavior of the average developer on the codebase. We need to all iteratively evolve the system towards the envisioned future together. The developers are an integral part of the system.

Dr. B.J. Fogg, founder of the Behavior Design Lab at Stanford University, developed a model for thinking about behaviors that matches our experiences. The model suggests that for a behavior to occur, three things need to be in place: Ability, Motivation, and Prompt.

Fogg Behaviour Model by BJ Fogg, PhD
Fogg Behaviour Model by  BJ Fogg, PHD

In a nutshell, prompts are necessary for a desired behavior to happen, but they're ineffective unless there's enough motivation and ability. Exceptionally high motivation can, within reason, compensate for low ability and vice versa.

Automated tooling and targeted manual code reviews provide prompts. That’s the easy part. Creating ability and motivation to make positive change is harder. Especially when that goes against common Ruby on Rails community practices and requires a view of the system that’s much larger than the area that most individual developers are working on. Spreading an understanding of what we’re aiming for, and why, is critical.

For example, we invested quite a bit of time and energy into developing patterns to ensure some consistency in how component boundary interfaces are designed. Again and again we pondered: How should components call each other? We then pushed developers to use these patterns everywhere. In hindsight, this strategy didn’t increase developer ability or motivation. It didn’t solve the problems actually holding them back, and it didn’t explain the reasons or long term goals well enough. Pushing for consistency added rules, which always add some friction, because they have to be learned, remembered, and followed. It didn’t make any hard problem significantly easier to solve. In some cases, the patterns were helpful. In other cases, they lead developers to redefine their problem to fit the solution we provided, which degraded the overall state of the monolith.

Today, we’re still providing some general suggestions on interface consistency, but we have a lot less hard rules. We’re focusing on finding the areas where developers are hungry to make positive change, but don’t end up doing it because it’s too hard. Often, making our code more modular is hard because legacy code and tooling are based on assumptions that no longer hold true. One of the most problematic outdated assumptions is that all Active Record models are OK to access everywhere, when in this new componentized world we want to restrict their usage to the component that owns them. We can help developers overcome this problem.

So in the words of Dr. Fogg, these days we’re looking for areas where the prompt is easy, the motivation is present, and we just have to amp up the ability to make things happen.

Foster the Grassroots

As I mentioned, we, as a centralized team, can’t make this change happen by ourselves. So, we work to create a grassroots movement among the developers at Shopify. We aim to increase the number of people that have ability, motivation and prompt to move the system a tiny step further in the right direction.

We give internal talks, write documentation, share wins, embed in other teams, and pair with people all over the company. Embedding and pairing make sure we’re solving the problems that product developers are most struggling with in practice, avoiding what’s often called Ivory Tower Syndrome where the solutions don’t match the problems. It also lets us gain context on different areas of the codebase and the business while helping motivated people achieve goals that align with ours.

As an example, we have a group called the Architecture Guild. The guild has a slack channel for software architecture discussions and bi-weekly meetups. It’s an open forum, and a way to grow more architecture conscious mindsets while encouraging architectural thinking. The Architecture Patterns team provides some content that we think is useful, but we encourage other people to share their thoughts, and most of the contributions come from other teams. Currently, the Architecture Guild has ~400 members and 54 documented meetups with meeting notes and recordings that are shared with all developers at Shopify.

The Architecture Guild grew organically out of the first Componentization team at Shopify after the first year of Componentization. If I were to start a similar effort again today, I’d establish a forum like this from the beginning to get as many people on board with the change as early as possible. It’s also generally a great vehicle to spread software design knowledge that’s siloed in specific teams to other parts of the company.

Other methods we use to create fertile ground for ambitious architecture projects are

  • the Developer Handbook, an internal online resource documenting how we develop software at Shopify.
  • Developer Talks, our internal weekly livestreamed and recorded talks about software development at Shopify.

Build Holistic Architecture

Some properties of software are so closely related that they need to be approached in pairs. By working on one property and ignoring its “partner property,” you could end up degrading the system.

Balance Encapsulation With A Simple Dependency Graph

We started out by focusing our work on building a clean public interface around each component to hide the internals. The expectation was that this would allow reasoning about and understanding the behavior of a component in isolation. Changing internals of a component wouldn’t break other components—as long as the interface stays stable.

It’s not that straightforward though. The public interface is what other components depend on; if a lot of components depend on it, it’s hard to change. The interface needs to be designed with those dependencies in mind, and the more components depend on it, the more abstract it needs to be. It’s hard to change because it’s used everywhere, and it will have to change often if it contains knowledge about concrete parts of the business logic.

When we started analyzing the graph of dependencies between components, it was very dense, to the point that every component depended on over half of all the other components. We also had lots of circular dependencies.

Circular Dependancies
Circular Dependancies

Circular dependencies are situations where for example component A depends on component B but component B also depends on component A. But circular dependencies don’t have to be direct, the cycles can be longer than two. For example, A depends on B depends on C depends on A.

These properties of the dependency graph mean that the components can’t be reasoned about, or evolved, separately. Changes to any component in a cycle can break all other components in the cycle. Changes to a component that has almost all other components depend on can break almost all other components. So these changes require a lot of context. A dense, cyclical dependency graph undermines the whole idea of Componentization—it blocks us from making the system feel smaller.

When we ignored the dependency graph, in large parts of the codebase the public interface turned out to just be an added layer of indirection in the existing control flows. This made it harder to refactor these control flows because it added additional pieces that needed to be changed. It also didn’t make it a lot easier to reason about parts of the system in isolation.

The simplest possible way to introduce a public interface to a private implementation
The simplest possible way to introduce a public interface to a private implementation

The diagram shows that the simplest possible way to introduce a public interface could just mean that a previously problematic design is leaked into a separate interface class, making the underlying design problem harder to fix by spreading it into more files.

Discussions about the desirable direction of a dependency often surface these underlying design problems. We routinely discover objects with too many responsibilities and missing abstractions this way.

Perhaps not surprisingly, one of the central entities of the Shopify system is the Shop and so almost everything depends on the Shop class. That means that if we want to avoid circular dependencies, the Shop class can depend on almost nothing. 

Luckily, there are proven tools we can use to straighten out the dependency graph. We can make arrows point in different directions, by either moving responsibilities into the component that depends on them or applying inversion of control. Inversion of control means to invert a dependency in such a way that control flow and source code dependency are opposed. This can be done for example through a publish/subscribe mechanism like ActiveSupport::Notifications.

This strategy of eliminating circular dependencies naturally guides us towards removing concrete implementation from classes like Shop, moving it towards a mostly empty container holding only the identity of a shop and some abstract concepts.

If we apply the aforementioned techniques while building out the public interfaces, the result is therefore much more useful. The simplified graph allows us to reason about parts of the system in isolation, and it even lays out a path towards testing parts of the system in isolation.

Dependencies diagram between Platform, Supporting, and Frontend components
Dependencies diagram between Platform, Supporting, and Frontend components

If determining the desired direction of all the dependencies on a component ever feels overwhelming, we think about the components grouped into layers. This allows us to prioritize and focus on cleaning up dependencies across layers first. The diagram above sketches out an example. Here, we have platform components, Platform and Shop Identity, that purely provide functionality to other components. Supporting components, like Merchandising and Inventory, depend on the platform components but also provide functionality to others and often serve their own external APIs. Frontend components, like Online Store, are primarily externally facing. The dependencies crossing the dotted lines can be prioritized and cleaned up first, before we look at dependencies within a layer, for example between Merchandising and Inventory.

Balance Loose Coupling With High Cohesion

Tight coupling with low cohesion and loose coupling with high cohesion
Tight coupling with low cohesion and loose coupling with high cohesion

Meaningful boundaries like those we want around components require loose coupling and high cohesion. A good approximation for this is Change Locality: The degree to which code that changes together lives together.

At first, we solely focused on decoupling components from each other. This felt good because it was an easy, visible change, but it still left us with cohesive parts of the codebase that spanned across component boundaries. In some cases, we reinforced a broken state. The consequence is that often small changes to the functionality of the system still meant changes in code across multiple components, for which the developers involved needed to know and understand all of those components.

Change Locality is a sign of both low coupling and high cohesion and makes evolving the code easier. The codebase feels smaller, which is one of our stated goals. And Change Locality can also be made visible. For example, we are working on automation analyzing all pull requests on our codebase for which components they touch. The number of components touched should go down over time.

An interesting side note here is that different kinds of cohesion exist. We found that where our legacy code respects cohesion, it’s mostly informational cohesion—grouping code that operates on the same data. This arises from a design process that starts with database tables (very common in the Rails community). Change Locality can be hindered by that. To produce software that is easy to maintain, it makes more sense to focus on functional cohesion—grouping code that performs a task together. That’s also much closer to how we usually think about our system. 

Our focus on functional cohesion is already showing benefits by making our business logic, the heart of our software, easier to understand.

Create a SOLID foundation

There are ideas in software design that apply in a very similar way on different levels of abstraction—coupling and cohesion, for example. We started out applying these ideas on the level of components. But most of what applies to components, which are really large groups of classes, also applies on the level of individual classes and even methods.

On a class level, the most relevant software design ideas are commonly summarized as the SOLID principles. On a component level, the same ideas are called “package principles.” Here’s a SOLID refresher from Wikipedia:

Single-responsibility principle

A class should only have a single responsibility, that is, only changes to one part of the software's specification should be able to affect the specification of the class.

Open–closed principle

Software entities should be open for extension, but closed for modification.

Liskov substitution principle

Objects in a program should be replaceable with instances of their subtypes without altering the correctness of that program.

Interface segregation principle

Many client-specific interfaces are better than one general-purpose interface.

Dependency inversion principle

Depend upon abstractions, not concretions.

The package principles express similar concerns on a different level, for example (source):

Common Closure Principle

Classes that change together are packaged together.

Stable Dependencies Principle

Depend in the direction of stability.

Stable Abstractions Principle

Abstractness increases with stability.

We found that it’s very hard to apply the principles on a component level if the code doesn’t follow the equivalent principles on a class and method level. Well designed classes enable well designed components. Also, people familiar with applying the SOLID principles on a class level can easily scale these ideas up to the component level.

So if you’re having trouble establishing components that have strong boundaries, it may make sense to take a step back and make sure your organization gets better at software design on a scale of methods and classes first.

This is again mostly a matter of changing people’s behavior that requires motivation and ability. Motivation and ability can be increased by spreading awareness of the problems and approaches to solving them.

In the Ruby world, Sandi Metz is great at teaching these concepts. I recommend her books, and we’re lucky enough to have her teach workshops at Shopify repeatedly. She really gets people excited about software design.

Apply Tooling Deliberately

To accelerate our progress towards the modular monolith, we’ve made a few major changes to our tooling based on our experience so far.

Use Rails Engines

While we started out with a lot of custom code, our components evolved to look more and more like Rails Engines. We’re doubling down on engines going forward. They are the one modularity mechanism that comes with Rails out of the box. They have the familiar looks and features of Rails applications, but other than apps, we can run multiple engines in the same process. And should we make the decision to extract a component from the monolith, an engine is easily transformed into a standalone application.

Engines don’t fit the use case perfectly though. Some of the roughest edges are related to libraries and tooling assuming a Rails application structure, not the slightly different structure of an engine. Others relate to the fact that each engine can (and probably should) specify its own external gem dependencies, and we need a predictable way to unify them into one set of gems for the host application. Thankfully, there are quite a few resources out there from other projects encountering similar problems. Our own explorations have yielded promising results with multiple production applications currently using engines for modularity, and we’re using engines everywhere going forward.

Define and Enforce Contracts

Strong boundaries require explicit contracts. Contracts in code and documentation allow developers to use a component without reading its implementation, making the system feel smaller.

Initially, we built a hash schema validation library called Component::Schema based on dry-schema. It served us well for a while, but we ran into problems keeping up with breaking changes and runtime performance for checking more complex contracts.

In 2019, Stripe released their static Ruby type checker, Sorbet. Shopify was involved in its development before that release and has a team contributing to Sorbet, as we are using it heavily. Now it’s our go-to tool for expressing input and output contracts on component boundaries. Configured correctly, it has barely any runtime performance impact, it’s more stable, and it provides advanced features like interfaces.

This is what an entrypoint into a component looks like using Component::Schema:

And this is what that entrypoint looks like today, using Sorbet:

Perform Static Dependency Analysis

As Kirsten laid out in the original blog post on Componentization at Shopify, we initially built a call graph analysis tool we called Wedge. It logged all method calls during test suite execution on CI to detect calls between components.

We found the results produced were often not useful. Call graph logging produces a lot of data, so it’s hard to separate the signal from the noise. Sometimes it’s not even clear which component a call is from or to. Consider a method defined in component A which is inherited by a class in component B. If this method is making a call to component C, which component is the call coming from? Also, because this analysis depended on the full test suite with added instrumentation, it took over an hour to run, which doesn’t make for a useful feedback cycle.

So, we developed a new tool called Packwerk to analyze static constant references. For example, the line Shop.first, contains a static reference to Shop and a method call to a method on that class that’s called first. Packwerk only analyzes the static constant reference to Shop. There’s less ambiguity in static references, and because they’re always explicitly introduced by developers, highlighting them is more actionable. Packwerk runs a full analysis on our largest codebase in a few minutes, so we’re able to integrate it with our Pull Request workflow. This allows us to reject changes that break the dependency graph or component encapsulation before they get merged into our main branch.

We’re planning to make Packwerk open source soon. Stay tuned!

Decide to Prioritize Ownership or Boundaries

There are two major ways to partition an existing monolith and create components from a big ball of mud. In my experience, all large architecture changes end up in an incomplete state. Maybe that’s a pessimistic view, but my experience tells me that the temporary incomplete state will at least last longer than you expect. So choose an approach based on which intermediary state is most useful for your specific situation.

One option is to draw lines through the monolith based on some vision of the future and strengthen those lines over time into full fledged boundaries. The other option is to spin off parts of it into tiny units with strong boundaries and then transition responsibilities over iteratively, growing the components over time.

For our main monolith, we took the first approach; our vision was guided by the ideas of Domain Driven Design. We defined components as implementations of subdomains of the domain of commerce, and moved the files into corresponding folders. The main advantage is that even though we’re not finished building out the boundaries, responsibilities are roughly grouped together, and every file has a stewardship team assigned. The disadvantage is that almost no component has a complete, strong boundary yet, because with the components containing large amounts of legacy code, it’s a huge amount of work to establish these. This vision of the future approach is good if well-defined ownership and a clearly visible partition of the app are most important for you—which they were for us because of the huge number of people working on the codebase.

On other large apps within Shopify, we’ve tried out the second approach. The advantage is that large parts of the codebase are in isolated and clean components. This creates good examples for people to work towards. The disadvantage of this approach is that we still have a considerable sized ball of mud within the app that has no structure whatsoever. This spin-off approach is good if clean boundaries are the priority for you.

What We’re Building Right Now

While feature development on the monolith is going on as fast as ever, many developers are making things more modular at the same time. We see an increase of people in a position to do this, and the number of good examples around the codebase is expanding.

We currently have 37 components in our main monolith, each with public entrypoints covering large parts of its responsibilities. Packwerk is used on about a third of the components to restrict their dependencies and protect the privacy of their internal implementation. We’re working on making Packwerk enticing enough that all components will adopt it.

Through increased adoption we’re progressively enforcing properties of the dependency graph. Total acyclicity is the long term goal, but the more edges we can remove from the graph in the short term the easier the system will be to reason about.

We have a few other monolithic apps going through similar processes of componentization right now; some with the goal of splitting into separate services long term, some aiming for the modular monolith. We are very deliberate about when to split functionality out into separate services, and we only do it for good reasons. That’s because splitting a single monolithic application into a distributed system of services increases the overall complexity considerably.

For example, we split out storefront rendering because it’s a read-only use case with very high throughput and it makes sense for us to scale and distribute it separately from the interface that our merchants use to manage their stores. Credit card vaulting is a separate service because it processes sensitive data that shouldn’t flow through other parts of the system.

In addition, we’re preparing to have all new Rails applications at Shopify componentized by default. The idea is to generate multiple separately tested engines out of the box when creating a Rails app, removing the top level app folder and setting up developers for a modular future from the start.

At the same time, we’re looking into some of the patterns necessary to unblock further adoption of Packwerk. First and foremost that means making the dependency graph easy to clean up. We want to encourage inversion of control and more generally dependency inversion, which will probably lead us to use a publish/subscribe mechanism instead of straightforward method calls in many cases.

The second big blocker is efficiently querying data across components without coupling them too tightly. The most interesting problems in this area are

  • Our GraphQL API exposes a partially circular graph to external consumers while we’d like the implementation in the components to be acyclic.
  • Our GraphQL query execution and ElasticSearch reindexing currently heavily rely on Active Record features, which defeats the “public interface, private implementation” idea.

The long term vision is to have separate, isolated test suites for most of the components of our main monolith.

Last But Not Least

I want to give a shout out to Josh Abernathy, Bryana Knight, Matt Todd, Matthew Clark, Mike Chlipala and Jakob Class at Github. This blog post is based on, and indirectly the result of a conversation I had with them. Thank you!

Anita Clarke, Edward Ocampo-Gooding, Gannon McGibbon, Jason Gedge, Martin LaRochelle, and Keyfer Mathewson contributed super valuable feedback on this article. Thank you BJ Fogg for the behavior model and use of your image.

If you’re interested in the kinds of challenges I described, you should join me at Shopify!

Further Reading


    Continue reading

    Tophatting in React Native

    Tophatting in React Native

    On average in 2019, Shopify handled billions of dollars of transactions per week. Therefore, it’s important to ensure new features are thoroughly tested before shipping them to our merchants. A vital part of the software quality process at Shopify is a practise called tophatting. Tophatting is manually testing your coworker’s changes and making sure everything is working properly before approving their pull request (PR).

    Earlier this year, we announced that React Native is the future of mobile development in the company. However, the workflow for tophatting a React Native app was quite tedious and time consuming. The reviewer had to 

    1. save their current work
    2. switch their development environment to the feature branch
    3. rebuild the app and load the new changes
    4. verify the changes inside the app.

    To provide a more convenient and painless experience, we built a tool enabling React Native developers to quickly load their peer’s work within seconds. I’ll explain how the tool works in detail.

    React Native Tophatting vs Native Tophatting

    About two years ago, the Mobile Tooling team developed a tool for tophatting native apps. The tool works by storing the app’s build artifacts in cloud storage, mobile developers can download the app and launch it in an emulator or a simulator on demand. However, the tool’s performance can be improved when there are only React Native code changes because we don’t need to rebuild and re-download the entire app. One major difference between React Native and native apps is that React Native apps produce an additional build artifact, the JavaScript bundle. If a developer only changes the React Native code and not native code, then the only build artifact needed to load the changes is the new JavaScript bundle. We leveraged this fact and developed a tool to store any newly built JavaScript bundles, so React Native apps can fetch any bundle and load the changes almost instantly.

    Storing the JavaScript Bundle

    The main idea behind the tool is to store the JavaScript bundle of any new builds in our cloud storage, so developers can simply download the artifact instead of building it on demand when tophatting.

    New PR on React Native project triggers a CI pipeline in Shopify Build

    New PR on React Native project triggers a CI pipeline in Shopify Build

    When a developer opens a new PR on GitHub or pushes a new commit in a React Native project, it triggers a CI pipeline in Shopify Build, our internal continuous integration/continuous delivery (CI/CD) platform then performs the following steps:

    1. The pipeline first builds the app’s JavaScript bundle.
    2. The pipeline compresses the bundle along with any assets that the app uses.
    3. The pipeline makes an API call to a backend service that writes the bundle’s metadata to a SQL database. The metadata includes information such as the app ID, the commit’s Secure Hash Algorithms (SHA) checksum, and the branch name.
    4. The backend service generates a unique bundle ID and a signed URL for uploading to cloud storage.
    5. The pipeline uploads the bundle to cloud storage using the signed URL.
    6. The pipeline makes an API call to the backend service to leave a comment on the PR.

    QR code that developers can scan on their mobile device

    QR code that developers can scan on their mobile device

    The PR comment records that the bundle upload is successful and gives developers three options to download the bundle, which include

    • A QR code that developers can scan on their mobile device, which opens the app on their device and downloads the bundle.
    • A bundle ID that developers can use to download the bundle without exiting the app using the Tophat screen. This is useful when developers are using a simulator/emulator.
    • A link that developers can use to download the bundle directly from a GitHub notification email. This allows developers to tophat without opening the PR on their computer.

    Loading the JavaScript Bundle

    Once the CI pipeline uploads the JavaScript bundle to cloud storage, developers need a way to easily download the bundle and load the changes in their app. We built a React Native component library providing a user interface (called the Tophat screen) for developers to load the changes.

    The Tophat Component Library 

    The component library registers the Tophat screen as a separate component and a URL listener that handles specific deep link events. All developers need to do is to inject the component into the root level of their application.

    The library also includes an action that shows the Tophat screen on demand. Developers open the Tophat screen to see the current bundle version or to reset the current bundle. In the example below, we use the action to construct a “Show Tophat” button, which opens the Tophat screen on press.

    The Tophat Screen

    The Tophat screen looks like a modal or an overlay in the app, but it’s separate from the app’s component tree, so it introduces a non-intrusive UI for React Native apps. 

    React Native tophat screen in action
    Tophat screen in action

    Here’s an example of using the tool to load a different commit in our Local Delivery app.

    The Typical Tophat Workflow

    Typical React Native tophat workflow

    Typical React Native tophat workflow

    The typical workflow using the Tophat library looks like:

    1. The developer scans the QR code or clicks the link in the GitHub PR comment that resolves to an URL in the format “{appId}://tophat_bundle/{bundle_id}”.
    2. The URL opens the app on the developer’s device and triggers a deep link event.
    3. The component library captures the event and parses the URL for the app ID and bundle ID.
    4. If the app ID in the URL matches the current app, then the library makes an API call to the backend service requesting a signed download URL and metadata for the corresponding JavaScript bundle.
    5. The Tophat screen displays the bundle’s metadata and asks the developer to confirm whether or not this is the bundle they wish to download.
    6. Upon confirmation, the library downloads the JavaScript bundle from cloud storage and saves the bundle’s metadata using local storage. Then it decompresses the bundle and restarts the app.
    7. When the app is restarting, it detects the new JavaScript bundle and starts the app using that bundle instead.
    8. Once the developer verifies the changes, they can reset the bundle in the Tophat screen.

    Managing Bundles Using a Backend Service

    In the native tophatting project, we didn’t use a backend service. However, we decided to use a backend service to handle most of the business logic in this tool. This creates additional maintenance and infrastructure cost to the project, but we believe its proven advantages outweigh its costs. There are two main reasons why we chose to use a backend service:

    1. It abstracts away authentication and implementation details with third-party services.
    2. It provides a scalable solution of storing metadata that enables better UI capabilities.

    Abstracting Implementation Details

    The tool requires the use of Google Cloud’s and GitHub’s SDKs, which means the client needs to have an authentication token for each of these services. If a backend service didn’t exist, then each app and its respective CI pipeline would need to configure their own tokens. The CI pipeline and the component library would also need to have consistent storage path formats. This introduces extra complexity and adds additional steps in the tool’s installation process.

    The backend service abstracts away the interaction with third party services such as authentication, uploading assets, and creating Github comments. The service also generates each bundle’s storage path, eliminating the issue of having inconsistent paths across different components. 

    Storing Metadata

    Each JavaScript bundle has important metadata that developers need to quickly retrieve along with the bundle. A solution used by the native tophatting project is to store the metadata in the filename of the build artifact. We could leverage the same technique to store the metadata in the JavaScript Bundle’s storage path. However, this isn’t scalable if we wish to include additional metadata. For example, if we want to add the author of the commit to the bundle’s metadata, it would introduce a change in the storage path format, which requires changes in every app’s CI pipeline and the component library.

    By using a backend service, we store more detailed metadata in a SQL database and decouple it from the bundle’s storage. This opens up the possibility of adding features like a confirmation step before downloading the bundle and querying bundles by app IDs or branch names.

    What’s Next?

    The first iteration of the tool is complete and React Native developers use the tool to tophat each other’s pull request by simply scanning a QR code or entering a bundle ID. There are improvements that we want to make in the future:

    • Building and uploading the JavaScript bundle directly from the command line.
    • Showing a list of available JavaScript bundles in the Tophat screen.
    • Detecting native code changes.
    • Designing a better UI in the Tophat screen.

    Almost all of the React Native projects at Shopify are now using the tool and my team keeps working to improve the tophatting experience for our React Native developers.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    5 Ways to Improve Your React Native Styling Workflow

    5 Ways to Improve Your React Native Styling Workflow

    In April, we announced Shop, our digital shopping assistant that brings together the best features of Arrive and Shop Pay. The Shop app started from our React Native codebase for our previous package tracking app Arrive, with every screen receiving a complete visual overhaul to fit the new branding.

    While our product designers worked on introducing a whole new design system that would decide the look and feel of the app, we on the engineering side took the initiative to evolve our thinking around how we work with styling of screens and components. The end-product became Restyle, our open source library that allowed us to move forward quickly and easily in our transformation from Arrive to Shop.

    I'll walk you through the styling best practices we learned through this process. They served as the guiding principles for the design of Restyle. However anyone working with a React app can benefit from applying these best practices, with or without using our library.

    The Questions We Needed to Answer

    We faced a number of problems with our current approach in Arrive, and these were the questions we needed to answer to take our styling workflow to the next level:

    • With a growing team working in different countries and time zones, how do we make sure that the app keeps a consistent style throughout all of its different screens?
    • What can we do to make it easy to style the app to look great on multiple different device sizes and formats?
    • How do we allow the app to dynamically adapt its theme according to the user’s preferences, to support for example dark mode?
    • Can we make working with styles in React Native a more enjoyable experience?

    With these questions in place, we came up with the following best practices that provided answers to them. 

    #1. Create a Design System

    A prerequisite for being able to write clean and consistent styling code is for the design of the app to be based on a clean and consistent design system. A design system is commonly defined as a set of rules, constraints and principles that lay the foundation for how the app should look and feel. Building a complete design system is a topic far too big to dig into here, but I want to point out three important areas that the system should define its rules for.


    Size and spacing are the two parameters used when defining the layout of an app. While sizes often vary greatly between different components presented on a screen, the spacing between them should often stay as consistent as possible to create a coherent look. This means that it’s preferred to stick to a small set of predefined spacing constants that’s used for all margins and paddings in the app.

    There are many conventions to choose between when deciding how to name your spacing constants, but I've found the t-shirt size scale (XS, S, M, L, XL, etc) work best. The order of sizes are easy to understand, and the system is extensible in both directions by prefixing with more X’s.


    When defining colors in a design system, it’s important not only to choose which colors to stick with, but also how and when they should be used. I like to split these definitions up into two layers:

    • The color palette - This is the set of colors that’s used. These can be named quite literally, e.g. “Blue”, “Light Orange”, “Dark Red”, “White”, “Black”.
    • The semantic colors - A set of names that map to and describe how the color palette should be applied, that is, what their functions are. Some examples are “Primary”, “Background”, “Danger”, “Failure”. Note that multiple semantic colors can be mapped to the same palette color, for example, both the “Danger” and “Failure” color could both map to “Dark Red”.

    When referring to a color in the app, it should be through the semantic color mapping. This makes it easy to later change, for example, the “Primary” color to be green instead of blue. It also allows you to easily swap out color schemes on the fly to, for example, easily accommodate a light and dark mode version of the app. As long as elements are using the “Background” semantic color, you can swap it between a light and dark color based on the chosen color scheme.


    Similar to spacing, it‘s best to stick to a limited set of font families, weights and sizes to achieve a coherent look throughout the app. A grouping of these typographic elements are defined together as a named text variant. Your “Header” text might be size 36, have a bold weight, and use the font family “Raleway”. Your “Body” text might use the “Merriweather” family with a regular font weight, and size 16.

    #2. Define a Theme Object

    A carefully put together design system following the spacing, colour, and typography practices above should be defined in the app‘s codebase as a theme object. Here‘s how a simple version might look:

    All values relating to your design system and all uses of these values in the app should be through this theme object. This makes it easy to tweak the system by only needing to edit values in a single source of truth.

    Notice how the palette is kept private to this file, and only the semantic color names are included in the theme. This enforces the best practice with colors in your design system.

    #3. Supply the Theme through React‘s Context API

    Now that you've defined your theme object, you might be tempted to start directly importing it in all the places where it's going to be used. While this might seem like a great approach at first, you’ll quickly find its limitations once you’re looking to work more dynamically with the theming. In the case of wanting to introduce a secondary theme for a dark mode version of the app, you would need to either:

    • Import both themes (light and dark mode), and in each component determine which one to use based on the current setting, or
    • Replace the values in the global theme definition when switching between modes. 

    The first option will introduce a large amount of tedious code repetition. The second option will only work if you force React to re-render the whole app when switching between light and dark modes, which is typically considered a bad practice. If you have a dynamic value that you want to make available to all components, you’re better off using React’s context API. Here’s how you would set this up with your theme:

    The theme in React’s context will make sure that whenever the app changes between light and dark mode, all components that access the theme will automatically re-render with the updated values. Another benefit of having the theme in context is being able to swap out themes on a sub-tree level. This allows you to have different color schemes for different screens in the app, which could, for example, allow users to customize the colors of their profile page in a social app.

    #4. Break the System into Components

    While it‘s entirely possible to keep reaching into the context to grab values from the theme for any view that needs to be styled, this will quickly become repetitious and overly verbose. A better way is to have components that directly map properties to values in the theme. There are two components that I find myself needing the most when working with themes this way, Box and Text.

    The Box component is similar to a View, but instead of accepting a style object property to do the styling, it directly accepts properties such as margin, padding, and backgroundColor. These properties are configured to only receive values available in the theme, like this:

    The “m” and “s” values here map to the spacings we‘ve defined in the theme, and “primary” maps to the corresponding color. This component is used in most places where we need to add some spacing and background colors, simply by wrapping it around other components.

    While the Box component is handy for creating layouts and adding background colors, the Text component comes into play when displaying text. Since React Native already requires you to use their Text component around any text in the app, this becomes a drop in replacement for it:

    The variant property applies all the properties that we defined in the theme for textVariant.header, and the color property follows the same principle as the Box component’s backgroundColor, but for the text color instead.

    Here’s how both of these components would be implemented:

    Styling directly through properties instead of keeping a separate style sheet might seem weird at first. I promise that once you start doing it you’ll quickly start to appreciate how much time and effort you save by not needing to jump back and forth between components and style sheets during your styling workflow.

    #5. Use Responsive Style Properties

    Responsive design is a common practice in web development where alternative styles are often specified for different screen sizes and device types. It seems that this practice has yet to become commonplace within the development of React Native apps. The need for responsive design is apparent in web apps where the device size can range from a small mobile phone to a widescreen desktop device. A React Native app only targeting mobile devices might not work with the same extreme device size differences, but the variance in potential screen dimensions is already big enough to make it hard to find a one-size-fits-all solution for your styling.

    An app onboarding screen that displays great on the latest iPhone Pro will most likely not work as well with the limited screen estate available on a first generation iPhone SE. Small tweaks to the layout, spacing and font size based on the available screen dimensions are often necessary to craft the best experience for all devices. In responsive design this work is done by categorizing devices into a set of predefined screen sizes defined by their breakpoints, for example:

    With these breakpoints we're saying that anything below 321 pixels in width should fall in the category of being a small phone, anything above that but below 768 is a regular phone size, and everything wider than that is a tablet.

    With these set, let's expand our previous Box component to also accept specific props for each screen size, in this manner:

    Here's roughly how you would go about implementing this functionality:

    In a complete implementation of the above you would ideally use a hook based approach to get the current screen dimensions that also refreshes on change (for example when changing device orientation), but I’ve left that out in the interest of brevity.

    #6. Enforce the System with TypeScript

    This final best practice requires you to be using TypeScript for your project.

    TypeScript and React pair incredibly well together, especially when using a modern code editor such as Visual Studio Code. Instead of relying on React’s PropTypes validation, which only happens when the component is rendered at run-time, TypeScript allows you to validate these types as you are writing the code. This means that if TypeScript isn’t displaying any errors in your project, you can rest assured that there are no invalid uses of the component anywhere in your app.

    Using the prop validation mechanisms of TypeScript

    Using the prop validation mechanisms of TypeScript

    TypeScript isn’t only there to tell you when you’ve done something wrong, it can also help you in using your React components correctly. Using the prop validation mechanisms of TypeScript, we can define our property types to only accept values available in the theme. With this, your editor will not only tell you if you're using an unavailable value, it will also autocomplete to one of the valid values for you.

    Here's how you need to define your types to set this up:

    Evolve Your Styling Workflow

    Following the best practices above via our Restyle library made a significant improvement to how we work with styles in our React Native app. Styling has become more enjoyable through the use of Restyle’s Box and Text components, and by restricting the options for colors, typography and spacing it’s now much easier to build a great-looking prototype for a new feature before needing to involve a designer in the process. The use of responsive style properties has also made it easy to tailor styles to specific screen sizes, so we can work more efficiently with crafting the best experience for any given device.

    Restyle’s configurability through theming allowed us to maintain a theme for Arrive while iterating on the theme for Shop. Once we were ready to flip the switch, we just needed to point Restyle to our new theme to complete the transformation. We also introduced dark mode into the app without it being a concrete part of our roadmap—we found it so easy to add we simply couldn't resist doing it.

    If you've asked some of the same questions we posed initially, you should consider adopting these best practices. And if you want a tool that helps you along the way, our Restyle library is there to guide you and make it an enjoyable experience.

    Wherever you are, your next journey starts here! Intrigued? We’d love to hear from you.

    Continue reading

    How to Track State with Type 2 Dimensional Models

    How to Track State with Type 2 Dimensional Models

    Application databases are generally designed to only track current state. For example, a typical user’s data model will store the current settings for each user. This is known as a Type 1 dimension. Each time they make a change, their corresponding record will be updated in place:







    2019-01-01 12:14:23

    2019-01-01 12:14:23



    2019-01-01 15:21:45

    2019-01-02 05:20:00


    This makes a lot of sense for applications. They need to be able to rapidly retrieve settings for a given user in order to determine how the application behaves. An indexed table at the user grain accomplishes this well.

    But, as analysts, we not only care about the current state (how many users are using feature “X” as of today), but also the historical state. How many users were using feature “X” 90 days ago? What is the 30 day retention rate of the feature? How often are users turning it off and on? To accomplish these use cases we need a data model that tracks historical state:








    2019-01-01 12:14:23

    2019-01-01 12:14:23




    2019-01-01 15:21:45

    2019-01-02 05:20:00




    2019-01-02 05:20:00




    This is known as a Type 2 dimensional model. I’ll show how you can create these data models using modern ETL tooling like PySpark and dbt (data build tool).

    Implementing Type 2 Dimensional Models at Shopify

    I currently work as a data scientist in the International product line at Shopify. Our product line is focused on adapting and scaling our product around the world. One of the first major efforts we undertook was translating Shopify’s admin in order to make our software available to use in multiple languages.

    Shopify admin translatedShopify admin translated

    At Shopify, data scientists work across the full stack—from data extraction and instrumentation, to data modelling, dashboards, analytics, and machine learning powered products. As a product data scientist, I’m responsible for understanding how our translated versions of the product are performing. How many users are adopting them? How is adoption changing over time? Are they retaining the new language, or switching back to English? If we default a new user from Japan into Japanese, are they more likely to become a successful merchant than if they were first exposed to the product in English and given the option to switch? In order to answer these questions, we first had to figure out how our data could be sourced or instrumented, and then eventually modelled.

    The functionality that decides which language to render Shopify in is based on the language setting our engineers added to the users data model. 








    2019-01-01 12:14:23

    2019-06-01 07:15:03




    2019-02-02 11:00:35

    2019-02-02 11:00:35


    User 1 will experience the Shopify admin in English, User 2 in Japanese, etc... Like most data models powering Shopify’s software, the users model is a Type 1 dimension. Each time a user changes their language, or any other setting, the record gets updated in place. As I alluded to above, this data model doesn’t allow us to answer many of our questions as they involve knowing what language a given user is using at a particular point in time. Instead, we needed a data model that tracked user’s languages over time. There are several ways to approach this problem.

    Options For Tracking State

    Modify Core Application Model Design

    In an ideal world, the core application database model will be designed to track state. Rather than having a record be updated in place, the new settings are instead appended as a new record. Due to the fact that the data is tracked directly in the source of truth, you can fully trust its accuracy. If you’re working closely with engineers prior to the launch of a product or new feature, you can advocate for this data model design. However, you will often run into two challenges with this approach:

    1. Engineers will be very reluctant to change the data model design to support analytical use cases. They want the application to be as performant as possible (as should you), and having a data model which keeps all historical state is not conducive to that.
    2. Most of the time, new features or products are built on top of pre-existing data models. As a result, modifying an existing table design to track history will come with an expensive and risky migration process, along with the aforementioned performance concerns.

    In the case of rendering languages for the Shopify admin, the language field was added to the pre-existing users model, and updating this model design was out of the question.

    Stitch Together Database Snapshots

    System that extracts newly created or updated records from the application databases on a fixed schedule

    System that extracts newly created or updated records from the application databases on a fixed schedule

    At most technology companies, snapshots of application database tables are extracted into the data warehouse or data lake. At Shopify, we have a system that extracts newly created or updated records from the application databases on a fixed schedule.

    Using these snapshots, one can leverage them as an input source for building a Type 2 dimension. However, given the fixed schedule nature of the data extraction system, it is possible that you will miss updates happening between one extract and the next.

    If you are using dbt for your data modelling, you can leverage their nice built-in solution for building Type 2’s from snapshots!

    Add Database Event Logging

    Newly created or updated record is stored in this log stored in Kafka

    Newly created or updated record is stored in this log in Kafka

    Another alternative is to add a new event log. Each newly created or updated record is stored in this log. At Shopify, we rely heavily on Kafka as a pipeline for transferring real-time data between our applications and data land, which makes it an ideal candidate for implementing such a log.

    If you work closely with engineers, or are comfortable working in your application codebase, you can get new logging in place that will stream any new or updated record to Kafka. Shopify is built on the Ruby on Rails web framework. Rails has something called “Active Record Callbacks”, which allows you to trigger logic before or after an alternation of an object’s (read “database records”) state. For our use case, we can leverage the after_commit callback to log a record to Kafka after it has been successfully created or updated in the application database.

    While this option isn’t perfect, and comes with a host of other caveats I will discuss later, we ended up choosing it as it was the quickest and easiest solution to implement that provided the required granularity.

    Type 2 Modelling Recipes

    Below, I’ll walk through some recipes for building Type 2 dimensions from the event logging option discussed above. We’ll stick with our example of modelling user’s languages over time and work with the case where we’ve added event logging to our database model from day 1 (i.e. when the table was first created). Here’s an example of what our user_update event log would look like:







    2019-01-01 12:14:23

    2019-01-01 12:14:23



    2019-02-02 11:00:35

    2019-02-02 11:00:35



    2019-02-02 11:00:35

    2019-02-02 12:15:06



    2019-02-02 11:00:35

    2019-02-02 13:01:17



    2019-02-02 11:00:35

    2019-02-02 14:10:01


    This log describes the full history of the users data model.

    1. User 1 gets created at 2019-01-01 12:14:23 with English as the default language.
    2. User 2 gets created at 2019-02-02 11:00:35 with English as the default language.
    3. User 2 decides to switch to French at 2019-02-02 12:15:06.
    4. User 2 changes some other setting that is tracked in the users model at 2019-02-02 13:01:17.
    5. User 2 decides to switch back to English at 2019-02-02 14:10:01.

    Our goal is to transform this event log into a Type 2 dimension that looks like this:








    2019-01-01 12:14:23





    2019-02-02 11:00:35

    2019-02-02 12:15:06




    2019-02-02 12:15:06

    2019-02-02 14:10:01




    2019-02-02 14:10:01




    We can see that the current state for all users can easily be retrieved with a SQL query that filters for WHERE is_current. These records also have a null value for the valid_to column, since they are still in use. However, it is common practice to fill these nulls with something like the timestamp at which the job last ran, since the actual values may have changed since then.


    Due to Spark’s ability to scale to massive datasets, we use it at Shopify for building our data models that get loaded to our data warehouse. To avoid the mess that comes with installing Spark on your machine, you can leverage a pre-built docker image with PySpark and Jupyter notebook pre-installed. If you want to play around with these examples yourself, you can pull down this docker image with docker pull jupyter/pyspark-notebook:c76996e26e48 and then run docker run -p 8888:8888 jupyter/pyspark-notebook:c76996e26e48 to spin up a notebook where you can run PySpark locally.

    We’ll start with some boiler plate code to create a Spark dataframe containing our sample of user update events:

    With that out of the way, the first step is to filter our input log to only include records where the columns of interest were updated. With our event instrumentation, we log an event whenever any record in the users model is updated. For our use case, we only care about instances where the user’s language was updated (or created for the first time). It’s also possible that you will get duplicate records in your event logs, since Kafka clients typically support “at-least-once” delivery. The code below will also filter out these cases:

    We now have something that looks like this:






    2019-01-01 12:14:23



    2019-02-02 11:00:35



    2019-02-02 12:15:06



    2019-02-02 14:10:01


    The last step is fairly simple; we produce one record per period for which a given language was enabled:








    2019-01-01 12:14:23

    2020-05-23 00:56:49




    2019-02-02 11:00:35

    2019-02-02 12:15:06




    2019-02-02 12:15:06

    2019-02-02 14:10:01




    2019-02-02 14:10:01

    2020-05-23 00:56:49




    dbt is an open source tool that lets you build new data models in pure SQL. It’s a tool we are currently exploring using at Shopify to supplement modelling in PySpark, which I am really excited about. When writing PySpark jobs, you’re typically taking SQL in your head, and then figuring out how you can translate it to the PySpark API. Why not just build them in pure SQL? dbt lets you do exactly that:

    With this SQL, we have replicated the exact same steps done in the PySpark example and will produce the same output shown above.

    Gotchas, Lessons Learned, and The Path Forward

    I’ve leveraged the approaches outlined above with multiple data models now. Here are a few of the things I’ve learned along the way.

    1. It took us a few tries before we landed on the approach outlined above. 

    In some initial implementations, we were logging the record changes before they had been successfully committed to the database, which resulted in some mismatches in the downstream Type 2 models. Since then, we’ve been sure to always leverage the after_commit callback based approach.

    2. There are some pitfalls with logging changes from within the code:

    • Your event logging becomes susceptible to future code changes. For example, an engineer refactors some code and removes the after_commit call. These are rare, but can happen. A good safeguard against this is to leverage tooling like the CODEOWNERS file, which notifies you when a particular part of the codebase is being changed.
    • You may miss record updates that are not triggered from within the application code. Again, these are rare, but it is possible to have an external process that is not using the Rails User model when making changes to records in the database.

    3. It is possible to lose some events in the Kafka process.

    For example, if one of the Shopify servers running the Ruby code were to fail before the event was successfully emitted to Kafka, you would lose that update event. Same thing if Kafka itself were to go down. Again, rare, but nonetheless something you should be willing to live with. There are a few ways you can mitigate the impact of these events:

    • Have some continuous data quality checks running that compare the Type 2 dimensional model against the current state and checks for discrepancies.
    • If & when any discrepancies are detected, you could augment your event log using the current state snapshot.

    4. If deletes occur in a particular data model, you need to implement a way to handle this.

    Otherwise, the deleted events will be indistinguishable from normal create or update records with the logging setup I showed above. Here are some ways around this:

    • Have your engineers modify the table design to use soft deletes instead of hard deletes. 
    • Add a new field to your Kafka schema and log the type of event that triggered the change, i.e. (create, update, or delete), and then handle accordingly in your Type 2 model code.

    Implementing Type 2 dimensional models for Shopify’s admin languages was truly an iterative process and took investment from both data and engineering to successfully implement. With that said, we have found the analytical value of the resulting Type 2 models well worth the upfront effort.

    Looking ahead, there’s an ongoing project at Shopify by one of our data engineering teams to store the MySQL binary logs (binlogs) in data land. Binlogs are a much better source for a log of data modifications, as they are directly tied to the source of truth (the MySQL database), and are much less susceptible to data loss than the Kafka based approach. With binlog extractions in place, you don’t need to add separate Kafka event logging to every new model as changes will be automatically tracked for all tables. You don’t need to worry about code changes or other processes making updates to the data model since the binlogs will always reflect the changes made to each table. I am optimistic that with binlogs as a new, more promising source for logging data modifications, along with the recipes outlined above, we can produce Type 2s out of the box for all new models. Everybody gets a Type 2!

    Additional Information

    SQL Query Recipes

    Once we have our data modelled as a Type 2 dimension, there are a number of questions we can start easily answering:

    Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

    Continue reading

    ShipIt! Presents: A Look at Shopify's API Health Report

    ShipIt! Presents: A Look at Shopify's API Health Report

    On July 17, 2020, ShipIt!, our monthly event series, presented A Look at Shopify's API Health Report. Our guests, Shuting Chang, Robert Saunders, Karen Xie, and Vrishti Dutta join us to talk about Shopify’s API Health Report, the tool, this multidisciplinary team, built to surface breaking changes affecting Shopify Partner apps. 

    Additional Information

    The links shared to the audience during the event:

    API Versioning at Shopify

    Shopify GraphQL

    API Support Channels

    Other Links

    If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions.

    Continue reading

    How Shopify Reduced Storefront Response Times with a Rewrite

    How Shopify Reduced Storefront Response Times with a Rewrite

    In January 2019, we set out to rewrite the critical software that powers all online storefronts on Shopify’s platform to offer the fastest online shopping experience possible, entirely from scratch and without downtime.

    The Storefront Renderer is a server-side application that loads a Shopify merchant's storefront Liquid theme, along with the data required to serve the request (for example product data, collection data, inventory information, and images), and returns the HTML response back to your browser. Shaving milliseconds off response time leads to big results for merchants on the platform as buyers increasingly expect pages to load quickly, and failing to deliver on performance can hinder sales, not to mention other important signals like SEO.

    The previous storefront implementation‘s development, started over 15 years ago when Tobi launched Snowdevil, lived within Shopify’s Ruby on Rails monolith. Over the years, we realized that the “storefront” part of Shopify is quite different from the other parts of the monolith: it has much stricter performance requirements and can accept more complexity implementation-wise to improve performance, whereas other components (such as payment processing) need to favour correctness and readability.

    In addition to this difference in paradigm, storefront requests progressively became slower to compute as we saw more storefront traffic on the platform. This performance decline led to a direct impact on our merchant storefronts’ performance, where time-to-first-byte metrics from Shopify servers slowly crept up as time went on.

    Here’s how the previous architecture looked:

    Old Storefront Implementation
    Old Storefront Implementation

    Before, the Rails monolith handled almost all kinds of traffic: checkout, admin, APIs, and storefront.

    With the new implementation, traffic routing looks like this:

    New Storefront Implementation
    New Storefront Implementation

    The Rails monolith still handles checkout, admin, and API traffic, but storefront traffic is handled by the new implementation.

    Designing the new storefront implementation from the ground up allowed us to think about the guarantees we could provide: we took the opportunity of this evergreen project to set us up on strong primitives that can be extended in the future, which would have been much more difficult to retrofit in the legacy implementation. An example of these foundations is the decision to design the new implementation on top of an active-active replication setup. As a result, the new implementation always reads from dedicated read replicas, improving performance and reducing load on the primary writers.

    Similarly, by rebuilding and extracting the storefront-related code in a dedicated application, we took the opportunity to think about building the best developer experience possible: great debugging tools, simple onboarding setup, welcoming documentation, and so on.

    Finally, with improving performance as a priority, we work to increase resilience and capacity in high load scenarios (think flash sales: events where a large number of buyers suddenly start shopping on a specific online storefront), and invest in the future of storefront development at Shopify. The end result is a fast, resilient, single-purpose application that serves high-throughput online storefront traffic for merchants on the Shopify platform as quickly as possible.

    Defining Our Success Criteria

    Once we clearly outlined the problem we’re trying to solve and scoped out the project, we defined three main success criteria:

    • Establishing feature parity: for a given input, both implementations generate the same output.
    • Improving performance: the new implementation runs on active-active replication setup and minimizes server response times.
    • Improving resilience and capacity: in high-load scenarios, the new implementation generally sustains traffic without causing errors.

    Building A Verifier Mechanism

    Before building the new implementation, we needed a way to make sure that whatever we built would behave the same way as the existing implementation. So, we built a verifier mechanism that compares the output of both implementations and returns a positive or negative result depending on the outcome of the comparison.

    This verification mechanism runs on storefront traffic in production, and it keeps track of verification results so we can identify differences in output that need fixing. Running the verifier mechanism on production traffic (in addition to comparing the implementations locally through a formal specification and a test suite) lets us identify the most impactful areas to work on when fixing issues, and keeps us focused on the prize: reaching feature parity as quickly as possible. It’s desirable for multiple reasons:

    • giving us an idea of progress and spreading the risk over a large amount of time
    • shortening the period of time that developers at Shopify work with two concurrent implementations at once
    • providing value to Shopify merchants as soon as possible.

    There are two parts to the entire verifier mechanism implementation:

    1. A verifier service (implemented in Ruby) compares the two responses we provide and returns a positive or negative result depending on the verification outcome. Similar to a `diff` tool, it lets us identify differences between the new and legacy implementations.
    2. A custom nginx routing module (implemented in Lua on top of OpenResty) sends a sample of production traffic to the verifier service for verification. This module acts as a router depending on the result of the verifications for subsequent requests.

    The following diagram shows how each part interacts with the rest of the architecture:

    Legacy implementation and new implementation at the same conceptual layer
    Legacy implementation and new implementation at the same conceptual layer

    The legacy implementation (the Rails monolith) still exists, and the new implementation (including the Verifier service) is introduced at the same conceptual layer. Both implementations are placed behind a custom routing module that decides where to route traffic based on the request attributes and the verification data for this request type. Let’s look at an example.

    When a buyer’s device sends an initial request for a given storefront page (for example, a product page from shop XYZ), the request is sent to Shopify’s infrastructure, at which point an nginx instance handles it. The routing module considers the request attributes to determine if other shop XYZ product page requests have previously passed verification.

    First request routed to Legacy implementation
    First request routed to Legacy implementation

    Since this is the first request of this kind in our example, the routing module sends the request to the legacy implementation to get a baseline reference that it will use for subsequent shop XYZ product page requests.

    Routing module sends original request and legacy implementation’s response to the new implementation
    Routing module sends original request and legacy implementation’s response to the new implementation

    Once the response comes back from the legacy implementation, the Lua routing module sends that response to the buyer. In the background, the Lua routing module also sends both the original request and the legacy implementation’s response to the new implementation. The new implementation computes a response to the original request and feeds both its response and the forwarded legacy implementation’s response to the verifier service. This is done asynchronously to make sure we’re not adding latency to responses we send to buyers, who don’t notice anything different.

    At this point, the verifier service received the responses from both the legacy and new implementations and is ready to compare them. Of course, the legacy implementation is assumed to be correct as it’s been running in production for years now (it acts as our reference point). We keep track of differences between the two implementations’ responses so we can debug and fix them later. The verifier service looks at both responses’ status code, headers, and body, ensuring they’re equivalent. This lets us identify any differences in the responses so we make sure our new implementation behaves like the legacy one.

    Time-related and randomness-related exceptions make it impossible to have exactly byte-equal responses, so we ignore certain patterns in the verifier service to relax the equivalence criteria. The verifier service uses a fixed time value during the comparison process and sets any random values to a known value so we reliably compare the outputs containing time-based and randomness-based differences.

    The verifier service sends comparison result back to the Lua module
    The verifier service sends comparison result back to the Lua module

    The verifier service sends the outcome of the comparison back to the Lua module, which keeps track of that comparison outcome for subsequent requests of the same kind.

    Dynamically Routing Requests To the New Implementation

    Once we had verified our new approach, we tested rendering a page using the new implementation instead of the legacy one. We iterated upon our verification mechanism to allow us to route traffic to the new implementation after a given number of successful verifications. Here’s how it works.

    Just like when we only verified traffic, a request arrives from a client device and hits Shopify’s architecture. The request is sent to both implementations, and both outputs are forwarded to the verifier service for comparison. The comparison result is sent back to the Lua routing module, which keeps track of it for future requests.

    When a subsequent storefront request arrives from a buyer and reaches the Lua routing module, it decides where to send it based on the previous verification results for requests similar to the current one (based on the request attributes

    For subsequent storefront requests, the Lua routing module decides where to send it
    For subsequent storefront requests, the Lua routing module decides where to send it

    If the request was verified multiple times in the past, and nearly all outcomes from the verifier service were “Pass”, then we consider the request safe to be served by the new implementation.

    If nearly all verifier service results are “Pass”, then it uses the new implementation
    If most verifier service results are “Pass”, then it uses the new implementation

    If, on the other hand, some verifications failed for this kind of request, we’ll play it safe and send the request to the legacy implementation.

    If most verifier service results are “Fail”, then it uses the old implementation
    If most verifier service results are “Fail”, then it uses the old implementation

    Successfully Rendering In Production

    With the verifier mechanism and the dynamic router in place, our first goal was to render one of the simplest storefront pages that exists on the Shopify platform: the password page that protects a storefront before the merchant makes it available to the public.

    Once we reached full parity for a single shop’s password page, we tested our implementation in production (for the first time) by routing traffic for this password page to the new implementation for a couple of minutes to test it out.

    Success! The new implementation worked in production. It was time to start implementing everything else.

    Increasing Feature Parity

    After our success with the password page, we tackled the most frequently accessed storefront pages on the platform (product pages, collection pages, etc). Diff by diff, endpoint by endpoint, we slowly increased the parity rate between the legacy and new implementations.

    Having both implementations running at the same time gave us a safety net to work with so that if we introduced a regression, requests would easily be routed to the legacy implementation instead. Conversely, whenever we shipped a change to the new implementation that would fix a gap in feature parity, the verifier service starts to report verification successes, and our custom routing module in nginx automatically starts sending traffic to the new implementation after a predetermined time threshold.

    Defining “Good” Performance with Apdex Scores

    We collected Apdex (Application Performance Index) scores on server-side processing time for both the new and legacy implementations to compare them.

    To calculate Apdex scores, we defined a parameter for a satisfactory threshold response time (this is the Apdex’s “T” parameter). Our threshold response time to define a frustrating experience would then be “above 4T” (defined by Apdex).

    We defined our “T” parameter as 200ms, which lines up with Google’s PageSpeed Insights recommendation for server response times. We consider server processing time below 200ms as satisfying and a server processing time of 800ms or more as frustrating. Anything in between is tolerated.

    From there, calculating the Apdex score for a given implementation consists of setting a time frame, and counting three values:

    • N, the total number of responses in the defined time frame
    • S, the number of satisfying responses (faster than 200ms) in the time frame
    • T, the number of tolerated responses (between 200ms and 800ms) in the time frame

    Then, we calculate the Apdex score: 

    By calculating Apdex scores for both the legacy and new implementations using the same T parameter, we had common ground to compare their performance.

    Methods to Improve Server-side Storefront Performance

    We want all Shopify storefronts to be fast, and this new implementation aims to speed up what a performance-conscious theme developer can’t by optimizing data access patterns, reducing memory allocations, and implementing efficient caching layers.

    Optimizing Data Access Patterns

    The new implementation uses optimized, handcrafted SQL multi-select statements maximizing the amount of data transferred in a single round trip. We carefully vet what we eager-load depending on the type of request and we optimize towards reducing instances of N+1 queries.

    Reducing Memory Allocations

    We reduce the number of memory allocations as much as possible so Ruby spends less time in garbage collection. We use methods that apply modifications in place (such as #map!) rather than those that allocate more memory space (like #map). This kind of performance-oriented Ruby paradigm sometimes leads to code that’s not as simple as idiomatic Ruby, but paired with proper testing and verification, this tradeoff provides big performance gains. It may not seem like much, but those memory allocations add up quickly, and considering the amount of storefront traffic Shopify handles, every optimization counts.

    Implementing Efficient Caching Layers

    We implemented various layers of caching throughout the application to reduce expensive calls. Frequent database queries are partitioned and cached to optimize for subsequent reads in a key-value store, and in the case of extremely frequent queries, those are cached directly in application memory to reduce I/O latency. Finally, the results of full page renders are cached too, so we can simply serve a full HTTP response directly from cache if possible.

    Measuring Performance Improvement Successes

    Once we could measure the performance of both implementations and reach a high enough level of verified feature parity, we started migrating merchant shops. Here are some of the improvements we’re seeing with our new implementation:

    • Across all shops, average server response times for requests served by the new implementation are 4x to 6x faster than the legacy implementation. This is huge!
    • When migrating a storefront to the new implementation, we see that the Apdex score for server-side processing time improves by +0.11 on average.
    • When only considering cache misses (requests that can’t be served directly from the cache and need to be computed from scratch), the new implementation increases the Apdex score for server-side processing time by a full +0.20 on average compared to the previous implementation.
    • We heard back from merchants mentioning a 500ms improvement in time-to-first-byte metrics when the new implementation was rolled out to their storefront.

    So another success! We improved store performance in production.

    Now how do we make sure this translates to our third success criteria?

    Improving Resilience and Capacity

    While working on the new implementation, the Verifier service identified potential parity gaps, which helped tremendously. However, a few times we shipped code to production that broke in exceedingly rare edge cases that it couldn’t catch.

    As a safety mechanism, we made it so that whenever the new implementation would fail to successfully render a given request, we’d fall back to the legacy implementation. The response would be slower, but at least it was working properly. We used circuit breakers in our custom nginx routing module so that we’d open the circuit and start sending traffic to the legacy implementation if the new implementation was having trouble responding successfully. Read more on tuning circuit breakers in this blog post by my teammate Damian Polan.

    Increase Capacity in High-load Scenarios

    To ensure that the new implementation responds well to flash sales, we implemented and tweaked two mechanisms. The first one is an automatic scaling mechanism that adds or remove computing capacity in response to the amount of load on the current swarm of computers that serve traffic. If load increases as a result of an increase in traffic, the autoscaler will detect this increase and start provisioning more compute capacity to handle it.

    Additionally, we introduced in-memory cache to reduce load on external data stores for storefronts that put a lot of pressure on the platform’s resources. This provides a buffer that reduces load on very-high traffic shops.

    Failing Fast

    When an external data store isn’t available, we don’t want to serve buyers an error page. If possible, we’ll try to gracefully fall back to a safe way to serve the request. It may not be as fast, or as complete as a normal, healthy response, but it’s definitely better than serving a sad error page.

    We implemented circuit breakers on external datastores using Semian, a Shopify-developed Ruby gem that controls access to slow or unresponsive external services, avoiding cascading failures and making the new implementation more resilient to failure.

    Similarly, if a cache store isn’t available, we’ll quickly consider the timeout as a cache miss, so instead of failing the entire request because the cache store wasn’t available, we’ll simply fetch the data from the canonical data store instead. It may take longer, but at least there’s a successful response to serve back to the buyer.

    Testing Failure Scenarios and the Limits of the New Implementation

    Finally, as a way to identify potential resilience issues, the new implementation uses Toxiproxy to generate test cases where various resources are made available or not, on demand, to generate problematic scenarios.

    As we put these resilience and capacity mechanisms in place, we regularly ran load tests using internal tooling to see how the new implementation behaves in the face of a large amount of traffic. As time went on, we increased the new implementation’s resilience and capacity significantly, removing errors and exceptions almost completely even in high-load scenarios. With BFCM 2020 coming soon (which we consider as an organic, large-scale load test), we’re excited to see how the new implementation behaves.

    Where We’re at Currently

    We’re currently in the process of rolling out the new implementation to all online storefronts on the platform. This process happens automatically, without the need for any intervention from Shopify merchants. While we do this, we’re adding more features to the new implementation to bring it to full parity with the legacy implementation. The new implementation is currently at 90%+ feature parity with the legacy one, and we’re increasing that figure every day with the goal of reaching 100% parity to retire the legacy implementation.

    As we roll out the new implementation to storefronts we are continuing to see and measure performance improvements as well. On average, server response times for the new implementation are 4x faster than the legacy implementation. Rhone Apparel, a Shopify Plus merchant, started using the new implementation in April 2020 and saw dramatic improvements in server-side performance over the previous month.

    We learned a lot during the process of rewriting this critical piece of software. The strong foundations of this new implementation make it possible to deploy it around the world, closer to buyers everywhere, to reduce network latency involved in cross-continental networking, and we continue to explore ways to make it even faster while providing the best developer experience possible to set us up for the future.

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    Building Reliable Mobile Applications

    Building Reliable Mobile Applications

    Merchants worldwide rely on Shopify's Point Of Sale (POS) app to operate their brick and mortar stores. Unlike many mobile apps, the POS app is mission-critical. Any downtime leads to long lineups, unhappy customers, and lost sales. The POS app must be exceptionally reliable, and any outages resolved quickly.

    Reliability engineering is a well-solved problem on the server-side. Back-end teams are able to push changes to production several times a day. So, when there's an outage, they can deploy fixes right away.

    This isn't possible in the case of mobile apps as app developers don’t own distribution. Any update to an app has to be submitted to Apple or Google for review. It's available to users for download only when they approve it. A review can take anywhere between a few hours to several days. Additionally, merchants may not install the update for weeks or even months.

    It's important to reduce the likelihood of bugs as much as possible and resolve issues in production as quickly as possible. In the following sections, we will detail the work we’ve done in both these areas over the last few years.


    We rely heavily on automation testing at Shopify. Every feature in the POS app has unit, integration, functional, and UI snapshot tests. Developers on the team write these simultaneously as they are adding new functionality to the code-base. Changes aren’t merged unless they include automated tests that cover them. These tests run for each push to the repo in our Continuous Integration environment. You can learn more about our testing strategy here.

    Besides automation testing, we also perform manual testing at various stages of development. Features like pairing a Bluetooth card reader or printing a receipt are difficult to test using automation. While we use mocks and stubs to test parts of such features, we manually test the full functionality.

    Sometimes tests that can be automated, inadvertently end up in the manual test suite. This causes us to spend time testing something manually when computers can do that for us. To avoid this, we audit the manual test suite every few months to weed out all such test cases.

    Code Reviews

    Changes made to the code-base aren’t merged until reviewed by other engineers on the team. These reviews allow us to spot and fix issues early in the life-cycle. This process works only if the reviewers are knowledgeable about that particular part of the code-base. As the team grew, finding the right people to do reviews became difficult.

    To overcome this, we have divided the code-base into components. Each team owns the component(s) that make up the feature that they are responsible for. Anyone can make changes to a component, but the team that owns it must review them before merging. We have set up Code Owners so that the right team gets added as reviewers automatically.

    Reviewers must test changes manually, or in Shopify speak, "tophat", before they approve them. This can be a very time-consuming process. They need to save their work, pull the changes, build them locally, and then deploy to a device or simulator. We have automated this process, so any Pull Request can be top-hatted by executing a single command:

    `dev android tophat <pull-request-url>`


    `dev ios tophat <pull-request-url>`


    You can learn more about mobile tophatting at Shopify here.

    Release Management

    Historically, updates to POS were shipped whenever the team was “ready.” When the team decided it was time to ship, a release candidate was created, and we spent a few hours testing it manually before pushing it to the app stores.

    These ad-hoc releases made sense when only a handful of engineers were working on the app. As the team grew, our release process started to break down. We decided to adopt the release train model and started shipping monthly.

    This method worked for a few months, but the team grew so fast that it wasn’t working anymore. During this time, we went from being a single engineering team to a large team of teams. Each of these teams is responsible for a particular area of the product. We started shipping large changes every month, so testing release candidates was taking several days.

    In 2018, we decided to switch to weekly releases. At first, this seemed counter-intuitive as we were doing the work to ship updates more often. In practice, it provided several benefits:

    • The number of changes that we had to test manually reduced significantly.
    • Teams weren’t as stressed about missing a release train as the next train left in a few days.
    • Non-critical bug fixes could be shipped in a few days instead of a month.

    We then made it easier for the team to ship updates every week by introducing Release Captain and ShipIt Mobile.

    Release Captain

    Initially, the engineering lead(s) were responsible for shipping updates, which included:
    • making sure all the changes are merged before the cut-off
    • incrementing the build and version numbers
    • updating the release notes
    • making sure the translations are complete
    • creating release candidates for manual testing
    • triaging bugs found during testing and getting them fixed
    • submitting the builds to app stores
    • updating the app store listings
    • monitoring the rollout for any major bugs or crashes

    As you can see, this is quite involved and can take a lot of time if done by the same person every week. Luckily, we had quite a large team, so we decided to make this a rotating responsibility.

    Each week, the engineer responsible for the release is called the Release Captain. They work on shipping the release so that the rest of the team can focus on testing, fixing bugs, or working on future releases.

    Each engineer on the team is the Release Captain for two weeks before the next engineer in the schedule takes over. We leverage PagerDuty to coordinate this, and it makes it very easy for everyone to know when they will be Release Captain next. It also simplifies planning around vacations, team offsites, etc.

    To simplify things even further, we configured our friendly chatbot, spy, to automatically announce when a new Release Captain shift begins.

    ShipIt Mobile

    We’ve automated most of the manual work involved in doing releases using ShipIt Mobile. With just a few clicks, the Release Captain can generate a new release candidate.

    Once ready, the rest of the team is automatically notified in Slack to start testing.

    After fixing all the bugs found, the update is submitted to the app store with just a single click. You can learn more about ShipIt Mobile here. These improvements not only make weekly releases easier, but they also make it significantly faster to ship hotfixes in case of a critical issue in production.

    Staged Rollouts

    Despite our best efforts, bugs sometimes slip into production. To reduce the surface area of a disruption, we make the updates available only to a small fraction of our user base at first. We then monitor the release to make sure there are no crashes or regressions. If everything goes well, we gradually increase the percentage of users the update is available to over the next few days. This is done using Phased Releases and Staged Rollouts in iOS AppStore and Google Play, respectively.

    The only exception to this approach is when a fix for a critical issue needs to go out immediately. In such cases, we make the update available to 100% of the users right away. We also can block users from using the app until they update to the latest version.

    We do this by having the POS app query the server for the minimum supported version that we set. If the current version is older than that, the app blocks the UI and provides update instructions. This is quite disruptive and can be annoying to merchants who are trying to make a sale. So we do it very rarely and only for critical security issues.

    Beta Flags

    Staged rollouts are useful for limiting how many users get the latest changes. But, they don’t provide a way to explicitly pick which users. When building new features, we often handpick a few merchants to take part in early-access. During this phase, they get to try the new features and give us feedback that we can work on before a final release.

    To do that, we put features, and even big refactors behind server-side beta flags. Only merchants whose stores we have explicitly set a beta flag will see the app’s new feature. This makes it easy to run closed betas with selected merchants. We also can do staged rollouts for beta flags, which gives us another layer of flexibility.

    Automated Monitoring and Alerts

    When something goes wrong in production, we want to be the first to know about it. The POS app and backend is instrumented with comprehensive metrics, reported in real-time. Using these metrics, we have dashboards set up to track the health of the product in production.

    Using these dashboards, we can check the health of any feature in a geography with just a few clicks. For example, the % of successful chip transactions made using a VISA credit card with the Tap, Chip & Swipe reader in the UK, or the % of successful tap transactions made using an Interac debit card with the Tap & Chip reader in Canada for a particular merchant.

    While this is handy, we didn’t want to have to keep checking these dashboards for anomalies all the time. Instead, we wanted to get notified when something goes wrong. This is important because while most of our engineering team is in North America, Shopify POS is used worldwide.

    This is harder to do than it may seem because the volume of commerce varies throughout the year. Time of day, day of the week, holidays, seasons, and even the ongoing pandemic affect how much merchants are able to sell. Setting manual thresholds to detect issues can cause a lot of false negatives and alert fatigue. To overcome this, we leverage Datadog’s Anomaly Detection. Once the selected algorithm has enough data to establish a baseline, alerts will only get fired if there’s an anomaly for that particular time of the year.

    We direct these alerts to Slack so that the right folks can investigate and fix them.

    Handling Outages

    Air Traffic Control

    In the early days of POS, bugs and outages were reported in the team Slack channel, and whoever on the team had the bandwidth, investigated them. This worked well when we had a handful of developers, but this approach didn’t scale as the team grew. Issues kept going to just a few folks who had the most context, and teams kept getting distracted from regular project work, causing delays.

    To fix this, we set up a rotating on-call schedule called Retail ATC (Air Traffic Control). Every week, there is a group of developers on the team dedicated to monitoring how things are working in production and handling outages. These developers are responsible only for this and are not expected to contribute to regular project work. When there are no outages, ATCs spend time tackling tech debt and helping our Technical Merchant Support team.

    Every developer on the team is on-call for two weeks at a time. The first week they are Primary ATC, and the next week they are Secondary ATC. Primary ATC is paged when something goes wrong, and they are responsible for triaging and investigating it. If they need help or are unavailable (commute time, connectivity issues, etc.), the Secondary ATC is paged. ATCs are not expected to fix all issues that arise by themselves, while often they can. They are instead responsible for working with the team that has the most context.


    Since we offer the POS app on both Android on iOS, we have ATC schedules for developers that work on each of those apps. Some areas, like payments, for instance, need a lot of domain knowledge to investigate issues. So we have dedicated ATCs for developers that work in those areas.

    Having folks dedicated to handling issues in production frees up the rest of the team to focus on regular project work. This approach has greatly reduced the amount of context switching teams had to do. It has also reduced the stress that comes with the responsibility of working on a mission-critical mobile application.

    Over the last couple of years, ATC has also become a great way for us to help new team members onboard faster. Investigating bugs and outages exposes them to various tools and parts of our codebase in a short amount of time. This allows them to become more self-sufficient quickly. However, being on-call can be stressful. So, we only add them to the schedule after they have been on the team for a few months and have undergone training. We also pair them with more experienced folks when they go on call.

    Incident Management

    When an outage occurs, it must be resolved as quickly as possible. To do this, we have a set of best practices that the team can follow so that we can spend more time investigating the issue vs figuring out how to do something.

    An incident is started by the ATC in response to an automated alert. ATCs use our ChatOps tools to start the incident in a dedicated Slack channel.


    Incidents are always started in the same channel, and all communication happens in it. This is to ensure that there is a single source of information for all stakeholders.

    As the investigation goes on, findings are documented by adding the 📝 emoji to messages. Our chatbot, spy automatically adds them to a service disruption document and confirms it by adding a  emoji to the same message.

    Once we identify the cause of the outage and verify that it has been resolved, the incident is stopped.



    The ATC then schedules a Root Cause Analysis (RCA) for the incident on the next working day. We have a no-blame culture, and the meeting is focused on determining what went wrong and how we can prevent it from happening in the future.

    At the end of the RCA, action items are identified and assigned owners. Keeping track of outages over time allows us to find areas that need more engineering investment to improve reliability.

    Thanks to these efforts, we've been able to take an app built for small stores and scale it for some of our largest merchants. Today, we support a large number of businesses to sell products worth billions of dollars each year. Along the way, we also scaled up our engineering team and can ship faster while improving reliability.

    We are far from done, though, as each year we are onboarding bigger and bigger merchants onto our platform. If these kinds of challenges sound interesting to you, come work with us! Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.

    Continue reading

    Using DNS Traffic Management to Add Resiliency to Shopify’s Services

    Using DNS Traffic Management to Add Resiliency to Shopify’s Services

    If you are lacking understanding of what is DNS, traffic management, or why we would even use it, read Part 1: Introduction to DNS traffic management.

    Distributed systems are only as resilient as we build them to be. Domain Name System (DNS) traffic management (TM) is a well-used approach to do so. In this second part of the two-part series, we’re sharing Shopify’s DNS traffic management journey from the numerous manually set-up, maintained, and updated traffic management approaches to the fully automated self-served system used by 40+ domains owned amongst 12+ different teams, handling 100M+ requests per day.

    Shopify’s Previous Approaches to DNS Traffic Management

    DNS traffic management isn’t entirely new at Shopify. A number of different teams had their own way of doing traffic management through DNS changes before we automated in 2019, which brought different sets of features and techniques to update records. 

    Traffic management through DNS changes

    Streaming Platform Team

    The team handling our Kafka pipelines used Kubernetes ConfigMaps to define target clusters. So, making changes required

    On top of the process duration which isn't ideal for failover time, using this manual approach doesn't open the door to any active/active configuration (where we share the traffic between two active clusters), since it would require to use two target clusters at once, while this only allows using the one defined in the ConfigMaps. At the time, being able to share traffic wasn't considered necessary for Kafka.

    Search Platform Team

    The team handling Elasticsearch set up the beginnings of DNS updates automation. They used our chatops bot, spy, to run commands requesting failovers. The bot creates a PR in our record-store repository (which uses the record_store open source project) with the requested DNS change, which then needs to be approved, merged, and deployed after passing the tests. Except for the automation, it’s a similar approach to the manual one, hence it has the same limitations with failover delays and active/active capabilities.

    Edgescale Team

    Part of the Edgescale team's responsibilities is to handle our assets (images and static files) and Content Delivery Networks (CDN), which bring the assets as close to the client as possible thanks to a network of distributed servers that store those files. To succeed in this mission, the team wanted more control and active/active capabilities. They used DNS providers allowing for weighted traffic management. They set weights to their records and define which share of traffic goes to which endpoint. It allowed them to share the traffic between their two CDN providers. To set this up, they used the DNS provider’s APIs with weights that could span from 0 (disabled) to 15, using DNS A records (hostname to IP address). To make their lives easier, they wrote a spy cdn command, responsible for making the API calls to the two DNS providers. It reduced the limitations of failover time, as well as provided the active/active capabilities. However, we couldn’t produce a stable, easy to reproduce, and version-controlled configuration of the providers. Adding endpoints to the traffic management was to be done manually, and thus prone to errors. The providers we use for our CDN needs don’t perform the same way in every region of the world and incur different costs in those regions. Using different traffic shares depending on the geographical location of the requesters is one of the traffic-management use-cases we presented in Part 1: Introduction to DNS traffic management(URL), but that wasn’t available here.

    Other Teams

    Finally, a few other teams decided to manually create records in one of the DNS providers used by the team handling our assets. They had fast failover and active/active capabilities but didn’t have a stable configuration, nor redundancy from using a single DNS provider.

    We had too many setups going all over the place, corresponding to maintainability problems. Also, any impactful changes needed a lot of coordination and moving pieces. All of these setups had similar use cases and needs. We started working on how to improve and consolidate our approach to DNS traffic management to

    • reduce the limitations encountered by teams
    • connect other teams with similar needs
    • improve maintainability
    • create ownership

    Consolidation and Ownership of DNS Traffic Management

    The first step of creating a better service used by many teams at Shopify was to define the requirements and goals. We wanted a reliable and redundant system that would provide

    • regionalized traffic
    • fine-grained traffic sharing
    • failover capabilities
    • easy setup and updating by the teams owning traffic-managed domains

    Consolidation and ownership of DNS traffic management

    The final state of our setup provides a unified way of setting up and handling our services. We use a git repository to store the domain's configuration that’s then deployed to two DNS providers. The configuration can be tweaked in a fast and easy manner for both providers through a set of spy commands, allowing for efficient failovers. Let's talk about those choices to build our system, and how we built it.

    Establish Reliability and Redundancy

    Each domain name has a set of nameservers, and when using a DNS client, one of those nameservers is selected and queried first, another one is used when a timeout occurs. Shopify used a single DNS provider until 2016, where a large DNS outage happened while our DNS provider was under a distributed denial of service (DDoS) attack, effectively dropping a large number of legitimate requests. We learned from our mistakes and increased our reliability and redundancy by using more than one DNS provider.

    When CDN traffic management was set up, it used two different DNS providers to follow in the steps of our static DNS records in the record_store. The decision for our new system was easy to make since it prevented being dependent on a single provider, we wanted to follow the same approach and adopt two providers to build our new standard.

    Define The Traffic Management Layers

    Two of our DNS providers allowed for regionalized and weighted traffic management, as well as multiple failover layers. It was just a matter of defining how we wanted things to work and build the equivalent approach for both providers.

    We defined our approach in layers of traffic management and considered that each layer had a decision to make that would reduce the set of options that the next layers can choose from.

    Layer 1 Geographical Fencing

    The layer supports globally matching endpoints, which are mandatoryThe layer supports globally matching endpoints, which are mandatory. We always have an answer to a DNS request for an existing traffic-managed domain, even if there is no region specifically matching the requester. We defined a global region that is selected when nothing more specific matches. The geographical fencing layer selects the endpoints that fit the region where the request originates from. This layer selects the best geographical match with the client’s request. For instance, we set a rule to have endpoint A answered for Canada and endpoint B for Quebec. When we get a request originating from Montreal, we return B. If the request originates from Ottawa, we return A.

    Layer 2 Endpoint Status

    Automatically setting endpoints status depends on a process called health checking or monitoringWe provided a way to enable and disable the use of endpoints depending on their status, which is manually or automatically set. Automatically setting endpoints status depends on a process called health checking or monitoring, where we try to reach the endpoint regularly in order to verify if it does (healthy) or does not (unhealthy) answer. We added a layer of traffic management based on the endpoint status aimed at selecting only the endpoints currently considered as healthy. However, we don’t want the requester to receive an empty answer for a domain that does exist, as it would trigger the negative TTL, most of the time higher than our traffic-managed domain TTL. If any of the endpoints is healthy, then only the healthy endpoints are returned by Layer 2. If none of the endpoints are healthy, all the endpoints are returned. The logic behind this is simple: returning something that doesn’t work is better than not returning anything, as it allows the client to start back using the service as soon as endpoints are back online.

    Layer 3 Endpoint Priority

    Endpoint priority. We allow users to define levels of priorities for the traffic shares of their domainsAnother aspect we want control over is the failover approach for our endpoints. We allow users to define levels of priorities for the traffic shares of their domains. For instance, they could define, as the highest priority, that three endpoints A, B, and C would receive 100%, 0%, and 0% of the traffic respectively. However, when A is unhealthy, instead of using B and C, we define, as a second priority, that B would receive 100% of the traffic. This can’t be done without a layer selecting endpoints based on their priority, as we’d be sending a share of the normal traffic to B, or we don’t have automated control over how B and C share the traffic in the case where A is failing. Endpoints of higher priority layers with a weight set to 0 (not receiving traffic) are also considered down for those layers. This means when the endpoints receiving traffic are unhealthy, through health checking, any higher-priority endpoints get discarded at Layer 2, allowing Layer 3 to keep only the highest priority endpoints left in the returned list.

    Layer 4 Weighted Selection

    This final layer deals with the weights defined for the endpointsThis final layer deals with the weights defined for the endpoints. Each endpoint E reaching this layer has a probability PE of being selected as the answer. PE is obtained through the formula <weight of E>/<sum of the weights of all endpoints reaching Layer 4>. Any 0-weighted endpoints will be automatically discarded unless there are only 0-weighted endpoints where all endpoints will have an equal chance of being selected and returned to the requester. 

    Deploying and Maintaining The Traffic-Managed Domains

    We try to build tooling in a self-service way. It creates a new standard requiring us to make tooling easily accessible for other teams to deploy their traffic-managed domains. Since we use Terraform with Atlantis for a number of our deployments, we built a Terraform module that receives only the required parameters for an application owner and hides most of the work happening behind the scenes to configure our providers.

    The above code represents the gist of what an application owner needs to provide to deploy their own traffic-managed domain.

    We work to keep our deployments organized, so we derive the zone and subdomain parameters from the path of the domain being terraformed. For example, the path to this file is terraform/ allows deriving that the zone is and the subdomain is test.

    When an application owner wants to make changes to their traffic-managed domain, they just need to update the file and apply the terraform change. There are a number of extra features that control

    • automated monitoring and failover for their domains 
    • monitoring configuration for domains
    • paging when a failover is automatically triggered or not.

    When we make changes to how traffic-managed domains are deployed, or add new features, we update the module and move domains to the new module version one by one. Everything stays transparent for the application owners and easy to maintain for us, the Edgescale team.

    Everyday Traffic Steering Operations

    We allowed the users of our new standard to make changes fast and easily applied to their traffic-managed domain. We built a new command in our chatops bot, spy endpoints, to perform operations on the traffic-managed domains.

    Those commands will operate on relative and absolute domains. Relative domains will automatically receive our default traffic-managed zone as a suffix. It’s also possible to specify in which region the change should apply by using square brackets; for example, cdn[us-*,na] would concern the cdn traffic-managed domain but only in region na and only ones starting with us-.

    The spy endpoints get command gets the current traffic shares between endpoints for a given domain

    The spy endpoints get command gets the current traffic shares between endpoints for a given domain. If all providers are holding the same data (which should be the case most of the time), then the command results without any mention of the DNS providers. When results are different (the providers went out of sync), the data will be shown with mentions of the providers to make sure that the current traffic shares are known.

    The spy endpoints set command changes the traffic shares using specific weight values we provide

    The spy endpoints set command changes the traffic shares using specific weight values we provide. It updates every provider and runs the spy endpoints set command to show the new traffic shares. Instead of specifying the weights for each endpoint, it’s possible to use a number of defined profiles that set predefined traffic shares. For example, our test domain with its two endpoints mostly[central-4] will define weights of 95 for central-4 and 5 for central-5.

    Our Success Stories

    Moving to ElasticSearch 7.0.0 - We talked about the process that was used by the ElasticSearch team and the fact that it was limited to failovers and didn’t allow traffic shares. When we moved our internal ElasticSearch clusters to ElasticSearch 7.0.0, the team was able to use the weighted load balancing provided by our tools to move the traffic chunk by chunk and ensure everything was working properly. It allowed them to keep the regular traffic going and mitigate any issue they might have encountered along the way, making the transition to ElasticSearch 7.0.0 seamless to the different systems using it.

    Recovering from Kafka overload during a flash sale - During a large flash sale, the Kafka brokers in one of our clusters started overloading from the traffic they had to handle. Once the problem was identified, it took a few minutes for the Kafka team to realize that they now had traffic share capabilities from our DNS traffic manager. They used it to divert half the traffic from the overloaded region and send it to another available region. Less than five minutes after making that change, the Kafka queues started recovering.

    Relieving on-call stress - Being on-call is stressful, especially when running errands and we don’t want to be stuck at home waiting for our phone to ring. Even with the great on-call culture at Shopify, and people always happy to override parts of your shift, being able to use the DNS traffic manager to steer traffic of an application to another cluster when something happens helps in so many cases. One aspect the different teams appreciated is that work can be done from the phone easily (thanks spy!). Another one is allowing Shopifolk to stay serene while solving the incidents which are mitigated thanks to traffic management and don’t impact our merchants and their clients. In summary, easy to use tooling and practical features together improve the experience of both our merchants and coworkers.

    Since the creation of our new DNS traffic management standard in the middle of 2019, we’ve onboarded more than 40 different domains across more than 12 different teams.

    Why Ownership Is Important. Demonstrated By Example

    A few months ago, while we moved teams to use the initial version of our DNS traffic manager, we got an email from one of our DNS providers letting us know that they would discontinue their service because it would be merged with the services of the company that bought them a few years prior. Of course, we weren’t so lucky, having their systems merged together would require action on our part. We needed to manually migrate our zones to the new provider.

    We launched a project to find our next DNS provider as a result. Since we needed to manually migrate our zones and consequently all of our tooling, we might as well evaluate our options. We looked at more than 40 providers, keeping in mind our needs for our static zones and traffic management requirements. We selected a few providers that fit our needs and decided on which one to sign a contract with.

    Once we chose the provider, the big migration happened. First, we updated our terraform module to support the new provider and deployed the traffic-managed domains in the three providers. Then we updated our spy endpoints tooling to update all providers when making changes so everything was ready and in sync. Next, we moved the nameservers of our traffic-managed zone one by one from the DNS provider we were leaving to the new DNS provider, making sure that in case of a problem only a controlled share of the traffic would be affected. We explained our migration plans to the different teams owning domains in the traffic manager, letting them know when it would happen, and that it should be transparent to them, but if anything unexpected seems to be happening, they should contact us. We also told the incident manager on-call of the changes happening and the timelines.

    Everything was in the plan for the change to happen. However, 30 minutes before change time, the provider we were leaving had an incident, preventing us from moving traffic and happening at the same time as one of the application owners having an incident that they wanted to mitigate with the traffic manager. It pushed our timeline forward, but we continued with the change without any issue and it was fully transparent to all the application owners.

    Looking back to how things were before we rolled out our new standard for DNS traffic management, we easily can say that moving to a new DNS provider wouldn’t have been that smooth. We would have had to

    • contact every team using their own approach to gather their needs and usage so we could find a good alternative to our current DNS provider (luckily this was done while preparing and building this project)
    • coordinate between those teams for the change to happen, and then chase after them to make sure they updated any tooling used

    The change couldn’t be handled for all of them as a whole as there wasn’t one product that one team handles, but many products that many teams handle.

    With our DNS traffic management system, we brought ownership to this aspect of our infrastructure because we understand the capabilities and requirements of teams, and how we can maintain and evolve as our teams’ needs evolve, improving the experience of our merchants and their customers.

    Our DNS traffic management journey took us from many manually setup, maintained, and updated traffic management approaches to a fully automated self-served system used by more than 40 domains owned by more than 12 different teams, and handling more than 100M requests per 24h. If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.

    Continue reading

    An Introduction to DNS Traffic Management

    An Introduction to DNS Traffic Management

    Distributed systems are only as resilient as we build them to be. Domain Name System (DNS) traffic management is a well-used approach to do so. In this first part of a two-part series, we aim to give a broad overview of DNS and how it’s used for traffic management, as well as the different reasons why we want to use DNS traffic management.

    If you already have context on what is DNS, what is traffic management, and the reasons why you would need to use DNS traffic management, you can skip directly to where we share our journey and improvements made regarding DNS traffic management at Shopify in Part 2: Shopify’s DNS Traffic Management.

    A Summarized History of DNS

    Everything started with humans trying to communicate and a plain text file, even before the advent of the modern internet.

    The Advanced Research Projects Agency Network (ARPANET) was thought, in 1966, to enable access to remote computers. In 1969, the first computers were connected to ARPANET, followed by the implementation of the Network Control Program (NCP) in 1970. Guided by the need to connect more and more computers together, and as the work on the Transmission Control Protocol (TCP), started in 1974, evolved, TCP/IP was created in the late 1970s to provide the ability to join separate networks in a network of networks and replaced NCP in ARPANET on January 1st, 1983.

    At the beginning of ARPANET there were just a handful of computers from four different universities connected together, which was easier for people to remember the addresses. This became challenging with new computers joining the network. The Stanford Research Institute provided, through file sharing, a manually maintained file containing the hostnames and related addresses of hosts as provided by member organizations of ARPANET. This file, originally named HOSTS.TXT, is now also broadly known as the /etc/hosts file on Unix and Unix-like systems.

    A growing network with an increasingly large number of computers meant an increasingly large file to download and maintain. By the early 1980s, this process became slow and an automated naming system was required to address the technical and personnel issues of the current approach. The Domain Name System (DNS) was born, a protocol converting human-readable (and rememberable) domain names into Internet Protocol (IP) addresses.

    What is DNS?

    Let’s consider that DNS is a very large library where domains are organized from the least to the most meaningful parts of their names. For instance, if you (the client) want to resolve, you would consider that .com is the least meaningful part as it’s shared with many domain names, and shops the most meaningful part as it’s a specification on the subdomain you’re requesting. Finding in this DNS Library would thus mean going to the .com shelves and finding the myshopify book. Once the book in hand, we would then open it to the shops page, and see something that looks like the following:

    The image is telling us that corresponds to the IP address
    DNS Library Book

    The image is telling us that corresponds to the IP address Also, our DNS Library provides us with the equivalent of a Due Date, which is called Time To Live (TTL). It corresponds to the amount of time the association of hostname to IP address is valid. We remember or cache that information for the given amount of time. If we need this information after expiration, we have to “find that book” again to verify if the association is still valid.

    The opposite concept already exists: if you’re trying to find a page in the book and can’t find it, chances are that you won’t wait there until someone writes it down for you. In DNS, this concept is driven by the Negative TTL, which represents the duration we consider a NXDOMAIN (non-existing domain) answer can be cached. This means that the author of a new page in this book cannot consider their update is known by everyone until that period of time has elapsed.

    Another relevant element is that the DNS Library doesn't necessarily hold only one book but multiple identical ones, from different editors, enabling others to be consulted if one copy is unavailable.

    In DNS terms, the editors are DNS providers. The shelves contain multiple sections for each domain nameserver, the servers that provide DNS resolution as a source of truth for a domain. The books are zones in the domain nameservers, and the book pages are DNS records, the direct relation between the queried record and the value it should resolve to. The DNS Library is what we call root servers, a set of 13 nameservers (named from a to m) that hold the keys to the root of the hostnames. The root servers are responsible for helping to locate the shelves, the nameservers of the Top Level Domains (TLD), the domains at the highest level in the hierarchy for DNS.

    What is Traffic Management?

    Traffic management is a key branch of logistics that aims to plan and control everything required to provide for the safe, orderly, and efficient movement of persons and goods. Traffic management helps to manage situations such as congestion or roadblocks, by redirecting traffic or sharing traffic between multiple routes. For instance, some navigation applications use data they get from their users (current location, current speed, etc.) to know where congestion is happening and improve the situation by suggesting alternative routes instead of sending them to the already overloaded roads.

    A more generic description is that traffic management uses data to decide where to direct the traffic. We could have different paths depending on the country of origin (think country border waiting lines for the booths, where the checks are different depending on the passport you hold), different paths depending on vehicle size (bike lanes, directions for trucks vs. cars, etc.) or any other information we find relevant.

    DNS + Traffic Management = DNS Traffic Management

    Bringing the concept of traffic management to DNS means serving data-driven answers to DNS queries resulting in different answers depending on the location of the requester or for each request. For instance, we could have two clusters of servers and want to split the traffic between the two: we can decide to answer 50% of the requests with the first cluster and the other 50% with the second. The clients obtaining the answers would connect to the cluster they got directed to, without any other action on their part.

    DNS queries are cached to avoid overloading servers with queries.

    However, from the previous section, DNS queries are cached to avoid overloading servers with queries. Each time a query is cached by a resolver, it won’t be repeated by that resolver for the duration of its TTL. Using a low TTL will make sure that the information is kept around but not for too long. For example, returning a TTL of 15 seconds means that after 15 seconds the client needs to resolve the record again, and can get a different answer than before.

    A low TTL needs careful consideration, as the time it takes to obtain the DNS record’s content from the DNS servers, called DNS resolution time, sometimes can dominate the time it takes to retrieve a resource like a webpage. The connection performance and accuracy of the result are thus often at odds. For instance, if I want my changes to appear to users in at most 15 seconds (hence setting a 15 seconds TTL), but the DNS resolution time takes 1 second means that every 15 seconds the users will take 1 more second to reach the service they are connecting to. Over a day, this added resolution time adds up to 5760 seconds, or 1 hour and 36 minutes. If we slightly sacrifice the accuracy by moving the TTL to 60 seconds, the resolution time becomes 1440 seconds over a day, or only 24 minutes, improving the overall performance.

    The use of caching and TTL implies that doing DNS traffic management isn’t instant. There's a short delay in refreshing the record that should be at most the TTL that we configured. In practice, it can be slightly more as some DNS resolvers, unbeknownst to the client, might cache the results for a longer time than they see fit. The override of TTL shouldn’t happen often, however, but it’s something to be aware of when choosing DNS to do traffic management.

    Examining Four DNS Traffic Management Use Cases

    DNS traffic management is interesting when handling systems that don’t necessarily hold load balancer capabilities at the network level, either through an IP-level load balancer or any front-facing proxy, i.e. once already connected to the service we are trying to reach. There are many reasons to use DNS traffic management in front of services, and multiple reasons why we use it at Shopify.

    Easy Failover

    One of Shopify’s use cases is easily failing over a service when the live instance crashes or is rendered unavailable for any reasonOne of Shopify’s use cases is easily failing over a service when the live instance crashes or is rendered unavailable for any reason. Using DNS management and having it ready to target two clusters, but using one by default, simply redirects the traffic to the second cluster whenever the first one crashes, it then redirects back the traffic when it recovers. This is commonly called active-passive. If you’re able to identify the unavailability of your main cluster in a timely fashion, this approach makes it almost seamless (considering the TTL) to the clients using the service, as they’d use the still-working cluster while the issue is solved, either automatically or through the intervention of the responsible on-call team (as a last resort). The pressure is relieved on those on-call teams, as they know that clients can still use the service while they solve the issues, sometimes even pushing the work to be done to the next working day.

    Share traffic between endpoints

    Share traffic between endpoints

    Services inevitably grow and end up receiving requests from many clients. Now those requests need to be shared between available endpoints offering the exact same service. This is called active-active. Another motivation behind this approach is money related, when using external vendors (an external company contracted to provide your users a service) with minimum commitment, allowing you to share your traffic load between those vendors in a way that ensures reaching those commitments. You define the percent of traffic sent to each given endpoint corresponding to the percentage of DNS requests answered with that endpoint.

    Deploy a Change Progressively

    DNS traffic management can help by allowing movement of a small percentage of your traffic to a cluster that’s already updated

    When developing production services, sometimes making a potentially disruptive change (such as deploying a new feature, changing the behavior of an existing one, or updating a system to a new version) is needed. In such cases, deploying your change and crossing your fingers while hoping for success is, at best, risky. DNS traffic management can help by allowing movement of a small percentage of your traffic to a cluster that’s already updated, then move more and more chunks of traffic until all of the traffic has been moved to the cluster with the new feature. This approach is called green-blue deployment. You can then update the other cluster, which allows you to be ready for the next update or failover.

    Regionalize Traffic Decisions

    geolocation can be fine-grained to the country, state or province level, or applied to a broader region of the world

    You might find cases where some endpoints are more performant in some regions than others, which might happen when using external vendors. If performance is important for users, as it is for Shopify’s merchants and their customers, then you want to make sure the most performant endpoints are used for users in each region by allowing DNS answers based on the client’s location. Most of the time, geolocation can be fine-grained to the country, state or province level, or applied to a broader region of the world. Routing rules are defined to indicate what should be answered depending on the origin of requests. Once done, a client connecting from a location will get the answer that fits them. 

    Our DNS traffic management journey took us from many manually set-up, maintained, and updated traffic management approaches to a fully automated self-served system used by more than 40 domains owned by more than 12 different teams, and handling more than 100M requests per 24h. If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.

    Continue reading

    Migrating Large TypeScript Codebases To Project References

    Migrating Large TypeScript Codebases To Project References

    In 2017, we began migrating the merchant admin UI of Shopify from a traditional Ruby on Rails Embedded RuBy (ERB) based front-end to an entirely new codebase, TypeScript paired with React and GraphQL. Using TypeScript enabled our ever-growing Admin teams to leverage TypeScript’s compiler to catch potential bugs and errors well before they ship. The VSCode editor also provides useful TypeScript language-specific features, such as inline feedback from TypeScript’s type-checker.

    Many teams work independently but in parallel on the Shopify merchant admin UI. These teams need to stay highly aligned and loosely coupled to ensure they quickly ship reliable code. We chose TypeScript to empower our developers with great tools that help them ship with confidence. Pairing TypeScript with VSCode provides developers with actionable real-time feedback right in their editor without the need to run separate commands or push to CI. With these tools, teams leverage the benefits of TypeScript’s static type-checker right in the editor as they pull in the latest changes.

    The Problems with Large TypeScript Codebases

    Over the years as teams grew, many new features shipped, and the codebase dramatically increased in size. Teams had a poorer experience in the editor. Our codebase was taxing the editor tooling available to use. TS-Server (VSCode’s built-in language server for TypeScript) would seemingly halt, leaving developers with all the weight of TypeScript’s syntax and none of the editor tooling payoff. The editor took ~2-3 minutes to load up all of TypeScript’s great tooling, slowing how fast we ship great features for our merchants, and it also created a very frustrating development experience.

    We need to ship commerce features fast, so we investigated solutions to bring the developer experience in the editor up to speed.

    Understanding What VSCode Didn’t Like About Our Codebase

    First, we started by understanding why and where our codebase was taxing VSCode and TypeScript’s type-checker. Thankfully, the TypeScript team has an in-depth wiki page on Performance, and the VSCode team has an insightful wiki page on Performance Issues.

    Before implementing any solutions, we needed to understand what VSCode didn’t like about our codebase. To my pleasant surprise, around this time Taryn Musgrave, a developer from our Orders & Fulfillment team, had recently attended JSConfHi 2020. At the conference, she met a few folks from TC39 (the team that standardizes the JavaScript language under the “ECMAScript” specification) , one of them being Daniel Rosenwasser, the Program Manager for TypeScript at Microsoft. After the conference, we connected and shared our codebase diagnostics with the core TypeScript team from Microsoft.

    The TypeScript team had asked us for more insight into our codebase. We ran some initial diagnostics against the single TypeScript configuration (tsconfig.json) file for the project. The TypeScript compiler provides an --extendedDiagnostics flag that we can pass to get some useful diagnostic info

    Their team was surprised by the size of our codebase and mentioned that it was in the top 1% of codebase sizes they’d seen. After sharing some more context into our tools, libraries, and build setup they suggested we break up our codebase into smaller projects and leverage a new TypeScript feature released in 3.0 called project references.

    Measuring Improvements

    Before diving into migrating our entire admin codebase, we needed to take a step back to decide how we could measure and track our improvements. To confidently know we’re making the right improvements, we need tools that enable us to measure changes.

    At first, we used VSCode’s TSServer logs to measure and verify our changes. For our Admin codebase the TSServer would spit out ~80,000 lines of logs over the course of ~2m30s on every bootup of VSCode.

    We quickly realized that this approach wasn’t scalable for our teams. Expecting teams to parse through 80,000 lines to verify their improvement wasn’t feasible. So, for this reason we set out to build a VSCode plugin to help our teams at Shopify measure and track their editor initialization times over time. Internally we called this plugin TypeTrack. 

    TypeTrack in action
    TypeTrack in action

    TypeTrack is a lightweight plugin to measure VSCode's TypeScript language feature initialization time. This plugin is intended to be used by medium-large TypeScript codebases to track editor performance improvements as projects migrate their code to TypeScript project references.

    Migrating to Project References

    In large projects, migrating to project references in one go isn’t feasible. The migration would have to be done gradually over time. We found it best to start out by identifying leaf nodes in our project’s dependency graph. These leaf nodes generally include the most widely shared parts of our codebase and have little to no coupling with the rest of our codebase.

    Our Admin codebase’s structure loosely resembles this:

    In our case, a good place to start was by migrating the packages and tests folders. Both of these folders are meant to be treated as “isolated projects” already. Migrating them to leverage project references not only improves their initialization performance in VSCode, but also encourages a healthier codebase. Developers are encouraged to break up their codebase into smaller pieces that are more manageable and reusable.

    Starting At The Leaves

    Once we’ve identified our leaf nodes, we start by creating an entrypoint for the project references.

    Whenever a project is referenced, that respective folder is also migrated to a project reference, meaning it requires a project-level tsconfig specifying the dependencies that project references in turn. In the case of the above-mentioned project-references.tsconfig.json, the ../../package path reference looks for a packages/tsconfig.json.

    Once you’ve got a few folder with project references created, you run the TypeScript compiler to check if your project builds:

    > Protip: Use the --diagnostics flag to get insight into what the compiler is doing.

    At this point it’s very likely that you’ll get tons of errors, the most common being that the compiler can’t find a module that your project is referencing.

    To illustrate this, see the screenshot below. The compiler isn’t able to find the module @shopify/address-consts from within the @shopify/address package

    Compiler can’t find a module the project is referencing
    Compiler can’t find a module the project is referencing

    The solution here would be:

    1. Create a project-level tsconfig.json for the module being depended on. You can think of this as a child node in your project dependency graph. For our example error from above, we need to create a tsconfig.json in packages/address-consts.
    2. Reference the new project reference (the tsconfig.json created from 1) in the consuming parent tsconfig.json that requires the new module. In our example, we need to include the new project reference.

    Keep in mind that you'll have to respect the following restrictions for project references:

    • The include/files compiler option must include all the input files that the project reference relies on.
    • Any project that’s referenced must itself have a references array (which may be empty).
    • Any local namespaces you import must be listed in the config/typescript/tsconfig.base.json #compilerOptions#paths field.

    General Steps to Migrating Your Codebase to Project References

    Once the TypeScript compiler successfully builds the leaf nodes that you’ve migrated to project references, you can start working your way up your project dependency graph. Outlined below are general steps you’ll take to continue migrating your whole codebase:

    1. Identify the folder you plan to migrate. Add this to your project’s config/typescript/project-references.tsconfig.json
    2. Create a tsconfig.json for the folder you’ve identified.
    3. Run the TypeScript compiler
    4. Fix the errors. Determine what other dependencies outside of the identified folder need to be migrated. Migrate and reference those dependencies.
    5. Go to step 3 until the compiler succeeds.

    For large projects, you may see hundreds or even thousands of compiler errors in step 4. If this is the case, pipe out those errors to a log file and look for patterns in the errors.

    Here are some common solutions we’ve found when we’ve come across this roadblock:


    • The folder is just too large. At the root of the folder create a tsconfig.json that includes references to child folders. Migrate the child folders with the general steps mentioned above
    • This folder contains spaghetti code. It’s possible this folder could be reaching into other dependencies/folders that it shouldn’t. In our case we’ve found it best to move shared code into our packages folder consisting of shared isolated packages.

    We Saw Drastic Improvements in VSCode

    Once we migrated our packages and tests folders in our Admin codebase, we made great improvements in the amount of time it takes VSCode to initialize the editor’s TypeScript language features in those folders.

    File Before  After
    packages/@web-utilities/env/index.ts ~155s ~13s
    packages/@shopify/address-autocomplete/AddressAutocomplete.ts ~155s ~8s
    tests/utilities/environment/app.tsx ~155s ~10s

    How Does This Make Sense? Why Is This 10x Faster?

    When a TypeScript file is opened up in VSCode, the editor sends off a request to the TypeScript Server for analysis on that file. The TS Server then looks for the closest tsconfig.json relative to that file for it to understand the types associated with that project. In our case, that’s the project-level tsconfig.json which included our whole codebase.

    With the addition of project references (i.e tsconfig.json with references to its dependent projects) in the packages and tests folders, we’re explicitly separating the codebase into smaller blocks of code. Now, when VSCode loads a file that uses project references, the TS Server will only load up the files that the project depends on.

    Scaling Migration Duties to Other Teams

    In our case, given the size of our Admin codebase, my team couldn’t migrate all of it ourselves. For this reason we wrote documentation and provided tools (our internal VSCode plugin) to other Admin section teams to improve their section’s editor performance.

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    ShipIt! Presents: AR/VR at Shopify

    ShipIt! Presents: AR/VR at Shopify

    On June 20, 2020, ShipIt!, our monthly event series, presented AR/VR at Shopify. Daniel Beauchamp, Head of AR/VR at Shopify talked about how we’re using this increasingly ubiquitous technology at Shopify.

    The missing Shopify AR Stroller in video in the AR/VR at Shopify presentation can be viewed at

    Additional Information

    Daniel Beauchamp



    Shopify AR



    Continue reading

    How We’re Solving Data Discovery Challenges at Shopify

    How We’re Solving Data Discovery Challenges at Shopify

    Humans generate a lot of data. Every two days we create as much data as we did from the beginning of time until 2003! The International Data Corporation estimates the global datasphere totaled 33 zettabytes (one trillion gigabytes) in 2018. The estimate for 2025 is 175 ZBs, an increase of 430%. This growth is challenging organizations across all industries to rethink their data pipelines.

    The nature of data usage is problem driven, meaning data assets (tables, reports, dashboards, etc.) are aggregated from underlying data assets to help decision making about a particular business problem, feed a machine learning algorithm, or serve as an input to another data asset. This process is repeated multiple times, sometimes for the same problems, and results in a large number of data assets serving a wide variety of purposes. Data discovery and management is the practice of cataloguing these data assets and all of the applicable metadata that saves time for data professionals, increasing data recycling, and providing data consumers with more accessibility to an organization’s data assets.

    To make sense of all of these data assets at Shopify, we built a data discovery and management tool named Artifact. It aims to increase productivity, provide greater accessibility to data, and allow for a higher level of data governance.

    The Discovery Problem at Shopify

    Data discovery and management is applicable at every point of the data process:

    • Where is the data coming from?
    • What is the quality of this data?
    • Who owns the data source? 
    • What transformations are being applied?
    • How can this data be accessed?
    • How often does this process run?
    • What business logic is being applied?
    • Is the model stale or current?
    • Are there other similar models out there?
    • Who are the main stakeholders?
    • How is this data being applied
    • What is the provenance of these applications?

    High level data process

    The data discovery issues at Shopify can be categorized into three main challenges: curation, governance, and accessibility.


    “Is there an existing data asset I can utilize to solve my problem?”

    Before Artifact, finding the answer to this question at Shopify often involved asking team members in person, reaching out on Slack, digging through GitHub code, sifting through various job logs, etc. This game of information tag resulted in multiple sources of truth, lack of full context, duplication of effort, and a lot of frustration. When we talked to our Data team, 80% felt the pre-Artifact discovery process hindered their ability to deliver results. This sentiment dropped to 41% after Artifact was released.

    The current discovery process hinders my ability to deliver results survey answers
    The current discovery process hinders my ability to deliver results survey answers


    “Who is going to be impacted by the changes I am making to this data asset?”

    Data governance is a broad subject that encompasses many concepts, but our challenges at Shopify are related to lack of granular ownership information and change management. The two are related, but generally refer to the process of managing data assets through their life cycle. The Data team at Shopify spent a considerable amount of time understanding the downstream impact of their changes, with 16% of the team feeling they understood how their changes impacted other teams:

    I am able to easily understand how my changes impact other teams and downstream consumers survey answers
    I am able to easily understand how my changes impact other teams and downstream consumers survey answers

    Each data team at Shopify practices their own change management process, which makes data asset revisions and changes hard to track and understand across different teams. This leads to loss of context for teams looking to utilize new and unfamiliar data assets in their workflows. Artifact has helped each data team understand who their downstream consumers are, with 46% of teams now feeling they understand the impact their changes have on them.


    “How many merchants did we have in Canada as of January 2020?”

    Our challenge here is surfacing relevant, well documented data points our stakeholders can use to make decisions. Reporting data assets are a great way to derive insights, but those insights often get lost in Slack channels, private conversations, and archived powerpoint presentations. Lack of metadata surrounding these report/dashboard insights directly impacts decision making, causes duplication of effort for the Data team, and increases the stakeholders’ reliance on data as a service model that in turn inhibits our ability to scale our Data team.

    Our Solution: Artifact, Shopify’s Data Discovery Tool

    We spent a considerable amount of time talking to each data team and their stakeholders. On top of the higher level challenges described above, there were two deeper themes that came up in each discussion:

    • The data assets and their associated metadata is the context that informs the data discovery process.
    • There are many starting points to data discovery, and the entire process involves multiple iterations.

    Working off of these themes, we wanted to build a couple of different entry points to data discovery, enable our end users to quickly iterate through their discovery workflows, and provide all available metadata in an easily consumable and accessible manner.

    Artifact is a search and browse tool built on top of a data model that centralizes metadata across various data processes. Artifact allows all teams to discover data assets, their documentation, lineage, usage, ownership, and other metadata that helps users build the necessary data context. This tool helps teams leverage data more effectively in their roles.

    Artifact’s User Experience

    Artifact Landing Page

    Artifact Landing Page

    Artifact’s landing page offers a choice to either browse data assets from various teams, sources, and types, or perform a plain English search. The initial screen is preloaded with all data assets ordered by usage, providing users who aren’t sure what to search for a chance to build context before iterating with search. We include the usage and ownership information to give the users additional context: highly leveraged data assets garner more attention, while ownership provides an avenue for further discovery.

    Artifact Search Results

    Artifact Search Results

    Artifact leverages Elasticsearch to index and store a variety of objects: data asset titles, documentation, schema, descriptions, etc. The search results provide enough information for users to decide whether to explore further, without sacrificing the readability of the page. We accomplished this by providing the users with data asset names, descriptions, ownership, and total usage.

    Artifact Data Asset Details

    Artifact Data Asset Details

    Clicking on the data asset leads to the details page that contains a mix of user and system generated metadata organized across horizontal tabs, and a sticky vertical nav bar on the right hand side of the page.

    The lineage information is invaluable to our users as it:

    • Provides context on how a data asset is utilized by other teams.
    • Lets data asset owners know what downstream data assets might be impacted by changes.

    Artifact Data Asset Lineage

    Artifact Data Asset Lineage

    This lineage feature is powered by a graph database, and allows the users to search and filter the dependencies by source, direction (upstream vs. downstream), and lineage distance (direct vs. indirect).

    Artifact’s Architecture

    Before starting the build, we decided on these guiding principles:

    1. There are no perfect tools; instead solve the biggest user obstacles with the simplest possible solutions.
    2. The architecture design has to be generic enough to easily allow future integrations and limit technical debt.
    3. Quick iterations lead to smaller failures and clear, focused lessons.

    Artifact High Level Architecture
    Artifact High Level Architecture

    With these in mind, we started with a generic data model, and a simple metadata ingestion pipeline that pulls the information from various data stores and processes across Shopify. The metadata extractor also builds the dependency graph for our lineage feature. Once processed, the information is stored in Elasticsearch indexes, and GraphQL APIs expose the data via an Apollo client to the Artifact UI.


    Buy vs. Build

    The recent growth in data, and applications utilizing data, has given rise to data management and cataloguing tooling. We researched a couple of enterprise and open source solutions, but found the following challenges were common across all tools:

    • Every organization’s data stack is different. While some of the upstream processes can be standardized and catalogued appropriately, the business context of downstream processes creates a wide distribution of requirements that are near impossible to satisfy with a one-size-fits-all solution.
    • The tools didn’t capture a holistic view of data discovery and management. You are able to effectively catalogue some data assets. However, cataloguing the processes surrounding the data assets were lacking: usage information, communication & sharing, change management, etc.
    • At Shopify, we have a wide range of data assets, each requiring its own set of metadata, processes, and user interaction. The tooling available in the market doesn’t offer support for this type of variety without heavy customization work.

    With these factors in mind, the buy option would’ve required heavy customization, technical debt, and large efforts for future integrations. So, we went with the build option as it was:

    • The best use case fit.
    • Provided the most flexibility.
    • Left us with full control of how much technical debt we take on.

    Metadata Push vs. Pull

    The architecture diagram above shows the metadata sources our pipeline ingests. During the technical design phase of the build, we reached out to the teams responsible for maintaining the various data tools across the organization. The ideal solution was for each tool to expose a metadata API for us to consume. All of the teams understood the value in what we were building, but writing APIs was new incremental work to their already packed roadmaps. Since pulling the metadata was an acceptable workaround and speed to market was a key factor, we chose to write jobs that pull the metadata from their processes; with the understanding that a future optimization will include metadata APIs for each data service.

    Data Asset Scope

    Our data processes create a multitude of data assets: datasets, views, tables, streams, aliases, reports, models, jobs, notebooks, algorithms, experiments, dashboards, CSVs, etc. During the initial exploration and technical design, we realized we wouldn’t be able to support all of them with our initial release. To cut down the data assets, we evaluated each against the following criteria:

    • Frequency of use:how often are the data assets being used across the various data processes?
    • Impact to end users:what is the value of each data asset to the users and their stakeholders?
    • Ease of integration:what is the effort required to integrate the data asset in Artifact?

    Based on our analysis, we decided to integrate the top queryable data assets first, along with their downstream reports and dashboards. The end users would get the highest level of impact with the least amount of build time. The rest of the data assets were prioritized accordingly, and added to our roadmap.

    What’s Next for Artifact?

    Since its launch in early 2020, Artifact has been extremely well received by data and non-data teams across Shopify. In addition to the positive feedback and the improved sentiment, we are seeing over 30% of the Data team using the tool on a weekly basis, with a monthly retention rate of over 50%. This has exceeded our expectations of 20% of the Data team using the tool weekly, with a 33% monthly retention rate.

    Our short term roadmap is focused on rounding out the high impact data assets that didn’t make the cut in our initial release, and integrating with new data platform tooling. In the mid to long term, we are looking to tackle data asset stewardship, change management, introduce notification services, and provide APIs to serve metadata to other teams. The future vision for Artifact is one where all Shopify teams can get the data context they need to make great decisions.

    Artifact aims to be a well organized toolbox for our teams at Shopify, increasing productivity, reducing the business owners’ dependence on the Data team, and making data more accessible. 

    Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

    Continue reading

    How We Built

    How We Built

    On July 20, 2020 we held ShipIt! Presents: AR/VR at Shopify. Daniel talked about what we’re doing with AR/VR at Shopify. The video of the event is available. is a free tool built by Shopify’s Augmented Reality (AR) team that lets anyone view the size of a product in the space around them using their smartphone camera.

    The tool came out of our April Hack Days focused on helping retailers impacted by Covid-19. Our idea was to create an easy way for merchants to show how big their products are—something that became increasingly important as retail stores were closed. While Shopify does support 3D models of products, it does take time and money to get 3D models made. We wanted to provide a quick stopgap solution.

    The ideal flow is for merchants to provide a link on their product page (e.g., that would open up an AR view when someone clicked on it. For this to be as seamless as possible, it had to be all done on the web (no app required!), fast, accurate, and work on iOS and Android.

    Let’s dive into how we pulled it off.

    AR on the Web

    3D on the web has been around for close to a decade. There are great WebGL libraries like Three.js that make it quick to display 3D content. For, all we needed to show was a 3D cube which is essentially the “Hello World” of computer graphics.

    The problem is that WebAR, the ability to power AR experiences on the web with JavaScript, isn’t supported on iOS. In order to build fully custom AR experiences on iOS, it needs to be in an app.

    Luckily, there’s a workaround. iOS has a feature called AR Quick Look, which is a native AR viewer built into the operating system. By opening a link to a 3D model file online, an AR viewer will launch right in your browser to let you view the model. Android has something similar called Scene Viewer. The functionality of both of these viewers is limited to only placing and moving a single 3D object in your space, but for that was all we needed.

    Example of a link to view a 3D model with iOS AR Quick Look:

    Example of a link to view a 3D model with Android Scene Viewer:

    You might have noticed that the file extensions in the above viewers are different between Android and iOS. Welcome to the wonderful world of 3D formats!

    A Quick Introduction to 3D File Formats

    Image files store pixels, and 3D files store information about an object’s shape (geometry) and what it’s made of (materials & textures). Let’s take a look at a simple object: a cube. To represent a cube as a 3D model we need to know the positions of each of its corners. These points are called vertices

    Representing a cube in 3D format.  Left: Each vertex of the cube.  Right: 3d-coordinates of each vertex
    Left: Each vertex of the cube.
    Right: 3d-coordinates of each vertex

    These can be represented in array like this:

    We also need to know how each side of the cube is connected. These are called the faces.

    A face made up of vertices 0, 1, 5, 4
    A face made up of vertices 0, 1, 5, 4

    The face array [0, 1, 5, 4] denotes that there’s a face made up by connecting vertices 0, 1, 5, and 4 together. Our full cube would look something like this:

    There is also a scale property that lets us resize the whole object instead of moving the vertices manually. A scale of (1,2,3) would scale by 1 along the x-axis, 2 along the y-axis, and 3 along the z-axis.

    A scale of (1,2,3) used to resize the whole object
    A scale of (1,2,3) used to resize the whole object

    So what file format can we use to store this information? Well, just like how images have different file formats (.jpeg, .png, .tiff, .bmp, etc), there’s more than one type of 3D format. And because the geometry data can get quite large, these formats are often binary instead of ASCII based.

    Android’s Scene Viewer uses .glTF, which is quickly becoming a 3D standard in the industry. Its aim is to be the jpeg of 3D. The binary version of a .glTF file is a .glb file. Apple on the other hand is using their own format called USDZ, which is based off of Pixar’s USD file format.

    For to work on both operating systems, we needed a way to create these files dynamically. When a user entered dimensions we’d have to serve them a 3D model of that exact size.

    Approach 1: Creating 3D Models Dynamically

    There are many libraries out there for creating and manipulating .glTF, and they’re very lightweight. You can use them client side or server side.

    USDZ is a whole other story. It’s a real pain to compile all the tools, and an even bigger pain to get running on a server. Apple distributes precompiled executables, but they only work on OSX. We definitely didn’t want to spend half of Hack Days wrestling with tooling, but assuming we could get it working, the idea was:

    1. Generate a cube as a .gltf file
    2. Use the usdzconvert tool to convert the cube.gltf to a .usdz

    The problem here is that this process could end up taking a non-trivial amount of time. We’ve seen usdzconvert take up to 3-4 seconds to convert, and if you add the time it takes to create the .gltf, users might be waiting 5 seconds or more before the AR view launches.

    There had to be a faster and easier way.

    Approach 2: Pre-generate All possible Models

    What if we generated every possible combination of box beforehand in both .gltf and .usdz formats? Then when someone entered in their dimensions, we would just serve up the relevant file.

    How many models would we need? Let’s say we limited sizes for width, length, and depth to be between 1cm and 1000cm. There’s likely not many products that are over 10 meters. We’d then have to pick how granular we could go. Likely people wouldn’t be able to visually tell the difference between 6.25cm and 6cm. So we’d go in increments of 25mm. That would require 4000 * 4000 * 4000 * 2, or 128,000,000,000 models. We didn’t really feel like generating that many models, nor did we have the time during Hack Days!

    How about if we went in increments of 1cm? That would need 1000 * 1000 * 1000 * 2, or 2 billion models. That’s a lot. At 3 KB a model, that’s roughly 6TB of models.

    This approach wasn’t going to work.

    Approach 3: Modify the Binary Data Directly

    All cubes have the same textures and materials, and the same numbers of vertices and faces. The only difference between them is their scale. It seemed inefficient to regenerate a new file every time just to change this one parameter, and seemed wasteful to pre-generate billions of almost identical files.

    But what if we took the binary data of a .usdz file, found which bytes correspond to the scale property, and swapped them out with new values? That way we could bypass all of the usdz tooling.

    The first challenge was to find the byte location of the scale values. The scale value (1, 1, 1) would be hard to look for because the value “1” likely comes up many times in the cube’s binary data. But if we scaled a cube with values that were unlikely to be elsewhere in the file, we could narrow down the location. We ended up creating a cube with scale = (1.7,1.7,1.7).

    By loading up our file in a hex editor, we’re able to look up and find the value. USDZ stores values as 32-bit floats, so 1.7 is represented as 0x9a99d93f. With a quick search we found at byte offset 1344 the values corresponding to scale along the x, y, and z axes.

    Identifying 0x9a99d93f within the USDZ binary
    Identifying 0x9a99d93f within the USDZ binary

    To test our assumption, the next step was to try changing these bytes and seeing that the scale would change.

    It worked! With this script we could generate .usdz files on the fly and it was really fast.The best part is that this could also run completely client side with a few modifications. We could modify the .usdz in the browser, encode the data in a URL, and pass that URL to our AR Quick Look link:

    Unfortunately, our dreams of running this script entirely on the client were dashed when it came to Android. Turns out you can’t launch Scene Viewer from a local data URL. The file has to be served somewhere, so we’re back to needing to do this on a server.

    But before we went about refactoring this script to run on a little server written in Ruby, we wanted to give our lovely cube a makeover. The default material was this grey colour that looked a bit boring.

    The new cube is semi-transparent and has a white outline that makes it easier to align with the room
    The new cube is semi-transparent and has a white outline that makes it easier to align with the room

    Achieving the transparency effect was as simple as setting the opacity value of the material. The white outline proved to be a bit more challenging because it could easily get distorted with scale.

    If one side of the cube is much longer than the other, the outline starts stretching

    The outline starts stretching on the box
    The outline starts stretching on the box

    You can see in the above image how the outlines are no longer consistent in thickness. The ones of the sides are now twice as thick the others.

    We needed a way to keep the outline thickness consistent regardless of the overall dimensions, and that meant that we couldn’t rely on scale. We’d have to modify the vertex positions individually.

    Vertices are highlighted in red
    Vertices are highlighted in red

    Above is the structure of the new cube we landed on with extra vertices for the outline. Since we couldn’t rely on the scale property anymore for resizing our cube, we needed to change the byte values of each vertex in order to reposition them. But it seemed tedious to maintain a list of all the vertex byte offsets, so we ended up taking a “find and replace” approach.

    Left: Outer vertices. Right: Inner outline vertices
    Left: Outer vertices. Right: Inner outline vertices

    We gave each x, y, z position of a vertex a unique value that we could search for in the byte array and replace. In this case the outer vertices for x, y, z were 51, 52, 53 respectively. We picked these numbers at random just like we picked the number 1.7 before. We wanted something unique.

    Vertex values that affected the outline were 41, 42, and 43. Vertex positions with the same value meant that they moved together. Also, since the cube is symmetrical, we gave opposite vertices the same value except negated.

    As an example, let’s say we wanted to make a cube 2m long, and 1m wide and tall. We’d first search for all the bytes with a float value of 51 (0x00004c42) and replace it with the value 2 * 0.5 = 1 (0x0000803f). We use 1 instead of 2 because of the vertices being symmetrical. If the left-most corner has a value of -1 and the right-most has a value of 1, then the distance between them is 2

    Distance between the vertices is 2
    Distance between the vertices is 2

    We’d then move the outline vertices by looking for all the bytes with value 41 (0x00002442) and replace them with 1.98 * 0.5 = 0.99 (0xa4707d3f) to keep the outline 2cm thick. We’d repeat this process for the width and the height

    The template cube is transformed to the proper dimensions with consistent outline thickness.
    The template cube is transformed to the proper dimensions with consistent outline thickness.

    Here’s what part of our server side Ruby code ended up looking like to do this:

    Et voilà! We now had a way to generate beautiful looking cubes on the fly. And most importantly, it’s blazingly fast.This new way can create usdz files in well under 1 millisecond, something much better than relying on the python USD toolset.

    All that remained was to write the .glb version, which we did using the same approach as above.

    What’s Next for

    We’re really happy with how turned out, and with the simplicity of its implementation. Now anytime you are shopping and want to see how big a product is, you can simply go to and visualize the dimensions.

    But why stop at showing 3D cubes? There are lots of standard-sized products that we could help visualize like rugs, posters, frames, mattresses, etc—Imagine being able to upload a product photo of a 6x9 rug and instantly load it up in front of you in AR. All we need to figure out is how to dynamically insert textures into .usdz and .glb files.

    Time to boot up the ol’ hex editor again…

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    Media at Scale: Callbacks vs pipelines

    Media at Scale: Callbacks vs pipelines

    The Shopify admin is a Ruby on Rails monolith that hundreds of developers work on, with a continuous deployment cycle used by millions of people.

    Shopify recently launched the ability to add videos and 3d models to products as a native feature within the Admin section. This feature expands a merchant's ability to showcase their products through media using videos and 3d models in addition to the already existing images. Images have been around since the beginning of Shopify, and there are currently over 7 billion images on the platform.

    A requirement of this project was to support the legacy image infrastructure while adding the new capabilities for all our merchants. In a large system like Shopify, it is critical to have control over the code and safe database operations. For this reason, at Shopify, transactions are used to make database writes safe.

    One of the challenges we faced during the design of this project was deciding whether to use callbacks or a pipeline approach. The first approach we can take to implement the feature is a complete rails approach with callbacks and dependencies. Callbacks, as rails describes them, allow you to trigger logic during an object's life cycle.

    “Callbacks are methods that get called at certain moments of an object's life cycle. With callbacks, it is possible to write code that will run whenever an Active Record object is created, saved, updated, deleted, validated, or loaded from the database.”

    Media via Callbacks

    Callbacks are quick and easy to set up. It is fast to hit the ground running. Let’s look at what this means for our project. For the simplest case of only one media type, i.e., image, we have an image object and a media object. The image object has a polymorphic association to media. Video and 3d models can be other media types. The code looks something like this:

    With this setup, the creation process of an Image will look something like this:

    We have to add a transaction during creation since we need both image and media records to exist. To add a bit more complexity to this example, we can add another media type, video, and expand beyond create. Video has unique aspects, such as the concept of a thumbnail that doesn’t exist for an image. This adds some conditionals to the models.

    From the two previous examples, it is clear that getting started with callbacks is quick when the application is simple; however, as we add more logic, the code becomes more complex. We see that, even for a simple example like this one, our media object has granular information on the specifics of video and images. As we add more media types, we end up with more conditional statements and business logic spread across multiple models. In our case, using callbacks made it hard to keep track of the object’s lifecycle and the order in which callbacks are triggered during each state. As a result, it became challenging to maintain and debug the code.

    Media via Pipelines

    In an attempt to have better control over the code, we decided not to use callbacks and instead try an approach using pipelines. Pipeline is the concept where the output of one element is the input to the next. In our case, the output of one class is the input to the next. There is a separate class that is responsible for only one operation of a single media type.

    Let’s imagine the whole system as a restaurant kitchen. The Entrypoint class is like the kitchen’s head chef. All the orders come to this person. In our case, the user input comes into this service (Product Create Media Service). Next, the head chef assigns the orders to her sous chef. This is the media create handler. The sous chef looks at the input and decides which of the kitchen staff get’s the order. She assigns the order for a key lime pie to the pastry chef and the order for roasted chicken to the roast chef. Similarly, the media create handler assigns the task to create a video to the video create handler and the task to create an image to the image create handler. Each of these individuals specializes in their tasks and are not aware of the details of others. The video create handler is only responsible for creating a media of type video. It has no information about image or 3d models.

    All of our individual classes have the same structure but are responsible for different media types. Each class has three methods:
    • before_transaction
    • during_transaction
    • after_transaction

    As the name suggests, these methods have to be called in that specific order. Going back to our analogy, the before transaction method is responsible for all the prep work that goes in before we create the food. The during transaction is everything involved in creating the dish, which involves cooking the dish and plating it. For rails, this is the method that is responsible for persisting the data to the database. Finally, after_transaction is the clean up that is required after each dish is created.

    Let's look at what the methods do in our case. For example, the video create handler will look like this:

    Similarly, if we move a step up, we can look at the media create handler. This handler will also follow a similar pattern with three methods. Each of these methods in turn calls the handler for the respective media type and creates a cascading effect.

    Media Create Handler

    The logic for each media type remains confined to its specific class. Each class is only aware of its operation, like how the example above is only concerned with creating a video. This creates a separation of concerns. Let's take a look at the product create media service. The service is unaware of the media type, and it’s only responsibility is to call the media handler.

    Product Create Media Service

    The product create media service also has a public entry point, which is used to call the service.

    The caller of the service has a single entry point and is completely unaware of the specifics of how each media type is created. Like in a restaurant, we want to make sure that the food for an entire table is delivered together. Similarly, in our code, we can manage that interdependent objects are created together using a transaction. This approach gives us the following features:

    • Different components of the system can create media and manage their own transactions.
    • The system components no longer have access to the media models but can interact with them using the service entry point.
    • The media callbacks don't get fired with those of the caller, making it easier to follow the code. When developers new to rails use callbacks, it requires a lot of knowledge of the framework and hides away the implementation details.
    • This approach makes it easier to follow and debug the code. The cognitive load on the reader is low, they are all ruby objects, and it is easy to understand.
    • It also gives us control over the order in which objects are created, deleted, updated.

    From the code example, we see that the methods of implementation using callbacks is quick and easy to set up. Ruby on rails can speed up the development process by abstracting away the implementation details, and it is a great feature to use when working with a simple use case. However, as the code evolves and grows more complex, It can be hard to maintain a large production application with callbacks. As we saw in the example above, we had conditionals spread across the active record models.

    An alternate approach can better serve the purpose of long-term maintenance and understandability of the code. In our case, pipelines helped achieve this. We separated the business logic in separate files, enabling us to understand the implementation details better. It also avoided having conditionals spread across the active record models. The most significant advantage of the approach is that it created a clear separation of concerns and different parts of the application do not know the particulars of the media component.

    When designing a pipeline it is important to make sure that there is a single entry point that can be used by the consumer. The pipeline should only perform the actions it is expected to and not have side effects. For example, our pipeline is responsible for creating media and no other action, the client does not expect any other product attribute to be modified. Pipelines are designed to make it easy for the caller to perform certain tasks and so we hide away the implementation details of creating media from the caller. And finally having several steps that perform smaller subtasks can create a clear separation of concern within the pipeline.

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    A First Look at Reanimated 2

    A First Look at Reanimated 2

    Last month, Software Mansion announced the alpha release of Reanimated 2. Version 2 is a complete paradigm shift in the way we build gestures and animations in React Native.

    Last January, at the React Native Community meetup in Kraków, Krzysztof Magiera (co-creator of React Native Android and creator of Gesture Handler and Reanimated) mentioned the idea of a new implementation of Reanimated based on a concept borrowed from an experimental web API, animation worklets: JavaScript functions that run on an isolated context to compute an animation frame. And a few days later Shopify announced its support for Software Mansion’s effort in the open source community, including backing the new implementation of Reanimated.

    When writing gestures and animations in React Native, the key to success is to run the complete user interaction on the UI thread. This means that you don’t need to communicate with the JavaScript thread, nor expect this thread to be free to compute an animation frame.

    In Reanimated 1, the strategy employed was to use a declarative domain-specific language to declare gestures and animations beforehand. This approach is powerful, but came with drawbacks:

    • learning curve is steep
    • some limitations in the available set of instructions
    • performance issues at initialization time.

    To use the Reanimated v1 Domain Specific Language, you have to adopt a declarative mindset which is challenging, and simple instructions could end-up being quite verbose. Basic mathematical tools such as coordinate conversions, trigonometry, and bezier curves, just to name a few, had to be rewritten using the declarative DSL.

    And while the instruction set offered by v1 is large, there are still some use cases where you are forced to cross the React Native async bridge (see Understanding the React Native bridge concept), for example, when doing date or number formatting for instance.

    Finally, the declaration of the animations at initialization time has a performance cost: the more animation nodes are created, the more messages need to be exchanged between the JavaScript and UI thread. On top of that, you need to take care of memoization: making sure to not re-declare animation nodes more than necessary. Memoization in v1 proved to be challenging and a substantial source of potential bugs when writing animations.

    Enter Reanimated 2.

    Animation Worklets

    One of the interesting takeaways from the official announcement is that Software Mansion approached the problem from a different angle. Instead of starting from the main constraint, to not cross the React Native bridge, and offering a way to circumvent it, they asked: How would it look if there were no limitations when writing gestures and animations? They started the work on a solution from there.

    Reanimated 2 is based on a new API named animation worklets. These are JavaScript functions that run on the UI thread independently from the JavaScript thread. These functions can be declared as a worklet via the worklet directive.

    Worklets can receive parameters, access constants from the JavaScript thread, invoke other worklet functions, and invoke functions from the JavaScript thread asynchronously. This new API might remind you of web workers which are also isolated JS functions that talk to the main thread via asynchronous messages. They may also remind you of OpenGL shaders which are also pieces of code to be compiled and executed in an independent manner.

    The code snippet above showcases six properties from worklets. Worklets:

    1. run on the UI thread.
    2. are declared via the ‘worklet’ directive.
    3. can be invoked from the JS thread and receive parameters.
    4. can invoke other worklet functions synchronously.
    5. can access constants from the JS thread.
    6. can asynchronously call functions from the JS thread.

    Liquid-swipe example
    Reanimated 2 liquid-swipe example - click here to see animation

    The team at Software Mansion is offering us a couple of great examples to showcase the new Reanimated API. An interesting one is the liquid-swipe example. It features a couple of advanced animation techniques such as bezier curve interpolation, gesture handling, and physics-based animations, showing us that Reanimated 2 is ready for any kind of advanced gestures and animations

    The API

    When writing gestures and animations, you need to do three things:

    1. create animation values
    2. handle gesture states
    3. assign animation values to component properties.

    The new Reanimated API is offering us five new hooks to perform these three tasks.

    Create Values

    There are two hooks available to create animation values. useSharedValue() creates a shared value that are like Animated.Value but they exist in both the JavaScript and the UI thread. Hence the name.

    The useDerivedValue() hook creates a shared value based on some worklet execution. For instance, in the code snippet below, the theta value is computed in a worklet.

    Handle Gesture States

    useAnimatedGestureHandler() hook can connect worklets to any gesture handler from react-native-gesture-handler. Each event implements a callback with two parameters: event, which contains the values of the gesture handler, and context which you can conveniently use to keep some state across gesture events

    Assign Values to Properties

    useAnimatedStyle() returns a style object that can be assigned to an animated component.

    Finally, useAnimatedProps() is similar to useAnimatedStyle but for animated properties. Now animated props are set via the animatedProps property

    A Reanimated 2 Animation Example

    Let’s build our first example with Reanimated 2. We have an object that we want to drag around the screen. Like in v1, we create two values for us to translate the card on the x and y axis, and we wrap the card component with a PanGesture handler. So far so good.

    As seen in the above code, the animation values are created using useSharedValue and assigned to an animated view via useAnimatedStyle. Now let’s create a gesture handler via useAnimatedGestureHandler. In our gesture handler, we want to do three things:

    1. When the gesture starts, we store the translate values into the context object. This allows us to keep track of the cumulated translations across different gestures.
    2. When the gesture is active, we assign to translate the gesture translation plus its offset values.
    3. When the gesture is released, we want to add a decay animation based on its velocity to give the effect that the view moves like a real physical object

    The final example source can be found on GitHub. You can see the gesture in action below.

    Reanimated 2 PanGesture example
    Reanimated 2 PanGesture example - click here to see the animation

    Going Forward

    With Reanimated 2, the team at Software Mansion asked: “How would it look if there were no constraints when writing gestures and animations in React Native?” and started from there.

    This new version is based on the concept of animation worklets, JavaScript functions that are executed on the UI thread independently from the JavaScript thread. Worklets can receive parameters, access constants, and asynchronously invoke functions from your React code. The new Reanimated API offers five new hooks to create animation values, gesture handlers, and assign animated values to component properties. Part of the Github repository, are many examples that showcase the power of the new Reanimated API.

    We hope that you are as excited about this new version as we are. Reanimated 2 dramatically lowers the barrier to entry in building complex user-interactions in React Native. It also enables new use-cases where we previously had to cross the React Native bridge (to format values for instance). It also dramatically improves the performance at initialization time which in the future might have a substantial impact on particular tasks such as navigation transitions.

    We are looking forward to following the progress of this new exciting way to write gestures and animations.

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    Writing Better, Type-safe Code with Sorbet

    Writing Better, Type-safe Code with Sorbet

    Hey, I’m Jay and I recently finished my first internship at Shopify as a Backend Developer Intern on the App Store Ads team. Out of all my contributions to the ad platform, I wanted to talk about one that has takeaways for any Ruby developer. Here are four reasons for why we adopted Sorbet for static type checking in our repository.

    1. Type-safe Method Calls

    Let’s take a look at an example.

    On the last line we call the method action and then call value.to_h on its return type. If action returns nil, calling value.to_h will cause an undefined method error.

    Without a unit test covering the case when action returns nil, such code could go by undetected. To make matters worse, what if foo() is overridden by a child class to have a different return type? When types are inferred from the names of variables such as in the example, it is hard for any new developer to know that their code needs to handle different return types. There is no clue to suggest what result contains, so the developer would have to search the entire code base for what it could be.

    Let’s see the same example with method signatures.

    In the revised example, it’s clear from the signature that `action` returns a Result object or nil. Sorbet type checking will raise an error to say that calling action.value.to_h is invalid because action can potentially return nil. If Sorbet doesn’t raise any errors regarding our method, we deduce that foo() returns a Result object, as well as an object (most likely an array) that we can call empty? on. Overall, method annotations give us additional clarity and safety. Now, instead of writing trivial unit tests for each case, we let Sorbet check the output for us.

    2. Type-safety for Complex Data Types

    When passing complex data types around, it’s easy to use hashes such as the following:

    This approach has a few concerns:

    • :id and :score may not be defined properties until the object is created in the database. If they’re not properties, calling or ad.score on the ad object will return nil, which is unexpected behavior in certain contexts.
    •  :state may be intended to be an enum. There are no runtime checks that ensure that a value such as running isn't accidentally put in the hash.
    •  :start_date has a value, but :end_date is nil. Can they both be nil? Will the :start_date always have a value? We don’t know without looking at the code that generated the object.

    Situations like this put a large onus on the developer to remember all the different variants of the hash and the contexts in which particular variants are used. It’s very easy for a developer to make a mistake by trying to access a key that doesn’t exist or assign the incorrect value to a key. Fortunately, Sorbet helps us solve these problems.

    Consider the example of creating an ad:

    Creating an ad
    Creating an ad

    Input data flows from an API request to the database through some layers of code. Once stored, a database record is returned.

    Here we define typed Sorbet structs for the input data and the output data. A Database::Ad extends an Input::Ad by additionally having an :id and :score.

    Each of the previous concerns have been addressed:

    • :id and :score clearly do not exist on ads being sent to the database as inputs, but definitely exist on ads being returned.
    • :state must be a State object (as an aside, we implement these using Sorbet enums), so invalid strings cannot be assigned to :state.
    • :end_date can be nil, but :start_date will never be nil.

    Any failure to obey these rules will raise errors during static type checking by Sorbet, and it is clear to developers what fields exist on our object when it’s being passed through our code.

    To extend beyond the scope of this article, we use GraphQL to specify type contracts between services. This lets us guarantee that ad data sent to our API will parse correctly into Input::Ad objects.

    3. Type-safe Polymorphism and Duck Typing

    Sorbet interfaces are integral to implementing the design patterns used in the Ad Platform repository. We’re committed to following a Hexagonal Structure with dependency injection:

    Hexagonal Structure with dependency injection
    Hexagonal Structure with dependency injection

    When we get an incoming request, we first compose a task to execute some logic by injecting the necessary ports/adapters. Then we execute the task and return its result. This architecture makes it easy to work on components individually and isolate logic for testing. This leads to very organized code, fast unit tests, and high maintainability—however, this strategy relies on explicit interfaces to keep contracts between components.

    Let’s see an example where errors can easily occur:

    In the example method, we call Action.perform with either a SynchronousIndexer or an AsynchronousIndexer. Both implement the index method in a different manner. For example, the AsynchronousIndexer may enqueue a job via a job queue, whereas the SynchronousIndexer may store values in a database immediately. The problem is that there’s no way to know if both indexers have the index method or if they return the correct result type expected by Action.perform.

    In this situation, Sorbet interfaces are handy:

    We define a module called Indexer that serves as our interface. AsynchronousIndexer and SynchronousIndexer as classes which implement this interface, which means that they both implement the index method. The index method must take in an array of keyword strings, and return a Result object as well as a list of errors.

    Now we can modify action to take an Indexer as a parameter so that it’s guaranteed that the indexer provided will implement the index method as expected. Now it’s clear to a developer what types are being used and it also ensures that the code behaves as expected.

    4. Support for Gradual Refactoring

    One roadblock to adding Sorbet to an entire codebase is that it’s a lot of work to refactor every file to be typed. Fortunately, Sorbet supports gradual typing. It statically types your codebase on a file-by-file level, so one can refactor at their own pace. A nice feature is that it comes with 5 different typing strictness levels, so one can choose the level of granularity. These levels also allow for gradual adoption across files in a codebase.

    On the ads team, we decided to refactor using a namespace-by-namespace scheme. When a particular Github issue requires committing to a set of files in the same namespace, we upgrade those to the minimum typed level of true, adding method signatures, interfaces, enums, and structs as needed.

    Enforcing Type Safety Catches Errors

    Typing our methods and data types with Sorbet encourages us to adhere to our design patterns more strictly. Sticking to our patterns keeps our code organized and friendly to developers while also discouraging duplication and bad practices. Enforcing type safety in our code saves us from shipping unsafe code to production and catches errors that our unit tests may not catch.

    We encourage everyone to try it in their projects!

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions. 

    Continue reading

    Shopify's Data Science & Engineering Foundations

    Shopify's Data Science & Engineering Foundations

    At Shopify, our mission is to make commerce better for everyone. With over one million businesses in more than 175 countries, Shopify is a mini-economy with merchants, partners, buyers, carriers, and payment providers all interacting. Careful and thoughtful planning helps us build products that positively impact the entire system.

    Commerce is a rapidly changing environment. Shopify’s Data Science & Engineering team supports our internal teams, merchants, and partners with high quality, daily insights so they can “Make great decisions quickly.” Here are the foundational approaches to data warehousing and analysis that empower us to deliver the best results for our ecosystem.

    1. Modelled Data

    One of the first things we do when we onboard (at least when I joined) is get a copy of The Data Warehouse Toolkit by Ralph Kimball. If you work in Data at Shopify, it’s required reading! Sadly it’s not about fancy deep neural nets or technologies and infrastructure. Instead, it focuses on data schemas and best practices for dimensional modelling. It answers questions like, “How should you design your tables so they can be easily joined together?” or “Which table makes the most sense to house a given column?” In essence, it explains how to take raw data and put it in a format that is queryable by anyone. 

    I’m not saying that this is the only good way to structure your data. For what it's worth, it could be the 10th best strategy. That doesn’t matter. What counts is that we agreed, as a Data Team, to use this modelling philosophy to build Shopify's data warehouse. Because of this agreed upon rule, I can very easily surf through data models produced by another team. I understand when to switch between dimension and fact tables. I know that I can safely join on dimensions because they handle unresolved rows in a standard way—with no sneaky nulls silently destroying rows after joining.

    The modelled data approach has a number of key benefits for working faster and more collaboratively. These are crucial as we continue to provide insights to our stakeholders and merchants in a rapidly changing environment.

    Key Benefits

    • No need to understand raw data’s structure
    • Data is compatible between teams

    2. Data Consistency and Open Access

    We have a single data modelling platform. It’s built on top of Spark in a single GitHub repo that everyone at Shopify can access, and everyone uses it. With everyone using the same tools as me, I can gather context quickly and independently: I know how to browse Ian's code, I can find where Ben has put the latest model, etc. I simply need to pick a table name and I can see 100% of the code that built that model.

    What is more, all of our modelled data sits on a Presto Cluster that’s available to the whole company, and not just data scientists (except PII information). That’s right! Anyone at the company can query our data. We also have internal tools to discover these data sets. That openness and consistency makes things scalable.

    Key Benefits

    • Data is easily discoverable
    • Everyone can take advantage of existing data

    3. Rigorous ETL (Extract, Transform, Load)

    As a company focused on software, the skills we’ve developed as a Data Team were influenced by our developer friends. All of our data pipeline jobs are unit tested. We test every situation that we can think of: errors, edge cases, and so on. This may slow down development a bit, but it also prevents many pitfalls. It’s easy to lose track of a JOIN that occasionally doubles the number of rows under a specific scenario. Unit testing catches this kind of thing more often than you would expect.

    We also ensure that the data pipeline does not let jobs fail in silence. While it may be painful to receive a Slack message at 4 pm on Friday about a five-year-old dataset that just failed, the system ensures you can trust the data you play with to be consistently fresh and accurate.

    Key Benefits

    • Better data accuracy and quality
    • Trust in data across the company

    4. Vetted Dashboards

    Like our data pipeline, we have one main visualization engine. All finalized reports are centralized on an internal website. Before blindly jumping into the code like a university student three hours before a huge deadline, we can go see what others have already published. In most cases, a significant portion of the metrics you’re looking for are already accessible to everyone. In other cases, an existing dashboard is pretty close to what we’re looking for. Since the base code for every dashboard is centralized, this is a great starting point.

    Key Benefits

    • Better discovery speed
    • Reuse of work

    5. Vetted data points

    All data points that form the basis for major decisions, or that need to be published externally are what we call vetted data points. They’re stored together with the context we need to understand them. This includes the original question, its answer, and the code that generated the results. One of the fundamentals in producing vetted data points is that the result shouldn’t change over time. For example, if I ask how many merchants were on the platform in Q1 2019, the answer should be the same today and in 4 years from now. Sounds trivial, but it’s harder than it looks! By having it all in a single GitHub repo, it's discoverable, reproducible, and easy to update each year

    Key Benefits

    • Reproducibility of key metrics

    6. Everything is Peer Reviewed

    All of our work is peer reviewed, usually by at least two other data scientists. Even my boss and my boss's boss go through this. This is another practice we gleaned by working closely with developers. Dashboards, vetted data points, dimensional models, unit tests, data extraction, etc… it’s all reviewed. Knowing several people looked at a query invokes a high level of trust in the data across the company. When we do work that touches more than one team, we make sure to involve reviewers from both teams. When we touch raw data, we add developers as reviewers. These tactics really improve the overall quality of data outputs by ensuring pipeline code and analytics meet a high standard that is upheld across the team.

    Key Benefits

    • Better data accuracy and quality
    • Higher trust in data

    7. Deep Product Understanding

    Now for my favourite part: all analyses require a deep understanding of the product. At Shopify, we strive to fall in love with the problem, not the tools. Excellence doesn’t come from just looking at the data, but from understanding what it means for our merchants.

    One way we do this is to divide the Data Team into smaller sub-teams, each of which is associated with a product (or product area). A clear benefit is that sub-teams become experts about a specific product and its data. We know it inside and out! We truly understand what enable means in the column status of some table.

    Product knowledge allows us to slice and dice quickly at the right angles. This has allowed us to focus on metrics that are vital for our merchants. Deep product understanding also allows us to guide stakeholders to good questions, identify confounding factors to account for in analyses, and design experiments that will really influence the direction of Shopify’s products.

    Of course, there is a downside, which I call the specialist gap: sub-teams have less visibility into other products and data sources. I’ll explain how we address that soon.

    Key Benefits

    • Better quality analysis
    • Emphasis on substantial problems

    8. Communication

    What is the point of insights if you don’t share them? Our philosophy is that discovering an insight is only half the work. The other half is communicating the result to the right people in a way they can understand.

    We try to avoid throwing a solitary graph or a statistic at anyone. Instead, we write down the findings along with our opinions and recommendations. Many people are uncomfortable with this, but it’s crucial if you want a result to be interpreted correctly and spur the right actions. We can't expect non-experts to focus on a survival analysis. This may be the data scientist’s tool to understand the data, but don’t mistake it for the result.

    On my team, every time anyone wants to communicate something, the message is peer reviewed, preferably by someone without much background knowledge of the problem. If they cannot understand your message, it’s probably not ready yet. Intuitively, it might seem best to review the work with someone who understands the importance of the message. However, assumptions about the message become clear when you engage someone with limited visibility. We often forget how much context we have on a problem when we’ve just finished working on it, so what we think is obvious might not be so obvious for others.

    Key Benefits

    • Stakeholder engagement
    • Positive influence on decision making

    9. Collaboration Across Data Teams

    Since Shopify went Digital by Default, I have worked with many people I’ve never met, and they’ve all been incredible! Because we share the same assumptions about the data and underlying frameworks, we understand each other. This enables us to work collaboratively with no restrictions in order to tackle important challenges faced by our merchants. Take COVID-19 for example. We created a fully cross-functional task force with one champion per data sub-team to close the specialist gap I mentioned previously. We meet to share findings on a daily basis and collaborate on deep dives that may require or affect multiple products. Within hours of establishing this task force, the team was running at full speed. Everyone has been successfully working together towards one goal, making things better for our merchants, without being constrained to their specific product area.

    Key Benefits

    • Business-wide impact
    • Team spirit

    10. Positive Philosophy About Data

    If you share some game-changing insights with a big decision maker at your company, do they listen? At Shopify, leaders might not action every single recommendation from Data because there are other considerations to weigh, but they definitely listen. They’re keen to consider anything that could help our merchants.

    Shopify announced several features at Reunite to help merchants like gift card features for all merchants and the launch of local deliveries. The Data Team provided many insights that influenced these decisions.

    At the end of the day, it is the data scientists job to make sure insights are understood by the key people. That being said, having leaders that listen helps a lot. Our company’s attitude towards data transforms our work from interesting to impactful.

    Key Benefits

    • Impactful data science

    No Team Member Starts from Scratch at Shopify

    Shopify isn’t perfect. However, our emphasis on foundations and building for the long term is paying off. No one on the Data Team needs to start from scratch. We leverage years of data work to uncover valuable insights. Some we get from existing dashboards and vetted data points. In other cases, modelled data allows us to calculate new metrics with fewer than 50 lines of SQL. Shopify’s culture of data sharing, collaboration, and informed decision making ensures these insights turn into action. I am proud that our investment in foundations is positively impacting the Data Team and our merchants.

    If you’re passionate about data at scale, and you’re eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

    Continue reading

    Spark Joy by Running Fewer Tests

    Spark Joy by Running Fewer Tests

    Developers write tests to ensure correctness and allow future changes to be made safely. However, as the number of features grows, so does the number of tests. Tests are a double-edged sword. On one hand, well-written ones catch bugs and maintain a program’s stability, but as the code base grows, a high number of tests impedes scalability because they take a long time to run and increase the likelihood of intermittently failing tests. Software projects often require all tests to pass before merging to the main branch. This adds overhead to all developers. Intermittently failing tests worsen the problem. Some causes of intermittently failing tests are

    • timing
    • instability in the database
    • HTTP connections/mockings
    • random generators
    • tests that leak state to other tests: the test passes every single time by itself, but fails other tests depending on the order.

    Unfortunately, one can’t fully eradicate intermittently failing tests, and the likelihood of them occurring increases as the codebase grows. They make already slow test suites even slower, because now you have to retry them until they pass.

    I’m not implying that one shouldn’t write tests. The benefits of quality assurance, performance monitoring, and speeding up development by catching bugs early instead of in production outweigh its downsides. However, improvements can be made. My team thus embarked on a journey of making our continuous integration (CI) more stable and faster. I’ll share the dynamic analysis system to select tests that we implemented, followed by other approaches we explored but decided against. Test selection sparks joy in my life. I wish that I can bring the same joy to you.

    Problems with Tests at Shopify

    Tests impede developers’ productivity here. The test suite of our monolithic repository:

    • has over 150,000 tests
    • is growing by 20-30% in size annually
    • takes about 30-40 min to run on hundreds of docker containers in parallel.

    Each pull request requires all tests to pass. Developers have to either wait for tests or pay for the price of context switching. In our bi-annual survey, build stability and speed is a recurring topic. So this problem is clearly felt by our developers.

    Solving the Problem with Tests

    There’s an abundance of blog posts/articles/research papers on optimizing code, unfortunately few on tests. In fact, we learned that it’s unrealistic to optimize tests because of the sheer quantity and growth. We also learned that this is an uncharted territory for many companies.

    As our research progressed, it became apparent that the right solution was to only run the tests related to the code change. This was challenging for a large, dynamically typed Ruby codebase that makes ample use of the language flexibility. Furthermore, the difficulty was exacerbated by metaprogramming in Rails as well as other non-Ruby files in the code base that affect how the application behaves, for example, YAML, JSON, and JavaScript.

    What Is Dynamic Analysis?

    Dynamic analysis, in essence, is logging every single method call. We run each test and track all the files in the call graph. Once we have the call graphs, we create a test mapping: for every file, we find what tests have that file in its call graph. By looking at what files have been modified, added, removed, or renamed, we can look up the tests we need to run.

    You can check out Rotoscope and Tracepoint to record call graphs for Ruby applications.

    Why Use Dynamic Analysis?

    Ruby is a dynamically typed language, we can’t retrieve a dependency graph using static analysis. Thus, we don’t know the corresponding tests for the code.

    Downsides of Running Dynamic Analysis on Ruby on Rails

    1. It’s Slow.

    It’s computationally intensive to generate the call graphs, and we can’t run it for every single PR. Instead, we run dynamic analysis on every deployed commit.

    2. Mapping Can Lag Behind HEAD

    The generated mapping lags behind the head of the main because it runs asynchronously. To solve this problem, we run all tests that satisfy at least one of the following criteria:

    • added or modified on the current branch
    • added or modified between the head of the last generated mapping and current branch head
    • mapped tests per current branch’s code change

    3. There Are Untraceable Files

    There are non-Ruby files such as YAML, JSON, etc. that can’t be traced on the call graph. We added custom patches to Rails to trace some of them. For example, we patched the I18n::Backend class to trace the translation files in YAML. For changes to files that haven’t been traced, we run every single test.

    4. Metaprogramming Obfuscates Call Paths

    We added existing metaprogramming in a known directory and added glob rules on the file path to determine which tests to run. We discourage new metaprogramming through Sorbet and other linters.

    5. Some Failing Tests Won’t Be Caught

    The generated mapping from dynamic analysis can get out of date with the latest main, and sometimes failing tests won’t get selected and get merged to main. To circumvent the issue, we run the full test suite every deploy and automatically disable any failing test, so other developers won’t be blocked from shipping their code. The full test suite runs asynchronously. Pull requests can get merged before the full test suite completes.

    Automatic disabling of failing tests sounds counterintuitive to many people. From what we observed, the percentage of pull requests with failing tests being merged to the main branch is about 0.06%. We also have other mechanisms to mitigate the risks, such as canary deploys and type checking using Sorbet. The code owners of the failing tests are notified. We expect developers to fix or remove the failures without blocking future deploys.

    How Was the Dynamic Analysis Rolled Out?

    In the experimentation phase of the new dynamic analysis system, the test-selection pipeline ran in parallel with the pipeline that runs the full test suite for each new PR. The recall of the new test selection pipeline was measured. Out of all the failing tests, we measured if the new pipeline selects the same failing tests. We didn’t care about the tests that pass because it’s only the failing tests that cause trouble.

    We measured our results using three metrics.

    Failure Recall

    We define recall as the percentage of legitimately failing tests, excluding intermittently failing tests, that our system selected. We want this to be as close as possible to 100%. It’s hard to measure this accurately because of the occurrence of intermittently failing tests. Instead, we approximate the recall by looking at the number of consistently failing tests merged into main.

    After two months that the project has been active, out of the 8,360 commits that were merged, we’ve only failed to detect five failing tests that landed on main. We also managed to resolve most of the root causes of those missed failures, so the same problems don’t repeat in the future.

    We achieved a 99.94% recall.

    Speed Improvement

    The overall selection rate, the ratio of selected tests over total number of tests, is about 60%:

    Percentage of selected test files per build
    Percentage of selected test files per build

    About 40% of builds run fewer than 20% of tests. This shows that many developers will significantly benefit from our test selection system:

    Percentage of builds that selected fewer than 20% of all tests
    Percentage of builds that selected fewer than 20% of all tests

    Compute Time

    In total, we’re saving about 25% compute time. This is measured by adding up the time spent preparing and running tests on every docker container in a build, and averaging that across all builds. It didn’t decrease more because a significant chunk of computing time is still used for setting up containers, databases, and pulling caches. Note that we’re also adding compute time by running the dynamic analysis for every deployed commit on main. We estimate that this will undo some, but not all of the infrastructure cost savings.

    Other Approaches We Explored

    Prior to choosing dynamic analysis, we explored other approaches but ultimately ruled them out.

    Static Analysis

    To determine a dependency graph, we briefly explored using Sorbet, but this would only be possible if the entire code base was converted to strict Sorbet type. Unfortunately, the code base is only partially in strict Sorbet type and too big for my team to convert the rest.

    Machine Learning

    It’s possible to use machine learning to find the dependency graph. Facebook has an excellent research paper [PDF] on it. We chose dynamic analysis at Shopify because we’re not sure if we have enough data to make the prediction, and we want to choose an approach that’s deterministic and reproducible.

    More Machines for Tests

    We tried adding more machines for the test suite. Unfortunately, the performance didn’t increase linearly as we scaled horizontally. In fact, tests on average take longer as we increase the number of machines past a certain number. Increasing machines doesn’t reduce intermittently failing tests and it increases the possibility of failing connections to sidecars, thus increasing test retries.

    Benefits of Running Fewer Tests

    There are three major benefits of selectively running tests:

    1. Developers get faster feedback from tests.
    2. The likelihood of encountering intermittently failing tests decreases, and. Thus, it increases the speed of developers further.
    3. CI costs less.

    Skepticism about Dynamic Analysis/Test Selection

    Before the feature was rolled out, many developers were skeptical that it would work. Frankly, I was one of them. Many people voiced their doubts both privately and openly. However, much to my surprise, after it went live, people were silent. On average, developers request running the full test suites on under 2% of the pull requests.

    If you’re in a similar situation, I hope our experience helps you. It’s hard for developers to embrace the idea that some tests won’t be run when the importance of tests is ingrained in our heads.

    If this sounds like the kind of problems you'd enjoy solving, come work for us. Check out the Software Development at Shopify (Expression of Interest) career posting and apply specifying an interest in Developer Acceleration.  

    Continue reading

    Understanding Programs Using Graphs

    Understanding Programs Using Graphs

    A recording for our event, ShipIt! presents Understanding Programs Using Graphs in TruffleRuby, is now available in the ShipIt! Presents section of this page.

    You may have heard that a compiler uses a data structure called an abstract-syntax-tree, or AST, when it parses your program, but it’s less common knowledge that an AST is normally used just at the very start of compilation, and a powerful compiler generally uses a more advanced type of data structure as its intermediate representation to optimize your program and translate it to machine code in the later phases of compilation. In the case of TruffleRuby, the just-in-time compiler for Ruby that we’re working on at Shopify, this data structure is something called a sea-of-nodes graph.

    I want to show you a little of what this sea-of-nodes graph data structure looks like and how it works, and I think it’s worth doing this for a couple of reasons. First of all, I just think it’s a really interesting and beautiful data structure, but also a practical one. There’s a lot of pressure to learn about data structures and algorithms in order to pass interviews in our industry, and it’s nice to show something really practical that we’re applying here at Shopify. Also, the graphs in sea-of-nodes are just really visually appealing to me and I wanted to share them.

    Secondly, knowing just a little about this data structure can give you some pretty deep insights into what your program really means and how the compiler understands it, so I think it can increase your understanding of how your programs run. 

    I should tell you at this point that I’m afraid I’m actually going to be using Java to show my examples, in order to keep them simpler and easier to understand. This is because compared to Ruby, Java has much simpler semantics—simpler rules for how the language works—and so much simpler graphs. For example, if you index an array in Java that’s pretty much all there is to it, simple indexing. If you index an array in Ruby you could have a positive index, a negative index, a range, conversion, coercion, and lots more—it’s just more complicated. But don’t worry, it’s all super-basic Java code, so you can just pretend it’s pseudo code if you’re coming from Ruby or any other language.

    Reading Sea-of-nodes Graphs

    Lets dive straight in by showing some code and the corresponding sea-of-nodes graph.

    Here’s a Java method. As I said, it’s not using any advanced Java features so you can just pretend it’s pseudo code if you want to. It returns a number from a mathematical sequence known as the Fibonacci sequence, using a simple recursive approach.

    Here’s the traditional AST data structure for this program. It’s a tree, and it directly maps to the textual source code, adding and removing nothing. To run it you’d follow a path in the tree, starting at the top and moving through it depth-first.

    Abstract syntax tree for the Fibonacci sequence program
    Abstract syntax tree for the Fibonacci sequence program

    And here’s the sea-of-nodes graph for the same program. This is a real dump of the data structure as used in practice to compile the Java method for the Fibonacci sequence we showed earlier.

    Sea-of-nodes graph for the Fibonacci sequence program
    Sea-of-nodes graph for the Fibonacci sequence program

    There’s quite a lot going on here, but I’ll break it down.

    We’ve got boxes and arrows, so it’s like a flowchart. The boxes are operations, and the arrows are connections between operations. An operation can only run after all the operations with arrows into it have run.

    A really important concept is that there are two main kinds of arrows that are very different. Thick red arrows show how the control flows in your program. Thin green arrows show how the data flows in your program. The dashed black arrows are meta-information. Some people draw the green arrows pointing upwards, but in our team we think it’s simpler to show data flowing downwards.

    There are also two major kinds of boxes for operations. Square red boxes do something; they have a side-effect or are imperative in some way. Diamond green boxes compute something (they’re pure, or side-effect free), and green for safe to execute whenever.

    P(0) means parameter 0, or the first parameter. C(2) means a constant value of 2. Most of the other nodes should be understandable from their labels. Each node has a number for easy reference.

    To run the program in your head, start at the Start node at the top and move down thick red arrows towards one of the Return nodes at the bottom. If your square red box has an arrow into it from an oval or diamond green box, then you run that green box, and any other green boxes pointing into that green box, first.

    Here’s one major thing that I think is really great about this data structure. The red parts are an imperative program, and the green parts are mini functional programs. We’ve separated the two out of the single Java program. They’re joined where it matters, and not where it doesn’t. This will get useful later on.

    Understanding Through Graphs

    I said that I think you can learn some insights about your program using these graphs. Here’s what I mean by that.

    When you write a program in text, you’re writing in a linear format that implies lots of things that aren’t really there. When we get the program into a graph format, we can encode only the actual precise rules of the language and relax everything else.

    I know that’s a bit abstract, so here’s a concrete example.

    Based on a three-way if-statement it does some arithmetic. Notice that b * c is common to two of the three branches.

    Sea-of-nodes graph for a three-way if-statement-program
    Sea-of-nodes graph for a three-way if-statement-program

    When we look at the graph for this we can see a really clear division between the imperative parts of the program and the functional parts. Notice in particular that there is only one multiplication operation. The value of a * b is the same on whichever branch you compute it, so we have just one value node in the graph to compute it. It doesn’t matter that it appeared twice in the textual source code—it has been de-duplicated by a process known as global value numbering. Also, the multiplication node isn’t fixed in either of the branches, because it’s a functional operation and it could happen at any point and it makes no change to what the program achieves.

    When you look at the source code you think that you pick a branch and only then you may execute a * b, but looking at the graph we can see that the computation a * b is really free from which branch you pick. You can run it before the branch if you want to, and then just ignore it if you take the branch which doesn’t need it. Maybe doing that produces smaller machine code because you only have the code for the multiplication once, and maybe it’s faster to do the multiplication before the branch because your processor can then be busy doing the multiplication while it decides which branch to go to.

    As long as the multiplication node’s result is ready when we need it, we’re free to put it wherever we want it.

    You may look at the original code and say that you could refactor it to have the common expression pulled out in a local variable. We can see here that doing that makes no difference to how the compiler understands the code. It may still be worth it for readability, but the compiler sees through your variable names and moves the expression to where it thinks it makes sense. We would say that it floats the expression.

    Graphs With Loops

    Here’s another example. This one has a loop.

    It adds the parameter a to an accumulator n times.

    Sea-of-nodes graph for a program with loops
    Sea-of-nodes graph for a program with loops

    This graph has something new, an extra thick red arrow backward now. That closes the loop, it’s the jump back to the start of the loop for a new iteration.

    The program is written in an imperative way, with a traditional iterative looping construct as you’d use in C, but if we look at the little isolated functional part, we can see the repeated addition on its own very clearly. There’s literally a little loop showing that the + 1 operation runs repeatedly on its own result.

    Isolated functional part of sea-of-nodes graph
    Isolated functional part of sea-of-nodes graph

    That phi node (the little circle with the line in it is a Greek letter) is a slightly complicated concept with a traditional name. It means that the value at that point may be one of multiple possibilities.

    Should We Program Using Graphs?

    Every few years someone writes a new PhD thesis on how we should all be programming graphically instead of using text. I think you can possibly see the potential benefits and the practical drawbacks of doing that by looking at these graphs.

    One benefit is that you’re free to reshape, restructure, and optimize your program by manipulating the graph. As long as you maintain a set of rules, the rules for the language you’re compiling, you can do whatever you want.

    A drawback is that it’s not exactly compact. This is a 6 line method but it’s a full-screen to draw it as a graph, and it already has 21 nodes and 22 arrows in it. As we get bigger graphs it becomes impossible to draw them without the arrows starting to cross and they become so long that they have no context—you can’t see where they’re going to or coming from, and then it becomes much harder to understand.

    Using Sea-of-nodes Graphs at Shopify

    At Shopify we’re working on ways to understand these graphs at Shopify-scale. The graphs for idiomatic Ruby code in a codebase like our Storefront Renderer can get very large and very complicated—for example this is the Ruby equivalent to the Java Fibonacci example.

    Sea-of-nodes graph for the Java Fibonacci example in Ruby
    Sea-of-nodes graph for the Java Fibonacci example in Ruby (click for larger SVG version)

    One tool we’re building is the program to draw these graphs that I’ve been showing you. It takes compiler debug dumps and produces these illustrations. We’re also working on a tool to decompile the graphs back to Ruby code, so that we can understand how Ruby code is optimized, by printing the optimized Ruby code. That means that developers who just know Ruby can use Ruby to understand what the compiler is doing.


    In summary, this sea-of-nodes graph data structure allows us to represent a program in a way that relaxes what doesn’t matter and encodes the underlying connections between parts of the program. The compiler uses it to optimize your program. You may think of your program as a linear sequence of instructions, but really your compiler is able to see through that to something simpler and more pure, and in TruffleRuby it does that using sea-of-nodes.

    Sea-of-nodes graphs are an interesting and, for most people, novel way to look at your program.

    Additional Information

    ShipIt! Presents: Understanding Programs Using Graphs in TruffleRuby

    Watch Chris Seaton and learn about TruffleRuby’s sea-of-nodes. This beautiful data structure can reveal surprising in-depth insights into your program.


    If this sounds like the kind of problems you'd enjoy solving, come work for us. Check out the Software Development at Shopify (Expression of Interest) career posting and apply specifying an interest in Developer Acceleration. 

    Continue reading

    ShipIt! Presents: How Shopify Uses Nix

    ShipIt! Presents: How Shopify Uses Nix

    On May 25, 2020,  ShipIt!, our monthly event series, presented How Shopify Uses Nix. Building upon on my What is Nix post, I show how we rebuilt our developer tooling using Nix, and show off some of the tooling we actually use at Shopify on a day-to-day basis.

    I wasn't able to answer all the questions during the event, so I've included answers to those ones below.

    Would runix interop well with lorri if/when it's open sourced?

    Maybe. Not effortlessly, because our whole shadowenv strategy is similar but different. It could probably be made to work without too much effort, and as long as compatibility didn’t make some major tradeoff that I’m not able to guess at right now. We’d be super open to a PR to make it compatible.

    Do you use nix for CI/CD, and if you do, how is it set up?

    Not yet. Hoping to get to that late this year.

    For which Lisp was that Lisp code you showed earlier?

    It uses Ketos, a little Rust implementation, but it’s almost not important: we document the available functions, and there are very few. I like to think of it more as a DSL than even as a “real” Lisp.

    I'm curious about how everyone WFH affects this tooling? Is there some limit to how often you can update dependencies because it'll force people to re-download everything on a rebase over their home internet connections?

    Yeah, this is something we’re still puzzling through. We don’t bump our nixpkgs revision very often just as a matter of, I don’t know, laziness maybe, but we’ve definitely seen more people complaining about large downloads when we do since moving out of our offices with nice multi-Gbit fiber. Mainly, it’s going to be interesting to see the world struggle with trying to provide home-workers with better internet speeds over the next year. This is something Canada and the US do an abysmal job of right now.

    What's been the pain points with Nix for your use case?

    The tooling is in general really optimized for “build” workflows, not development workflows. This is really clear in the bundleEnv/bundleApp workflows. I had to spend a lot of time putting together a solution for that that mapped into the way we actually want to use it, better. I’m going to try to upstream that at some point, but I anticipate more of these problem.

    One of your dev.yml files had import < nixpkgs > in it. Do you pin nixpkgs versions in your system?

    Yes. I think I kind of touched on this probably later in the demo but that dev tool has a specific nixpkgs revision that it enforces each time dev up is run.

    Is there more of an issue with gem dependencies as they populate a global cache by default vs something like node? Do you "node"-nix your node deps or are them being in the project-local node_modules "enough" in this case?

    We don’t yet manage node modules with Nix but I’m looking at doing that whenever we get the (large amount of) time required to do that successfully. Profpatsch/yarn2nix seems promising. We started with gems just because they annoyed me more and I had more familiarity with how they work.

    If you’re looking for more Nix content, I’ve re-released a series of screencasts I recorded for developers at Shopify to the public. Check out Nixology on YouTube.

    If this sounds like the kind of problems you'd enjoy solving, come work for us. Check out the Software Development at Shopify (Expression of Interest) career posting and apply specifying an interest in Developer Acceleration.

    Continue reading

    7 Ways to Make Your SQL Workshop Beginner-friendly

    7 Ways to Make Your SQL Workshop Beginner-friendly

    Data is key to making great decisions at scale, so it’s no surprise that all new hires in RnD (Research & Development) at Shopify are invited to take a 90-minute hands-on workshop to learn about Shopify’s data practices, architecture, and some SQL. Since RnD is a multi-disciplinary organization, participants come from a variety of technical backgrounds ( engineering, data science, UX, project management) and sometimes non-technical disciplines that work closely with engineering teams. We split the workshop into two groups: a beginner-friendly group that assumes no prior knowledge of SQL or other data tools, and an intermediate group for those with familiarity.

    Beginners have little-to-no experience with SQL, although they may have experience with other programming languages. Still, many participants aren’t programmers or data analysts. They’ve never encountered databases, tables, data models, and have never seen a SQL query. The goal is to empower them to responsibly use the data available to them to discover insights and make better decisions in their work.

    That’s a lofty goal to achieve in 90 minutes, but creating an engaging, interactive workshop can help us get there. Here are some of the approaches we use to help participants walk away equipped and excited to start using data to power their work.

    1. Answer the Basics: Where Does Data Come From?

    Before learning to query and analyze data, you should first know some basics about where the data comes from and how to use it responsibly. Data Scientists at Shopify are responsible for acquiring and preprocessing data, and generating data models that can be leveraged by anyone looking to query the data. This means participants of the workshop, who aren’t data scientists, won’t be expected to know how to collect and clean big data, but should understand how the data available to them was collected. We take the first 30 minutes to walk participants through the basics of the data warehouse architecture:

    • Where does data come from?
    • How does it end up in the data warehouse?
    • What does “raw data” vs. “cleaned/modelled data” mean?

    Data Warehouse Architecture

    Data Warehouse Architecture

    We also caution participants to use data ethically and responsibly. We warn them not to query production systems directly, rely on modelled data when possible, and respect data privacy. We also give them tools that will help them find and understand datasets (even if that “tool” is to ask a data scientist for help). By the time they are ready to start querying, they should have an appreciation for how datasets have been prepared and what they can do with them.

    2. Use Real Data and Production Tools

    At Shopify, anyone in RnD has access to tools to help them query and analyze data extracted from the ecosystem of apps and services that power Shopify. Our workshop doesn’t play around with dummy data—participants go right into using real data from our production systems, and learn the tools they’ll be using throughout their career at Shopify.

    If you have multiple tools available, choose one that has a simple querying interface, easy chart-building capabilities, and is connected (or can easily be connected) to your data source. We use Mode Analytics, which has a clean console UI, drag-and-drop chart and report builders, clean error feedback, and a pane for searching and previewing data sources from our data lake.

    In addition to choosing the right tool, the dataset can make or break the workshop. Choosing a complex dataset with cryptic column names that is poorly documented, or one that requires extensive domain knowledge will draw the attention of the workshop away from learning to query. Instead, participants will be full of questions about what their data means. For these reasons, we often use our support tickets dataset. Most people understand the basic concepts of customer support: a customer who has an issue submits a ticket for support and an agent handles the question and closes the ticket once it’s been solved. That ticket’s information exists in a table where we have access to facts like when the ticket was created, when it was solved, who it was handled by, what customer it was linked to, and more. As a rule of thumb, if you can explain the domain and the structure of the data in 3 sentences or less, it’s a good candidate to use in a beginner exercise.

    3. Identify Your Objectives

    To figure out what your workshop should touch, it’s often helpful to think about the types of questions you want your participants to be able to answer. At the very least, you should introduce the SELECT, FROM, WHERE, ORDER BY, and LIMIT clauses. These are foundational for almost any type of question participants will want to answer. Additional techniques can depend on your organization’s goals.

    Some of the most common questions we often answer with SQL include basic counts, trends over time, and segmenting metrics by certain user attributes. For example, a question we might get in Support is “How many tickets have we processed monthly for merchants with a Shopify Plus plan?”, or “What was the average handling time of tickets in January 2020?”

    For our workshop, we teach participants foundations of SQL that include the keywords SELECT, DISTINCT, FROM, WHERE, JOIN, GROUP BY, and ORDER BY, along with functions like COUNT, AVG, and SUM. We believe this provides a solid foundation to answer almost any type of question someone outside of Data Science could be able to self-solve.

    4. Translate Objectives Into Real Questions

    Do you remember the most common question from your high school math classes? If it was anything like my classroom, it was probably, “When will we actually need to know this in the real world?” Linking the techniques to real questions helps all audiences grasp the “why” behind what we’re doing. It also helps participants identify real use cases of the queries they build and how they can apply queries to their own product areas, which motivates them to keep learning!

    Once you have an idea what keywords and functions you want to include in your workshop, you’ll need to figure out how you want to teach them to your participants. Using the dataset you chose for the workshop, construct some interesting questions and identify the workshop objectives required for each question.

    Identifying Objectives and Questions

    Identifying Objectives and Questions


    It also helps to have one big goal question for participants to aim to answer. Ideally, answering the question should result in a valuable, actionable insight. For example, using the domain of support tickets, our goal question can be “How have wait times for customers with a premium plan changed over time?” Answers to this question can help identify trends and bottlenecks in our support service, so participants can walk away knowing their insights can solve real problems.

    A beginner-friendly exercise should start with the simplest question and work its way toward answering the goal question.

    5. Start With Exploration of Data Sources

    An important part of any analysis is the initial exploration. When we encounter new data sources, we should always spend time understanding the structure and quality of the data before trying to build insights off it. For example, we should know what the useful columns for our analysis will be, the ranges for any numerical or date columns, the unique values for any text columns, and the filtering we’ll need to apply later, such as filtering out test tickets created by the development team.

    The first query we run with participants is always a simple “SELECT * FROM {table}”. Not only does this introduce participants to their first keywords, but it gives us a chance to see what the data in the table looks like. We then learn how to select specific columns and apply functions like MIN, MAX, or DISTINCT to explore ranges.

    6. Build on Each Query to Answer More Complex Questions

    Earlier, we talked about identifying real questions we want to have participants answer with SQL. It turns out that it only really requires one or two additional keywords or functions to answer increasingly-difficult questions from a simple query.

    We discussed how participants start with a simple “SELECT * FROM {tickets_table}” query to explore the data. This forms the foundational query used for the rest of the exercise. Next, we might start adding keywords or functions in this sequence:

    Question → Objective

    1. How many tickets were created last month? → Add COUNT and WHERE.
    2. What is the daily count of tickets created over the last month? → Add GROUP BY and date-parsing functions (e.g. Presto’s DATE_TRUNC function to convert a timestamp to a day).
    3. How do the metrics reported in #2 change by the customer’s plan type? → Add JOIN and GROUP BY multiple columns.

    Each question above can be built from the prior query and the addition of up to two new concepts, and already participants can start deriving some interesting insights from the data!

    7. Provide Resources to Keep the Learning Journey Going

    Unfortunately, there’s only so far we can go in a single 90-minute workshop and we won’t be able to answer every question participants may have. However, there are many free resources out there for those who want to keep learning. Before we part, here’s a list of my favourite learning resources from beginner to advanced levels.

    Why Teach SQL to Non-Technical Users?

    I haven’t run into anyone in RnD at Shopify who hasn’t used SQL in some capacity. Those who invest time into learning it are often able to use the available data tools to 10x their work. When employees are empowered to self-serve their data, they have the ability to quickly gather information that helps them make better decisions.

    There are many benefits to learning SQL, regardless of discipline. SQL is a tool that can help us troubleshoot issues, aggregate data, or identify trends and anomalies. As a developer, you might use SQL to interact with your app’s databases. A UX designer can use SQL to get some basic information on the types of users using their feature and how they interact with the product. SQL helps technical writers explore pageview and interaction data of their documents to inform how they should spend their time updating documents that may be stale or filling resource gaps. The opportunities are endless when we empower everyone.

    Learn SQL!

    Learn SQL from the ground up with these free resources! Don’t be discouraged by the number of resources shared below - pick and choose which resources best fit your learning style and goals. See the comments section to understand the key concepts you should learn. If you choose to skip a resource, make sure you understand any concepts listed before moving on.

    Database theory

    Database management systems allow data to be stored and accessed in a computer system. At Shopify, many applications store their data inRelational Database Management Systems (RDMS). It can be useful to understand the basics behind RDMS since much of the raw data we encounter will come from these databases, which can be “spoken to” with SQL. Learning about RDMS also exposes you to concepts like tables, keys, and relational operations that are helpful in learning SQL.




    Udacity - Intro to Relational Databases 

    This free course introduces databases for developers. Complete the 1st lesson to learn about data and tables. Complete the rest of the lessons if they interest you, or continue on with the rest of the resources

    Stanford Databases 1 MOOC

    This free MOOC from Stanford University is a great resource for tailoring your learning path to your specific goals. Unfortunately as of March 2020, the Lagunita online learning platform was retired and they are working on porting the course over to Stay tuned for these resources to come back online!

    When they become available, check out the following videos: Introduction, Relational Model, Querying Relational Databases

    Some terms you should be familiar with at this point: data model, schema, DDL (Data Definition Language), DML (Data Manipulation Language), Columns/attributes, Rows, Keys (primary and foreign keys), Constraints, and Null values

    Coursera Relational Database Systems

    While the Stanford MOOC is being migrated, this course may be the next best thing. It covers many of the concepts discussed above

    Coursera - Database Management Essentials

    This is another comprehensive course that covers relational models and querying. This is part of a specialization focusing on business intelligence warehousing, so there’s also some exposure to data modelling and normalization (advanced concepts!)

     Querying with SQL

    Before learning the SQL syntax, it’s useful to familiarize yourself with some of the concepts listed in the prior section. You should understand the concept of a table (which has columns/attributes and rows), data types, and keys (primary and foreign keys).

    Beginner resources




    SQLBolt interactive lessons

    This series of interaction lessons will take you through the basics of SQL. Lessons 1-6 cover basic SQL exercises, and lessons 7-12 cover intermediate-level concepts. The remaining exercises are primarily for developers and focus on creating and manipulating tables.

    You should become familiar with SQL keywords like: SELECT, FROM, WHERE, LIMIT, ORDER BY, GROUP BY, and JOIN

    Intermediate-level concepts you should begin to understand include: aggregations, types of joins, unions

    Article: A beginner's guide to SQL

    This is a great read that breaks down all the keywords you’ve encountered so far in the most beginner-friendly way possible. If you’re still confused about what a keyword means, how to use it properly, or how to get the syntax right for the question you’re trying to answer (what order do these keywords go in?), this is a great resource to help you build your intuition.

    W3 Schools SQL Tutorial

    W3 has a comprehensive tutorial for learning SQL. I don’t recommend you start with w3, since it doesn’t organize the concepts in a beginner-friendly way as the other resources have done. But, even as an advanced SQL user, I still find their guides helpful as a reference. It’s also a good reference for understanding all the different JOIN types.

    YouTube - SQL Tutorial

    For the audio-visual learners, this video lesson covers the basics of database management and SQL

    Intermediate resources




    SQL Basics - Subqueries

    SQL Basics - Aggregations

    Despite its name, this resource covers some topics I’d consider beyond the basic level.

    Learn about subqueries and aggregations.This site also has good notes on joins if you’re still trying to wrap your head around them.

    SQL Zoo Interactive Lessons

    Complete the exercises here for some more hands-on practice

    Modern SQL - WITH Keyword

    Learn about the WITH keyword (also known as Common Table Expressions (CTE)) for writing more readable queries

    Mode's SQL Tutorial

    At Shopify, we use Mode to analyze data and create shareable reports. Mode has its own SQL tutorial that also shares some tips specific to working in the Mode environment. I find their section on the UNION operator more comprehensive than other sources. If you’re interested in advanced topics, you can also explore their advanced tutorials.>

    Complete some of their exercises in the Intermediate and Advanced sections. Notably, at this point you should start to become comfortable with aggregate functions - functions like COUNT, SUM, MIN, MAX, AVG, GROUP BY and HAVING keywords, CASE statements, joins, and date formats

    Here’s a full list of interactive exercises available:

    And finally, bookmark this SQL cheat sheet for future reference

    Advanced resources

    The journey to learning SQL can sometimes feel endless - there are many keywords, functions, or techniques that can help you craft readable, performant queries. Here are some advanced concepts that might be useful to add to your toolbelt.

    Note: database systems and querying engines may have different functions built-in. At Shopify, we use Presto to query data across several datastores - consult the Presto docs on available functions




    Window Functions 

    Window functions are powerful functions that allow you to perform a calculation across a set of rows in a “window” of your data.

    The Mode SQL tutorial also shares some examples and practice problems for working with window functions

    Common Table Expressions (CTE)

    Common Table Expressions, aka CTEs, allow you to create a query that can be reused throughout the context of a larger query. They are useful for enhancing the readability and performance of your query.

    Explain (Presto)

    The EXPLAIN keyword in Presto’s SQL engine allows you to see a visual representation of the operations performed by the engine in order to return the data required by your query, which can be used to troubleshoot performance issues. Other SQL engines may have alternative methods for creating an execution plan

    Grouping Sets

    Grouping sets can be used in the GROUP BY clause to define multiple groupings in the same query


    Coalesce is used to evaluate a number of arguments and return the first non-NULL argument seen. For example, it can be used to set default values for a column holding NULL values.

    String functions

    More string functions (Presto)

    Explore functions for working with string values

    Date functions

    More date functions (Presto)

    Explore functions for working with date and time values

    Best practices

    More best practices

    These “best practices” share tips for enhancing the performance of your queries

    Magic of SQL (YouTube)

    Chris Saxon shares tips and tricks for advancing your SQL knowledge and tuning performance

    If you’re passionate about data at scale, and you’re eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

    Continue reading

    What Is Nix

    What Is Nix

    Over the past year and a bit, Shopify has been progressively rebuilding parts of our developer tooling with Nix. I initially planned to write about how we're using Nix now, and what we're going to do with it in the future (spoiler: everything?). However, I realize that most of you won't have a really clear handle on what Nix is, and I haven't found a lot of the introductory material to convey a clear impression very quickly, so this article is going to be a crash course in what Nix is, how to think about it, and why it's such a valuable and paradigm-shifting piece of technology.

    There are a few places in this post where I will lie to you in subtle ways to gloss over all of the small nuances and exceptions to rules. I'm not going to call these out. I'm just trying to build a general understanding. At the end of this post, you should have the basic conceptual scaffolding you need in order to think about Nix. Let's dive in!

    What is Nix?

    The most basic, fundamental idea behind Nix is this:

    Everything on your computer implicitly depends on a whole bunch of other things on your computer.

    • All software exists in a graph of dependencies.
    • Most of the time, this graph is implicit.
    • Nix makes this graph explicit.

    Four Building Blocks

    Let's get this out of the way up front: Nix is a hard thing to explain.

    There are a few components that you have to understand in order to really get it, and all of their explanations are somewhat interdependent; and, even after explaining all of these building blocks, it still takes a bit of mulling over the implications of how they compose in order for the magic of Nix to really click. Nevertheless, we'll try, one block at a time.

    The major building blocks, at least in my mental model of Nix, are:

    1. The Nix Store
    2. Derivations
    3. Sandboxing
    4. The Nix Language.

    The Nix Store

    The easiest place to start is the Nix Store. Once you've installed Nix, you'll wind up with a directory at /nix/store, containing a whole bunch of entries that look something like this:


    This directory, /nix/store, is a kind of Graph Database. Each entry (each file or directory directly under /nix/store) is a Node in that Graph Database, and the relationships between them constitute Edges.

    The only thing that's allowed to write directories and files into /nix/store is Nix itself, and after Nix writes a Node into this Graph Database, it's completely immutable forever after: Nix guarantees that the contents of a Node doesn't change after it's been created. Further, due to magic that we'll discuss later, the contents of a given Node is guaranteed to be functionally identical to a Node with the same name in some other Graph, regardless of where they're built.

    What, then, is a "relationship between them?" Put another way, what is an Edge? Well, the first part of a Store path (the 32-character-long alphanumeric blob) is a cryptographic hash (of what, we'll discuss later). If a file in some other Store path includes the literal text "h9bvv0qpiygnqykn4bf7r3xrxmvqpsrd-nix-2.3.3", that constitutes a graph Edge pointing from the Node containing that text to the Node referred to by that path. Nodes in the Nix store are immutable after they're created, and the Edges they originate are scanned and cached elsewhere when they're first created.

    To demonstrate this linkage, if you run otool -L (or ldd on Linux) on the nix binary, you'll see a number of libraries referenced, and these look like:


    That's extracted by otool or ldd, but ultimately comes from text embedded in the binary, and Nix sees this too when it determines the Edges directed from this Node.

    Highly astute readers may be skeptical that scanning for literal path references in a Node after it's created is a reliable way to determine a dependency. For now, just take it as given that this, surprisingly, works almost flawlessly in practice.

    To put this into practice, we can demonstrate just how much of a Graph Database this actually is using nix-store --query. /nix/store is a tool built in to Nix that interacts directly with the Nix Store, and the --query mode has a multitude of flags for asking different questions of the Graph Database that is the Store.

    Let's find all of the Nodes that <hash>-nix-2.3.3 has Edges pointing to:

    $ nix-store --query --references /nix/store/h9bvv0qpiygnqykn4bf7r3xrxmvqpsrd-nix-2.3.3/
    ...(and 21 more)...

    Similarly, we could ask for the Edges pointing to this node using --referers, or we could ask for the full transitive closure of Nodes reachable from the starting Node using --requisites.

    The transitive closure is an important concept in Nix, but you don't really have to understand the graph theory: An Edge directed from a Node is logically a dependency: if a Node includes a reference to another Node, it depends on that Node. So, the transitive closure (--requisites) also includes those dependencies' dependencies, and so on recursively, to include the total set of things depended on by a given Node.

    For example, a Ruby application may depend on the result of bundling together all the rubygems specified in the Gemfile. That bundle may depend on the result of installing the Gem nokogiri, which may depend on libxml2 (which may depend on libc or libSystem). All of these things are present in the transitive closure of the application (--requisites), but only the gem bundle is a direct reference (--references).

    Now here's the key thing: This transitive closure of dependencies always exists, even outside of Nix: these things are always dependencies of your application, but normally, your computer is just trusted to have acceptable versions of acceptable libraries in acceptable places. Nix removes these assumptions and makes the whole graph explicit.

    To really drive home the "graphiness" of software dependencies, we can install Ruby via nix (nix-env -iA nixpkgs.ruby) and then build a graph of all of its dependencies:

    nix-store --query --graph $(which ruby) \
    | nix run nixpkgs.graphviz -c dot > ruby.svg

    Graphiness of Software Dependencies

    Graphiness of Software Dependencies


    The second building block is the Derivation. Above, I offhandedly mentioned that only Nix can write things into the Nix Store, but how does it know what to write? Derivations are the key.

    A Derivation is a special Node in the Nix store, which tells Nix how to build one or more other Nodes.

    If you list your /nix/store, you'll see a whole lot of items most likely, but some of them will end in .drv:


    This is a Derivation. It's a special format written and read by Nix, which gives build instructions for anything in the Nix store. Just about everything (except Derivations) in the Nix store is put there by building a Derivation.
    So what does a Derivation look like?

    $ cat /nix/store/ynzfmamryf6lrybjy1zqp1x190l5yiy5-demo.drv
    Derive([("out","/nix/store/76gxh82dqh6gcppm58ppbsi0h5hahj07-demo","","")],[],[],"x86_64-darwin","/bin/sh",["-c","echo 'hello world' > $out"],[("builder","/nix/store/5arhyyfgnfs01n1cgaf7s82ckzys3vbg-bash-4.4-p23/bin/bash"),("name","demo"),("out","/nix/store/76gxh82dqh6gcppm58ppbsi0h5hahj07-demo"),("system","x86_64-darwin")])

    That's not especially readable, but there's a couple of important concepts to communicate here:

    • Everything required to build this Derivation is explicitly listed in the file by path (you can see "bash" here, for example).
    • The hash component of the Derivation's path in the Nix Store is essentially a hash of the contents of the file.

    Since every direct dependency is mentioned in the contents, and the path is a hash of the contents, that means that if the dependencies and whatever other information the derivation contains don't change, the hash won't change, but if a different version of a dependency is used, the hash changes.

    There are a few different ways to build Derivations. Let's use nix-build:

    $ nix-build /nix/store/ynzfmamryf6lrybjy1zqp1x190l5yiy5-demo.drv

    This ran whatever the build instructions were and generated a new path in the Nix Store (a new Node in the Graph Database).

    Take a close look at the hash in the newly-created path. You'll see the same hash in the Derivation contents above. That output path was pre-defined, but not pre-generated. The output path is also a stable hash. You can essentially think of it as being a hash of the derivation and also the name of the output (in this case: "out"; the default output).

    So, if a dependency of the Derivation changes, that changes the hash of the Derivation. It also changes the hashes of all of that Derivation's outputs. This means that changing a dependency of a dependency of a dependency bubbles all the way down the tree, changing the hashes of every Derivation and all those Derivation's outputs that depend on the changed thing, directly or indirectly.

    Let's break apart that unreadable blob of Derivation content from above a little bit.

    • outputs: What nodes can this build?
    • inputDrvs: Other Derivations that must be built before this one
    • inputSrcs: Things already in the store on which this build depends
    • platform: Is this for macOS? Linux?
    • builder: What program gets run to do the build?
    • args: Arguments to pass to that program
    • env: Environment variables to set for that program

    Or, to dissect that Derivation:



    This Derivation has one output, named "out" (the default name), with some path that would be generated if we would build it.


    [ ]

    This is a simple toy derivation, with no inputDrvs. What this really means is that there are no dependencies, other than the builder. Normally, you would see something more like:


    This indicates a dependency on the OpenSSH Derivation's default output.


    [ ]

    Again, we have a very simple toy Derivation! Commonly, you will see:


    It's not really critical to the mental model, but Nix can also copy static files into the Nix Store in some limited ways, and these aren't really constructed by Derivations. This field just lists any static files in the Nix store on which this Derivation depends.



    Nix runs on multiple platforms and CPU architectures, and often the output of compilers will only work on one of these, so the derivation needs to indicate which architecture it's intended for.

    There's actually an important point here: Nix Store entries can be copied around between machines without concern, because all of their dependencies are explicit. The CPU details are a dependency in many cases.



    This program is executed with args and env, and is expected to generate the output(s).


    ["-c","echo 'hello world' > $out"]

    You can see that the output name ("out") is being used as a variable here. We're running, basically, bash -c "echo 'hello world' > $out". This should just be writing the text "hello world" into the Derivation output.



    Each of these is set as an Environment Variable before calling the builder, so you can see how we got that $out variable above, and note that it's the same as the path given in outputs above.

    Derivation in Summary

    So, if we build that Derivation, let's see what the output is:

    $ nix-build /nix/store/ynzfmamryf6lrybjy1zqp1x190l5yiy5-demo.drv
    $ cat /nix/store/76gxh82dqh6gcppm58ppbsi0h5hahj07-demo
    hello world

    As we expected, it's "hello world".

    A Derivation is a recipe to build some other path in the Nix Store.


    After walking through that Derivation in the last section, you may be starting to develop a feel for how explicitly-declared dependencies make it into the build, and how that Graph structure comes together—but what prevents builds from referring to things at undeclared paths, or things that aren't in the Nix store at all?

    Nix does a lot of work to make sure that builds can only see the Nodes in the Graph which their Derivation has declared, and also, that they don't access things outside of the store.

    A Derivation build simply cannot access anything not declared by the Derivation. This is enforced in a few ways:

    • For the most part, Nix uses patched versions of compilers and linkers that don't try to look in the default locations (/usr/lib, and so on).
    • Nix typically builds Derivations in an actual sandbox that denies access to everything that the build isn't supposed to access.

    A Sandbox is created for a Derivation build that gives filesystem read access to—and only to—the paths explicitly mentioned in the Derivation.

    What this amounts to is that artifacts in the Nix Store essentially can't depend on anything outside of the Nix Store.

    The Nix Language

    And finally, the block that brings it all together: the Nix Language.
    Nix has a custom language used to construct derivations. There's a lot we could talk about here, but there are two major aspects of the language's design to draw attention to. The Nix Language is:

    1. lazy-evaluated
    2. (almost) free of side-effects.

    I'll try to explain these by example.

    Lazy Evaluation

    Take a look at this code:

    data = {
      a = 1;
      b = functionThatTakesMinutesToRun 1;

    This is Nix code. You can probably figure out what's going on here: we're creating something like a hash table containing keys "a" and "b", and "b" is the result of calling an expensive function.

    In Nix, this code takes approximately no time to run, because the value of "b" isn't actually evaluated until it's needed. We could even:

      data = {
       a = 1;
       b = functionThatTakesMinutesToRun 1;
    in data.a

    Here, we're creating the table (technically called an Attribute Set in Nix), and extracting "a" from it.

    This evaluates to "1" almost instantly, without ever running the code that generates "b".

    Conspicuously absent in the code samples above is any sort of actual work getting done, other than just pushing data around within the Nix language. The reason for this is that the Nix language can’t actually do very much.

    Free of Side Effects (almost)

    The Nix language lacks a lot of features you will expect in normal programming languages. It has

    • No networking
    • No user input
    • No file writing
    • No output (except limited debug/tracing support).

    It doesn't actually do anything at all in terms of interacting with the world…well, except for when you call the derivation function.

    One Side Effect

    The Nix Language has precisely one function with a side effect. When you call derivation with the right set of arguments, Nix writes out a new <hash>-<name>.drv file into the Nix Store as a side effect of calling that function.

    For example:

    derivation {
      name = "demo";
      builder = "${bash}/bin/bash";
      args = [ "-c" "echo 'hello world' > $out" ];
      system = "x86_64-darwin";

    If you evaluate this in nix repl, it will print something like:

    «derivation /nix/store/ynzfmamryf6lrybjy1zqp1x190l5yiy5-demo.drv»

    That returned object is just the object you passed in (with name, builder, args, and system keys), but with a few extra fields (including drvPath, which is what got printed after the call to derivation), but importantly, that path in the Nix store was actually created.

    It's worth emphasizing again: This is basically the only thing that the Nix Language can actually do. There's a whole lot of pushing data and functions around in Nix code, but it all boils down to calls to derivation.

    Note that we referred to ${bash} in that Derivation. This is actually the Derivation from earlier in this article, and that variable substitution is actually how Derivations depend on each other. The variable bash refers to another call to derivation, which generates instructions to build bash when it's evaluated.

    The Nix Language doesn't ever actually build anything. It creates Derivations, and later, other Nix tools read those derivations and build the outputs. The Nix Language is just a Domain Specific Language for creating Derivations

    Nixpkgs: Derivation and Lazy Evaluation

    Nixpkgs is the global default package repository for Nix, but it's very unlike what you probably think of when you hear "package repository."

    Nixpkgs is a single Nix program. It makes use of the fact that the Nix Language is Lazy Evaluated, and includes many, many calls to derivation. The (simplified but) basic structure of Nixpkgs is something like:

      ruby = derivation { ... };
      python = derivation { ... };
      nodejs = derivation { ... };

    In order to build “ruby”, various tools just force Nix to evaluate the “ruby” attribute of that Attribute Set, which calls derivation, generating the Derivation for Ruby into the Nix Store and returning the path that was built. Then, the tool runs something like nix-build on that path to generate the output.

    Shipit! Presents: How Shopify Uses Nix

    Well, it takes a lot more words than I can write here—and probably some amount of hands-on experimentation—to let you really, viscerally, feel the paradigm shift that Nix enables, but hopefully I’ve given you a taste.

    If you’re looking for more Nix content, I’m currently re-releasing a series of screencasts I recorded for developers at Shopify to the public. Check out Nixology on YouTube.

    You can also join me for a discussion about how Shopify is using Nix to rebuild our developer tooling. I’ll cover some of this content again, and show off some of the tooling we actually use on a day-to-day basis.

    What: ShipIt! Presents: How Shopify Uses Nix

    Date: May 25, 2020 at 1:00 pm EST

    Please view the recording at

    If you want to work on Nix, come join my team! We're always hiring, so visit our Engineering career page to find out about our open positions. 

    Continue reading

    A Brief History of TLS Certificates at Shopify

    A Brief History of TLS Certificates at Shopify

    Transport Layer Security (TLS) encryption may be commonplace in 2020, but this wasn’t always the case. Back in 2014, our business owner storefront traffic wasn’t encrypted. We manually provisioned the few TLS certificates that were in production. In this post, we’ll cover Shopify’s journey from manually provisioning TLS certificates to the fully automated system that supports over 1M business owners today.

    In the Beginning

    Up to 2014, only business owner shop administration and checkout traffic were encrypted. All checkouts were on the domain. Secured shop administration functions used the * certificate and a single-domain certificate for Our Operations team renewed the certificates manually as needed. During this time, teams began research on what it would take for us to offer TLS encryption for all business owners in an automated fashion.

    Shopify Plus

    We launched Shopify Plus in early 2014. One of Plus’s earliest features was TLS encrypted storefronts. We manually provisioned certificates, adding new domains to the Subject Alternative Name (SAN) list as required. As our certificate authority placed a limit on the number of domains per certificate, certificates were added to support the new domains being onboarded. At the time, Internet Explorer on Windows XP was still used by a significant number of users, which prevented our use of the Server Name Indication (SNI) extension.

    While this addressed our immediate needs, there were several drawbacks:

    • Manual certificate updates and provisioning were labor-intensive and needed to be handled with care.
    • Additional IP addresses were needed to support new certificates.
    • Having domains for non-related shops in a single certificate wasn’t ideal.

    The pace of onboarding was manageable at first. As we onboarded more merchants, it was apparent that this process wasn’t sustainable. At this point, there were dozens of certificates that all had to be manually provisioned and renewed. For each Plus account onboarded, the new domains had to be manually added. This was labor-intensive and error-prone. We worked on a fully automated system during Shopify’s Hack Days, and it became a fully staffed project in May 2015.

    Shopify’s Notary System

    Automating TLS certificates had to address multiple facets of the process including

    • How are the certificates provisioned from the certificate authority?
    • How to serve the certificates at scale?
    • What other considerations are there for offering encrypted storefronts?

    Shopify's Notary System
    Shopify's Notary System

    Provisioning Certificates

    Our Notary system provisions certificates. When a business owner adds a domain to their shop, the system receives a request for a certificate to be provisioned. The certificate provisioning is fully automated via Application Programming Interface (API) calls to the certificate authority. This includes the order request, domain ownership verification, and certificate/private key pair delivery. Certificate renewals are performed automatically in the same fashion.

    While it makes sense that we group domains from a shop to one certificate, the system handles all domains separately for simplicity. Each certificate has one domain with a unique private key. The certificate and private key are stored in a relational database. This relational database is accessible by the load balancers for terminating TLS connections.

    Scaling Up Certificate Provisioning

    At the time, we hosted our nginx load balancers at our datacenters. Storing the TLS certificates on disk and reloading nginx when certificates changed wasn’t feasible. In a past article, we talked about our use of nginx and OpenResty Lua modules. Using OpenResty allowed us to script nginx to serve dynamic content outside of the nginx configuration. In addition, browser support for the TLS SNI extension was almost universal. By leveraging the TLS SNI extension, we dynamically load TLS certificates from our database in a Lua middleware via the ssl_certificate_by_lua module. Certificates and private keys are directly accessible from the relational database via a single SQL query. An in-memory Least Recently Used (LRU) cache reduced the latency of TLS handshakes for frequently accessed domains.

    Solving Mixed Content Warnings

    With TLS certificates in place for business owner shop domains, we could offer encrypted storefronts for all shops. However, there was still a significant hurdle to overcome. Each shop’s theme could have images or assets referencing non-encrypted Uniform Resource Locators (URLs). Mixing of encrypted and unencrypted content would cause the browser to display a Mixed Content warning, denoting that some resources on the page are not encrypted. To resolve this problem, we had to process all the shop themes to replace references to HTTP with HTTPS.

    With all the infrastructure in place, we realized the goal of supporting encrypted storefronts for all merchants in February 2016. The same system is still in place and has scaled to provide TLS certificates for all of our 1M+ merchants.

    Let’s Encrypt!

    Let’s Encrypt is a non-profit certificate authority that provides TLS certificates at no charge. Shopify has been and is currently a sponsor. The service launched in April 2016, shortly after our Notary went into production. With the exception of Extended Verification (EV) certificates and other special cases, we’ve migrated away from our paid certificate authority in favor of Let’s Encrypt.

    Move to the Cloud

    In June 2019, our network edge moved from our datacenter to a cloud provider. The number of TLS certificates in our requirements needing support drastically reduced the viable vendor list. Once the cloud provider was selected, our TLS provisioning system had to be adapted to work with their system. There were two paths forward, using the cloud provider’s managed certificates or continuing to provision Let’s Encrypt certificates and upload them. The initial migration leveraged the provider’s certificate provisioning.

    Using managed certificates from the cloud provider has the advantage of being maintenance-free after they’ve been provisioned. There are no storage concerns for certificates and private keys. In addition, certificates are automatically renewed by the vendor. Administrative work was required during the migration to guide merchants to modify their domain’s Certification Authority Authorization (CAA) Domain Name System (DNS) records as needed. Backfilling the certificates for our 1M+ merchants took several weeks to complete.

    After the initial successful migration to our cloud provider, we revisited the certificate provisioning strategy. As we maintain an alternate edge network for contingency, the Notary infrastructure is still in place to provide certificates for that infrastructure. The intent of using provider managed certificates is for it to be a stepping stone for deprecating Notary in the future. While the cloud provider-provisioned certificates worked well for us, there are now two sets of certificates to keep synchronized. To simplify certificate state and operation load, we now use the Notary provisioned certificates for both edge networks. Instead of provisioning certificates on our cloud provider, certificates from Notary are uploaded as new ones are required.

    Outside of our business owner shop storefronts, we rely on nginx for other services that are part of our cloud infrastructure. Some of our Lua middleware, including the dynamic TLS certificate loading code, was contributed to the ingress-nginx Kubernetes project.

    Our TLS certificate journey took us from a handful of manually provisioned certificates to a fully automated system that can scale up to support over 1M merchants. If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Learn about the actions we’re taking as we continue to hire during COVID‑19

    Continue reading

    Dev Degree: Behind the Scenes

    Dev Degree: Behind the Scenes

    On April 24th, we proudly celebrated the graduation of our first Dev Degree program cohort. This milestone holds a special place in Shopify history because it’s a day born out of trial and error, experimentation, iteration, and hustle. The 2016 cohort had the honor and challenge of being our first class, lived through the churn and pivots of a newly designed program, and completed their education during a worldwide pandemic, thrust into remote learning and work. The students’ success is a testament to their dedication, adaptability, and grit. It’s also the product of a thoughtfully-designed program and a high-functioning Dev Degree team.

    What does it take to create an environment where students can thrive and develop into work-ready employees in four years?

    We’ve achieved this mission with the Dev Degree program. The key to our success is our learning structure and multidisciplinary team. With our model, students master development skills faster than traditional methods.

    The Dev Degree Program Structure

    When we set out to shake-up software education in 2016, we had no prescriptive blueprint to guide us and no tried-and-true best practices. Still, we embraced the opportunity to forge a new path in partnership with trusted university advisors and experienced internal educators at Shopify.

    Our vision was to create an alternative to the traditional co-op model that alternates between university studies and work placements. In Dev Degree, students receive four years of continual hands-on developer experience at Shopify, through skills training and team placements, in parallel to attending university classes. This model accelerates understanding and allows students to apply classroom theory to real-life product development across a breadth of technology.

    Dev Degree Timeline - Year 1: Developer skill training at Shopify. Year 2, 3, 4: New development team and mentor every 8/12 months for 3 yearsDev Degree Timeline

    While computer science and technology are at the core of our learning model, what elevates the program is the focus on personal growth, actionable feedback loops, and the opportunity to make an impact on the company, coworkers, and merchants.

    University Course Curriculum

    The Dev Degree program leads to an accredited Computer Science degree, which is a deciding factor for many students and their families exploring post-secondary education opportunities. All required core theoretical concepts, computer science courses, math courses, and electives are defined by and taught at the universities. Students take three university courses per semester while working 25 hours per week at Shopify throughout the four-year program. All formal assessments, grading, and final exams for university courses are carried out by the universities.

    Dev Degree Program - 20 Hrs/week at Carleton or York and 25hrs/week at Shopify
    Dev Degree Structure

    While the universities set the requirements for the courses, we work collaboratively to define the course sequencing to ensure the students are exposed to computer science content as early as possible in their program before they start work on team placements at Shopify.

    In addition to the core university courses, there are internship courses that teach software development concepts applicable to the technology industry. The universities assess the learning outcomes of the internship courses through practicum reports and meetings with university supervisors or advisors.

    The courses and concepts taught at Shopify build on the university courses and teach students hands-on development skills, communication skills, developer tools training, and how to contribute to a real-world product development team effectively.

    Developer Skills Training: Building a Strong Foundation

    One of the lessons we learned early in the program was that students need a solid foundation of developer skills before being placed on development teams to feel confident and ready to contribute. The first year at Shopify sets the Dev Degree program apart from other work-integrated learning programs because we immerse the students in our Shopify-led developer skills training.

    In the first year, we introduce the students to skills and tools that form the foundation for how they work at Shopify and other companies. There are skills they need to develop before moving into teams, such as working with code repositories and committing code, using a command line, front-end development, working with data, and more.

    The breadth of technical skills that students learn in their first year in Dev Degree goes beyond the traditional university curriculum. This foundation allows students to confidently join their first placement team and have an immediate impact.

    We teach this way on purpose. Universities often chose a bottom-up learning model by front-loading theory and concepts. We designed our program to immerse students somewhere in the middle of top-down and bottom-up, allowing them to discover the fundamentals gradually after they develop base skills and code a bit every day.

    Due to the ever-evolving nature of software development, we update the developer skills training path often. Our current courses include the following technologies:

    • Command Line Interface (CLI)
    • Vim
    • Git & GitHub
    • Ruby
    • HTML, CSS, JavaScript
    • Databases
    • Ruby on Rails
    • React
    • TypeScript
    • GraphQL

    Team Placements: Working on Merchant-Facing Projects

    After they’ve completed their developer skills training courses, students spend the next three years on team placements. This is a big deal. On team placements, students get to apply what they learn in their developer skills training at Shopify and from their university courses to meaningful, real-world software development work. Our placements are purposefully-designed to expose students to a wide range of disciplines and teams to make them well-rounded developers, give them new perspectives, and introduce them to new people.

    Working with their placement specialist, students interview with teams from back-end, front-end, data, security, and production engineering disciplines.

    Over the course of the Dev Degree program, each student receives:

    • One 12-month team placement
    • Three 8-month team placements
    • Four different technical mentors
    • A dedicated Life@Shopify mentor
    • Twenty hours per week working on a Shopify team
    • Actionable feedback on a regular cadence
    • Evaluations from mentors and leads
    • High-touch support from the Dev Degree team
    • Access to new people and workplace culture

    By the time students complete the program, they’ve been on four back-to-back team placements in their final three years at Shopify. This experience makes them a valuable asset to their future company. It also allows students to launch their careers with the confidence that they are well-prepared to make a positive contribution.

    It Takes a Village: Building an Impactful Program Team

    Creating a successful work-integrated learning program requires a significant commitment of time and resources from a team that spans multiple disciplines and functions. While the Dev Degree program team is responsible for the bulk of the heavy-lifting, including logistics, mentorship, and support, the program doesn’t happen without expertise and time from other Shopify subject matter experts and university stakeholders.

    Dev Degree Program Team

    The Dev Degree team are the most actively involved in all aspects of the program and with the students from onboarding to graduation. They are responsible for ensuring that the program meets the needs of the students, the university, and Shopify.

    Program Leads

    The Dev Degree program leads are the liaison between Shopify and our university partners. We have a program lead to represent each partnership, and they keep this ambitious program on the rails, including working with educators to define the curriculum and developer skills training courses. They are also responsible for hiring and evaluating student performance.

    Student Success Specialists

    Many of the Dev Degree students come to Shopify straight from high school, which can be daunting. In traditional co-op programs, students have a couple of years of university experience before starting their internships and being dropped into a professional workplace setting. To ease the transition to Shopify, our student success specialists are responsible for supporting students’ well-being, connecting them with other mentors, helping them learn how to become effective communicators, and being the voice of the students at Shopify. This nurturing environment helps protect first-year students from being overwhelmed and underprepared for team placements.

    Placement Specialists

    Team placements are an integral part of the applied learning in the Dev Degree program. Placement specialists are responsible for coordinating and overseeing all four 8-month placements for each cohort. This high-touch role requires extensive relationship-building with development teams and a deep understanding of the goals and interests of the students to ensure appropriate compatibility. To ensure that development teams get the return on investment (ROI) from investing in mentorship, placement specialists place students on teams where they can be impactful. They also support the leads and mentors on the development teams and play an active role in advocating for an influential culture of mentorship within Shopify.


    The courses we teach at Shopify are foundational for students to prepare them for their team placements. Dev Degree educators have an extensive background in education and computer science and are responsible for building out the curriculum for all the developer skills training courses taught at Shopify. They design and deliver the course material and evaluate the students on technical expertise and subject knowledge. The instructors create courses for a wide range of development skills relevant to development at Shopify and other companies.

    Recruitment Team

    As with all recruitment at Shopify, we aim to recruit a diverse mix of students to the Dev Degree program. Our talent team is actively involved in helping us create a recruitment strategy to engage and attract top talent from a variety of schools and programs, including university meetups and mentorship in youth programs like Technovation.


    After four years of having Dev Degree students on teams across twelve disciplines, the program is woven into the Shopify culture, and mentorship plays a big role.

    Development Team Mentors

    Development team mentors are critical to helping students build confidence, technical skills, and gain the experience needed to become an asset to the team. Mentors are responsible for guiding, evaluating, and providing actionable feedback to students throughout their 8-month placements. Mentorship requires a strong commitment and takes up about 10% of a developer’s time. We feel it’s worth the investment to build mentors and invest in a culture of mentorship. It’s a challenging but rewarding role, and especially helpful to developers looking to grow leadership skills and level up in their roles.

    Life@Shopify Mentors

    In addition to placement mentors, we also have experienced Shopify employees who volunteer to mentor students as they navigate through the program, their placements, and the company on the whole. These Life@Shopify mentors act as a trusted guide and help round out the mentorship experience at Shopify.

    University Stakeholders

    Close relationships between the universities and Shopify help integrate the theory and development practices and deepen both the understanding of concepts and work experience. We’re fortunate to have both Carleton and York University as part of the Dev Degree program and fully engaged in the model that we’ve built. The faculty advisors play an active role in working with the students to guide them on their course selections, navigate the program, and evaluate their internship courses and practicums. Without university buy-in and support, a program like Dev Degree doesn’t happen.

    Dev Degree is Worth the Investment

    Building a new work-integrated learning program requires a big commitment of company time, resources, and cost, but we are reaping the benefits of our gamble.

    • Graduates are well-rounded developers with a rich development experience across a range of teams and disciplines.
    • 88% of the 2016 cohort have been offered, and have accepted, full-time positions here at Shopify.
    • Students who accept positions at Shopify have already built four years of relationships and have acquired vast knowledge and skills that will help them make an immediate impact on their teams.
    • We are building future leaders through mentorship. 

    While we are excited about how far we’ve come, we still have room to grow. We are looking at metrics and data to help us quantify the success of the program and to drive program improvements to take computer science education to a new level. When we started this ambitious endeavor, we wanted to mature it to a point where we could create a blueprint of the Dev Degree program for other companies and universities to adopt it and evolve it. There’s interest in what we’re doing here. It’s just a matter of time before we help make the Dev Degree model of computer science education the norm rather than the exception.

    Additional Information

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions. Learn about the actions we’re taking as we continue to hire during COVID‑19

    Continue reading

    How to Fix Slow Code in Ruby

    How to Fix Slow Code in Ruby

    By Jay Lim and Gannon McGibbon

    At Shopify, we believe in highly aligned, loosely coupled teams to help us move fast. Since we have many teams working independently on a large monolithic Rails application, inefficiencies in code are sometimes inadvertently added to our codebase. Over time, these problems can add up to serious performance regressions.

    By the time such performance regressions are noticeable, it might already be too late to track offending commits down. This can be exceedingly challenging on codebases with thousands of changes being committed each day. How do we effectively find out why our application is slow? Even if we have a fix for the slow code, how can we prove that our new code is faster?

    It all starts with profiling and benchmarking. Last year, we wrote about writing fast code in Ruby on Rails. Knowing how to write fast code is useful, but insufficient without knowing how to fix slow code. Let’s talk about the approaches that we can use to find slow code, fix it, and prove that our new solution is faster. We’ll also explore some case studies that feature real world examples on using profiling and benchmarking.


    Before we dive into fixing unperformant code, we need to find it first. Identifying code that causes performance bottlenecks can be challenging in a large codebase. Profiling helps us to do so easily.

    What is Profiling?

    Profiling is a type of program analysis that collects metrics about the program at runtime, such as the frequency and duration of method calls. It’s carried out using a tool known as a profiler, and a profiler’s output can be visualized in various ways. For example, flat profiles, call graphs, and flamegraphs.

    Why Should I Profile My Code?

    Some issues are challenging to detect by just looking at the code (static analysis, code reviews, etc.). One of the main goals of profiling is observability. By knowing what is going on under the hood during runtime, we gain a better understanding of what the program is doing and reason about why an application is slow. Profiling helps us to narrow down the scope of a performance bottleneck to a particular area.

    How Do I Profile?

    Before we figure out what to profile, we need to first figure out what we want to know: do we want to measure elapsed time for a specific code block, or do we want to measure object allocations in that code block? In terms of granularity, do we need elapsed time for every single method call in that code block, or do we just need the aggregated value? Elapsed time here can be further broken down into CPU time or wall time.

    For measuring elapsed time, a simple solution is to measure the start time and the end time of a particular code block, and report the difference. If we need a higher granularity, we do this for every single method. To do this, we use the TracePoint API in Ruby to hook into every single method call made by Ruby. Similarly, for object allocations, we use the ObjectSpace module to trace object allocations, or even dump the Ruby heap to observe its contents.

    However, instead of building custom profiling solutions, we can use one of the available profilers out there, and each has its own advantages and disadvantages. Here are a few options:

    1. rbspy

    rbspy samples stack frames from a Ruby process over time. The main advantage is that it can be used as a standalone program without needing any instrumentation code.

    Once we know the Ruby Process Identifier (PID) that we want to profile, we start the profiling session like this:

    rbspy record —pid $PID

    2. stackprof

    Like rbspy, stackprof samples stack frames over time, but from a block of instrumented Ruby code. Stackprof is used as a profiling solution for custom code blocks:

    profile = :cpu) do
      # Code to profile

    3. rack-mini-profiler

    The rack-mini-profiler gem is a fully-featured profiling solution for Rack-based applications. Unlike the other profilers described in this section, it includes a memory profiler in addition to call-stack sampling. The memory profiler collects data such as Garbage Collection (GC) statistics, number of allocations, etc. Under the hood, it uses the stackprof and memory_profiler gems.

    4. app_profiler

    app_profiler is a lightweight alternative to rack-mini-profiler. It contains a Rack-only middleware that supports call-stack profiling for web requests. In addition to that, block level profiling is also available to any Ruby application. These profiles can be stored in a configurable storage backend such as Google Cloud Storage, and can be visualized through a configurable viewer such as Speedscope, a browser-based flamegraph viewer.

    At Shopify, we collect performance profiles in our production environments. Rack Mini Profiler is a great gem, but it comes with a lot of extra features such as database and memory profiling, and it seemed too heavy for our use case. As a result, we built App Profiler that similarly uses Stackprof under the hood. Currently, this gem is used to support our on-demand remote profiling infrastructure for production requests.

    Case Study: Using App Profiler on Shopify

    An example of a performance problem that was identified in production was related to unnecessary GC cycles. Last year, we noticed that a cart item with a very large quantity used a ridiculous amount of CPU time and resulted in slow requests. It turns out, the issue was related to Ruby allocating too many objects, triggering the GC multiple times.

    The figure below illustrates a section of the flamegraph for a similar slow request, and the section corresponds to approximately 500ms of CPU time.

    A section of the flamegraph for a similar slow request

    A section of the flamegraph for a similar slow request

    The highlighted chunks correspond to the GC operations, and they interleave with the regular operations. From this section, we see that GC itself consumed about 35% of CPU time, which is a lot! We inferred that we were allocating too many Ruby objects. Without profiling, it’s difficult to identify these kinds of issues quickly.


    Now that we know how to identify performance problems, how do we fix them? While the right solution is largely context sensitive, validating the fix isn’t. Benchmarking helps us prove performance differences in two or more different code paths.

    What is Benchmarking?

    Benchmarking is a way of measuring the performance of code. Often, it’s used to compare two or more similar code paths to see which code path is the fastest. Here’s what a simple ruby benchmark looks like:

    This code snippet is benchmarking at its simplest. We’re measuring how long a method takes to run in seconds. We could extend the example to measure a series of methods, a complex math equation, or anything else that fits into a block. This kind of instrumentation is useful because it can unveil regression or improvement in speed over time.

    While wall time is a pretty reliable measurement of “performance”, there’s other methods one can measure code by besides realtime, the Ruby standard library’s Benchmark module includes bm and bmbm.

    The bm method shows a more detailed breakdown of timing measurements. Let’s take a look at a script with some output:

    User, system, and total are all different measurements of CPU time. User refers to time spent working in user space. Similarly, system denotes time spent working in kernel space. Total is the sum of CPU timings, and real is the same wall time measurement we saw from Benchmark.realtime.

    What about bmbm? Well, it is exactly the same as bm with one unique difference. Here’s what the output looks like:

    The rehearsal, or warmup step is what makes bmbm useful. It runs benchmark code blocks once before measuring to prime any caching or similar mechanism to produce more stable, reproducible results.

    Lastly, let’s talk about the benchmark-ips gem. This is the most common method of benchmarking Ruby code. You’ll see it a lot in the wild, this is what a simple script looks like:

    Here, we’re benchmarking the same method using familiar syntax with ips method. Notice the inline bundler and gemfile code. We need this in a scripting context because benchmark-ips isn’t part of the standard library. In a normal project setup, we add gem entries to the Gemfile as usual.

    The output of this script is as follows:

    Ignoring the bundler output, we see the warmup iteration score per 100 milliseconds ran for the default of 2 seconds, and how many times the code block was able to run in 5 seconds. It’ll become more apparent why benchmark-ips is so popular later.

    Why Should I Benchmark My Code?

    So, now we know what benchmarking is and some tools available to us. But why even bother benchmarking at all? It may not be immediately obvious why benchmarking is so valuable.

    Benchmarks are used to quantify the performance of one or more blocks of code. This becomes very useful when there are performance questions that need answers. Often, these questions boil down to “which is faster, A or B?”. Let’s look at an example:

    In this script, we’re doing most of what we did in the first benchmark-ips example. Pay attention to the addition of another method, and how it changes the benchmark block. When benchmarking more than one thing at once, simply add another report block. Additionally, the compare! method prints a comparison of all reports:

    Wow, that’s pretty snazzy! compare! is able to tell us which benchmark is slower and by how much. Given the amount of thread sleeping we’re doing in our benchmark subject methods, this aligns with our expectations.

    Benchmarking can be a means of proving how fast a given code path is. It’s not uncommon for developers to propose a code change that makes a code path faster without any evidence.

    Depending on the change, comparison can be challenging. As in the previous example, benchmark-ips may be used to benchmark individual code paths. Running the same single report benchmark on versions of code easily tests pre and post patch performance.

    How Do I Benchmark My Code?

    Now we know what benchmarking is and why it is important. Great! But how do you get started benchmarking in an application? Trivial examples are easy to learn from but aren’t very relatable.

    When developing in a framework like Ruby on Rails, it can be difficult to understand how to set up and load framework code for benchmark scripts. Thankfully, one of the latest features of Ruby on Rails can generate benchmarks automatically. Let’s take a look:

    This benchmark can be generated by running bin/rails generate benchmark my_benchmark, placing a file in script/benchmarks/my_benchmark.rb. Note the inline gemfile isn’t required because we piggyback off of the Rails app’s Gemfile. The benchmark generator is slated for release in Rails 6.1.

    Now, let’s look at a real world example of a Rails benchmark:

    In this example, we’re subclassing Order and caching the calculation it does to find the total price of all line items. While it may seem obvious that this would be a beneficial code change, it isn’t obvious how much faster it is compared to the base implementation. Here’s a more unabridged version of the script for full context.

    Running the script reveals a ~50x improvement to a simple order of 4 line items. With orders with more line items, the payoff only gets better.

    One last thing to know about benchmarking effectively is being aware of micro-optimization. These are optimizations that are so small, the performance improvement isn’t worth the code change. While these are sometimes acceptable for hot code paths, it’s best to tackle larger scale performance issues first.

    Case Study: Rails Contributions

    As with many open source projects, Ruby on Rails usually requires performance optimization pull requests to include benchmarks. The same is common for new features to performance sensitive areas like Active Record query building or Active Support’s cache stores. In the case of Rails, most benchmarks are made with benchmark-ips to simplify comparison.

    For example, changes how primary keys are accessed in Active Record instances. Specifically, refactoring class method calls to instance variable references. It includes before and after benchmark results with a clear explanation of why the change is necessary. changes model attribute assignment in Active Record so that key stringification of attribute hashes is no longer needed. A benchmark script with multiple scenarios is provided with results. This is a particularly hot codepath because creating and updating records is at the heart of most Rails apps.

    Another example, reduces object allocations in ActiveRecord#respond_to?. It provides a memory benchmark that compares total allocations before and after the patch, with a calculated diff. Reducing allocations delivers better performance because the less Ruby allocates, the less time Ruby spends assigning objects to blocks of memory.

    Final Thoughts

    Slow code is an inevitable facet of any codebase. It isn’t important who introduces performance regressions, but how they are fixed. As developers, it’s our job to leverage profiling and benchmarking to find and fix performance problems.

    At Shopify, we’ve written a lot of slow code, often for good reasons. Ruby itself is optimized for the developer, not the servers we run it on. As Rubyists, we write idiomatic, maintainable code that isn’t always performant, so profile and benchmark responsibly, and be wary of micro-optimizations!

    Additional Information

    If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions. Learn about the actions we’re taking as we continue to hire during COVID‑19




    Continue reading

    Categorizing Products at Scale

    Categorizing Products at Scale

    By: Jeet Mehta and Kathy Ge

    With over 1M business owners now on Shopify, there are billions of products being created and sold across the platform. Just like those business owners, the products that they sell are extremely diverse! Even when selling similar products, they tend to describe products very differently. One may describe their sock product as a “woolen long sock,” whereas another may have a similar sock product described as a “blue striped long sock.”

    How can we identify similar products, and why is that even useful?

    Applications of Product Categorization

    Business owners come to our platform for its multiple sales/marketing channels, app and partner ecosystem, brick and mortar support, and so much more. By understanding the types of products they sell, we provide personalized insights to help them capitalize on valuable business opportunities. For example, when business owners try to sell on other channels like Facebook Marketplace, we can leverage our product categorization engine to pre-fill category related information and save them time.

    In this blog post, we’re going to step through how we implemented a model to categorize all our products at Shopify, and in doing so, enabled cross-platform teams to deliver personalized insights to business owners. The system is used by 20+ teams across Shopify to power features like marketing recommendations for business owners (imagine: “t-shirts are trending, you should run an ad for your apparel products”), identification of brick-and-mortar stores for Shopify POS, market segmentation, and much more! We’ll also walk through problems, challenges, and technical tradeoffs made along the way.

    Why is Categorizing Products a Hard Problem?

    To start off, how do we even come up with a set of categories that represents all the products in the commerce space? Business owners are constantly coming up with new, creative ideas for products to sell! Luckily, Google has defined their own hierarchical Google Product Taxonomy (GPT) which we leveraged in our problem.

    The particular task of classifying over a large-scale hierarchical taxonomy presented two unique challenges:

    1. Scale: The GPT has over 5000 categories and is hierarchical. Binary classification or multi-class classification can be handled well with most simple classifiers. However, these approaches don’t scale well as the number of classes increases to the hundreds or thousands. We also have well over a billion products at Shopify and growing!
    2. Structure: Common classification tasks don’t share structure between classes (i.e. distinguishing between a dog and a cat is a flat classification problem). In this case, there’s an inherent tree-like hierarchy which adds a significant amount of complexity when classifying.

    Sample visualization of the GPT

    Sample visualization of the GPT

    Representing our Products: Featurization 👕

    With all machine learning problems, the first step is featurization, the process of transforming the available data into a machine-understandable format.

    Before we begin, it’s worth answering the question: What attributes (or features) distinguish one product from another? Another way to think about this is if you, the human, were given the task of classifying products into a predefined set of categories: what would you want to look at?

    Some attributes that likely come to mind are

    • Product title
    • Product image
    • Product description
    • Product tags.

    These are the same attributes that a machine learning model would need access to in order to perform classification successfully. With most problems of this nature though, it’s best to follow Occam’s Razor when determining viable solutions.

    Among competing hypotheses, the one with the fewest assumptions should be selected.

    In simpler language, Occam’s razor essentially states that the simplest solution or explanation is preferable to ones that are more complex. Based on the computational complexities that come with processing and featurizing images, we decided to err on the simpler side and stick with text-based data. Thus, our classification task included features like

    • Product title
    • Product description
    • Product collection
    • Product tags
    • Product vendor
    • Merchant-provided product type.

    There are a variety of ways to vectorize text features like the above, including TF-IDF, Word2Vec, GloVe, etc. Optimizing for simplicity, we chose a simple term-frequency hashing featurizer using PySpark that works as follows:

    HashingTF toy example

     HashingTF toy example

    Given the vast size of our data (and the resulting size of the vocabulary), advanced featurization methods like Word2Vec didn’t scale since they involved storing an in-memory vocabulary. In contrast, the HashingTF provided fixed-length numeric features which scaled to any vocabulary size. So although we’re potentially missing out on better semantic representations, the upside of being able to leverage all our training data significantly outweighed the downsides.

    Before performing the numeric featurization via HashingTF, we also performed a series of standard text pre-processing steps, such as:

    • Removing stop words (i.e. “the”, “a”, etc.), special characters, HTML, and URLs to reduce vocabulary size
    • Performing tokenization: splitting a string into an array of individual words or “tokens”.

    The Model 📖

    With our data featurized, we can now move towards modelling. Ensuring that we maintain a simple, interpretable, solution while tackling the earlier mentioned challenges of scale and structure was difficult.

    Learning Product Categories

    Fortunately, during the process of solution discovery, we came across a method known as Kesler’s Construction [PDF]. This is a mathematical maneuver that enables the conversion of n one-vs-all classifiers into a single binary classifier. As shown in the figure below, this is achieved by exploding the training data with respect to the labels, and manipulating feature vectors with target labels to turn a multi-class training dataset into a binary training dataset.

    Figure 3: Kesler’s Construction formulation

    Kesler’s Construction formulation

    Applying this formulation to our problem implied pre-pending the target class to each token (word) in a given feature vector. This is repeated for each class in the output space, per feature vector. The pseudo-code below illustrates the process, and also showcases how the algorithm leads to a larger, binary-classification training dataset.

    1. Create a new empty dataset called modified_training_data
    2. For each feature_vector in the original_training_data:
      1. For each class in the taxonomy:
        1. Prepend the class to each token in the feature_vector, called modified_feature_vector
        2. If the feature_vector is an example of the class, append (modified_feature_vector, 1) to modified_training_data
      2. If the feature vector is not an example of the class, append (modified_feature_vector, 0) to modified_training_data
    3. Return modified_training_data

    Note: In the algorithm above, a vector can be an example of a class if its ground truth category belongs to a class that’s a descendant of the category being compared to. For example, a feature vector that has the label Clothing would be an example of the Apparel & Accessories class, and as a result would be assigned a binary label of 1. Meanwhile, a feature vector that has the label Cell Phones would not be an example of the Apparel & Accessories class, and as a result would be assigned a binary label of 0.

    Combining the above process with a simple Logistic Regression classifier allowed us to:

    • Solve the problem of scale - Kesler’s construction allowed a single model to scale to n classes (in this case, n was into the thousands)
    • Leverage taxonomy structure - By embedding target classes into feature vectors, we’re also able to leverage the structure of the taxonomy and allow information from parent categories to permeate into features for child categories. 
    • Reduce computational resource usage - Training a single model as opposed to n individual classifiers (albeit on a larger training data-set) ensured a lower computational load/cost.
    • Maintain simplicity - Logistic Regression is one of the most simple classification methods available. It’s coefficients allow interpretability, and reduced friction with hyperparameter tuning.

    Inference and Predictions 🔮

    Great, we now have a trained model, how do we then make predictions to all products on Shopify? Here’s an example to illustrate. Say we have a sample product, a pair of socks, below:

    Figure 4: sample product entry for a pair of socks

    Sample product entry for a pair of socks

    We aggregate all of its text (title, description, tags, etc.) and clean it up using the Kesler’s Construction formulation resulting in the string:

    “Check out these socks”

    We take this sock product and compare it to all categories in the available taxonomy we trained on. To avoid computations on categories that will likely be low in relevance, we leverage the taxonomy structure and use a greedy approach in traversing the taxonomy.

    Figure 5: Sample traversal of taxonomy at inference time

    Sample traversal of taxonomy at inference time

    For each product, we prepend a target class to each token of the feature vector, and do so for every category in the taxonomy. We score the product against each root level category by multiplying this prepended feature vector against the trained model coefficients. We start at the root level and keep track of the category with the highest score. We then score the product against the children of the category with the highest score. We continue in this fashion until we’ve reached a leaf node. We output the full path from root to leaf node as a prediction for our sock product.

    Evaluation Metrics & Performance ✅

    The model is built. How do we know if it’s any good? Luckily, the machine learning community has an established set of standards around evaluation metrics for models, and there are good practices around which metrics make the most sense for a given type of task.

    However, the uniqueness of hierarchical classification adds a twist to these best practices. For example, commonly used evaluation metrics for classification problems include accuracy, precision, recall, and F1 Score. These metrics work great for flat binary or multi-class problems, but there are several edge cases that show up when there’s a hierarchy of classes involved.

    Let’s take a look at an illustrating example. Suppose for a given product, our model predicts the following categorization: Apparel & Accessories > Clothing > Shirts & Tops. There’s a few cases that can occur, based on what the product actually is:

    Product is a shirt - Model example

    Product is a shirt - Model example

    1. Product is a Shirt: In this case, we’re correct! Everything is perfect.

    Figure 7. Product is a dress - Model example

    Product is a dress - Model example

    2. Product is a Dress: Clearly, our model is wrong here. But how wrong is it? It still correctly recognized that the item is a piece of apparel and is clothing

    Figure 8. Product is a watch - Model example

    Product is a watch - Model example

    3. Product is a Watch: Again, the model is wrong here. It’s more wrong than the above answer, since it believes the product to be an accessory rather than apparel.

    Figure 9. Product is a phone - Model example

    Product is a phone - Model example

    4. Product is a Phone: In this instance, the model is the most incorrect, since the categorization is completely outside the realm of Apparel & Accessories.

    The flat metrics discussed above would punish each of the above predictions equally, when it’s clear that this isn’t the case. To rectify this, we leveraged work done by Costa et al. on hierarchical evaluation measures [PDF] which use the structure of the taxonomy (output space) to punish incorrect predictions accordingly. This includes:

    • Hierarchical accuracy
    • Hierarchical precision
    • Hierarchical recall
    • Hierarchical F1

    As shown below, the calculation of the metrics largely remains the same as their original flat form. The difference is that these metrics are regulated by the distance to the nearest common ancestor. In the examples provided, Dresses and Shirts & Tops are only a single level away from having a common ancestor (Clothing). In contrast, Phones and Shirts & Tops are in completely different sub-trees, and are four levels away from having a common ancestor

    Example hierarchical metrics for “Dresses” vs. “Shirts & Tops”

    Example hierarchical metrics for “Dresses” vs. “Shirts & Tops”

    This distance is used as a proxy to indicate the magnitude of incorrectness of our predictions, and allows us to present, and better assess the performance of our models. The lesson here is to always question conventional evaluation metrics, and ensure that they indeed fit your use-case, and measure what matters.

    When Things Go Wrong: Incorrect Classifications ❌

    Like all probabilistic models, our model is bound to be incorrect on occasions. While the goal of model development is to reduce these misclassifications, it’s important to note that 100% accuracy will never be the case (and it shouldn’t be the gold standard that teams drive towards).

    Instead, given that the data product is delivering downstream impact to the business, it's best to determine feedback mechanisms for misclassification instances. This is exactly what we implemented through a unique setup of schematized Kafka events and an in-house annotation platform.

    Feedback system design

    Feedback system design

    This flexible human-in-the-loop setup ensures a plug-in system that any downstream consumer can leverage, leading to reliable, accurate data additions to the model. It also extends beyond misclassifications to entire new streams of data, such that new business owner-facing products/features that allow them to provide category information can directly feed this information back into our models.

    Back to the Future: Potential Improvements 🚀

    Having established a baseline product categorization model, we’ve identified a number of possible improvements that can significantly improve the model’s performance, and therefore its downstream impact on the business.

    Data Imbalance ⚖️

    Much like other e-commerce platforms, Shopify has large sets of merchants selling certain types of products. As a result, our training dataset is skewed towards those product categories.

    At the same time, we don’t want that to preclude merchants in other industries from receiving strong, personalized insights. While we’ve taken some efforts to improve the data balance of each product category in our training data, there’s a lot of room for improvement. This includes experimenting with different re-balancing techniques, such as minority class oversampling (e.g. SMOTE [PDF]), majority class undersampling, or weighted re-balancing by class size.

    Translations 🌎

    As Shopify expands to international markets, it’s increasingly important to make sure we’re providing equal value to all business owners, regardless of their background. While our model currently only supports English language text (that being the primary source available in our training data), there’s a big opportunity here to capture products described and sold in other languages. One of the simplest ways we can tackle this is by leveraging multi-lingual pre-trained models such as Google’s Multilingual Sentence Embeddings.

    Images 📸

    Product images would be a great way to leverage a rich data source to provide a universal language in which products of all countries and types can be represented and categorized. This is something we’re looking to incorporate into our model in the future, however with images come increased engineering resources required. While very expensive to train images from scratch, one strategy we’ve experimented with is using pre-trained image embeddings like Inception v3 [PDF] and developing a CNN for this classification problem.

    Our simple model design allowed us interpretability and reduced computational resource usage, enabling us to solve this problem at Shopify’s scale. Building out a shared language for products unlocked tons of opportunities for us to build out better experiences for business owners and buyers. This includes things like being able to identify trending products or identifying product industries prone to fraud, or even improving storefront search experiences.

    If you’re passionate about building models at scale, and you’re eager to learn more - we’re always hiring! Reach out to us or apply on our careers page.

    Additional Information



    Continue reading

    Software Release Culture at Shopify

    Software Release Culture at Shopify

    A recording of the event and the additional questions are now available in the Release Culture @ Shopify Virtual Event section at the end of the post.

    By Jack Li, Kate Neely, and Jon Geiger

    At the end of last year, we shared the Merge Queue v2 in our blog post Successfully Merging the Work of 1000+ Developers. One question that we often get is, “why did you choose to build this yourself?” The short answer is that nothing we found could quite solve the problem in the way we wanted. The long answer is that it’s important for us to build an optimized experience for how our developers want to work and to continually shape our tooling and process around our “release culture”.

    Shopify defines culture as:

    “The sum of beliefs and behavior of everyone at Shopify.”

    We approach the culture around releasing software the exact same way. We have important goals, like making sure that bad changes don’t deploy to production and break for our users, and that our changes can make it into production without compromises in security. But there are many ways of getting there and a lot of right answers on how we can do things.

    As a team, we try to find the path to those goals that our developers want to take. We want to create experiences through tooling that can make our developers feel productive, and we want to do our best to make shipping feel like a celebration and not a chore.

    Measuring Release Culture at Shopify

    When we talk about measuring culture, we’re talking about a few things.

    • How do developers want to work?
    • What is important to them?
    • How do they expect the tools they use to support them?
    • How much do they want to know about what’s going on behind the scenes or under the hood of the tools they use?

    Often, there isn’t one single answer to these questions, especially given the number and variety of people who deploy every day at Shopify. There are a few active and passive ways we can get a sense of the culture around shipping code. One method isn’t more important than the others, but all of them together paint a clearer picture of what life is like for the people who use our tools.

    Passive and active methods of measurement
    Passive and active methods of measurement

    The passive methods we use really don’t require much work from our team, except to manage and aggregate information that comes in. The developer happiness survey is a biannual survey of developers across the company. Devs are asked to self-report about everything from their satisfaction with the tools they use or where they feel the most of their time is wasted or lost.

    In addition, we have Slack channels dedicated to shipping that are open to anyone. Users can get support from our team or each other, and report problems they’re having. Our team is active in these channels to help foster a sense of community and encourage developers to share their experiences, but we don’t often use these channels to directly ask for feedback.

    That said, we do want to be proactive about identifying pain points, and we know we can’t rely too much on users to provide that direction, so there are also active things we do to make sure we’re solving the most important problems.

    The first thing is dogfooding. Just like other developers at Shopify, our team ships code every day using the same tools that we build and maintain. This helps us identify gaps in our service and empathize with users when things don’t go as planned.

    Another valuable resource is our internal support team. They take on the huge responsibility of helping users and supporting our evolving suite of internal tools. They diagnose issues and help users find the right team to direct their questions. And they are invaluable in terms of identifying common pain points that users experience in current workflows, as well as potential pitfalls in concepts and prototypes. We love them.

    Finally, especially when it comes to adding new features or changing existing workflows, we do UX research throughout our process:

    • to better understand user behavior and expectations
    • to test out concepts and prototypes as we develop them

    We shadow developers as they ship PRs to see what else they’re looking at and what they’re thinking about as they make decisions. We talk to people, like designers and copywriters, who might not ship code at other companies (but they often do at Shopify) and ask them to walk us through their processes and how they learned to use the tools they rely on. We ask interns and new hires to test out prototypes to get fresh perspectives and challenge our assumptions.

    All of this together ensures that, throughout the process of building and launching, we’re getting feedback from real users to make things better.

    Feedback is a Gift

    At Shopify, we often say feedback is a gift, but that doesn’t always make it less intimidating for users to share their frustrations, or easier for us to hear when things go wrong. Our goal with all measuring is to create a feedback loop where users feel comfortable talking about what’s not working for them (knowing that we care and will try to act on it), and we feel energized and inspired by what we learn from users instead of disheartened and bitter. We want them to know that their feedback is valuable and helpful for us to make both the tools and culture around shipping supportive of everyone.

    Shopify’s Release Process

    Let’s look at what Shopify’s actual release process looks like and how we’re working to improve it.

    Release Pipeline

    Happy path of the release pipeline
    Happy path of the release pipeline

    This is what the release pipeline looks like on the happy path. We go from Pull Request (PR) to Continuous Integration (CI)/Merge to Canary deployment and finally Production.

    Release pipeline process starts with a PR and a /shipit command
    Release pipeline process starts with a PR and a /shipit command

    Developers start the process by creating a PR and then issue a /shipit command when ready to ship. From here, the Merge Queue system tries to integrate the PR with the trunk branch, Master.

    PR merged to Master and then deployed to Canary
    PR merged to Master and then deployed to Canary

    When the Merge Queue determines the changes can be integrated successfully, the PR is merged to Master and deployed to our Canary infrastructure. The Canary environment receives a random 5% of all incoming requests.

    Changes deployed to Production Changes deployed to Production 

    Developers have tooling allowing them to test their changes in the Canary environment for 10 minutes. If there’s no manual intervention and the automated canary analysis doesn’t trigger any alerts, the changes are deployed to Production


    Developers want to be trusted and have autonomy over their work. Developers should be able to own the entire release process for their PRs.

    Developers own the whole process
    Developers own the process

    Developers own the whole process. There are no release managers, sign offs, or windows that developers are allowed to make releases in.

    We have infrastructure to limit the blast radius of bad changes
    We have infrastructure to limit the blast radius of bad changes

    Unfortunately, sometimes things will break, and that’s ok. We have built our infrastructure to limit the blast radius of bad changes. Most importantly, we trust each of our developers to be responsible and own the recovery if their change goes bad.

    Developers can fast track a fix using /shipit --emergency command
    Developers can fast track a fix using /shipit --emergency

    Once a fix has been prepared (either a fix-forward or revert), developers can fast track their fix to the front of the line with a single /shipit --emergency command. To help our developers make decisions quickly, we don’t have multiple recovery protocols, and instead, just have a single emergency feature that takes the quickest path to recovery.


    Developers want to ship fast.

    A quick release process allows us a quick path to recovery
    A quick release process allows us a quick path to recovery

    Speed of release is a crucial element to most apps at Shopify. It’s a big productivity boost for developers to ship their code multiple times a day and have it reach end-users immediately. But more importantly, having a quick-release process allows us a quick path to recovery.

    We’re willing to make tradeoffs in cost for a fast release process
    We’re willing to make tradeoffs in cost for a fast release process

    In order to invest in really making our release process fast, we’re willing to make tradeoffs in cost. In addition to dedicated infrastructure teams, we also manage our own CI cluster with multiple thousands of nodes at its daily peak.

    Automate as Much as Possible

    Developers don’t want to perform repetitive tasks. Computers are still better at doing things than humans—so we automate as much as possible. We use automation in places like continuous deployments and canary analysis.

    Developers don't have to press deploy, we automate that
    Developers don't have to press deploy, we automate that

    We automated out the need for developers to press Deploy—we automatically continuously deploy to Canary and Production.

    Developers can override automation
    Developers can override automation

    It’s still important for developers to be able to override the machinery. Developers can lock the automatic deployments and deploy manually, in cases like emergencies.

    Release Culture @ Shopify Virtual Event

    We held Shipit! Presents a Q&A about Release Culture @ Shopify with our guests Jack Li, Kate Neely, and Jon Geiger on April 20, 2020.  We had a discussion about the culture around shipping code at Shopify and answered your questions. We weren't able to answer all the questions during the event, so we've included answers to all the questions that we didn't get to below.

    How has automation and velocity maintained uptime?

    Automation has helped maintain uptime by providing assurances on a level of quality for changes on the way out, such as the automated canary analysis guaranteeing that the changes meet a certain level of production quality. Velocity has helped maintain uptime by reducing downtime; when things break, the high velocity of release means that problems are resolved quicker.

    For an active monolith with many merges happening throughout the day, deploys to canary must be happening very frequently. How do you identify the "bad" merge, if there have been many recent merges on canary, and how do you ensure that bad merges don't block the release of other merges while there's a "bad" merge in the canary environment?

    Our process here is still relatively immature, so triaging failures is still a manual process. Our velocity helps us ensure smaller changelists which makes triaging failures easier. As for reducing the impact of bad changes, I will defer to our blog post about the Merge Queue, which helps us ensure that we are not completely stalled when bad changes happen. 

    How do you as a tooling organization handle sprawl? How do you balance enabling and controlling? That is, can teams choose to use their own languages and frameworks within those languages, or are they restricted to a set of things they're allowed to use?

    Generally, we are more restrictive in technology choices. This is mostly because we want to be strategic in the technologies that we use, and so we are open to experimentation, but we have technologies that are battle-tested at Shopify that we encourage as recommended defaults (e.g. Ruby, Rails). React Native is the Future of Mobile at Shopify is an interesting article that talks about a recent technology change we have made.

    What were you using before /shipit? How did that transition look? How did you measure its success?

    Successfully Merging the Work of 1000+ Developers tells the story of how we got to the current /shipit system. Our two measures of success around this has been through feedback from our developer happiness survey, and from metrics around average pull request time-to-production.

    How many different tools comprise the CI/CD pipeline and are they all developed in house, and are they owned by a specific team or does everyone contribute?

    We work with a variety of vendors! Our biggest partners are Buildkite, which we use for scheduling CI, and GitHub which we build our development workflow around. Some more info about our tooling can be found at The tools we build are developed and owned by our Developer Acceleration team, but everyone is free to contribute and we get tons of contributions all the time. Our CD system, Shipit, is actually open source and we see contributions by community members frequently as well.

    How is performance on production monitored after a feature release?

    Typically this is something that teams themselves will strategize around and monitor. Performance is a big deal to us, and we have an internal dashboard to help teams track their performance metrics. We trust each team to take this component of their product seriously. Aside from that, the Production Engineering team has monitors and dashboards around performance metrics of the entire system.

    How did you get into creating dev tooling for in-house teams. Are there languages/systems you would recommend learning to someone who is interested?

    (Note from Jack: Interpreting this as a kind of career question) Personally, I’ve always gravitated towards the more “meta” parts of software development, focusing on long-term productivity and maintainability of previous projects, so working on dev tooling full-time felt like a perfect fit. In my opinion, the most important skill to be successful in this problem space is to be adaptable, both in adapting to new technologies and to new ideas. Languages like Ruby, Python, that allow you to focus more on the ideas behind your code can be good enablers for this. Docker and Kubernetes knowledge is valuable in this area as well.

    Is development done on feature branches and entire features merged all at once, or are partial/incomplete features merged into master, but guarded by a feature flag?

    Very good question, I think certain teams/features will do slightly different things, but typically releases happen via feature flags that we call “Beta Flags” in our system. This allows changes to be rolled out on a per-shop basis, or a percentage-of-shops basis.

    Do you guys use Crystalball?

    We forwarded this question to our test infrastructure team, their response was that we don’t use Crystallball, there was some brief exploration into this, but it wasn’t fast enough to trace through our codebase, and the test suite in our main monolith is written in minitest.

    Additional Information

    If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.Learn about the actions we’re taking as we continue to hire during COVID‑19



    Continue reading

    Building Arrive's Confetti in React Native with Reanimated

    Building Arrive's Confetti in React Native with Reanimated

    Shopify is investing in React Native as our primary choice of mobile technology moving forward. As a part of this we’ve rewritten our package tracking app Arrive with React Native and launched it on Android—an app that previously only had an iOS version.

    One of the most cherished features by the users of the Arrive iOS app is the confetti that rains down on the screen when an order is delivered. The effect was implemented using the built-in CAEmitterLayer class in iOS, producing waves of confetti bursting out with varying speeds and colors from a single point at the top of the screen.

    When we on the Arrive team started building the React Native version of the app, we included the same native code that produced the confetti effect through a Native Module wrapper. This would only work on iOS however, so to bring the same effect to Android we had two options before us:

    1. Write a counterpart to the iOS native code in Android with Java or Kotlin, and embed it as a Native Module.
    2. Implement the effect purely in JavaScript, allowing us to share the same code on both platforms.

    As you might have guessed from the title of this blog post, we decided to go with the second option. To keep the code as performant as the native implementation, the best option would be to write it in a declarative fashion with the help of the Reanimated library.

    I’ll walk you through, step by step, how we implemented the effect in React Native, while also explaining what it means to write an animation declaratively.

    When we worked on this implementation, we also decided to make some visual tweaks and improvements to the effect along the way. These changes make the confetti spread out more uniformly on the screen, and makes them behave more like paper by rotating along all three dimensions.

    Laying Out the Confetti

    To get our feet wet, the first step will be to render a number of confetti on the screen with different colors, positions and rotation angles.

    Initialize the view of 100 confetti

    Initialize the view of 100 confetti

    We initialize the view of 100 confetti with a couple of randomized values and render them out on the screen. To prepare for animations further down the line, each confetto (singular form of confetti, naturally) is wrapped with Reanimated's Animated.View. This works just like the regular React Native View, but accepts declaratively animated style properties as well, which I’ll explain in the next section.

    Defining Animations Declaratively

    In React Native, you generally have two options for implementing an animation:

    1. Write a JavaScript function called by requestAnimationFrame on every frame to update the properties of a view.
    2. Use a declarative API, such as Animated or Reanimated, that allows you to declare instructions that are sent to the native UI-thread to be run on every frame.

    The first option might seem the most attractive at first for its simplicity, but there’s a big problem with the approach. You need to be able to calculate the new property values within 16 milliseconds every time to maintain a consistent 60 FPS animation. In a vacuum, this might seem like an easy goal to accomplish, but because of JavaScript's single threaded nature you’ll also be blocked by anything else that needs to be computed in JavaScript during the same time period. As an app grows and needs to be able to do more things at once, it quickly becomes implausible to always be able to finish the computation within the strict time limit.

    With the second option, you only rely on JavaScript at the beginning of the animation to set it all up, after which all computation happens on the native UI-thread. Instead of relying on a JavaScript function to answer where to move a view on each frame, you assemble a set of instructions that the UI-thread itself can execute on every frame to update the view. When using Reanimated these instructions can include conditionals, mathematical operations, string concatenation, and much more. These can be combined in a way that almost resembles its own programming language. With this language, you write a small program that can be sent down to the native layer, that is executed once every frame on the UI-thread.

    Animating the Confetti

    We are now ready to apply animations to the confetti that laid out in the previous step. Let's start by updating our createConfetti function:

    Instead of randomizing x, y and angle, we give all confetti the same initial values but instead randomize the velocities that we're going to be applying to them. This creates the effect of all confetti starting out inside an imaginary confetti cannon and shooting out in different directions and speeds. Each velocity expresses how much a value will be changing for each full second of animation.

    We need to wrap each value that we're intending to animate with Animated.Value, to prepare them for declarative instructions. The Animated.Clock value is what's going to be the driver of all our animations. As the name implies it gives us access to the animation's time, which we'll use to decide how much to move each value forward on each update.

    Further down, next to where we’re mapping over and rendering the confetti, we add our instructions for how the values should be animated:

    Before anything else, we set up our dt (delta time) value that will express how much time has passed since the last update, in seconds. This decides the x, y, and angle delta values that we're going to apply.

    To get our animation going we need to start the clock if it's not already running. To do this, we wrap our instructions in a condition, cond, which checks the clock state and starts it if necessary. We also need to call our timeDiff (time difference) value once to set it up for future use, since the underlying diff function returns its value’s difference since the last frame it evaluated, and the first call will be used as the starting reference point.

    The declarative instructions above roughly translate to the following pseudo code, which runs on every frame of the animation:

    Considering the nature of confetti falling through the air, moving at constant speed makes sense here. If we were to simulate more solid objects that aren't slowed down by air resistance as much, we might want to add a yAcc (y-axis acceleration) variable that would also increase the yVel (y-axis velocity) within each frame.

    Everything put together, this is what we have now:

    Staggering Animations

    The confetti is starting to look like the original version, but our React Native version is blurting out all the confetti at once, instead of shooting them out in waves. Let's address this by staggering our animations:

    We add a delay property to our confetti, with increasing values for each group of 10 confetti. To wait for the given time delay, we update our animation code block to first subtract dt from delay until it reaches below 0, after which our previously written animation code kicks in.

    Now we have something that pretty much looks like the original version. But isn’t it a bit sad that a big part of our confetti is shooting off the horizontal edges of the screen without having a chance to travel across the whole vertical screen estate? It seems like a missed potential.

    Containing the Confetti

    Instead of letting our confetti escape the screen on the horizontal edges, let’s have them bounce back into action when that’s about to happen. To prevent this from making the confetti look like pieces of rubber macaroni bouncing back and forth, we need to use a good elasticity multiplier to determine how much of the initial velocity to keep after the collision.

    When an x value is about to go outside the bounds of the screen, we reset it to the edge’s position and reverse the direction of xVel while reducing it by the elasticity multiplier at the same time:

    Adding a Cannon and a Dimension

    We’re starting to feel done with our confetti, but let’s have a last bit of fun with it before shipping it off. What’s more fun than a confetti cannon shooting 2-dimensional confetti? The answer is obvious of course—it’s two confetti cannons shooting 3-dimensional confetti!

    We should also consider cleaning up by deleting the confetti images and stopping the animation once we reach the bottom of the screen, but that’s not nearly as fun as the two additions above so we’ll leave that out of this blog post.

    This is the result of adding the two effects above:

    The final full code for this component is available in this gist.

    Driving Native-level Animation with JavaScript

    While it can take some time to get used to the Reanimated’s seemingly arcane API, once you’ve played around with it for a bit there should be nothing stopping you from implementing butter smooth cross-platform animations in React Native, all without leaving the comfort of the JavaScript layer. The library has many more capabilities we haven’t touched on in this post, for example, the possibility to add user interactivity by mapping animations to touch gestures. Keep a lookout for future posts on this subject!

    Continue reading

    Optimizing Ruby Lazy Initialization in TruffleRuby with Deoptimization

    Optimizing Ruby Lazy Initialization in TruffleRuby with Deoptimization

    Shopify's involvement with TruffleRuby began half a year ago, with the goal of furthering the success of the project and Ruby community. TruffleRuby is an alternative implementation of the Ruby language (where the reference implementation is CRuby, or MRI) developed by Oracle Labs. TruffleRuby has high potential in speed, as it is nine times faster than CRuby on optcarrot, a NES emulator benchmark developed by the Ruby Core Team.

    I’ll walk you through a simple feature I investigated and implemented. It showcases many important aspects of TruffleRuby and serves as a great introduction to the project!

    Introduction to Ruby Lazy Initialization

    Ruby developers tend to use the double pipe equals operator ||= for lazy initialization, likely somewhat like this:

    Syntactically, the meaning of the double pipe equals operator is that the value is assigned if the value of the variable is currently not set.

    The common use case of the operator isn’t so much “assign if” and more “assign once”.

    This idiomatic usage is a subset of the operator’s syntactical meaning so prioritizing that logic in the compiler can improve performance. For TruffleRuby, this would lead to less machine code being emitted as the logic flow is shortened.

    Analyzing Idiomatic Usage

    To confirm that this usage is common enough to be worth optimizing for, I ran static profiling on how many times this operator is used as lazy initialization.

    For a statement to count as a lazy initialization for these profiling purposes, we had it match one of the following requirements:

    • The value being assigned is a constant (uses only literals of int, string, symbol, hash, array or is a constant variable). An example would be a ||= [2 * PI].
    • The statement with the ||= operator is in a function, an instance or class variable is being assigned, and the name of the instance variable contains the name of the function or vice versa. The function must accept no params. An example would be def get_a; @a ||= func_call.

    These criteria are very conservative. Here are some examples of cases that won’t be considered a lazy initialization but probably still follow the pattern of “assign once”.

    After profiling 20 popular open-source projects, I found 2082 usages of the ||= operator, 64% of them being lazy initialization by this definition.

    Compiling Code with TruffleRuby

    Before we get into optimizing TruffleRuby for this behaviour, here’s some background on how TruffleRuby compiles your code.

    TruffleRuby is an implementation of Ruby that aims for higher performance through optimizing Just In Time (JIT) compilation (programs that are compiled as they're being executed). It’s built on top of GraalVM, a modified JVM built by Oracle that provides Truffle, a framework used by TruffleRuby for implementing languages through building Abstract Syntax Tree (AST) interpreters. With Truffle, there’s no explicit step where JVM bytecode is created as with a conventional JVM language, rather Truffle will just use the interpreter and communicate with the JVM to create machine code directly with profiling and a technique called partial evaluation. This means that GraalVM can be advertised as magic that converts interpreters into compilers!

    TruffleRuby also leverages deoptimization (more than other implementations of Ruby) which is a term for quickly moving between the fast JIT-compiled machine code to the slow interpreter. One application for deoptimization is how the compiler handles monkey patching (e.g. replacing a class method at runtime). It’s unlikely that a method will be monkey patched, so we can deoptimize if it has been monkey patched to find and execute the new method. The path for handling the monkey patching won't need to be compiled or appear in the machine code. In practice, this use case is even better—instead of constantly checking if a function has been redefined, we can just place the deoptimization where the redefinition is and never need a check in compiled code.

    In this case with lazy initialization, we make the deoptimization case the uncommon one where the variable needs to be assigned a value more than once.

    Implementing the Deoptimization

    Before when TruffleRuby encountered the ||= operator, a Graal profiler would see that since both sides have been used, the entire statement should be compiled into machine code. Our knowledge of how Ruby is used in practice tells us that the right hand side is unlikely to be run again, and so doesn’t need to be compiled into machine code if it’s never been executed or has been executed just once.

    TruffleRuby uses little objects called nodes to represent each part of a Ruby program. We use an OrNode to handle the ||= operator, with the left side being the condition and the right side being the action to execute if the left side is true (in this case the action is an assignment). The creation of these nodes are implemented in Java.

    To make this optimization, we swapped out the standard OrNode for an OrLazyValueDefinedNode in the BodyTranslator which translates the Ruby AST into nodes that Truffle can understand.

    The basic OrNode executes like this:

    The ConditionProfile is what counts how many times each branch is executed. With lazy initialization it counts both sides as used by default, so compiles them both into the machine code.

    The OrLazyValueDefinedNode only changes the else block. What I'm doing here is counting the number of times the else part is executed, and turning it into a deoptimization if it’s less than twice.

    Benchmarking and Impact

    Benchmarking isn’t a perfect measure of how effective this change is (benchmarking is arguably never perfect, but that’s a different conversation), as the results would be too noisy to observe in a large project. However, I can still benchmark on some pieces of code to see the improvements. By doing the “transfer to interpreter and invalidate”, time and space is saved in creating machine code for everything related to the right side.

    With our new optimization this piece of code compiles about 6% faster and produces about 63% fewer machine code by memory (about half the number of assembly instructions). Faster compilation means more time for your app to run, and smaller machine code means less usage of memory and cache. Producing less machine code more quickly improves responsiveness and should in turn make the program run faster, though it's difficult to prove.

    Function foo without optimization

    Function foo without optimization


    Above is a graph of the foo method in sample code above without the optimization that vaguely represents the logic present in the machine code. I can look at the actual compiler graphs produced by Graal at various stages to understand how exactly our code is being compiled, but this is the overview.

    Each of the nodes in this graph expands to more control flow and memory access, which is why this optimization can impact the amount of machine code so much. This graph represents the uncommon case where the checks and call to the calculate_foo method are needed, so for lazy initialization it’ll only need this flow once or zero times.

    Function foo with optimizationFunction foo with optimization

    The graph that includes the optimization is a bit less complex. The control flow doesn’t need to know anything about variable assignment or anything related to calling and executing a method.

    What I've added is just an optimization, so if you:

    • aren’t using ||= to mean lazy initialization
    • need to run the right-hand-side of the expression multiple times
    • need it to be fast

    then the optimization goes away and the code is compiled as it would have done before (you can revisit the OrLazyValueDefinedNode source above to see the logic for this).

    This optimization shows the benefit of looking at codebases used in industry for patterns that aren’t visible in the language specifications. It’s also worth noting that none of the code changes here were very complicated and modified code in a very modular way—other than the creation of the new node, only one other line was touched!

    Truffle is actually named after the chocolates, partially in reference to the modularity of a box of chocolates. Apart from modularity, TruffleRuby is also easy to develop on as it's primarily written in Ruby and Java (there's some C in there for extensions).

    Shopify is leading the way in experimenting with TruffleRuby for production applications. TruffleRuby is currently mirroring storefront traffic. This helped us work through some bugs, build better tooling for TruffleRuby and can lead to a faster browsing for customers.

    We also contribute to CRuby/MRI and Sorbet as a part of our work on Ruby. We like desserts, so along with contributions to TruffleRuby and Sorbet, we maintain Tapioca! If you'd like to become a part of our dessert medley (or work on other amazing Shopify projects), send us an application!

    Additional Information

    Tangentially related things about Ruby and TruffleRuby

    If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    Refactoring Legacy Code with the Strangler Fig Pattern

    Refactoring Legacy Code with the Strangler Fig Pattern

    Large objects are a code smell: overloaded with responsibilities and dependencies, as they continue to grow, it becomes more difficult to define what exactly they’re responsible for. Large objects are harder to reuse and slower to test. Even worse, they cost developers additional time and mental effort to understand, increasing the chance of introducing bugs. Unchecked, large objects risk turning the rest of your codebase into a ball of mud, but fear not! There are strategies for reducing the size and responsibilities of large objects. Here’s one that worked for us at Shopify, an all-in-one commerce platform supporting over one million merchants across the globe. 

    As you can imagine, one of the most critical areas in Shopify’s Ruby on Rails codebase is the Shop model. Shop is a hefty class with well over 3000 lines of code, and its responsibilities are numerous. When Shopify was a smaller company with a smaller codebase, Shop’s purpose was clearer: it represented an online store hosted on our platform. Today, Shopify is far more complex, and the business intentions of the Shop model are murkier. It can be described as a God Object: a class that knows and does too much.

    My team, Kernel Architecture Patterns, is responsible for enforcing clean, efficient, scalable architecture in the Shopify codebase. Over the past few years, we invested a huge effort into componentizing Shopify’s monolithic codebase (see Deconstructing the Monolith) with the goal of establishing well-defined boundaries between different domains of the Shopify platform.

    Not only is creating boundaries at the component-level important, but establishing boundaries between objects within a component is critical as well. It’s important that the business subdomain modelled by an object is clearly defined. This ensures that classes have clear boundaries and well-defined sets of responsibilities.

    Shop’s definition is unclear, and its semantic boundaries are weak. Unfortunately, this makes it an easy target for the addition of new features and complexities. As advocates for clean, well-modelled code, it was evident that the team needed to start addressing the Shop model and move some of its business processes into more appropriate objects or components.

    Using the ABC Code Metric to Determine Code Quality

    Knowing where to start refactoring can be a challenge, especially with a large class like Shop. One way to find a starting point is to use a code metric tool. It doesn’t really matter which one you choose, as long as it makes sense for your codebase. Our team opted to use Flog, which uses a score based on the number of assignments, branches and calls in each area of the code to understand where code quality is suffering the most. Running Flog identified a particularly disordered portion in Shop: store settings, which contains numerous “global attributes” related to a Shopify store.

    Refactoring Shop with the Strangler Fig Pattern

    Extracting store settings into more appropriate components offered a number of benefits, notably better cohesion and comprehension in Shop and the decoupling of unrelated code from the Shop model. Refactoring Shop was a daunting task—most of these settings were referenced in various places throughout the codebase, often in components that the team was unfamiliar with. We knew we’d potentially make incorrect assumptions about where these settings should be moved to. We wanted to ensure that the extraction process was well laid out, and that any steps taken were easily reversible in case we changed our minds about a modelling decision or made a mistake. Guaranteeing no downtime for Shopify was also a critical requirement, and moving from a legacy system to an entirely new system in one go seemed like a recipe for disaster.

    What is the Strangler Fig Pattern?

    The solution? Martin Fowler’s Strangler Fig Pattern. Don’t let the name intimidate you! The Strangler Fig Pattern offers an incremental, reliable process for refactoring code. It describes a method whereby a new system slowly grows over top of an old system until the old system is “strangled” and can simply be removed. The great thing about this approach is that changes can be incremental, monitored at all times, and the chances of something breaking unexpectedly are fairly low. The old system remains in place until we’re confident that the new system is operating as expected, and then it’s a simple matter of removing all the legacy code.

    That’s a relatively vague description of the Strangler Fig Pattern, so let’s break down the 7-step process we created as we worked to extract settings from the Shop model. The following is a macro-level view of the refactor.

    Macro-level view of the Strangler Fig Pattern
    Macro-level view of the Strangler Fig Pattern

    We’ll dive into exactly what is involved in each step, so don’t worry if this diagram is a bit overwhelming to begin with.

    Step 1: Define an Interface for the Thing That Needs to Be Extracted

    Define the public interface by adding methods to an existing class, or by defining a new model entirely.
    Define the public interface by adding methods to an existing class, or by defining a new model entirely

    The first step in the refactoring process is to define the public interface for the thing being extracted. This might involve adding methods to an existing class, or it may involve defining a new model entirely. This first step is just about defining the new interface; we’ll depend on the existing interface for reading data during this step. In this example, we’ll be depending on an existing Shop object and will continue to access data from the shops database table.

    Let’s look at an example involving Shopify Capital, Shopify’s finance program. Shopify Capital offers cash advances and loans to merchants to help them kick-start their business or pursue their next big goal. When a merchant is approved for financing, a boolean attribute, locked_settings, is set to true on their store. This indicates that certain functionality on the store is locked while the merchant is taking advantage of a capital loan. The locked_settings attribute is being used by the following methods in the Shop class:

    We already have a pretty clear idea of the methods that need to be involved in the new interface based on the existing methods that are in the Shop class. Let’s define an interface in a new class, SettingsToLock, inside the Capital component.

    As previously mentioned, we’re still reading from and writing to a Shop object at this point. Of course, it’s critical that we supply tests for the new interface as well.

    We’ve clearly defined the interface for the new system. Now, clients can start using this new interface to interact with Capital settings rather than going through Shop.

    Step 2: Change Calls to the Old System to Use the New System Instead

    Replace calls to the existing “host” interface with calls to the new system instead
    Replace calls to the existing “host” interface with calls to the new system instead

    Now that we have an interface to work with, the next step in the Strangler Fig Pattern is to replace calls to the existing “host” interface with calls to the new system instead. Any objects sending messages to Shop to ask about locked settings will now direct their messages to the methods we’ve defined in Capital::SettingsToLock.

    In a controller for the admin section of Shopify, we have the following method:

    This can be changed to:

    A simple change, but now this controller is making use of the new interface rather than going directly to the Shop object to lock settings.

    Step 3: Make a New Data Source for the New System If It Requires Writing

    New Data Source
    New data source

    If data is written as a part of the new interface, it should be written to a more appropriate data source. This might be a new column in an existing table, or may require the creation of a new table entirely.

    Continuing on with our existing example, it seems like this data should belong in a new table. There are no existing tables in the Capital component relevant to locked settings, and we’ve created a new class to hold the business logic—these are both clues that we need a new data source.

    The shops table currently looks like this in db/schema.rb

    We create a new table, capital_shop_settings_locks, with a column locked_settings and a reference to a shop.

    The creation of this new table marks the end of this step.

    Step 4: Implement Writers in the New Model to Write to the New Data Source

    implement writers in the new model to write data to the new data source while also writing to the existing data source
    Implement writers in the new model to write data to the new data source and existing data source

    The next step in the Strangler Fig Pattern is a bit more involved. We need to implement writers in the new model to write data to the new data source while also writing to the existing data source.

    It’s important to note that while we have a new class, Capital::SettingsToLockcapital_shop_settings_locks, these aren’t connected at the moment. The class defining the new interface is a plain old Ruby object and solely houses business logic. We are aiming to create a separation between the business logic of store settings and the persistence (or infrastructure) logic. If you’re certain that your model’s business logic is going to stay small and uncomplicated, feel free to use a single Active Record. However, you may find that starting with a Ruby class separate from your infrastructure is simpler and faster to test and change.

    At this point, we introduce a record object at the persistence layer. It will be used by the Capital::SettingsToLock class to read data from and write data to the new table. Note that the record class will effectively be kept private to the business logic class.

    We accomplish this by creating a subclass of ApplicationRecord. Its responsibility is to interact with the capital_shop_settings_locks table we’ve defined. We define a class Capital::SettingsToLockRecord, map it to the table we’ve created, and add some validations on the attributes.

    Let’s add some tests to ensure that the validations we’ve specified on the record model work as intended:

    Now that we have Capital::SettingsToLockRecord to read from and write to the table, we need to set up Capital::SettingsToLock to access the new data source via this record class. We can start by modifying the constructor to take a repository parameter that defaults to the record class:

    Next, let’s define a private getter, record. It performs find_or_initialize_by on the record model, Capital::SettingsToLockRecord, using shop_id as an argument to return an object for the specified shop.

    Now, we complete this step in the Strangler Fig Pattern by starting to write to the new table. Since we’re still reading data from the original data source, we‘ll need to write to both sources in tandem until the new data source is written to and has been backfilled with the existing data. To ensure that the two data sources are always in sync, we’ll perform the writes within transactions. Let’s refresh our memories on the methods in Capital::SettingsToLock that are currently performing writes.

    After duplicating the writes and wrapping these double writes in transactions, we have the following:

    The last thing to do is to add tests that ensure that lock and unlock are indeed persisting data to the new table. We control the output of SettingsToLockRecord’s find_or_initialize_by, stubbing the method call to return a mock record.

    At this point, we are successfully writing to both sources. That concludes the work for this step.

    Step 5: Backfill the New Data Source with Existing Data

    Backfill the data
    Backfill the data

    The next step in the Strangler Fig Pattern involves backfilling data to the new data source from the old data source. While we’re writing new data to the new table, we need to ensure that all of the existing data in the shops table for locked_settings is ported over to capital_shop_settings_locks.

    In order to backfill data to the new table, we’ll need a job that iterates over all shops and creates record objects from the data on each one. Shopify developed an open-source iteration API as an extension to Active Job. It offers safer iterations over collections of objects and is ideal for a scenario like this. There are two key methods in the iteration API: build_enumerator specifies the collection of items to be iterated over, and each_iteration defines the actions to be taken out on each object in the collection. In the backfill task, we specify that we’d like to iterate over every shop record, and each_iteration contains the logic for creating or updating a Capital::SettingsToLockRecord object given a store. The alternative is to make use of Rails’ Active Job framework and write a simple job that iterates over the Shop collection. 

    Some comments about the backfill task: the first is that we’re placing a pessimistic lock on the Shop object prior to updating the settings record object. This is done to ensure data consistency across the old and new tables in a scenario where a double write occurs at the same time as a row update in the backfill task. The second thing to note is the use of a logger to output information in the case of a persistence failure when updating the settings record object. Logging is extremely helpful in pinpointing the cause of persistence failures in a backfill task such as this one, should they occur.

    We include some tests for the job as well. The first tests the happy path and ensures that we're creating and updating settings records for every Shop object. The other tests the unhappy path in which a settings record update fails and ensures that the appropriate logs are generated

    After writing the backfill task, we enqueue it via a Rails migration:

    Once the task has run successfully, we celebrate that the old and new data sources are in sync. It’s wise to compare the data from both tables to ensure that the two data sources are indeed in sync and that the backfill hasn’t failed anywhere.

    Step 6: Change the Methods in the Newly Defined Interface to Read Data from the New Source

    Change the reader methods to use the new data source
    Change the reader methods to use the new data source

    The remaining steps of the Strangler Fig Pattern are fairly straightforward. Now that we have a new data source that is up to date with the old data source and is being written to reliably, we can change the reader methods in the business logic class to use the new data source via the record object. With our existing example, we only have one reader method:

    It’s as simple as changing this method to go through the record object to access locked_settings:

    Step 7: Stop Writing to the Old Source and Delete Legacy Code

    Remove the now-unused, “strangled” code from the codebase
    Remove the now-unused, “strangled” code from the codebase

    We’ve made it to the final step in our code strangling! At this point, all objects are accessing locked_settings through the Capital::SettingsToLock interface, and this interface is reading from and writing to the new data source via the Capital::SettingsToLockRecord model. The only thing left to do is remove the now-unused, “strangled” code from the codebase.

    In Capital::SettingsToLock, we remove the writes to the old data source in lock and unlock and get rid of the getter for shop. Let’s review what Capital::SettingsToLock looks like.

    After the changes, it looks like this:

    We can remove the tests in Capital::SettingsToLockTest that assert that lock and unlock write to the shops table as well.

    Last but not least, we remove the old code from the Shop model, and drop the column from the shops table.

    With that, we’ve successfully extracted a store settings column from the Shop model using the Strangler Fig Pattern! The new system is in place, and all remnants of the old system are gone.


    In summary, we’ve followed a clear 7-step process known as the Strangler Fig Pattern to extract a portion of business logic and data from one model and move it into another:

    1. We defined the interface for the new system.
    2. We incrementally replaced reads to the old system with reads to the new interface.
    3. We defined a new table to hold the data and created a record for the business logic model to use to interface with the database.
    4. We began writing to the new data source from the new system.
    5. We backfilled the new data source with existing data from the old data source.
    6. We changed the readers in the new business logic model to read data from the new table.
    7. Finally, we stopped writing to the old data source and deleted the remaining legacy code.

    The appeal of the Strangler Fig Pattern is evident. It reduces the complexity of the refactoring journey by offering an incremental, well-defined execution plan for replacing a legacy system with new code. This incremental migration to a new system allows for constant monitoring and minimizes the chances of something breaking mid-process. With each step, developers can confidently move towards a refactored architecture while ensuring that the application is still up and tests are green. We encourage you to try out the Strangler Fig Pattern with a small system that already has good test coverage in place. Best of luck in future code-strangling endeavors!

    Continue reading

    The Evolution of Kit: Automating Marketing Using Machine Learning

    The Evolution of Kit: Automating Marketing Using Machine Learning

    For many Shopify business owners, whether they’ve just started their entrepreneur journey or already have an established business, marketing is one of the essential tactics to build audience and drive sales to their stores. At Shopify, we offer various tools to help them do marketing. One of them is Kit, a virtual employee that can create easy Instagram and Facebook ads, perform email marketing automation and offer many other skills through powerful app integrations. I’ll talk about the engineering decision my team made to transform Kit from a rule based system to an artificially-intelligent assistant that looks at a business owner’s products, visitors, and customers to make informed recommendations for the next best marketing move. 

    As a virtual assistant, Kit interacts with business owners through messages over various interfaces including Shopify Ping and SMS. Designing the user experience for messaging is challenging especially when creating marketing campaigns that can involve multiple steps, such as picking products, choosing audience and selecting budget. Kit not only builds a user experience to reduce the friction for business owners in creating ads, but also goes a step further to help them create more effective and performant ads through marketing recommendation.

    Simplifying Marketing Using Heuristic Rules

    Marketing can be daunting, especially when the number of different configurations in the Facebook Ads Manager can easily overwhelm its users.

    Facebook Ads Manager ScreenshotFacebook Ads Manager screenshot

    There is a long list of settings that need configuring including objective, budget, schedule, audience, ad format and creative. For a lot of business owners who are new to marketing, understanding all these concepts is already time consuming, let alone making the correct decision at every decision point in order to create an effective marketing campaign.

    Kit simplifies the flow by only asking for the necessary information and configuring the rest behind the scenes. The following is a typical flow on how a business owner interacts with Kit to start a Facebook ad.

    Screenshot of the conversation flow on how a business owner interacts with Kit to start a Facebook ad
    Screenshot of the conversation flow on how a business owner interacts with Kit to start a Facebook ad

    Kit simplifies the workflow into two steps: 1) pick products as ad creative and 2) choose a budget. We use heuristic rules based on our domain knowledge and give business owners limited options to guide them through the workflow. For products, we identify several popular categories that they want to market. For budget, we offer a specific range based on the spending behavior of the business owners we want to help. For the rest of configurations, Kit defaults to best practices removing the need to make decisions based on expertise.

    The first version of Kit was a standalone application that communicated with Shopify to extract information such as orders and products to make product suggestions and interacted with different messaging channels to deliver recommendations conversationally.

    System interaction diagram for heuristic rules based recommendation. There are two major systems that Kit interacts with: Shopify for product suggestions; messaging channels for communication with business owners
    System interaction diagram for heuristic rules based recommendation. There are two major systems that Kit interacts with: Shopify for product suggestions; messaging channels for communication with business owners

    Building Machine Learning Driven Recommendation

    One of the major limitations in the existing heuristic rules-based implementations is that the range of budget is hardcoded into the application where every business owner has the same option to choose from. The static list of budget range may not fit their needs, where the more established ones with store traffic and sales may want to spend more. In addition, for many of the business owners who don’t have enough marketing experience, it’s a tough decision to choose the right amount in order to generate the optimal return.

    Kit strives to automate marketing by reducing steps when creating campaigns. We found that budgeting is one of the most impactful criteria in contributing to successful campaigns. By eliminating the decision from the configuration flow, we reduced the friction for business owners to get started. In addition, we eliminated the first step of picking products by generating a proactive recommendation for a specific category such as new products. Together, Kit can generate a recommendation similar to the following:

    Screenshot of a one-step marketing recommendation
    Screenshot of a one-step marketing recommendation

    To generate this recommendation, there are two major decisions Kit has to make:

    1. How much is the business owner willing to spend?
    2. Given the current state of the business owner, will the budget be enough for the them to successfully generate sales?

    Kit decided that for business owner Cheryl, she should spend about $40 for the best chance to make sales given the new products marketing opportunity. From a data science perspective, it’s broken down into two types of machine learning problems:

    1. Regression: given a business owner’s historic spending behavior, predict the budget range that they’re likely to spend.
    2. Classification: given the budget a business owner has with store attributes such as existing traffic and sales that can measure the state of their stores, predict the likelihood of making sales.

    The heuristic rules-based system allowed Kit to collect enough data to make solving the machine learning problem possible. Kit can generate actionable marketing recommendation that gives the business owners the best chance of making sales based on their budget range and the state of their stores using the data we learnt.

    The second version of Kit had its first major engineering revision by implementing the proactive marketing recommendation in the app through the machine learning architecture in Google Cloud Platform:

    Flow diagram on generating proactive machine learning driven recommendation in Kit
    Flow diagram on generating proactive machine learning driven recommendation in Kit

    There are two distinct flows in this architecture:

    Training flow: Training is responsible for building the regression and classification models that are used in the prediction flow.

    1. Aggregate all relevant features. This includes the historic Facebook marketing campaigns created by business owners through Shopify, and the store state (e.g. traffic and sales) at the time when they create the marketing campaign.
    2. Perform feature engineering, a process using domain knowledge to extract useful features from the source data that are used to train the machine learning models. For historic marketing features, we derive features such as past 30 days average ad spend and past 30 days marketing sales. For shop state features, we derive features such as past 30 days unique visitors and past 30 days total orders. We take advantage of Apache Spark’s distributed computation capability to tackle the large scale Shopify dataset.
    3. Train the machine learning models using Google Cloud’s ML Engine. ML Engine allows us to train models using various popular frameworks including scikit-learn and TensorFlow.
    4. Monitor the model metrics. Model metrics are methods to evaluate the performance of a given machine learning model by comparing the predicted values against the ground truth. Monitoring is the process to validate the integrity of the feature engineering and model training by comparing the model metrics against its historic values. The source features in Step 1 can sometimes be broken leading to inaccurate feature engineering results. Even when feature pipeline is intact, it’s possible that the underlying data distribution changes due to unexpected new user behavior leading to deteriorating model performance. A monitoring process is important to keep track of historic metrics and ensure the model performs as expected before making it available for use. We employed two types of monitoring strategies: 1) threshold: alert when the model metric is beyond a defined threshold; 2) outlier detection: alert when the model metrics deviates from its normal distribution. We use z-score to detect outliers.
    5. Persist the models for prediction flow.
    Prediction flow: Prediction is responsible for generating the marketing recommendation by optimizing for the budget and determining whether or not the ad will generate sales given the existing store state.
    1. Generate marketing recommendations by making predictions using the features and models prepared in the training flow.
    2. Send recommendations to Kit through Apache Kafka.

    At Shopify, we have a data platform engineering team to maintain the data services required to implement both the training and prediction flows. This allows the product team to focus on building the domain specific machine learning pipelines, prove product values, and iterate quickly.

    Moving to Real Time Prediction Architecture

    Looking back at our example featuring business owner Cheryl, Kit decided that she can spend $40 for the best chance of making sales. In marketing, making sales is often not the first step in a business owner’s journey, especially when they don’t have any existing traffic to their store. Acquiring new visitors to the store in order to build lookalike audiences that are more relevant to the business owner is a crucial step to expand the audience size in order to create more successful marketing campaigns afterward. For this type of business owner, Kit evaluates the budget based on a different goal and suggests a more appropriate amount to acquire enough new visitors in order to build the lookalike audience. This is how the recommendation looks:

    [Screenshot of a recommendation to build lookalike audience
    Screenshot of a recommendation to build lookalike audience

    To generate this recommendation, there are three major decisions Kit has to make:

    1. How many new visitors does the business owner need in order to create lookalike audiences?
    2. How much are they willing to spend?
    3. Given the current state of the business owner, will the budget be enough for them to acquire those visitors?

    Decision two and three are solved using the same machine learning architecture as described previously. However, there’s a new complexity in this recommendation that step one needs to determine the required number of new visitors in order to build lookalike audiences. Since the traffic to a store can change in real time, the prediction flow needs to process the request at the time when the recommendation is delivered to the business owner.

    One major limitation for the Spark-based prediction flow is that recommendations are optimized in batch manner rather than on demand, i.e., the prediction flow is triggered from Spark on schedule basis rather than from Kit at the time when the recommendation is delivered to business owners. With the Spark batch setting, it’s possible that the budget recommendation is already stale by the time it’s delivered to the business owner. To solve that problem, we built a real time prediction service to replace the Spark prediction flow.

    Flow diagram on generating real time recommendation in Kit
    Flow diagram on generating real time recommendation in Kit

    One major distinction compared to the previous Spark-based prediction flow is that Kit is proactively calling into the real time prediction service to generate the recommendation.

    1. Based on the business owner’s store state, Kit decides that their marketing objective should be building lookalike audiences. Kit sends a request to the prediction service to generate budget recommendation from which the request reaches an HTTP API exposed through the web container component.
    2. Similar to the batch prediction flow in Spark, the web container generates marketing recommendations by making predictions using the features and models prepared in the training flow. However, there are several design considerations:
      1. We need to ensure efficient access to the features to minimize prediction request latency. Therefore, once features are generated during the feature engineering stage, they are immediately loaded into a key value store using Google Cloud’s Bigtable.
      2. Model prediction can be computationally expensive especially when the model architecture is complex. We use Google’s TensorFlow Serving which is a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving also provides out-of-the-box integration with TensorFlow models from which it can directly consume the models generated from the training flow with minimal configurations.
      3. Since the most heavy-lifting CPU/GPU-bound model prediction operations are dedicated to TensorFlow Serving, the web container remains a light-weight application that holds the business logic to generate recommendations. We chose Tornado as the Python web framework. By using non-blocking network I/O, Tornado can scale to tens of thousands open connections for model predictions.
    3. Model predictions are delegated to the TensorFlow Serving container.
    4. TensorFlow Serving container preloads the machine learning models generated during the training flow and uses them to perform model predictions upon requests.

    Powering One Third of All Kit Marketing Campaigns

    Kit started as a heuristic rules-based application that uses common best practices to simplify and automate marketing for Shopify’s business owners. We progressively improved the user experience by building machine learning driven recommendations to further reduce user friction and to optimize budgets giving business owners a higher chance of creating a more successful campaign. By first using a well established Spark-based prediction process (that’s well supported within Shopify) we showed the value of machine learning in driving user engagement and marketing results. This also allows us to focus on productionalizing an end-to-end machine learning pipeline with both training and prediction flows that serve tens of thousands of business owners. 

    We learned that having a proper monitoring component in place is crucial to ensure the integrity of the overall machine learning system. We moved to an advanced real time prediction architecture to solve use cases that required time-sensitive recommendations. Although the real time prediction service introduced two additional containers (web and TensorFlow Serving) to maintain, we delegated the most heavy-lifting model prediction component to TensorFlow Serving, which is a well supported service by Google and integrates with Shopify’s existing cloud infrastructure easily. This ease of use allowed us to focus on defining and implementing the core business logic to generate marketing recommendation in the web container.

    Moving to machine learning driven implementation has been proven valuable. One third of the marketing campaigns in Kit are powered by machine learning driven recommendations. Kit will continue to improve its marketing automation skills by optimizing for different marketing tactics and objectives in order to support their diverse needs.

    We're always on the lookout for talent and we’d love to hear from you. Please take a look at our open positions on the Data Science & Engineering career page.

    Continue reading

    Creating Native Components That Accept React Native Subviews

    Creating Native Components That Accept React Native Subviews

    React Native adoption has been steadily growing since its release in 2015, especially with its ability to quickly create cross-platform apps. A very strong open-source community has formed, producing great libraries like Reanimated and Gesture Handler that allow you to achieve native performance for animations and gestures while writing exclusively React Native code. At Shopify we are using React Native for many different types of applications, and are committed to giving back to the community.

    However, sometimes there is a native component you made for another app, or already exists on the platform, which you want to quickly port to React Native and aren’t able to build cross-platform using exclusively React Native. The documentation for React Native has good examples of how to create a native module which exposes native methods or components, but what should you do if you want to use a component you already have and render React Native views inside of it? In this guide, I’ll show you how to make a native component which provides bottom sheet functionality to React Native and lets you render React views inside of it. 

    A simple example is the bottom sheet pattern from Google’s Material Design. It’s a draggable view which peeks up from the bottom of the screen and is able to expand to take up the full screen. It renders subviews inside of the sheet, which can be interacted with when the sheet is expanded.

    This guide only focuses on an Android native implementation and assumes a basic knowledge of Kotlin. When creating an application, it’s best to make sure all platforms have the same feature parity.

    Bottom sheet functionality

    Bottom sheet functionality

    Table of Contents

    Setting Up Your Project

    If you already have a React Native project set up for Android with Kotlin and TypeScript you’re ready to begin. If not, you can run react-native init NativeComponents —template react-native-template-typescript in your terminal to generate a project that is ready to go.

    As part of the initial setup, you’ll need to add some Gradle dependencies to your project.

    Modify the root build.gradle (android/build.gradle) to include these lines:

    Make sure to substitute your current Kotlin version in the place of 1.3.61.

    This will add all of the required libraries for the code used in the rest of this guide.

    You should use fixed version numbers instead of + for actual development.

    Creating a New Package Exposing the Native Component

    To start, you need to create a new package that will expose the native component. Create a file called NativeComponentsReactPackage.kt.

    Right now this doesn’t actually expose anything new, but you’ll add to the list of View Managers soon. After creating the new package, go to your Application class and add it to the list of packages.

    Creating The Main View

    A ViewGroupManager<T> can be thought of as a React Native version of ViewGroup from Android. It accepts any number of children provided, laying them out according to the constraints of the type T specified on the ViewGroupManager.

    Create a file called ReactNativeBottomSheet.kt and a new

    The basic methods you have to implement are getName() and createViewInstance().

    name is what you’ll use to reference the native class from React Native.

    createViewInstance is used to instantiate the native view and do initial setup.

    Inflating Layouts Using XML

    Before you create a real view to return, you need to set up a layout to inflate. You can set this up programmatically, but it’s much easier to inflate from an XML layout.

    Here’s a fairly basic layout file that sets up some CoordinatorLayouts with behaviours for interacting with gestures. Add this to android/app/src/main/res/layout/bottom_sheet.xml.

    The first child is where you’ll put all of the main content for the screen, and the second is where you’ll put the views you want inside BottomSheet. The behaviour is defined so that the second child can translate up from the bottom to cover the first child, making it appear like a bottom sheet.

    Now that there is a layout created, you can go back to the createViewInstance method in ReactNativeBottomSheet.kt.

    Referencing The New XML File

    First, inflate the layout using the context provided from React Native. Then save references to the children for later use.

    If you aren’t using Kotlin Synthetic Properties, you can do the same thing with container = findViewById(

    For now, this is all you need to initialize the view and have a fully functional bottom sheet.

    The only thing left to do in this class is to manage how the views passed from React Native are actually handled.

    Handling Views Passed from React Native To Native Android

    By overriding addView you can change where the views are placed in the native layout. The default implementation is to add any views provided as children to the main CoordinatorLayout. However, that won’t have the effect expected, as they’ll be siblings to the bottom sheet (the second child) you made in the layout.

    Instead, don’t make use of super.addView(parent, child, index) (the default implementation), but manually add the views to the layout’s children by using the references stored earlier.

    The basic idea followed is that the first child passed in is expected to be the main content of the screen, and the second child is the content that’s rendered inside of the bottom sheet. Do this by simply checking the current number of children on the container. If you already added a child, add the next child to the bottomSheet.

    The way this logic is written, any views passed after the first one will be added to the bottom sheet. You’re designing this class to only accept two children, so you’ll make some modifications later.

    This is all you need for the first version of our bottom sheet. At this point, you can run react-native run-android, successfully compile the APK, and install it.

    Referencing the New Native Component in React Native

    To use the new native component in React Native you need to require it and export a normal React component. Also set up the props here, so it will properly accept a style and children.

    Create a new component called BottomSheet.tsx in your React Native project and add the following:

    Now you can update your basic App.tsx to include the new component.

    This is all the code that is required to use the new native component. Notice that you're passing it two children. The first child is the content used for the main part of the screen, and the second child is rendered inside of our new native bottom sheet.

    Adding Gestures

    Now there's a working native component that renders subviews from React Native, you can add some more functionality.

    Being able to interact with the bottom sheet through gestures is our main use case for this component, but what if you want to programmatically collapse/expand the bottom sheet?

    Since you’re using a CoordinatorLayout with behaviour to make the bottom sheet in native code, you can make use of BottomSheetBehaviour. Going back to ReactNativeBottomSheet.kt, we will update the createViewInstance() method.

    By creating a BottomSheetBehaviour you can make more customizations to how the bottom sheet functions and when you’re informed about state changes.

    First, add a native method which specifies what the expanded state of the bottom sheet should be when it renders.

    This adds a prop to our component called sheetState which takes a string and sets the collapsed/expanded state of the bottom sheet based on the value sent. The string sent should be either collapsed or expanded.

    We can adapt our TypeScript to accept this new prop like so:

    Now, when you include the component, you can change whether it’s collapsed or expanded without touching it. Here’s an example of updating your App.tsx to add a button that updates the bottom sheet state.

    Now, when pressing the button, it expands the bottom sheet. However, when it’s expanded, the button disappears. If you drag the bottom sheet back down to a collapsed state, you'll notice that the button isn't updating its text. So you can set the state programmatically from React Native, but interacting with the native component isn't propagating the value of the bottom sheet's state back into React. To fix this you will add more to the *BottomSheetBehaviour* you created earlier.

    This code adds a state change listener to the bottom sheet, so that when its collapsed/expanded state changes, you emit a React Native event that you listen to in the React component. The event is called "BottomSheetStateChange” and has the same value as the states accepted in setSheetState().

    Back in the React component, you listen to the emitted event and call an optional listener prop to notify the parent that our state has changed due to a native interaction.

    Updating the App.tsx again

    Now when you drag the bottom sheet, the state of the button updates with its collapsed/expanded state.

    Native Code And Cross Platform Components

    When creating components in React Native our goal is always to make cross-platform components that don’t require native code to perform well, but sometimes that isn’t possible or easy to do. By creating ViewGroupManager classes, we are able to extend the functionality of our native components so that we can take full advantage of React Native’s flexible layouts, with very little code required.

    Additional Information

    All the code included in the guide can be found at the react-native-bottom-sheet-example repo.

    This guide is just an example of how to create native views that accept React Native subviews as children. If you want a complete implementation for bottom sheets on Android, check out the react-native wrapper for android BottomSheetBehavior.

    You can follow the Android guideline for CoordinatorLayout and BottomSheetBehaviour to better understand what is going on. You’re essentially creating a container with two children.

    If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    Your Circuit Breaker is Misconfigured

    Your Circuit Breaker is Misconfigured

    Circuit breakers are an incredibly powerful tool for making your application resilient to service failure. But they aren’t enough. Most people don’t know that a slightly misconfigured circuit is as bad as no circuit at all! Did you know that a change in 1 or 2 parameters can take your system from running smoothly to completely failing?

    I’ll show you how to predict how your application will behave in times of failure and how to configure every parameter for your circuit breaker.

    At Shopify, resilient fallbacks integrate into every part of the application. A fallback is a backup behavior which activates when a particular component or service is down. For example, when Shopify’s Redis, that stores sessions, is down, the user doesn’t have to see an error. Instead, the problem is logged and the page renders with sessions soft disabled. This results in a much better customer experience. This behaviour is achievable in many cases, however, it’s not as simple as catching exceptions that are raised by a failing service.

    Imagine Redis is down and every connection attempt is timing out. Each timeout is 2 seconds long. Response times will be incredibly slow, since requests are waiting for the service to timeout. Additionally, during that time the request is doing nothing useful and will keep the thread busy.

    Utilization, the percentage of a worker’s maximum available working capacity, increases indefinitely as the request queue builds up, resulting in a utilization graph like this:

    Utilization during service outage
    Utilization during service outrage

    A worker which had a request processing rate of 5 requests per second now can only process half a request per second. That’s a tenfold decrease in throughput! With utilization this high, the service can be considered completely down. This is unacceptable for production level standards.

    Semian Circuit Breaker

    At Shopify, this fallback utilization problem is solved by Semian Circuit Breaker. This is a circuit breaker implementation written by Shopify. The circuit breaker pattern is based on a simple observation: if a timeout is observed for a given service one or more times, it’s likely to continue to timeout until that service recovers. Instead of hitting the timeout repeatedly, the resource is marked as dead and an exception is raised instantly on any call to it.

    I'm looking at this from the configuration perspective of Semian circuit breaker but another notable circuit breaker library is Hystrix by Netflix. The core functionality of their circuit breaker is the same, however, it has less available parameters for tuning which means, as you will learn below, it can completely lose its effectiveness for capacity preservation.

    A circuit breaker can take the above utilization graph and turn it into something more stable.

    Utilization during service outage with a circuit breaker
    Utilization during service outage with a circuit breaker

    The utilization climbs for some time before the circuit breaker opens. Once open, the utilization stabilizes so the user may only experience some slight request delays which is much better.

    Semian Circuit Breaker Parameters

    Configuring a circuit breaker isn’t a trivial task. It’s seemingly trivial because there are just a few parameters to tune: 

    • name
    • error_threshold
    • error_timeout
    • half_open_resource_timeout
    • success_threshold.

    However, these parameters cannot just be assigned to arbitrary numbers or even best guesses without understanding how the system works in detail. Changes to any of these parameters can greatly affect the utilization of the worker during a service outage.

    At the end, I'll show you a configuration change that drops the utilization requirement of 263% to 4%. That’s the difference between complete outage and a slight delay. But before I get to that, let’s dive into detail about what each parameter does and how it affects the circuit breaker.


    The name identifies the resource being protected. Each name gets its own personal circuit breaker. Every different service type, such as MySQL, Redis, etc. should have its own unique name to ensure that excessive timeouts in a service only opens the circuit for that service.

    There is an additional aspect to consider here. The worker may be configured with multiple service instances for a single type. In certain environments, there can be dozens of Redis instances that a single worker can talk to.

    We would never want a single Redis instance outage to cause all Redis connections to go down so we must give each instance a different name.

    For this example, see the diagram below. We will model a total of 3 Redis instances. Each instance is given a name “redis_cache_#{instance_number}”.

    3 Redis instances. Each instance is given a name “redis_cache_#{instance_number}”
    3 Redis instances. Each instance is given a name “redis_cache_#{instance_number}”

    You must understand how many services your worker can talk to. Each failing service will have an aggregating effect on the overall utilization. When going through the examples below, the maximum number of failing services you would like to account for is defined by failing_services. For example, if you have 3 Redis instances, but you only need to know the utilization when 2 of those go down, failing_services should be 2.

    All examples and diagrams in this post are from the reference frame of a single worker. None of the circuit breaker state is shared across workers so we can simplify things this way.

    Error Threshold

    The error_threshold defines the number of errors to encounter within a given timespan before opening the circuit and starting to reject requests instantly. If the circuit is closed and error_threshold number of errors occur within a window of error_timeout, the circuit will open.

    The larger the error_threshold, the longer the worker will be stuck waiting for input/output (I/O) before reaching the open state. The following diagram models a simple scenario where we have a single Redis instance failure.

    error_threshold = 3, failing_services = 1

    A single Redis instance failure
    A single Redis instance failure

    3 timeouts happen one after the other for the failing service instance. After the third, the circuit becomes open and all further requests raise instantly.

    3 timeouts must occur during the timespan before the circuit becomes open. Simple enough, 3 times the timeout isn’t so bad. The utilization will spike, but the service will reach steady state soon after. This graph is a real world example of this spike at Shopify:

    A real world example of a utilization spike at Shopify
    A real world example of a utilization spike at Shopify

    The utilization begins to increase when the Redis services goes down, after a few minutes, the circuit begins opening for each failing service and the utilization lowers to a steady state.

    Furthermore, if there’s more than 1 failing service instance, the spike will be larger, last longer, and cause more delays for end users. Let’s come back to the example from the Name section with 3 separate Redis instances. Consider all 3 Redis instances being down. Suddenly the time until all circuits are open triples.

    error_threshold = 3, failing_services = 3

    3 failing services and each service has 3 timeouts before the circuit opens
    3 failing services and each service has 3 timeouts before the circuit opens

    There are 3 failing services and each service has 3 timeouts before the circuit opens. All the circuits must become open before the worker will stop being blocked by I/O.

    Now, we have a longer time to reach steady state because each circuit breaker wastes utilization waiting for timeouts. Imagine 40 Redis instances instead of 3, a timeout of 1 second and an error_threshold of 3 means there’s a minimum time of around 2 minutes to open all the circuits.

    The reason this estimate is a minimum is because the order that the requests come in cannot be guaranteed. The above diagram simplifies the scenario by assuming the requests come in a perfect order.

    To keep the initial utilization spike low, the *error_threshold* should be reduced as much as possible. However, the probability of false-positives must be considered. Blips can cause the circuit to open despite the service not being down. The lower the error_threshold, the higher the probability of a false-positive circuit open.

    Assuming a steady state timeout error rate is 0.1% in your time window of error_timeout. An error_timeout of 3 will give you a 0.0000001% chance of getting a false positive.

    100 *(probability_of_failure)number_of_failures =(0.001)3=0.0000001%

    You must balance this probability with your error_timeout to reduce the number of false positives circuit opens. When the circuit opens, it will be instantly raising for every request that is made during error_timeout.

    Error Timeout

    The error_timeout is the amount of time until the circuit breaker will try to query the resource again. It also determines the period to measure the error_threshold count. The larger this value is, the longer the circuit will take to recover after an outage. The larger this value is, the longer a false positive circuit open will affect the system.

    error_threshold = 3, failing_services = 1

    The circuit will stay open for error_timeout amount of time
    The circuit will stay open for error_timeout amount of time

    After the failing service causes the circuit to become open, the circuit will stay open for error_timeout amount of time. The Redis instance comes back to life and error_timeout amount of time passes so requests start sending to Redis again.

    It’s important to consider the error_timeout in relation to half_open_resource_timeout. These 2 parameters are the most important for your configuration. Getting these right will determine the success of the circuit breakers resiliency mechanism in times of outage for your application.

    Generally we want to minimize the error_timeout because the higher it is, the higher the recovery time. However, the primary constraints come from its interaction with these parameters. I’ll show you that maximizing error_timeout will actually preserve worker utilization.

    Half Open Resource Timeout

    The circuit is in half-open state when it’s checking to see if the service is back online. It does this by letting a real request through. The circuit becomes half-open after error_timeout amount of time has passed. When the operating service is completely down for an extended period of time, a steady-state behavior arises.

    failing_services = 1

    Circuit becomes half-open after error_timeout amount of time has passed
    Circuit becomes half-open after error_timeout amount of time has passed

    Error threshold expires but the service is still down. The circuit becomes half-open and a request is sent to Redis, which times out. The circuit opens again and the process repeats as long as Redis remains down.

    This flip flop between the open and half-open state is periodic which means we can deterministically predict how much time is wasted on timeouts.

    By this point, you may already be speculating on how to adjust wasted utilization. The error_timeout can be increased to reduce the total time wasted in the half-open state! Awesome — but the higher it goes, the slower your application will be to recover. Furthermore, false positives will keep the circuit open for longer. Not good, especially if we have many service instances. 40 Redis instances with a timeout of 1 second is 40 seconds every cycle wasted on timeouts!

    So how else do we minimize the time wasted on timeouts? The only other option is to reduce the service timeout. The lower the service timeout, the less time is wasted on waiting for timeouts. However, this cannot always be done. Adjusting this timeout is highly dependent on how long the service needs to provide the requested data. We have a fundamental problem here. We cannot reduce the service timeout because of application constraints and we cannot increase the error_timeout because the recovery time will be too slow.

    Enter half_open_resource_timeout, the timeout for the resource when the circuit is in the half-open state. It gets used instead of the original timeout. Simple enough! Now, we have another tunable parameter to help adjust utilization. To reduce wasted utilization, error_timeout and half_open_resource_timeout can be tuned. The smaller half_open_resource_timeout is relative to *error_timeout*, the better the utilization will be.

    If we have 3 failing services, our circuit diagram looks something like this:

    failing_services = 3

    A total of 3 timeouts before all the circuits are open
    A total of 3 timeouts before all the circuits are open

    In the half-open state, each service has 1 timeout before the circuit opens. With 3 failing services, that’s a total of 3 timeouts before all the circuits are open. All the circuits must become open before the worker will stop being blocked by I/O.

    Let’s solidify this example with the following timeout parameters:

    error_timeout = 5 seconds
    half_open_resource_timeout = 1 second

    The total steady state period will be 8 seconds with 5 of those seconds spent doing useful work and the other 3 wasted waiting for I/O. That’s 37% of total utilization wasted on I/O.

    Note: Hystrix does not have an equivalent parameter for half_open_resource_timeout which may make it impossible to tune a usable steady state for applications that have a high number of failing_services.

    Success Threshold

    The success_threshold is the amount of consecutive successes for the circuit to close again, that is, to start accepting all requests to the circuit.

    The success_threshold impacts the behavior during outages which have an error rate of less than 100%. Imagine a resource error rate of 90%, with a success_threshold of 1, the circuit will flip flop between open and closed quite often. In this case there’s a 10% chance of it closing when it shouldn’t. Flip flopping also adds additional strain on the system since the circuit must spend time on I/O to re-close.

    Instead, if we increase the success_threshold to 3, then the likelihood of an open becomes significantly lower. Now, 3 successes must happen in a row to open the circuit reducing the chance of flip flop to 0.1% per cycle.

    Note: Hystrix does not have an equivalent parameter for success_threshold which may make it difficult to reduce the flip flopping in times of partial outage for certain applications.

    Lowering Wasted Utilization

    Each parameter affects wasted utilization in some way. Semian can easily be configured into a state where a service outage will consume more utilization than the capacity allows. To calculate the additional utilization required, I have put together an equation to model all of the parameters of the circuit breaker. Use it to plan your outage effectively.

    The Circuit Breaker Equation

    The Circuit Break Equation

    This equation applies to the steady state failure scenario in the last diagram where the circuit is continuously checking the half-open state. Additional threads reduce the time spent on blocking I/O, however, the equation doesn’t account for the time it takes to context switch a thread which could be significant depending on the application. The larger the context switch time, the lower the thread count should be.

    I ran a live test to test out the validity of the equation and the utilization observed closely matched the utilization predicted by the equation.

    Tuning Your Circuit

    Let’s run through an example and see how the parameters can be tuned to match the application needs. In this example, I’m integrating a circuit breaker for a Rails worker configured with 2 threads. We have 42 Redis instances, each configured with its own circuit and a service timeout of 0.25s.

    As a starting point, let’s go with the following parameters. Failing instances is 42 because we are judging behaviour in the worst case, when all of the Redis instances are down.

    Parameter  Value
    0.25 seconds
    2 seconds
    0.25 seconds (same as service timeout)

    Plugging into The Circuit Breaker Equation, we require an extra utilization of 263%. Unacceptable! Ideally we should have something less than 30% to account for regular traffic variation.

    So what do we change to drop this number?

    From production observation metrics, I know 99% percent of Redis requests have a response time of less than 50ms. With a value this low, we can easily drop the half_open_resource_timeout to 50ms and still be confident that the circuit will close when Redis comes back up from an outage. Additionally, we can increase the error_timeout to 30 seconds. This means a slower recovery time but it reduces the worst case utilization.

    With these new numbers, the additional utilization required drops to 4%!

    I use this equation as something concrete to relate back to when making tuning decisions. I hope this equation helps you with your circuit breaker configuration as it does with mine.

    Author's Edit: "I fixed an error with the original circuit breaker equation in this post. success_threshold does not have an impact on the steady state utilization because it only takes 1 error to keep the circuit open again."

    If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    Great Code Reviews—The Superpower Your Team Needs

    Great Code Reviews—The Superpower Your Team Needs

    There is a general consensus that code reviews are an important aspect of highly effective teams. This research paper is one of many exploring this subject. Most organizations undergo code reviews of some form.

    However, it’s all too common to see code reviews that barely scratch the surface, or that offer feedback that is unclear or hard to act upon. This robs the team the opportunity to speed up learning, share knowledge and context, and raise the quality bar on the resulting code.

    At Shopify, we want to move fast while building for the long term. In our experience, having strong code review practices has a huge impact on the growth of our engineers and in the quality of the products we build.

    A Scary Scenario

    Imagine you join a new team and you’re given a coding task to work on. Since you’re new on the team, you really want to show what you’re made of. You want to perform. So, this is what you do:

    1. You work frantically on your task for 3 weeks.
    2. You submit a Pull Request for review with about 1000 new lines of code
    3. You get a couple comments about code style and a question that shows the person has no clue what this work is about.
    4. You get approval from both reviewers after fixing the code style and answering the question.
    5. You merge your branch into master, eyes closed, shoulders tense, grinding your teeth. After a few minutes, CI completes. Master is not broken. Yet.
    6. You live in fear for 6 months, not knowing when and how your code will break.

    You may have lived through some of the situations above, and hopefully you’ve seen some of the red flags in that process.

    Let’s talk about how we can make it much better.

    Practical Code Review Practices

    At Shopify, we value the speed of shipping, learning, and building for the long term. These values - which sometimes conflict - lead us to experiment with many techniques and team dynamics. In this article, I have distilled a series of very practical techniques we use at Shopify to ship valuable code that can stand the test of time.

    A Note about terminology: We refer to Pull Requests (PR) as one unit of work that's put forth for review before merging into the base branch. Github and Bitbucket users will be familiar with this term.

    1. Keep Your Pull Requests Small

    As simple as this sounds, this is easily the most impactful technique you can follow to level up your code review workflow. There are 2 fundamental reasons why this works:

    • It’s mentally easier to start and complete a review for a small piece. Larger PRs will naturally make reviewers delay and procrastinate examining the work, and they are more likely to be interrupted mid-review.
    • As a reviewer, it’s exponentially harder to dive deep if the PR is long. The more code there is to examine, the bigger the mental map we need to build to understand the whole piece.

    Breaking up your work in smaller chunks increases your chances of getting faster and deeper reviews.

    Now, it’s impossible to set one universal standard that applies to all programming languages and all types of work. Internally, for our data engineering work, the guideline is around 200-300 lines of code affected. If we go above this threshold, we almost always break up the work into smaller blocks.

    Of course, we need to be careful about breaking up PRs into chunks that are too small, since this means reviewers may need to inspect several PRs to understand the overall picture.

    2. Use Draft PRs

    Have you heard the metaphor of building a car vs. drawing a car? It goes something like this:

    1. You’re asked to build a car.
    2. You go away for 6 months and build a beautiful Porsche.
    3. When you show it to your users, they ask about space for their 5 children and the surf boards.

    Clearly, the problem here is that the goal is poorly defined and the team jumped directly into the solution before gathering enough feedback.If after step 1 we created a drawing of the car and showed it to our users, they would have asked the same questions and we would have discovered their expectations and saved ourselves 6 months of work. Software is no different—we can make the same mistake and work for a long time on a feature or module that isn't what our users need.

    At Shopify, it’s common practice to use Work In Progress (WIP) PRs to elicit early feedback whose goal is validating direction (choice of algorithm, design, API, etc). Early changes mean less wasted effort on details, polish, documentation, etc.

    As an author, this means you need to be open to changing the direction of your work. At Shopify, we try to embrace the principle of strong opinions, loosely held. We want people to make decisions confidently, but also be open to learning new and better alternatives, given sufficient evidence. In practice, we use Github’s Draft PRs—they clearly signal the work is still in flow and Github prevents you from merging a Draft PR. Other tools may have similar functionality, but at the very least you can create normal PRs with a clear WIP label to indicate the work is early stage. This will help your reviewers focus on offering the right type of feedback.

    3. One PR Per Concern

    In addition to line count, another dimension to consider is how many concerns your unit of work is trying to address. A concern may be a feature, a bugfix, a dependency upgrade, an API change, etc. Are you introducing a new feature while refactoring at the same time? Fixing two bugs in one shot? Introducing a library upgrade and a new service?

    Breaking down PRs into individual concerns has the following effects:

    • More independent review units and therefore better review quality
    • Fewer affected people, therefore less domains of expertise to gather
    • Atomicity of rollbacks, the ability of rolling back a small commit or PR. This is valuable because if something goes wrong, it will be easier to identify where errors were introduced and what to roll back.
    • Separating easy stuff from hard stuff. Imagine a new feature that requires refactoring a frequently used API. You change the API, update a dozen call-sites, and then implement your feature. 80% of your changes are obvious and skimmable with no functional changes, while 20% are new code that needs careful attention to test coverage, intended behaviour, error handling, etc. and will likely go through multiple revisions. With each revision, the reviewer will need to skim through all of the changes to find the relevant bits. By splitting this in two PRs, it becomes easy to quickly land the majority of the work and to optimize the review effort applied to the harder work.

    If you end up with a PR that includes more than one concern, you can break it down into individual chunks. Doing so will accelerate the iteration cycle on each individual review, giving a faster review overall. Often part of the work can land quickly, avoiding code rot and merge conflicts.

    Breaking down PRs into individual concerns

    Breaking down PRs into individual concerns

    In the example above, we’ve taken a PR that covered three different concerns and broke it up. You can see how each reviewer has strictly less context to go over. Best of all, as soon as any of the reviews is complete, the author can begin addressing feedback while continuing to wait for the rest of the work. In the most extreme cases, instead of completing a first draft, waiting several days (and shifting focus), and then eventually returning to address feedback, the author can work almost continuously on their family of PRs as they receive the different reviews asynchronously.

    4. Focus on the Code, Not the Person

    Focus on the code, not the person practice refers to communication styles and relationships between people. Fundamentally, it’s about trying to focus on making the product better, and avoiding the author perceiving a review as personal criticism.

    Here are some tips you can follow:

    • As a reviewer, think, “This is our code, how can we improve on it?”
    • Offer positive remarks! If you see something done well, comment on it. This reinforces good work and helps the author balance suggestions for improvement.
    • As an author, assume best intention, and don’t take comments personally.

    Below are a few examples of not-so-great review comments, and a suggestion on how we can reword to emphasize the tips above.

    Less of These
     More of These
    Move this to Markdown
    How about moving this documentation into our Markdown README file? That way we can more easily share with other users.
    Read the Google Python style guidelines
    We should avoid single-character variables. How about board_size or size instead?
    This feels too slow. Make it faster. Lightning fast.
     This algorithm is very easy to read but I’m concerned about performance. Let’s test this with a large dataset to gauge its efficiency.
    Bool or int?
    Why did you choose a list of bool values instead of integers?

    Ultimately, a code review is a learning and teaching opportunity and should be celebrated as such.

    5. Pick the Right People to Review

    It’s often challenging to decide who should review your work. Here are some questions can use as guidance:

    • Who has context on the feature or component you’re building?
    • Who has strong skills in the language, framework, or tool you’re using?
    • Who has strong opinions on the subject?
    • Who cares about the result of what you’re doing?
    • Who should learn this stuff? Or if you’re a junior reviewing someone more senior, use this as an opportunity to ask questions and learn. Ask all the silly questions, a strong team will find the time to share knowledge.

    Whatever rules your team might have, remember that it is your responsibility as an author to seek and receive a high-quality code review from a person or people with the right context.

    6. Give Your Reviewers a Map

    Last but definitely not least, the description on your PR is crucial. Depending on who you picked for review, different people will have different context. The onus is on the author to help reviewers by providing key information or links to more context so they can produce meaningful feedback.

    Some questions you can include in your PR templates:

    • Why is this PR necessary?
    • Who benefits from this?
    • What could go wrong?
    • What other approaches did you consider? Why did you decide on this approach?
    • What other systems does this affect?

    Good code is not only bug-free; it is also useful! As an author, ensure that your PR description ties your code back to your team’s objectives, ideally with link to a feature or bug description in your backlog. As a reviewer, start with the PR description; if it’s incomplete, send it back before attempting to judge the suitability of the code against undefined objectives. And remember, sometimes the best outcome of a code review is to realize that the code isn’t needed at all!

    What’s the Benefit?

    By adopting some of the techniques above, you can have a strong impact on the speed and quality of your software building process. But beyond that, there’s the potential for a cultural effect:

    • Teams will build a common understanding. The group understands your work better and you’re not the only person capable of evolving any one area of the codebase.
    • Teams will adopt a sense of shared responsibility. If something breaks, it’s not one person’s code that needs fixing. It’s the team’s work that needs fixing.

    Any one person in a team should be able to take a holiday and disconnect from work for a number of days without risking the business or stressing about checking email to make sure the world didn’t end.

    What Can I Do to Improve My Team’s Code Review Process?

    If you lead teams, start experimenting with these techniques and find what works for your team.

    If you’re an individual contributor, discuss with your lead on why you think code reviews techniques are important, how they help effectiveness and how they help your team.

    Bring this up on your next 1:1 or your next team synch.

    The Importance of Code Reviews

    To close, I’ll share some words from my lead, which summarizes the importance of Code Reviews:

    “We could prioritize landing mediocre but working code in the short term, and we will write the same debt-ridden code forever, or we can prioritize making you a stronger contributor, and all of your future contributions will be better (and your career brighter).

    An enlightened author should be delighted to have this attention.”

    We're always on the lookout for talent and we’d love to hear from you. Please take a look at our open positions on the Data Science & Engineering career page.

    Continue reading

    Bug Bounty Year in Review 2019

    Bug Bounty Year in Review 2019

    For the third year in a row, we’ve taken time to reflect on our Bug Bounty program. This past year was an exciting one for us because we ran multiple experiments and made a number of process improvements to increase our program speed. 

    2020 Program Improvements

    Building on our program’s continued success in 2019, we’re excited to announce more improvements. 

    Bounties Paid in Full Within 7 Days

    As of today, we pay bounties in full within 7 days of a report being triaged. Paying our program minimum on triage has been a resounding success for us and our hackers. After having experimented with paying full bounties on triage in Shopify-Experiments (described below), we’ve decided to make the same change to our public program.

    Maximum Bounty is Now $50,000

    We are increasing our maximum bounty amount to $50,000. Beginning today, we are doubling the bounty amounts for valid reports of Arbitrary Code Execution, now $20K–$50K, SQL Injection, now $20K$40K, and Privilege Escalation to Shop Owner, now $10K$30K. Trust and security is our number one priority at Shopify and these new amounts demonstrate our commitment to both.

    Surfacing More Information About Duplicate Reports

    Finally, we know how important it is for hackers to trust the programs they choose to work with. We value that trust. So, beginning today, anyone who files a duplicate report to our program will be added to the original report, when it exists within HackerOne. We're continuing to explore ways to share information about internally known issues with hackers and hope to have a similar announcement later this year.

    Learning from Bug Bounty Peers

    Towards the end of 2018, we reached out to other bug bounty programs to share experiences and lessons learned. This was amazing. We learned so much chatting with our peers and those conversations gave us better insight into improving our data analytics and experimenting with a private program.

    Improving Our Analytics

    At Shopify, we make data-informed decisions and our bug bounty program is no exception. However, HackerOne platform data only gives us insight into what hackers are reporting and when; it doesn’t tell us who is testing what and how often. Discussing this problem with other programs revealed how some had already tackled this obstacle; they were leveraging provisioned accounts to understand their program funnel, from invitation, to registration, to account creation, and finally testing. Hearing this, we realized we could do the same.

    To participate in our bug bounty program, we have always required hackers to register for an account with a specific identifier (currently a email address). Historically, we used that registration requirement for investigating reports of suspicious platform activity. However, we realized that the same data could tell us how often people are testing our applications. Furthermore, with improvements to the HackerOne API and the ability to export all of our report data regularly, we have all the data necessary to create exciting activity reports and program trends. It’s also given us more knowledge to share in our monthly program recap tweets.

    Shopify-Experiments, A Private Bug Bounty Program

    Chatting with other programs, we also shared ideas about what is and isn’t working. We heard about some having success running additional private programs. Naturally, we launched a private bug bounty program to test the return on investment. We started Shopify-Experiments in mid-2019 and invited high signal, high impact hackers who have reported to our program previously or who have a proven track record on the HackerOne platform. The program allowed us to run controlled experiments aimed at improving our public program. For example, in 2019, we experimented with:

    • expanding the scope to help us better understand the workload implications
    • paying bounties in full after validating and triaging a report
    • making report disclosure mandatory and adding hackers to duplicate reports
    • allowing for self-closing reports that were submitted in good faith, but were false positives
    • increasing opportunities to collaborate with Shopify third party developers to test their apps.

    These experiments had immediate benefits for our Application Security Team and the Shopify public program. For instance, after running a controlled experiment with an expanded scope, we understood the workload it would entail in our public program. So, on September 11, 2019, we added all but a few Shopify-developed apps into the scope of our public program. Since then, we’ve received great reports about these new assets, such as Report 740989 from Vulnh0lic, which identified a misconfiguration in our OAuth implementation for the Shopify Stocky app. If you’re interested in being added to the program, all it takes is 3 resolved Shopify reports with an overall signal of 3.0 or more in our program.

    Improving Response Times with Automation

    In 2018, our average initial response time was 17 hours. In 2019, we wanted to do better. Since we use a dedicated Slack channel to manage incoming reports, it made sense to develop a chatbot and use the HackerOne API. In January last year, we implemented HackerOne API calls to change report states, assign reports, post public and private comments as well as suggest bounty amounts.

    Immediately this gave us better access to responding to reports on mobile devices. However, our chosen syntax was difficult to remember. For example, changing a report state was done via the command hackerone change_state <report_id> <state>. Responding with an auto response was hackerone auto_respond <report_id> <state> <response_id>. To make things easier, we introduced shorthands and emoji responses. Now, instead of typing hackerone change_state 123456 not-applicable, we can use h1 change_state 123456 na. For common invalid reports, we react with emojis which post the appropriate common response and close the report as not applicable.

    2019 Bug Bounty Statistics

    Knowing how important communication is to our hackers, we continue to pride ourselves on all of our response metrics being among the best on HackerOne. For another straight year, we reduced our communication times. Including weekends, our average time to first response was 16 hours compared to 1 day and 9 hours in 2018. This was largely a result of being able to quickly close invalid reports on weekends with Slack. We reduced our average time to triage from 3 days and 6 hours in 2018 to 2 days and 13 hours in 2019.

    We were quicker to pay bounties and resolve bugs; our average time to bounty from submission was 7 days and 1 hour in 2019 versus 14 days in 2018. Our average resolution time from time of triage was down to 20 days and 3 hours from 48 days and 15 hours in 2018. Lastly, we thanked 88 hackers in 2019, compared to 86 in 2018.

    Average Shopify Response Times - Hours vs. YearsAverage Shopify Response Times - Hours vs. Years

    We continued to request disclosure on our resolved bugs. In 2019, we disclosed 74 bugs, up from 37 in 2018. We continue to believe it’s extremely important that we build a resource library to enable ethical hackers to grow in our program. We strongly encourage other companies to do the same.

    Reports Disclosed - Number vs. YearReports Disclosed - Number of Reports vs. Year

    In contrast to our speed improvements and disclosures, our bounty related statistics were down from 2018, largely a result of having hosted H1-514 in October 2018, which paid out over $130,000 to hackers. Our total amount paid to hackers was down to $126,100 versus $296,400 in 2018, despite having received approximately the same number of reports; 1,379 in 2019 compared to 1,306 in 2018.

    Bounties Paid - Bounties Awarded vs. YearBounties Paid - Bounties Awarded vs. Years

    Number of Reports by Year - Number of Reports vs. YearNumber of Reports by Year - Number of Reports vs. Year

    Report States by Year - Number of Reports vs. YearReport States by Year - Number of Reports vs. Year

    Similarly, our average bounty awarded was also down in 2019, $1,139 compared to $2,052 in 2018. This is partly attributed to the amazing bugs found at H1-514 in October 2018 and our decision to merge the Shopify Scripts bounty program, which had a minimum bounty of $100, to our core bounty program in 2019. We rewarded bounties to fewer reports; 107 in 2019 versus 182 in 2018.

    After another successful year in 2019, we’re excited to work with more hackers in 2020. If you’re interested in helping to make commerce more secure, visit to start hacking or our careers page to check out our open Trust and Security positions.

    Happy Hacking.
    - Shopify Trust and Security

    Continue reading

    React Native is the Future of Mobile at Shopify

    React Native is the Future of Mobile at Shopify

    After years of native mobile development, we’ve decided to go full steam ahead building all of our new mobile apps using React Native. As I’ll explain, that decision doesn’t come lightly.

    Each quarter, the majority of buyers purchase on mobile (with 71% of our buyers purchasing on mobile in Q3 of last year). Black Friday and Cyber Monday (together, BFCM) are the busiest time of year for our merchants, and buying activity during those days is a bellwether. During this year’s BFCM, Shopify merchants saw another 3% increase in purchases on mobile, an average of 69% of sales.

    So why the switch to React Native? And why now? How does this fit in with our native mobile development? It’s a complicated answer that’s best served with a little background.

    Mobile at Shopify Pre-2019

    We have an engineering culture at Shopify of making specific early technology bets that help us move fast.

    On the whole, we prefer to have few technologies as a foundation for engineering. This provides us multiple points of leverage:

    • we build extremely specific expertise in a small set of deep technologies (we often become core contributors)
    • every technology choice has quirks, but we learn them intimately
    • those outside of the initial team contribute, transfer and maintain code written by others
    • new people are onboarded more quickly.

    At the same time, there are always new technologies emerging that provide us with an opportunity for a step change in productivity or capability. We experiment a lot for the opportunity to unlock improvements that are an order of magnitude improvement—but ultimately, we adopt few of these for our core engineering.

    When we do adopt these early languages or frameworks, we make a calculated bet. And instead of shying away from the risk, we meticulously research, explore and evaluate such risks based on our unique set of conditions. As is often within risky areas, the unexplored opportunities are hidden. We instead think about how we can mitigate that risk:

    • what if a technology stops being supported by the core team?
    • what if we run into a bug we can’t fix?
    • what if the product goes in a direction against our interests?

    Ruby on Rails was a nascent and obscure framework when Tobi (our CEO) first got involved as a core contributor in 2004. For years, Ruby on Rails has been seen as a non-serious, non-performant language choice. But that early bet gave Shopify the momentum to outperform the competition even though it was not a popular technology choice. By using Ruby on Rails, the team was able to build faster and attract a different set of talent by using something more modern and with a higher level of abstraction than traditional programming languages and frameworks. Paul Graham talks about his decision to use Lisp in building Viaweb to similar effect and 6 of the 10 most valuable Y Combinator companies today all use Ruby on Rails (even though again, it still remains largely unpopular). As a contrast, none of the Top 10 most valuable Y Combinator companies use Java; largely considered the battle tested enterprise language.

    Similarly two years ago, Shopify decided to make the jump to Google CloudAgain, a scary proposition for the 3rd largest US Retail eCommerce site in 2019—to do a cloud migration away from our own data centers, but to also pick an early cloud contender. We saw the technology arc of value creation moving us to focusing on what we’re good at—enabling entrepreneurship and letting others (in this case Google Cloud) focus on the undifferentiated heavy lifting of maintaining physical hardware, power, security, the operating system updates, etc.

    What is React Native?

    In 2015, Facebook announced and open sourced React Native; it was already being used internally for their mobile engineering. React Native is a framework for building native mobile apps using React. This means you can use a best-in-class JavaScript library (React) to build your native mobile user interfaces.

    At Shopify, the idea had its skeptics at the time (and still does), but many saw its promise. At the company’s next Hackdays the entire company spent time on React Native. While the early team saw many benefits, they decided that we couldn’t ship an app we’d be proud of using React Native in 2015. For the most part, this had to do with performance and the absence of first-class Android support. What we did learn was that we liked the Reactive programming model and GraphQL. Also, we built and open-sourced a functional renderer for iOS after working with React Native. We adopted these technologies in 2015 for our native mobile stack, but not React Native for mobile development en masse. The Globe and Mail documented our aspirations in a comprehensive story about the first version of our mobile apps.

    Until now, the standard for all mobile development at Shopify was native mobile development. We built mobile tooling and foundations teams focused on iOS and Android helping accelerate our development efforts. While these teams and the resulting applications were all successful, there was a suspicion that we could be more effective as a team if we could:

    • bring the power of JavaScript and the web to mobile
    • adopt a reactive programming model across all client-side applications
    • consolidate our iOS and Android development onto a single stack.

    How React Native Works

    React Native provides a way to build native cross platform mobile apps using JavaScript. React Native is similar to React in that it allows developers to create declarative user interfaces in JavaScript, for which it internally creates a hierarchy tree of UI elements or in React terminology a virtual DOM. Whereas the output of ReactJS targets a browser, React Native translates the virtual DOM into mobile native views using platform native bindings that interface with application logic in JavaScript. For our purposes, the target platforms are Android and iOS, but community driven effort have brought React Native to other platforms such as Windows, macOS and Apple tvOS.

    ReactJS targets a browser, whereas React Native can can target mobile APIs.


    ReactJS targets a browser, whereas React Native can target mobile APIs.

    When Will We Not Default to Using React Native?

    There are situations where React Native would not be the default option for building a mobile app at Shopify. For example, if we have a requirement of:

    • deploying on older hardware (CPU <1.5GHz)
    • extensive processing
    • ultra-high performance
    • many background threads.

    Reminder: Low-level libraries including many open sourced SDKs will remain purely native. And we can always create our own native modules when we need to be close to the metal.

    Why Move to React Native Now?

    There were 3 main reasons now is a great time to take this stance:

    1. we learned from our acquisition of Tictail (a mobile first company that focused 100% on React Native) in 2018 how far React Native has come and made 3 deep product investments in 2019
    2. Shopify uses React extensively on the web and that know-how is now transferable to mobile
    3. we see the performance curve bending upwards (think what’s now possible in Google Docs vs. desktop Microsoft Office) and we can long-term invest in React Native like we do in Ruby, Rails, Kubernetes and Rich Media.

    Mobile at Shopify in 2019

    We have many mobile surfaces at Shopify for buyers and merchants to interact, both over the web and with our mobile apps. We spent time over the last year experimenting with React Native with three separate teams over three apps: Arrive, Point of Sale, and Compass.

    From our experiments we learned that:

    • in rewriting the Arrive app in React Native, the team felt that they were twice as productive than using native development—even just on one mobile platform
    • testing our Point of Sale app on low-power configurations of Android hardware let us set a lower CPU threshold than previously imagined (1.5GHz vs. 2GHz)
    • we estimated ~80% code sharing between iOS and Android, and were surprised by the extremely high-levels in practice—95% (Arrive) and 99% (Compass)

    As an aside, even though we’re making the decision to build all new apps using React Native, that doesn’t mean we’ll automatically start rewriting our old apps in React Native.


    At the end of 2018, we decided to rewrite one of our most popular consumer apps, Arrive in React Native. Arrive is no slouch, it’s a highly rated, high performing app that has millions of downloads on iOS. It was a good candidate because we didn’t have an Android version. Our efforts would help us reach all of the Android users who were clamoring for Arrive. It’s now React Native on both iOS and Android and shares 95% of the same code. We’ll do a deep dive into Arrive in a future blog post.

    So far this rewrite resulted in:

    • less crashes on iOS than our native iOS app
    • an Android version launched
    • team composed of mobile + non-mobile developers.

    The team also came up with this cool way to instantly test work-in-progress pull requests. You simply scan a QR code from an automated GitHub comment on your phone and the JavaScript bundle is updated in your app and you’re now running the latest code from that pull request. JML, our CTO, shared the process on Twitter recently.

    Point of Sale

    At the beginning of 2019, we did a 6-week experiment on our flagship Point of Sale (POS) app to see if it would be a good candidate for a rewrite in React Native. We learned a lot, including that our retail merchants expect almost 2x the responsiveness in our POS due to the muscle memory of using our app while also talking to customers.

    In order to best serve our retail merchants and learn about React Native in a physical retail setting, we decided to build out the new POS natively for iOS and use React Native for Android.

    We went ahead with 2 teams for the following reasons:

    1. we already had a team ramped up with iOS expertise, including many of the folks that built the original POS apps
    2. we wanted to be able to benchmark our React Native engineering velocity as well as app performance against the gold standard which is native iOS
    3. to meet the high performance requirements of our merchants, we felt that we’d need all of the Facebook re-architecture updates to React Native before launch (as it turns out, they weren’t critical to our performance use cases). Having two teams on two platforms, de-risked our ability to launch.

    We announced a complete rewrite of POS at Unite 2019. Look for both the native iOS and React Native Android apps to launch in 2020!


    The Start team at Shopify is tasked with helping folks new to entrepreneurship. Before the company wide decision to write all mobile apps in React Native came about, the team did a deep dive into Native, Flutter and React Native as possible technology choices. They chose React Native and now have iOS and Android apps (in beta) live in the app stores.

    The first versions of Compass (both iOS and Android) were launched within 3 months with ~99% of the code shared between iOS and Android.

    Mobile at Shopify 2020+

    We have lots in store for 2020.

    Will we rewrite our native apps? No. That’s a decision each app team makes independently

    Will we continue to hire native engineers? Yes, LOTS!

    We want to contribute to core React Native, build platform specific components, and continue to understand the subtleness of each of the platforms. This requires deep native expertise. Does this sound like you?

    Partnering and Open Source

    We believe that building software is a team sport. We have a commitment to the open web, open source and open standards.

    We’re sponsoring Software Mansion and Krzysztof Magiera (co-founder of React Native for Android) in their open source efforts around React Native.

    We’re working with William Candillon (host of Can It Be Done in React Native) for architecture reviews and performance work.

    We’ll be partnering closely with the React Native team at Facebook on automation, 3rd party libraries and stewardship of some modules via Lean Core.

    We are working with Discord to accelerate the open sourcing of FastList for React Native (a library which only renders list items that are in the viewport) and optimizing for Android.

    Developer Tooling and Foundations for React Native

    When you make a bet and go deep into a technology, you want to gain maximum leverage from that choice. In order for us to build fast and get the most leverage, we have two types of teams that help the rest of Shopify build quickly. The first is a tooling team that helps with engineering setup, integration and deployment. The second is a foundations team that focuses on SDKs, code reuse and open source. We’ve already begun spinning up both of these teams in 2020 to focus on React Native.

    Our popular Shopify Ping app which has enabled hundreds of thousands of customer conversations is currently only iOS. In 2020, we’ll be building the Android version using React Native out of our San Francisco office and we’re hiring.

    In 2019, Twitter released their desktop and mobile web apps using something called React Native Web. While this might seem confusing, it allows you to use the same React Native stack for your web app as well. Facebook promptly hired Nicolas Gallagher as a result, the lead engineer on the project. At Shopify we’ll be doing some React Native Web experiments in 2020.

    Join Us

    Shopify is always hiring sharp folks in all disciplines. Given our particular stack (Ruby on Rails/React/React Native) we’ve always invested in people even if they don’t have this particular set of experiences coming in to Shopify. In mobile engineering (btw, I love this video about engineering opinions) we’ll continue to write mobile native code and hire native engineers (iOS and Android).

    In addition we are looking for a Principal Mobile Developer to work with me directly across the mobile portfolio at Shopify. This person has a track record of excellence, can solve extremely complex technical challenges and can help Shopify to become an industry and technology leader in React Native. If this sounds like you, message me directly!

    Farhan Thawar is VP Engineering for Channels and Mobile at Shopify
    Twitter: @fnthawar

    Continue reading

    Scaling Mobile Development by Treating Apps as Services

    Scaling Mobile Development by Treating Apps as Services

    Scaling development without slowing down the delivery speed of new features is a problem that companies face when they grow. Speed can be achieved through better tooling, but the bigger the teams and projects, the more tooling they need. When projects and teams use different tools to solve similar problems, it gets harder for tooling teams to create one solution that works for everybody. Additionally, it complicates knowledge sharing and makes it difficult for developers to contribute to other projects. This lack of knowledge and developers is magnified during incident response because only a handful of people have enough context and understanding of the system to mitigate and fix issues.

    At Shopify, we believe in highly aligned, but loosely coupled teams—teams working independently from each other while sharing the same vision and goals—that move fast and minimize slowdowns in productivity. To continue working towards this goal, we designed tools to share processes and best practices that ease collaboration and code sharing. With tools, teams ship code fast while maintaining quality and productivity. Tooling worked efficiently for our web services, but we lacked something similar for mobile projects. Tools enforce processes that increase quality, reliability and reproducibility. A few examples include using

    • continuous Integration (CI) and automated testing
    • continuous Delivery to release new versions of the software
    • containers to ensure that the software runs in a controlled environment.

    Treating Apps as Services

    Last year, the Mobile Tooling Team shipped tools helping mobile developers be more productive, but we couldn’t enforce the usage of those tools. Moreover, checking which tools mobile apps used required digging into configuration files and scripts spread across different projects repositories. We have several mobile apps available for download between Google Play and the App Store, so this approach didn’t scale.

    Fortunately, Shopify has a tool that enforces tool usage and we extended it to our mobile projects. ServicesDB tracks all production services running at Shopify and has three major goals:

    1. keep track of all running services across Shopify
    2. define what it means to own a service, and what the expectations are for an owner
    3. provide tools for owners to improve the quality of the infrastructure around their services.

    ServicesDB allows us to treat apps as services with an owner, and for which we define a set of expectations. We specify, in a configuration file, the information that we need to codify best practices and allows us to check for things such as

    • Service Ownership: each project must be owned by a team or an individual, and they must be responsible for its maintenance and development. A team is accountable for any issues or requests that might arise in regards to the app.
    • Contact Information: Slack channels people use if they need more information about a certain mobile app. We also use those channels to notify teams about their projects not meeting the required checks.
    • Testing and Deployment Configuration: CI and our mobile deployment tool, Shipit Mobile, are properly configured. This check is essential because we need to be able to release a new version of our apps at any time.
    • Versioning: Apps use the latest version of our internal tools. With this check we make sure that our dependencies don’t contain known security vulnerabilities.
    • Monitoring: Bug tracking services configured to check for errors and crashes that are happening in production.
    ServicesDB performs checks on one of our mobile apps

    ServicesDB checks for one of our mobile apps

    ServicesDB defines a contract with the development team through automatic checks for tooling requirements on mobile projects that mitigate the problem of understanding how projects are configured and which tools they are using which keeps teams highly aligned, but loosely coupled. Now, the Mobile Tooling team can see if a project can use our tooling. It allows developers to understand why some tools don’t work with their projects, and instructs them on how to fix it, as every check provides a description for how to make it pass. Some common issues are using an outdated Ruby version, or not having a bug tracking tool configured. If any of them fails, we automatically create an issue on Github to notify the team that they aren’t meeting the contract.

    Github issue created when a check fails. It contains instructions to fix the failure

    Github issue created when a check fails. It contains instructions to fix it.

    Abstracting Tooling and Configuration Away

    If you want to scale development efficiently, you need to be opinionated about the tools supported. Through ServicesDB we detect misconfigured projects, notify their owners, and help them to fix those issues. At the end of the day, we don’t want our mobile developers to think about tooling and configurations. Our goal is to make commerce better for everyone, so we want people to spend time solving commerce problems that provide a better experience to both buyers and entrepreneurs.

    At the moment, we’ve only implemented some basic checks, but in the future we plan to define service level objectives for mobile apps and develop better tools for easing the creation of new projects and reducing build times, all while being confident that they will work as long as the defined contract is satisfied.

    Intrigued? Shopify is hiring and we’d love to hear from you. Please take a look at our open positions on the Engineering career page

    Continue reading

    How to Implement a Secure Central Authentication Service in Six Steps

    How to Implement a Secure Central Authentication Service in Six Steps

    As Shopify merchants grow in scale they will often introduce multiple stores into their organization. Previously, this meant that staff members had to be invited to multiple stores to setup their accounts. This introduced administrative friction and more work for the staff users who had to manage multiple accounts just to do their jobs.

    We created a new service to handle centralized authentication and user identity management called, surprisingly enough, Identity. Having a central authentication service within Shopify was accomplished by building functionality on the OpenID Connect (OIDC) specification. Once we had this system in place, we built a solution to reliably and securely allow users to combine their accounts to get the benefit of single sign-on. Solving this specific problem involved a team comprising product management, user experience, engineering, and data science working together with members spread across three different cities: Ottawa, Montreal, and Waterloo.

    The Shop Model

    Shopify is built so that all the data belonging to a particular store (called a Shop in our data model) lives in a single database instance. The data includes core commerce objects like Products, Orders, Customers, and Users. The Users model represents the staff members who have access, with specific permissions, to the administration interface for a particular Shop.

    Shop Commerce Object Relationships
    Shop Commerce Object Relationships

    User authentication and profile management belonged to the Shop itself and worked as long as your use of Shopify never went beyond a single store. As soon as a Merchant organization expanded to using multiple stores, the experience for both the person managing store users and the individual users involved more overhead. You had to sign into each store independently as there was no single sign-on (SSO) capabilities because Shops don’t share any data between each other. The users had to manage their profile data, password, and two-step authentication on each store they had access to.

    Shop isolation of users
    Shop isolation of users

    Modelling User Accounts Within Identity

    User accounts modelled within our Identity service are two important types: Identity accounts and Legacy accounts. A service or application that a user can access via OIDC is modelled as a Destination within Identity. Examples of destinations within Shopify would be stores, the Partners dashboard, or our Community discussion forums.

    A Legacy account only has access to a single store and an Identity account can be used to access multiple destinations.

    Legacy account model: one destination per account. Can only access Shops
    Legacy account model: one destination per account. can only access Shops

    We ensured that new accounts are created as Identity accounts and that existing users with legacy accounts can be safely and securely upgraded to Identity accounts. The big problem was combining multiple legacy accounts together. When a user has the same email to sign into several different Shopify stores we combined these accounts together into a single Identity account without blocking their access to any of the stores they used.

    Combined account model: each account can have access to multiple destinations
    Combined account model: each account can have access to multiple destinations

    There were six steps needed to get us to a single account to rule them all.

    1. Synchronize data from existing user accounts into a central Identity service.
    2. Have all authentication go through the central Identity service via OpenID Connect.
    3. Prompt users to combine their accounts together.
    4. Prompt users to enable a second factor (2FA) to protect their account.
    5. Create the combined Identity account.
    6. Prevent new legacy accounts from being created.

    1. Synchronize Data From Existing User Accounts Into a Central Identity Service

    We ensured that all user profile and security credential information was synchronized from the stores, where it's managed, into the centralized Identity service. This meant synchronizing data from the store to the Identity service every time one of the following user events occurred

    • creation
    • deletion
    • profile data update
    • security data update (password or 2FA).

    2. Have All Authentication Go Through the Central Identity Service Via OpenID Connect (OIDC)

    OpenID Connect is an extension to the OpenID 2.0 specification and the method used to delegate authentication from the Shop to the Identity service. Prior to this step, all password and 2FA verification was done within the core Shop application runtime. Given that Shopify shards the database for the core platform by Shop, all of the data associated with a given Shop is available on a single database instance.

    One downside with having all authentication go through Identity is that when a user first signs into a Shopify service it requires sending the user’s browser to Identity to perform an OIDC authentication request (AuthRequest), so there is a longer delay on initial sign in to a particular store.

     Users signing into Shopify got familiar with this loading spinner
    Users signing into Shopify got familiar with this loading spinner

    3. Prompt Users to Combine Their Accounts Together

    Users with an email address that can sign into more than one single Shopify service are prompted to combine their accounts together into a single Identity account. When a legacy user is signing into a Shopify product we interrupt the OIDC AuthRequest flow, after verifying they were authenticated but before sending them to their destination, to check if they had accounts that could be upgraded.

    There were two primary upgrade paths to an Identity account for a user: auto-upgrading a single legacy account or combining multiple accounts.

    Auto-upgrading a single legacy account occurs when a user’s email address only has a single store association. In this case, we convert the single account into an Identity account retaining all of their profile, password, and 2FA settings. Accounts in the Identity service are modelled using single table inheritance with a type attribute specifying which class a particular record uses. Upgrading a legacy account in this case was as simple as updating the value of this type attribute. This required no other changes anywhere else within the Shopify system because the universally unique identifier (UUID) for the account didn't change and this is the value used to identity an account in other systems.

    Combining multiple accounts is triggered when a user has more than one active account (legacy or Identity) that uses the same email address. We created a new session object, called a MergeSession, for this combining process to keep track of all the data required to create the Identity account. The MergeSession was associated to an individual AuthRequest which means that when the AuthRequest was completed, the session would no longer be active. If a user went through more than a single combining process we would have to generate a new MergeSession object for each one.

    The prompt users saw when they had multiple accounts that could be combined
    The prompt users saw when they had multiple accounts that could be combined

    Shopify doesn't require users to verify their email address when creating a new store. This means it’s possible that someone could sign up for a trial using an email address they don’t have access to. Because of this we need to verify that you have access to the email address before we show a user information about other accounts with the same email or allow you to take any actions on those other accounts. This verification involves you requesting an email be sent to your address with a link.

    If the user’s email address on the store they were signing in to was verified, we list all of the other destinations where their email address was used. If a user hadn't verified their email address for the account they are authenticating into then we would only indicate that there were other accounts and they must verify their email address before proceeding with combining them.

    The prompt users saw when they signed in with an unverified email address
    The prompt users saw when they signed in with an unverified email address

    If any of the accounts that need combining use 2FA then the user had to provide a valid code for each required account. When someone uses SMS as a 2FA method, they could potentially save some time in this step if they use the same phone number across multiple accounts because we only require a single code for all of the destinations that use the same number. This was a secure convenience to our users in an attempt to reduce time spent on this step. Individuals using an authenticator app (e.g. Google Authenticator, Authy, 1Password, etc.), however, had to provide a code per destination because the authenticator app is configured per user account and there’s nothing associating them to one another.

    If a user couldn’t provide a 2FA code for any accounts other than the account they are signing into, they are able to exclude that account from being combined. Legitimate reasons why a person may be unable to provide a code include if the account uses an old SMS phone number that the person no longer has access to or the person no longer has an authenticator app configured to generate a code for that account.

    The idea here is that any account which was excluded can be combined at a later date when the user re-gains access to the account.

    Once the 2FA requirements for all accounts are satisfied we prompt the user to setup a new password for their combined account. We store the encrypted password hash on an object that is keeping track of state for this session.

    4. Prompt Users to Enable a Second Factor to Protect Their Account

    Having a user engaged in performing account maintenance was an excellent opportunity to expose them to the benefits of protecting their account with a second factor of security. We displayed a different flow to users who already had 2FA enabled on at least one of their accounts being combined as the assumption was they don’t require explanation about what 2FA is but someone who had never set it up most likely would.

    5. Create the Combined Identity Account

    Once a user had validated their 2FA configuration of choice, or opted out of setting it up, we performed the following actions:

    Attach 2FA setup, if present, to an object that keeps track of the specific account combination session (MergeSession).

    Merge session object with new password and 2FA configuration.
    Merge session object with new password and 2FA configuration.

    Inside a single database transaction create the complete new account, associate destinations from legacy accounts to it, and delete the old accounts

    We needed to do this inside a transaction after getting all of the information from a user to prevent the potential for reducing the security of their accounts. If a user was using 2FA before starting this process and we created the Identity account immediately after the new password was provided, there exists a small window of time when their new Identity account would be less secure than their old legacy accounts. As soon as the Identity account exists and has a password associated with it, it could be used to access destinations with only knowledge of the password. Deferring account creation until both password and 2FA are defined means that the new account can be as secure as the ones being combined were.

    Final state of combined account
    Final state of combined account

    Generate a session for the new account and use it to satisfy the AuthRequest that initiated this session in the first place.

    Some of the more complex pieces of logic for this process included finding all of the related accounts for a given email address and the information about the destinations they had access to, replacing the legacy accounts when creating the Identity account, and ensuring that the Identity account was setup correctly with all of the required data defined correctly. For these parts of the solution we relied on a Ruby library called ActiveOperation. It's a very small framework allowing you to isolate and model business logic within your application in an operation class. Traditionally in a Rails application you end up having to put logic either in your controllers or models and in this case we were able to have controllers and models that were very small by defining the complex business logic as operations. These operations were easily testable given that they were isolated and had very specific responsibilities that each separate class was responsible for.

    There are other libraries for handling this kind of business logic process but we chose ActiveOperation because it was easy to use, made our code easier to understand, and had built-in support for the RSpec testing framework we were using.

    We added support for the new Web Authentication (WebAuthn) standard in our Identity service just as we were beginning to roll out the account combining flow to our users. This meant that we were able to allow users to use physical security keys as a second factor when securing their accounts rather than just the options of SMS or an authenticator app.

    6. Prevent New Legacy Accounts From Being Created

    We didn’t want any more legacy accounts created. There were two user scenarios that needed to be updated to use the Identity creation flow: signing up for a new trial store on and inviting new staff members to an existing store.

    When signing up for a new store you would enter your email address as part of that process. This email address was used as the primary owner for the new store. With legacy accounts even if the email address belonged to another store we’d still be creating a new legacy account for the newly created store.

    When inviting a new staff member to your store you would enter the email address for the new user and an invite would be sent that email address that includes a link to accept the invite and finish setting up their account. Similarly to the store creation process, this would always be a new legacy account on each individual store.

    In both cases with the new process we determine whether the email address belongs to an Identity account already and, if so, require the user to be authenticated for the account belonging to that email address before they can proceed.

    Build New Experiences for Shopify Users That Rely on SSO Identity Accounts

    As of the time of this writing over 75% of active user accounts have been auto-upgraded or combined into a single Identity account. Accounts that don’t require user interaction, such as accounts that can be auto-upgraded, can be done automatically without the user signing in. The accounts that require a user to prove ownership of their accounts can only be combined when logging in. At some point in the future we will prevent users from signing into Shopify without having an Identity account.

    When product teams within Shopify can rely on our active users having Identity accounts we can start building new experiences for those users that delegate authentication and profile management to the Identity service. Authorization is still up to the service leveraging these Identity accounts as Identity specifically only handles authentication and knows nothing about the permissions within the services that the accounts can access.

    For our users, it means that they don’t have to create and manage a new account when Shopify launches a new service that utilizes Identity for user sign in.

    If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions. 

    Continue reading

    How Shopify Manages API Versioning and Breaking Changes

    How Shopify Manages API Versioning and Breaking Changes

    Earlier this year I took the train from Ottawa to Toronto. While I was waiting in line in the main hall of the station, I noticed a police officer with a detection dog. The police officer was giving the dog plenty of time at each bag or person as they worked and weaved their way back and forth along the lines. The dog would look to his handler for direction, receiving it with the wave of a hand or gesture towards the next target. That’s about the moment I began asking myself a number of questions about dogs… and APIs.

    To understand why, you have to appreciate that the Canadian government recently legalized cannabis. Watching this incredibly well-trained dog work his way up and down the lines, it made me wonder, how did they “update” the dogs once the legislation changed? Can you really retrain or un-train a dog? How easy is it to implement this change, and how long does it take to roll out? So when the officer ended up next to me I couldn’t help but ask,

    ME: “Excuse me, I have a question about your dog if that’s alright with you?”

    OFFICER: “Sure, what’s on your mind?”

    ME: “How did you retrain the dogs after the legalization of cannabis?”

    OFFICER: “We didn’t. We had to retire them all and train new ones. You really can’t teach an old dog new tricks.“

    ME: “Wow, seriously? How long did that take?”

    OFFICER: “Yep, we needed a full THREE YEARS to retire the previous group and introduce a new generation. It was a ton of work.”

    I found myself sitting on the train thinking about how simple it might have been for one layer of government plotting out the changes, to completely underestimate the downstream impact on the K9 unit of the police services. To anyone that didn’t understand the system (dogs), the change sounds simple. Simply detect substances in a set that is now n-1 in size. In reality, due to the way this dog-dependent system works, it requires significant time and effort, and a three-year program to migrate from the old system to the new.

    How We Handle API Versioning

    At Shopify, we have tens of thousands of partners building on our APIs that depend on us to ensure our merchants can run their businesses every day. In April of this year, we released the first official version of our API. All consumers of our APIs require stability and predictability and our API versioning scheme at Shopify allows us to continue to develop the platform while providing apps with stable API behavior and predictable timelines for adopting changes.

    The increasing growth of our API RPM quarter over quarter since 2017 overlaid with growth in active API clients

    The increasing growth of our API RPM quarter over quarter since 2017 overlaid with growth in active API clients

    To ensure that we provide a stable and predictable API, Shopify releases a new API version every three months at the beginning of the quarter. Version names are date-based to be meaningful and semantically unambiguous (for example, 2020-01).

    Shopify API Versioning Schedule

    Shopify API Versioning Schedule


    Although the Platform team is responsible for building the infrastructure, tooling, and systems that enforce our API versioning strategy at Shopify, there are a 1000+ engineers working across Shopify, each with the ability to ship code that can ultimately affect any of our APIs. So how do we think about versioning, and help manage changes to our APIs at scale?

    Our general rule of thumb about versioning is that

    API versioning is a powerful tool that comes with added responsibility. Break the API contract with the ecosystem only when there are no alternatives or it’s uneconomical to do otherwise.

    API versions and changes are represented in our monolith through new frozen records, one file for versions, and one for changes. API changes are packaged together and shipped as a part of a distinct version. API changes are initially introduced to the unstable version, and can optionally have a beta flag associated with the change, to prevent the change from being visible publicly. At runtime, our code can check whether a given change is in effect through a ApiChange.in_effect? construct. I’ll show you how this, and other methods of the ApiChange module are used in an example later on.

    Dealing With Breaking and Non-breaking Changes

    As we continue to improve our platform, changes are necessary and can be split into two broad categories: breaking and non-breaking.

    Breaking changes are more problematic and require a great deal of planning, care and go-to-market effort to ensure we support the ecosystem and provide a stable commerce platform for merchants. Ultimately, a breaking change is any change that requires a third-party developer to do any migration work to maintain the existing functionality of their application. Some examples of breaking changes are

    • adding a new or modifying an existing validation to an existing resource
    • requiring a parameter that wasn’t required before
    • changing existing error response codes/messages
    • modifying the expected payload of webhooks and async callbacks
    • changing the data type of an existing field
    • changing supported filtering on existing endpoints
    • renaming a field or endpoint
    • adding a new feature that will change the meaning of a field
    • removing an existing field or endpoint
    • changing the URL structure of an existing endpoint.

    Teams inside Shopify considering a breaking change conduct an impact analysis. They put themselves into the shoes of a third-party developer using the API and think through the changes that might be required. If there is ambiguity, our developer advocacy team can reach out to our partners to gain additional insight and gauge the impact of proposed changes. 

    On the other hand, to determine if a change is non-breaking, a change must pass our forward compatibility test. Forward compatible changes are those which can be adopted and used by any merchant, without limitation, regardless of whether shops have been migrated or any other additional conditions have been met.

    Forward compatible changes can be freely adopted without worrying about whether there is a new user experience or the merchant’s data is adapted to work with the change, etc. Teams will keep these changes in the unstable API version and if forward compatibility cannot be met, keep access limited and managed by protecting the change with a beta flag.

    Every change is named in the changes frozen record mentioned above, to track and manage the change, and can be referenced by its name, for example,


    Analyzing the Impact of Breaking Changes

    If a proposed change is identified as a breaking change, and there is agreement amongst the stakeholders that it’s necessary, the next step is to enable our teams to figure out just how big the change’s impact is.

    Within the core monolith, teams make use of our API change tooling methods mark_breaking and mark_possibly_breaking to measure the impact of a potential breaking change. These methods work by capturing request metadata and context specific to the breaking code path then emitting this into our event pipeline, Monorail, which places the events into our data warehouse.

    The mark_breaking method is called when the request would break if everything else was kept the same, while mark_possibly_breaking would be used when we aren’t sure whether the call would have an adverse effect on the calling application. An example would be the case where a property of the response has been renamed or removed entirely:


    Once shipped to production, teams can use a prebuilt impact assessment report to see the potential impact of their changes across a number of dimensions.

    Measuring and Managing API Adoption

    Once the change has shipped as a part of an official API version, we’re able to make use of the data emitted from mark_breaking and mark_possibly_breaking to measure adoption and identify shops and apps that are still at risk. Our teams use the ApiChange.in_effect? method (made available by our API change tooling) to create conditionals and manage support for the old and new behaviour in our API. A trivial example might look something like this:

    The ApiChange module and the automated instrumentation it drives allow teams at Shopify to assess the current risk to the platform based on the proportion of API calls still on the breaking path, and assist in communicating these risks to affected developers.

    At Shopify, our ecosystem’s applications depend on the predictable nature of our APIs. The functionality these applications provide can be critical for the merchant’s businesses to function correctly on Shopify. In order to build and maintain trust with our ecosystem, we consider any proposed breaking change thoroughly and gauge the impact of our decisions. By providing the tooling to mark and analyze API calls, we empower teams at Shopify to assess the impact of proposed changes, and build a culture that respects the impact our decisions have on our ecosystem. There are real people out there building software for our merchants, and we want to avoid ever having to ask them to replace all the dogs at once!

    We're always on the lookout for talent and we’d love to hear from you. Please take a look at our open positions on the Engineering career page.

    Continue reading

    Successfully Merging the Work of 1000+ Developers

    Successfully Merging the Work of 1000+ Developers

    Collaboration with a large team is challenging, and even more so if it’s on a single codebase, like the Shopify monolith. Shopify changes 40 times a day. We follow a trunk-based development workflow and merge around 400 commits to master daily. There are three rules that govern how we deploy safely, but they were hard to maintain at our growing scale. Soft conflicts broke master, slow deployments caused large drift between master and production, and the time to deploy emergency merges slowed due to a backlog of pull requests. To solve these issues, we upgraded the Merge Queue (our tool to automate and control the rate of merges going into master) so it integrates with GitHub, runs continuous integration (CI) before merging to master keeping it green, removes pull requests that fail CI, and maximizes deployment throughput of pull requests.

    Our three essential rules about deploying safely and maintaining master:

    1. Master must always be green (passing CI). Important because we must be able to deploy from master at all times. If master is not green, our developers cannot merge, slowing all development across Shopify.
    2. Master must stay close to production. Drifting master too far ahead of what is deployed to production increases risk.
    3. Emergency merges must be fast. In case of emergencies, we must be able to quickly merge fixes intended to resolve the incident.

    Merge Queue v1

    Two years ago, we built the first iteration of the merge queue inside our open-source continuous deployment tool, Shipit. Our goal was to prevent master from drifting too far from production. Rather than merging directly to master, developers add pull requests to the merge queue which merges pull requests on their behalf.

    Merge Queue v1 - developers add pull requests to the merge queue which merges pull requests on their behalfMerge Queue v1

    Pull requests build up in the queue rather than merging to master all at once. Merge Queue v1 controlled the batch size of each deployment and prevented merging when there were too many undeployed pull requests on master. It reduced the risk of failure and possible drift from production. During incidents, we locked the queue to prevent any further pull requests from merging to master, giving space for emergency fixes.

    Merge Queue v1 browser extensionMerge Queue v1 Browser Extension

    Merge Queue v1 used a browser extension allowing developers to send a pull request to the merge queue within the GitHub UI, but also allowed them to quickly merge fixes during emergencies by bypassing the queue.

    Problems with Merge Queue v1

    Merge Queue v1 kept track of pull requests, but we were not running CI on pull requests while they sat in the queue. On some unfortunate days—ones with production incidents requiring a halt to deploys—we would have upwards of 50 pull requests waiting to be merged. A queue of this size could take hours to merge and deploy. There was also no guarantee that a pull request in the queue would pass CI after it was merged, since there could be soft conflicts (two pull requests that pass CI independently, but fail when merged together) between pull requests in the queue.

    The browser extension was a major pain point because it was a poor experience for our developers. New developers sometimes forgot to install the extension which resulted in accidental direct merges to master instead of going through the merge queue, which can be disruptive if the deploy backlog is already large, or if there is an ongoing incident.

    Merge Queue v2

    This year, we completed Merge Queue v2. We focused on optimizing our throughput by reducing the time that the queue is idle, and improving the user experience by replacing the browser extension with a more integrated experience. We also wanted to address the pieces that the first merge queue didn’t address: keeping master green and faster emergency merges. In addition, our solution needed to be resilient to flaky tests—tests that can fail nondeterministically.

    No More Browser Extension

    Merge Queue v2 came with a new user experience. We wanted an interface for our developers to interact with that felt native to GitHub. We drew inspiration from Atlantis, which we were already using for our Terraform setup, and went with a comment-based interface.

    Merge Queue v2 went with a comment-based interfaceMerge Queue v2 went with a comment-based interface

    A welcome message gets issued on every pull request with instructions on how to use the merge queue. Every merge now starts with a /shipit comment. This comment fires a webhook to our system to let us know that a merge request has been initiated. We check if Branch CI has passed and if the pull request has been approved by a reviewer before adding the pull request to the queue. If successful, we issue a thumbs up emoji reaction to the /shipit comment using the GitHub addReaction GraphQL mutation.

    In the case of errors, such as invalid base branch, or missing reviews, we surface the errors as additional comments on the pull request.

    Jumping the queue by merging directly to master is bad for overall throughput. To ensure that everyone uses the queue, we disable the ability to merge directly to master using GitHub branch protection programmatically as part of the merge queue onboarding process.


    However, we still need to be able to bypass the queue in an emergency, like resolving a service disruption. For these cases, we added a separate /shipit --emergency command that skips any checks and merges directly to master. This helps communicate to developers that this action is reserved for emergencies only and gives us auditability into the cases where this gets used.

    Keeping Master Green

    In order to keep master green, we took another look at how and when we merged a change to master. If we run CI before merging to master, we ensure that only green changes merge. This improves the local development experience by eliminating the cases of pulling a broken master, and by speeding up the deploy process by not having to worry about delays due to a failing build.

    Our solution here is to have what we call a “predictive branch,” implemented as a git branch, onto which pull requests are merged, and CI is run. The predictive branch serves as a possible future version of master, but one where we are still free to manipulate it. We avoid maintaining a local checkout, which incurs the cost of running a stateful system that can easily be out of sync, and instead interact with this branch using the GraphQL GitHub API.

    To ensure that the predictive branch on GitHub is consistent with our desired state, we use a similar pattern as React’s “Virtual DOM.” The system constructs an in-memory representation of the desired state and runs a reconciliation algorithm we developed that performs the necessary mutations to the state on GitHub. The reconciliation algorithm synchronizes our desired state to GitHub by performing two main steps. The first step is to discard obsolete merge commits. These are commits that we may have created in the past, but are no longer needed for the desired state of the tree. The second step is to create the missing desired merge commits. Once these merge commits are created, a corresponding CI run will be triggered. This pattern allows us to alter our desired state freely when the queue changes and gives us a layer of resiliency in the case of desynchronization.

    Merge Queue v2Merge Queue v2 runs CI in the queue

    To ensure our goal of keeping master green, we need to also remove pull requests that fail CI from the queue to prevent them from cascading failures to all pull requests behind them. However, like many other large codebases, our core Shopify monolith suffers from flaky tests. The existence of these flaky tests makes removing pull requests from the queue difficult because we lack certainty about whether failed tests are legitimate or flaky. While we have work underway to clean up the test suite, we have to be resilient to the situation we have today.

    We added a failure-tolerance threshold, and only remove pull requests when the number of successive failures exceeds the failure tolerance. This is based on the idea that legitimate failures will propagate to all later CI runs, but flaky tests will not block later CI runs from passing. Larger failure tolerances will increase the accuracy, but at the tradeoff of taking longer to remove problematic changes from the queue. In order to calculate the best value, we can take a look at the flakiness rate. To illustrate, let’s assume a flakiness rate of 25%. These are the probabilities of a false positive based on how many successive failures we get.

    Failure tolerance
    0 25%
    1 6.25%
    2 1.5%
    3 0.39%
    4 0.097%

    From these numbers, it’s clear that the probability decreases significantly with each increase to the failure tolerance. The possibility will never reach exactly 0%, but in this case, a value of 3 will bring us sufficiently close. This means that on the fourth consecutive failure, we will remove the first pull request failing CI from the queue.

    Increasing Throughput

    An important objective for Merge Queue v2 was to ensure we can maximize throughput. We should be continuously deploying and making sure that each deployment contains the maximum amount of pull requests we deem acceptable.

    To continuously deploy, we make sure that we have a constant flow of pull requests that are ready to go. Merge Queue v2 affords this by ensuring that CI is started for pull requests as soon as they are added to the queue. The impact is especially noticeable during incidents when we lock the queue. Since CI is running before merging to master, we will have pull requests passing and ready to deploy by the time the incident is resolved and the queue is unlocked. From the following graph, the number of queued pull requests rises as the queue gets locked, and then drops as the queue is unlocked and pull requests get merged immediately.

    The number of queued pull requests rises as the queue gets locked, and then drops as the queue is unlocked and pull requests get merged immediately

    To optimize the number of pull requests for each deploy, we split the pull requests in the merge queue up into batches. We define a batch as the maximum number of pull requests we can put in a single deploy. Larger batches result in higher theoretical throughput, but higher risk. In practice, the increased risk of larger batches impedes throughput by causing failures that are harder to isolate, and results in an increased number of rollbacks. In our application, we went with a batch size of 8 as a balance between throughput and risk.

    At any given time, we run CI on 3 batches worth of pull requests in the queue. Having a bounded number of batches ensures that we’re only using CI resources on what we will need soon, rather than the entire set of pull requests in the queue. This helps reduce cost and resource utilization.


    We improved the user experience, safety of deploying to production, and throughput of deploys through the introduction of the Merge Queue v2. While we accomplished our goals for our current level of scale, there will be patterns and assumptions that we’ll need to revisit as we grow. Our next steps will focus on the user experience and ensure developers have the context to make decisions every step of the way. Merge Queue v2 has given us flexibility to build for the future, and this is only the beginning of our plans to scale deploys.

    We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.

    Continue reading

    Four Steps to Creating Effective Game Day Tests

    Four Steps to Creating Effective Game Day Tests

    At Shopify, we use Game Day tests to practice how we react to unpredictable situations. Game Day tests involve deliberately triggering failure modes within our production systems, and analyzing whether the systems handle these problems in the ways we expect. I’ll walk through a set of best practices that we use for our internal Shopify Game Day tests, and how you can apply these guidelines to your own testing.

    Shopify’s primary responsibility is to provide our merchants with a stable ecommerce platform. Even a small outage can have a dramatic impact on their businesses, so we put a lot of work into preventing them before they occur. We verify our code changes rigorously before they’re deployed, both through automated tests and manual verification. We also require code reviews from other developers who are aware of the context of these changes and their potential impact to the larger platform.

    But these upfront checks are only part of the equation. Inevitably, things will break in ways that we don’t expect, or due to forces that are outside our control. When this happens, we need to quickly respond to the issue, analyze the situation at hand, and restore the system back to a healthy state. This requires close coordination between humans and automated systems, and the only way to ensure that it goes smoothly is to practice it beforehand. Game Day tests are a great way of training your team to expect the unexpected.

    1. List All the Things That Could Break

    The first step to running a successful Game Day test is to compile a list of all the potential failure scenarios that you’re interested in analyzing. Collaborate with your team to take a detailed inventory of everything that could possibly cause your systems to go haywire. List all the problem areas you know about, but don’t stop there—stretch your imagination! 

    • What are the parts of your infrastructure that you think are 100% safe? 
    • Where are your blind spots?
    • What would happen if your servers started inexplicably running out of disk space? 
    • What would happen if you suffered a DNS outage or a DDOS attack? 
    • What would happen if all network calls to a host started timing out?
    • Can your systems support 20x their current load?

    You’ll likely end up with too many scenarios to reasonably test during a single Game Day testing session. Whittle down the list by comparing the estimated impact of each scenario against the difficulty you’d face in trying to reasonably simulate it. Try to avoid weighing particular scenarios based on your estimates of the likelihood that those scenarios will happen. Game Day testing is about insulating your systems against perfect storm incidents, which often hinge on failure points whose danger was initially underestimated.

    2. Create a Series of Experiments

    At Shopify, we’ve found that we get the best results from our Game Day tests when we run them as a series of controlled experiments. Once you’ve compiled a list of things that could break, you should start thinking about how they will break, as a list of discrete hypotheses. 

    • What are the side effects that you expect will be triggered when you simulate an outage during your test? 
    • Will the correct alerts be dispatched? 
    • Will downstream systems manifest the expected behaviors?
    • When you stop simulating a problem, will your systems recover back to their original state?

    If you express these expectations in the form of testable hypotheses, it becomes much easier to plan the actual Game Day session itself. Use a separate spreadsheet (using a tool like Google Sheets or similar) to catalogue each of the prerequisite steps that your team will walk through to simulate a specific failure scenario. Below those steps indicate the behaviors that you hypothesize will occur when you trigger that scenario, along with an indicator for whether this behavior occurs. Lastly, make sure to list the necessary steps to restore your system back to its original state.

    Example spreadsheet for a Game Day test that simulates an upstream service outage. A link to this spreadsheet is available in the “Additional Resources” section below.

    Example spreadsheet for a Game Day test that simulates an upstream service outage. A link to this spreadsheet is available in the “Additional Resources” section below. 

    3. Test Your Human Systems Too

    By this point, you’ve compiled a series of short experiments that describe how you expect your systems to react to a list of failure scenarios. Now it’s time to run your Game Day test and validate your experimental hypotheses. There are a lot of different ways to run an Game Day test. One approach isn’t necessarily better than another. How you approach the testing should be tailored to the types of systems you’re testing, the way your team is structured and communicates, the impact your testing poses to production traffic, and so on. Whatever approach you take, just make sure that you track your experiment results as you go along!

    However, there is one common element that should be present regardless of the specifics of your particular testing setup: team involvement. Game Day tests aren’t just about seeing how your automated systems react to unexpected pressures—you should also use the opportunity to analyze how your team handles these situations on the people side. Good team communication under pressure can make a huge difference when it comes to mitigating the impact of a production incident. 

    • What are the types of interactions that need to happen among team members as an incident unfolds? 
    • Is there a protocol for how work is distributed among multiple people? 
    • Do you need to communicate with anyone from outside your immediate team?

    Make sure you have a basic system in place to prevent people from doing the same task twice, or incorrectly assuming that something is already being handled.

    4. Address Any Gaps Uncovered

    After running your Game Day test, it’s time to patch the holes that you uncovered. Your experiment spreadsheets should be annotated with whether each hypothesis held up in practice.

    • Did your off hours alerting system page the on-call developer? 
    • Did you correctly switch over to reading from the backup database? 
    • Were you able to restore things back to their original healthy state?

    For any gaps you uncover, work with your team to determine why the expected behavior didn’t occur, then establish a plan for how to correct the failed behavior. After doing so, you should ideally run a new Game Day test to verify that your hypotheses are now valid with the new fixes in place.

    This is also the opportunity to analyze any gaps in communication between your team, or problems that you identified regarding how people distribute work among themselves when they’re under pressure. Set aside some time for a follow up discussion with the other Game Day participants to discuss the results of the test, and ask for their input on what they thought went well versus what could use some improvement. Finally, make any necessary changes to your team’s guidelines for how to respond to these incidents going forward.

    In Conclusion

    Using these best practices, you should be able to execute a successful Game Day test that gives you greater confidence in how your systems—and the humans that control them—will respond during unexpected incidents. And remember that a Game Day test isn’t a one-time event: you should periodically update your hypotheses and conduct new tests to make sure that your team remains prepared for the unexpected. Happy testing!

    Additional resources


    Continue reading

    Start your free 14-day trial of Shopify