How Shopify Uses WebAssembly Outside of the Browser

How Shopify Uses WebAssembly Outside of the Browser

On February 24, 2021, Shipit!, our monthly event series, presented Making Commerce Extensible with WebAssembly. The video is now available.

At Shopify we aim to make what most merchants need easy, and the rest possible. We make the rest possible by exposing interfaces to query, extend and alter our Platform. These interfaces empower a rich ecosystem of Partners to solve a variety of problems. The primary mechanism of this ecosystem is an “App”, an independently hosted web service which communicates with Shopify over the network. This model is powerful, but comes with a host of technical issues. Partners are stretched beyond their available resources as they have to build a web service that can operate at Shopify’s scale. Even if Partners’ resources were unlimited, the network latency incurred when communicating with Shopify precludes the use of Apps for time sensitive use cases.

We want Partners to focus on using their domain knowledge to solve problems, and not on managing scalable web services. To make this a reality we’re keeping the flexibility of untrusted Partner code, but executing it on our own infrastructure. We choose a universal format for that code that ensures it’s performant, secure, and flexible: WebAssembly.


What is WebAssembly? According to

“WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine. Wasm is designed as a portable compilation target for programming languages, enabling deployment on the web for client and server applications.”

To learn more, see this series of illustrated articles written by Lin Clark of Mozilla with information on WebAssembly and its history.

Wasm is often presented as a performant language that runs alongside JavaScript from within the Browser. We, however, execute Wasm outside of the browser and with no Javascript involved. Wasm, far from being solely a Javascript replacement, is designed for Web and Non-Web Embeddings alike. It solves the more general problem of performant execution in untrusted environments, which exists in browsers and code execution engines alike. Wasm satisfies our three main technical requirements: security, performance, and flexibility.


Executing untrusted code is a dangerous thing—it's exceptionally difficult to predict by nature, and it has potential to cause harm to Shopify’s platform at large. While no application is entirely secure, we need to both prevent security flaws and mitigate their impacts when they occur.

Wasm executes within a sandboxed stack-based environment, relying upon explicit imports to allow communication with the host. Because of this, you cannot express anything malicious in Wasm. You can only express manipulations of the virtual environment and use provided imports. This differs from bytecodes which have references to the computers or operating systems they expect to run on built right into the syntax.

Wasm also hosts a number of features which protect the user from buggy code, including protected call stacks and runtime type checking. More details on the security model of Wasm can be found on


In ecommerce, speed is a competitive advantage that merchants need to drive sales. If a feature we deliver to merchants doesn’t come with the right tradeoff of load times to customization value, then we may as well not deliver it at all.

Wasm is designed to leverage common hardware capabilities that provide it near native performance on a wide variety of platforms. It’s used by a community of performance driven developers looking to optimize browser execution. As a result, Wasm and surrounding tooling was built, and continues to be built, with a performance focus.


A code execution service is only as useful as the developers using it are productive. This means providing first class development experiences in multiple languages they’re familiar with. As a bytecode format, Wasm is targeted by a number of different compilers. This allows us to support multiple languages for developer use without altering the underlying execution model.

Community Driven

We have a fundamental alignment in goals and design, which provides our “engineering reason” for using Wasm. But there’s more to it than that—it’s about the people as well as the technology. If nobody was working on the Wasm ecosystem, or even if it was just on life support in its current state, we wouldn’t use it. WebAssembly is an energized community that’s constantly building new things and has a lot of potential left to reach. By becoming a part of that community, Shopify stands to gain significantly from that enthusiasm.

We’re also contributing to that enthusiasm ourselves. We’re collecting user feedback, discussing feature gaps, and most importantly contributing to the open source tools we depend on. We think this is the start of a healthy reciprocal relationship between ourselves and the WebAssembly community, and we expect to expand these efforts in the future.

Architecture of our Code Execution Service

Now that we’ve covered WebAssembly and why we’re using it, let’s move onto how we’re executing it.

We use an open source tool called Lucet (originally written by Fastly). As a company, Fastly provides a programmable edge cloud platform. They’re trying to bring execution of high-volume, short-lived, and untrusted modules closer to where they’re being requested. This is the same as the problem we’re trying to solve with our Partner code, so it’s a natural fit to be using the same tools.


Lucet is both a runtime and a compiler for Wasm. Modules are represented in Wasm for the safety that representation provides. Recall that you can’t express anything malicious in Wasm. Lucet takes advantage of this and uses a validation of the Wasm module as a security check. After the validation, the module is compiled to an executable artifact with near bare metal performance. It also supports ahead of time compilation, allowing us to have these artifacts ready to execute at runtime. Lucet containers boast an impressive startup time of 35 μs. That’s because it’s a container that doesn’t need to do anything at all to start up.  If you want the full picture, Tyler McMullan, the CTO of Fastly, did a great talk which gives an overview of Lucet and how it works.

A flow diagram showing how Shopify uses our Wasm engine: Lucet wrapped within a Rust web service which manages the I/O and storage of modules
A flow diagram showing Shopify's Wasm engine

We wrap Lucet within a Rust web service which manages the I/O and storage of modules, which we call the Wasm Engine. This engine is called by Shopify during a runtime process, usually a web request, in order to satisfy some function. It then applies the output in context of the callsite. This application could involve the creation of a discount, the enforcement of a constraint, or any form of synchronous behaviour Merchants want to customize within the Platform.

Execution Performance

Here’s some metrics pulled from a recent performance test. During this test, 100k modules were executed per minute for approximately 5 min. These modules contained a trivial implementation of enforcing a limit on the number of items purchased in a cart. 

A line graph showcasing the time taken to execute a module. The x axis representing the time over the test was running and the y axis is the time represented in ms
Time taken to execute a module

This chart demonstrates a breakdown of the time taken to execute a module, including I/O with the container and the execution of the module. The y-axis is time in ms, the x-axis is the time over which the test was running.

The light purple bar shows the time taken to execute the module in Lucet, the width of which hovers around 100 μs. The remaining bars deal with I/O and engine specifics, and the total time of execution is around 4 ms. All times are 99th percentiles (p99).To put these times in perspective, let’s compare these times to the request times of Storefront Renderer, our performant Online Store rendering service:

A line graph showing Storefront Renderer Response time
Storefront Renderer response time

This chart demonstrates the request time to Storefront Renderer over time. The y-axis is request time in seconds. The x-axis is the time over which the values were retrieved. The light blue line representing the 99th percentile hovers around 700 ms.

Then if we consider the time taken by our module execution process to be generally under 5 ms, we can say that the performance impact of Lucet execution is negligible.

Generating WebAssembly

To get value out of our high performance execution engine, we’ll need to empower developers to create compatible Wasm modules. Wasm is primarily intended as a compilation target, rather than something you write by hand (though you can write Wasm by hand). This leaves us with the question of what languages we’ll support and to what extent.

Theoretically any language with a Wasm target can be supported, but the effort developers spend to conform to our API is better focused on solving problems for merchants. That’s why we’ve chosen to provide first class support to a single language that includes tools that get developers up and running quickly.At Shopify, our language of choice is Ruby. However, because Ruby is a dynamic language, we can’t compile it down to Wasm directly. We explored solutions involving compiling interpreters, but found that there was a steep performance penalty. Because of this, we decided to go with a statically compiled language and revisit the possibility of dynamic languages in the future.

Through our research we found that developers in our ecosystem were most familiar with Javascript. Unfortunately, Javascript was precluded as it’s a dynamic language like Ruby. Instead, we chose a language with familiar TypeScript-like syntax called AssemblyScript.

Using AssemblyScript

At first glance, there are a huge number of languages that support a WebAssembly target. Unfortunately, there are two broad categories of WebAssembly compilers which we can’t use:

  • Compilers that generate environment or language specific artifacts, namely node or the browser. (Examples: Asterius, Blazor)
  • Compilers that are designed to work only with a particular Runtime. The modules generated by these compilers rely upon special language specific imports. This is often done to support a language’s standard library, which expects certain system calls or runtime features to be available. Since we don’t want to be locked down to a certain language or tool, we don’t use these compilers. (Examples: Lumen)

These are powerful tools in the right conditions, but aren’t built for our use case. We need tools that produce WebAssembly, rather than tools which are powered by WebAssembly. AssemblyScript is one such tool.

AssemblyScript, like many tools in the WebAssembly space, is still under development. It’s missing a few key features, such as closure support, and it still has a number of edge case bugs. This is where the importance of the community comes in.

The language and the tooling around AssemblyScript has an active community of enthusiasts and maintainers who have supported Shopify since we first started using the language in 2019. We’ve supported the community through an OpenCollective donation and continuing code contributions. We’ve written a language server, made some progress towards implementing closures, and have written bug fixes for the compiler and surrounding tooling.

We’ve also integrated AssemblyScript into our own early stage tooling. We’ve built integrations into the Shopify CLI which will allow developers to create, test, and deploy modules from their command line. To improve developer ergonomics, we provide SDKs which handle the low level implementation concerns of Shopify defined objects like “Money”. In addition to these tools, we’re building out systems which allow Partners to monitor their modules and receive alerts when their modules fail. The end goal is to give Partners the ability to move their code onto our service without losing any of the flexibility or observability they had on their own platform.

New Capabilities, New Possibilities

As we tear down the boundaries between Partners and Merchants, we connect merchants with the entrepreneurs ready to solve their problems. If you have ideas on how our code execution could help you and the Apps you own or use, please tweet us at @ShopifyEng. To learn more about Apps at Shopify and how to get started, visit our developer page.

Duncan is a Senior Developer at Shopify. He is currently working on the Scripts team, a team dedicated to enabling and managing untrusted code execution within Shopify for Merchants and Developers alike.

Shipit! Presents: Making Commerce Extensible with WebAssembly


If you love working with open source tools, are passionate about API design and extensibility, and want to work remotely, we’re always hiring! Reach out to us or apply on our careers page.

Continue reading

Simplify, Batch, and Cache: How We Optimized Server-side Storefront Rendering

Simplify, Batch, and Cache: How We Optimized Server-side Storefront Rendering

On December 16, 2020 we held Shipit! presents: Performance Tips from the Storefront Renderer Team. A video for the event is now available for you to learn more about how the team optimized this Ruby application for the particular use case of serving storefront traffic. Click here to watch the video.

By Celso Dantas and Maxime Vaillancourt

In the previous post about our new storefront rendering engine, we described how we went about the rewrite process and smoothly transitioned to serve storefront requests with the new implementation. As a follow-up and based on readers’ comments and questions, this post dives deeper into the technical details of how we built the new storefront rendering engine to be faster than the previous implementation.

To set the table, let’s see how the new storefront rendering engine performs:

  • It generates a response in less than ~45ms for 75% of storefront requests;
  • It generates a response in less than ~230ms for 90% of storefront requests;
  • It generates a response in less than ~900ms for 99% of storefront requests.

Thanks to the new storefront rendering engine, the average storefront response is nearly 5x faster than with the previous implementation. Of course, how fast the rendering engine is able to process a request and spit out a response depends on two key factors: the shop’s Liquid theme implementation, and the number of resources needed to process the request. To get a better idea of where the storefront rendering engine spends its time when processing a request, try using the Shopify Theme Inspector: this tool will help you identify potential bottlenecks so you can work on improving performance in those areas.

A data scheme diagram showing that the Storefront Renderer and Redis instance are contained in a Kubernetes node. The Storefront Renderer sends Redis data. The Storefront Renderer sends data to two sharded data stores outside of the Kubernetes node: Sharded MySQL and Sharded Redis
A simplified data schema of the application

Before we cover each topic, let’s briefly describe our application stack. As mentioned in the previous post, the new storefront rendering engine is a Ruby application. It talks to a sharded MySQL database and uses Redis to store and retrieve cached data.

Optimizing how we load all that data is extremely important. As one of our requirements was to improve rendering time for Storefront requests. Here are some of the approaches that we took to accomplish that.

Using MySQL’s Multi-statement Feature to Reduce Round Trips

To reduce the number of network round trips to the database, we use MySQL’s multi-statement feature to allow sending multiple queries at once. With a single request to the database, we can load data from multiple tables at once. Here’s a simplified example:

This request is especially useful to batch-load a lot of data very early in the response lifecycle based on the incoming request. After identifying the type of request, we trigger a single multi-statement query to fetch the data we need for that particular request in one go, which we’ll discuss later in this blog post. For example, for a request for a product page, we’ll load data for the product, its variants, its images, and other product-related resources in addition to information about the shop and the storefront theme, all in a single round-trip to MySQL.

Implementing a Thin Data Mapping Layer

As shown above, the new storefront rendering engine uses handcrafted, optimized SQL queries. This allows us to easily write fine-tuned SQL queries to select only the columns we need for each resource and leverage JOINs and sub-SELECT statements to optimize data loading based on the resources to load which are sometimes less straightforward to implement with a full-service object-relational mapping (ORM) layer.

However, the main benefit of this approach is the tiny memory footprint of using a raw MySQL client compared to using an object-relational mapping (ORM) layer that’s unnecessarily complex for our needs. Since there’s no unnecessary abstraction, forgoing the use of an ORM drastically simplifies the flow of data. Once the raw rows come back from MySQL, we effectively use the simplest ORM possible: we create plain old Ruby objects from the raw rows to model the business domain. We then use these Ruby objects for the remainder of the request. Below is an example of how it’s done.

Of course, not using an ORM layer comes with a cost: if implemented poorly, this approach can lead to more complexity leaking into the application code. Creating thin model abstractions using plain old Ruby objects prevents this from happening, and makes it easier to interact with resources while meeting our performance criteria. Of course, this approach isn’t particularly common and has the potential to cause panic in software engineers who aren’t heavily involved in performance work, instead worrying about schema migrations and compatibility issues. However, when speed is critical, we accept to take on that complexity.

Book-keeping and Eager-loading Queries

An HTTP request for a Shopify storefront may end up requiring many different resources from data stores to render properly. For example, a request for a product page could lead to requiring information about other products, images, variants, inventory information, and a whole lot of other data not loaded on multi-statement select. The first time the storefront rendering engine loads this page, it needs to query the database, sometimes making multiple requests, to retrieve all the information it needs. This usually happens during the request at any given time.

A flow diagram showing the Storefront Renderer's requests from  the data stores and how it uses a Query Book Keeper Middlewear to eager-load data
Flow of a request with the Book-keeping solution

As it retrieves this data for the first time, the storefront rendering engine keeps track of the queries it performed on the database for that particular product page and stores that list of queries in a key-value store for later use. When an HTTP request for the same product page comes in later (which it knows when the cache key matches), the rendering engine looks up the list of queries it performed throughout the previous request of the same type and performs those queries all at once, at the very beginning of the current request, because we’re pretty confident we’ll need them for this request (since they were used in the previous request).

This book-keeping mechanism lets us eager-load data we’re pretty confident we’ll need. Of course, when a page changes, this may lead to over-fetching and/or under-fetching, which is expected, and the shape of the data we fetch stabilizes quickly over time as more requests come in.

On the other side, some liquid models of Shopify’s storefronts are not accessed as frequently, and we don’t need to eager-load data related to them. If we did, we’d increase I/O wait time for something that we probably wouldn’t use very often. What the new rendering engine does instead is lazy-load this data by default. Unless the book-keeping mechanism described above eager-loads it, we’ll defer retrieving data to only load it if it’s needed for a particular request.

Implementing Caching Layers

Much like a CPU’s caching architecture, the new rendering engine implements multiple layers of caching to accelerate responses.

A critical aside before we jump into this section: adding caching should never be the first step towards building performance-oriented software. Start by building a solution that’s extremely fast from the get go, even without caching. Once this is achieved, then consider adding caching to reduce load on the various components on the system while accelerating frequent use cases. Caching is like a sharp knife and can introduce hard to detect bugs.

In-Memory Cache

A data scheme diagram showing that the Storefront Renderer and Redis instance are contained in a Kubernetes node. Within the Storefront Renderer is an In-memory cache. The Storefront Renderer sends Redis data. The Storefront Renderer sends data to two sharded data stores outside of the Kubernetes node: Sharded MySQL and Sharded Redis
A simplified data schema of the application with an in-memory cache for the Storefront Renderer

At the frontline of our caching system is an in-memory cache that you can essentially think of as a global hash that’s shared across requests within each web worker. Much like the majority of our caching mechanisms, this caching layer uses the LRU caching algorithm. As a result, we use this caching layer for data that’s accessed very often. This layer is especially useful in high throughput scenarios such as flash sales.

Node-local Shared Caching

As a second layer on top of the in-memory cache, the new rendering engine leverages a node-local Redis store that’s shared across all server workers on the same node. Since the database is available on the same machine as the rendering engine process itself, this node-local data transfer prevents network overhead and improves response times. As a result, multiple Ruby processes benefit from sharing cached data with one another.

Full-page Caching

Once the rendering engine successfully renders a full storefront response for a particular type of request, we store the final output (most often an HTML or JSON string) into the local Redis for later retrieval for subsequent requests that match the same cache key. This full-page caching solution lets us prevent regenerating storefront responses if we can by using the output we previously computed.

Database Query Results Caching

In a scenario where the full-page output cache, the in-memory cache, and the node-local cache doesn’t have a valid entry for a given request, we need to reach all the way to the database. Once we get a result back from MySQL, we transparently cache the results in Redis for later retrieval based on the queries and their parameters. As long as the cache keys don’t change, running the same database queries over and over always hit Redis instead of reaching all the way to the database.

Liquid Object Memoizer

Thanks to the Liquid templating language, merchants and partners may build custom storefront themes. When loading a particular storefront page, it’s possible that the Liquid template to render includes multiple references to the same object. This is common on the product page for example, where the template will include many references to the product object:
{{ product.title }}, {{ product.description }}, {{ product.featured_media }}, and others.

Of course, when each of these are executed, we don’t fetch the product over and over again from the database—we fetch it once, then keep it in memory for later use throughout the request lifecycle. This means that if the same product object is required multiple times at different locations during the render process, we’ll always use the same one and only instance of it throughout the entire request lifecycle.

The Liquid object memoizer is especially useful when multiple different Liquid objects end up loading the same resource. For example, when loading multiple product objects on a collection page using {{ collection.products }} and then referring to a particular product using {{ all_products[‘cowboy-hat’] }} on a collection page, with the Liquid object memoizer we’ll load it from an external data store once, then store it in memory and fetch it from there if it’s needed later. On average, across all Shopify storefronts, we see that the Liquid object memoizer prevents between 16 and 20 accesses to Redis and/or MySQL for every single storefront request, where we leverage the in-memory cache instead. In some extreme cases, we see that the memoizer prevents up to 4,000 calls to data stores per request.

Reducing Memory Allocations

Writing Memory-aware Code

Garbage collection execution is expensive. So we write code that doesn’t generate unnecessary objects. Use of methods and algorithms that modify objects in place, instead of generating a new object. For example:

  • use map! instead of map when dealing with lists. It prevents a new Array object from being created.
  • Use string interpolation instead of string concatenation. Interpolation does not create intermediate unnecessary String objects.

This may not seem like much, but consider this: using #map! instead of #map could reduce your memory usage significantly, even when simply looping over an array of integers to double the values.

Let’s set up an following array of 1000 integers from 1 to 1000:

array = (1..1000).to_a

Then, let’s double each number in the array with Array#map: { |i| i * 2 }

The line above leads to one object allocated in memory, for a total of 8040 bytes.

Now let’s do the same thing with Array#map! instead:! { |i| i * 2 }

The line above leads to zero object allocated in memory, for a total of 0 bytes.

Even with this tiny example, using map! instead of map saves ~8 kilobytes of allocated memory, and considering the sheer scale of the Shopify platform and the storefront traffic throughput it receives, every little bit of memory optimization counts to help the garbage collector run less often and for smaller periods of time, thus improving server response times.

With that in mind, we use tracing and profiling tools extensively to dive deeper into areas in the rendering engine that are consuming too much memory and to make precise changes to reduce memory usage.

Method-specific Memory Benchmarking

To prevent accidentally increasing memory allocations, we built a test helper method that lets us benchmark a method or a block to know many memory allocations and allocated bytes it triggers. Here’s how we use it:

This benchmark test will succeed if calling Product.find_by_handle('cowboy-hat') matches the following criteria:

  • The call allocates between 48 and 52 objects in memory;
  • The call allocates between 5100 and 5200 bytes in memory.

We allow a range of allocations because they’re not deterministic on every test run. This depends on the order in which tests run and the way data is cached, which can affect the final number of allocations.

As such, these memory benchmarks help us keep an eye on memory usage for specific methods. In practice, they’ve prevented introducing inefficient third-party gems that bloat memory usage, and they’ve increased awareness of memory usage to developers when working on features.

We covered three main ways to improve server-side performance: batching up calls to external data stores to reduce roundtrips, caching data in multiple layers for specific use cases, and simplifying the amount of work required to fulfill a task by reducing memory allocations. When they’re all combined, these approaches lead to big time performance gains for merchants on the platform—the average response time with the new rendering engine is 5x faster than with the previous implementation. 

Those are just some of the techniques that we are using to make the new application faster. And we never stop exploring new ways to speed up merchant’s storefronts. Faster rendering times are in the DNA of our team!

- The Storefront Renderer Team

Celso Dantas is a Staff Developer on the Storefront Renderer team. He joined Shopify in 2013 and has worked on multiple projects since then. Lately specializing in making merchants storefront faster.

Maxime Vaillancourt is a Senior Developer on the Storefront Rendering team. He has been at Shopify for 3 years, starting on the Online Store Themes team, then specializing towards storefront performance with the storefront rendering engine rewrite.

Shipit! Presents: Performance Tips from the Storefront Renderer Team

The Storefront Renderer is a server-side application that loads a Shopify merchant's storefront Liquid theme, along with the data required to serve the request (for example product data, collection data, inventory information, and images), and returns the HTML response back to your browser. On average, server response times for the Storefront Renderer are four times faster than the implementation it replaced.

Our blog post, How Shopify Reduced Storefront Response Times with a Rewrite generated great discussions and questions. This event looks to answer those questions and dives deeper into the technical details of how we made the Storefront Renderer engine faster.

​​​​​​​During this event you will learn how we:
  • optimized data access
  • implemented caching layers
  • reduced memory allocations

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Resiliency Planning for High-Traffic Events

Resiliency Planning for High-Traffic Events

On January 27, 2021 Shipit!, our monthly event series, presented Building a Culture of Resiliency at Shopify. Learn about creating and maintaining resiliency plans for large development teams, testing and tooling, developing incident strategies, and incorporating and improving feedback loops. The video is now available.

Each year, Black Friday Cyber Monday weekend represents the peak of activity for Shopify. Not only is this the most traffic we see all year, but it’s also the time our merchants put the most trust in our team. Winning this weekend each year requires preparation, and it starts as soon as the weekend ends.

Load Testing & Stress Testing: How Does the System React?

When preparing for a high traffic event, load testing regularly is key. We have discussed some of the tools we use already, but I want to explain how we use these exercises to build towards a more resilient system.

While we use these tests to confirm that we can sustain required loads or probe for new system limits, we can also use regular testing to find potential regressions. By executing the same experiments on a regular basis, we can spot any trends at easily handled traffic levels that might spiral into an outage at higher peaks.

This same tool allows us to run similar loads against differently configured shops and look for differences caused by the theme, configuration, and any other dimensions we might want to use for comparison.

Resiliency Matrix: What are Our Failure Modes?

If you've read How Complex Systems Fail, you know that "Complex systems are heavily and successfully defended against failure" and "Catastrophe requires multiple failures - single point failures are not enough.” For that to be true, we need to understand our dependencies, their failure modes, and how those impact the end-user experience.

We ask teams to construct a user-centric resiliency matrix, documenting the expected user experience under various scenarios. For example:

This user-centric resiliency matrix shows the potential failures and their impact on user experience. For example, can a user browse (yes) or check out (no) if MySQL is down.
User-centric resiliency matrix documenting expected user experience and possible failures

The act of writing this matrix serves as a very basic tabletop chaos exercise. It forces teams to consider how well they understand their dependencies and what the expected behaviors are.

This exercise also provides a visual representation of the interactions between dependencies and their failure modes. Looking across rows and columns reveals areas where the system is most fragile. This provides the starting point for planning work to be done. In the above example, this matrix should start to trigger discussion around the ‘User can check out’ experience and what can be done to make this more resilient to a single dependency going ‘down’.

Game Days: Do Our Models Match?

So, we’ve written our resilience matrix. This is a representation of our mental model of the system, and when written, it's probably a pretty accurate representation. However, systems change and adapt over time, and this model can begin to diverge from reality.

This divergence is often unnoticed until something goes wrong, and you’re stuck in the middle of a production incident asking “Why?”. Running a game day exercise allows us to test the documented model against reality and adjust in a controlled setting.

The plan for the game day will derive from the resilience matrix. For the matrix above, we might formulate a plan like:

This game day exercise allows us to test the model against reality and adjust in a controlled setting. This plan lays out scenarios to be tested and how they will be accomplished.
Game day planning scenarios 

Here, we are laying out what scenarios are to be tested, how those will be accomplished, and what we expect to happen. 

We’re not only concerned with external effects (what works, what doesn’t), but internally do any expected alerts fire, are the appropriate on-call teams paged, and do those folks have the information available to understand what is happening?

If we refer back to How Complex Systems Fail, the defences against failure are technical, human, and organizational. On a good game day, we’re attempting to exercise all of these.

  • Do any automated systems engage?
  • Do the human operators have the knowledge, information and tools necessary to intervene?
  • Do the processes and procedures developed help or hinder responding to the outage scenario?

By tracking the actual observed behavior, we can then update the matrix as needed or make changes to the system in order to bring our mental model and reality back into alignment.

Incident Analysis: How Do We Get Better?

During the course of the year, incidents happen which disrupt service in some capacity. While the primary focus is always in restoring service as fast as possible, each incident also serves as a learning opportunity.

This article is not about why or how to run a post-incident review; there are more than enough well-written pieces by folks who are experts on the subject. But to refer back to How Complex Systems Fail, one of the core tenets in how we learn from incidents is “Post-accident attribution to a ‘root cause’ is fundamentally wrong.”

When focusing on a single root cause, we stop at easy, shallow actions to resolve the ‘obvious’ problem. However, this ignores deeper technical, organizational, and cultural issues that contributed to the issue and will again if uncorrected.

What’s Special About BFCM?

We’ve talked about the things we’re constantly doing, year-round to ensure we’re building for reliability and resiliency and creating an anti-fragile system that gets better after every disruption. So what do we do that’s special for the big weekend?

We’ve already mentioned How Complex Systems Fail several times, but to go back to that well once more, “Change introduces new forms of failure.” As we get closer to Black Friday, we slow down the rate of change.

This doesn’t mean we’re sitting on our hands and hoping for the best, but rather we start to shift where we’re investing our time. Fewer new services and features as we get closer, and more time spent dealing with issues of performance, reliability, and scale.

We review defined resilience matrices carefully, start running more frequent game days and load tests and working on any issues or bottlenecks those reveal. This means updating runbooks, refining internal tools, and shipping fixes for issues that this activity brings to light.

All of this comes together to provide a robust, reliable platform to power over $5.1 billion in sales.

Shipit! Presents: Building a Culture of Resiliency at Shopify

Watch Ryan talk about how we build a culture of resiliency at Shopify to ensure a robust, reliable platform powering over $5.1 billion in sales.


Ryan is a Senior Development Manager at Shopify. He currently leads the Resiliency team, a centralized globally distributed SRE team responsible for keeping commerce better for everyone.

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

How to Reliably Scale Your Data Platform for High Volumes

How to Reliably Scale Your Data Platform for High Volumes

By Arbab Ahmed and Bruno Deszczynski

Black Friday and Cyber Monday—or as we like to call it, BFCM—is one of the largest sales events of the year. It’s also one of the most important moments for Shopify and our merchants. To put it into perspective, this year our merchants across more than 175 countries sold a record breaking $5.1+ billion over the sales weekend. 

That’s a lot of sales. That’s a lot of data, too.

This BFCM, the Shopify data platform saw an average throughput increase of 150 percent. Our mission as the Shopify Data Platform Engineering (DPE) team is to ensure that our merchants, partners, and internal teams have access to data quickly and reliably. It shouldn’t matter if a merchant made one sale per hour or a million; they need access to the most relevant and important information about their business, without interruption. While this is a must all year round, the stakes are raised during BFCM.

Creating a data platform that withstands the largest sales event of the year means our platform services need to be ready to handle the increase in load. In this post, we’ll outline the approach we took to reliably scale our data platform in preparation for this high-volume event. 

Data Platform Overview

Shopify’s data platform is an interdisciplinary mix of processes and systems that collect and transform data for use by our internal teams and merchants. It enables access to data through a familiar pipeline:

  • Ingesting data in any format, from any part of Shopify. “Raw” data (for example, pageviews, checkouts, and orders) is extracted from Shopify’s operational tables without any manipulation. Data is then conformed to an Apache Parquet format on disk.
  • Processing data, in either batches or streams, to form the foundations of business insights. Batches of data are “enriched” with models developed by data scientists, and processed within Apache Spark or dbt
  • Delivering data to our merchants, partners, and internal teams so they can use it to make great decisions quickly. We rely on an internal collection of streaming and serving applications, and libraries that power the merchant-facing analytics in Shopify. They’re backed by BigTable, GCS, and CloudSQL.

In an average month, the Shopify data platform processes about 880 billion MySQL records and 1.75 trillion Kafka messages.

Tiered Services

As engineers, we want to conquer every challenge right now. But that’s not always realistic or strategic, especially when not all data services require the same level of investment. At Shopify, a tiered services taxonomy helps us prioritize our reliability and infrastructure budgets in a broadly declarative way. It’s based on the potential impact to our merchants and looks like this:

Tier 1

This service is critical externally, for example. to a merchant’s ability to run their business

Tier 2

This service is critical internally to business functions, e.g. a operational monitoring/alerting service

Tier 3

This service is valuable internally, for example, internal documentation services

Tier 4

This service is an experiment, in very early development, or is otherwise disposable. For example, an emoji generator

The highest tiers are top priority. Our ingestion services, called Longboat and Speedboat, and our merchant-facing query service Reportify are examples of services in Tier 1.

The Challenge 

As we’ve mentioned, each BFCM the Shopify data platform receives an unprecedented volume of data and queries. Our data platform engineers did some forecasting work this year and predicted nearly two times the traffic of 2019. The challenge for DPE is ensuring our data platform is prepared to handle that volume. 

When it comes to BFCM, the primary risk to a system’s reliability is directly proportional to its throughput requirements. We call it throughput risk. It increases the closer you get to the front of the data pipeline, so the systems most impacted are our ingestion and processing systems.

With such a titillating forecast, the risk we faced was unprecedented throughput pressure on data services. In order to be BFCM ready, we had to prepare our platform for the tsunami of data coming our way.

The Game Plan

We tasked our Reliability Engineering team with Tier 1 and Tier 2 service preparations for our ingestion and processing systems. Here’s the steps we took to prepare our systems most impacted by BFCM volume:

1. Identify Primary Objectives of Services

A data ingestion service's main operational priority can be different from that of a batch processing or streaming service. We determine upfront what the service is optimizing for. For example, if we’re extracting messages from a limited-retention Kafka topic, we know that the ingestion system needs to ensure, above all else, that no messages are lost in the ether because they weren’t consumed fast enough. A batch processing service doesn’t have to worry about that, but it may need to prioritize the delivery of one dataset versus another.

In Longboat’s case, as a batch data ingestion service, its primary objective is to ensure that a raw dataset is available within the interval defined by its data freshness service level objective (SLO). That means Longboat is operating reliably so long as every dataset being extracted is no older than eight hours— the default freshness SLO. For Reportify, our main query serving service, its primary objective is to get query results out as fast as possible; its reliability is measured against a latency SLO.

2. Pinpoint Service Knobs and Levers

With primary objectives confirmed, you need to identify what you can “turn up or down” to sustain those objectives.

In Longboat’s case, extraction jobs are orchestrated with a batch scheduler, and so the first obvious lever is job frequency. If you discover a raw production dataset is stale, it could mean that the extraction job simply needs to run more often. This is a service-specific lever.

Another service-specific lever is Longboat’s “overlap interval” configuration, which configures an extraction job to redundantly ingest some overlapping span of records in an effort to catch late-arriving data. It’s specified in a number of hours.

Memory and CPU are universal compute levers that we ensure we have control of. Longboat and Reportify run on Google Kubernetes Engine, so it’s possible to demand that jobs request more raw compute to get their expected amount of work done within their scheduled interval (ignoring total compute constraints for the sake of this discussion).

So, in pursuit of data freshness in Longboat, we can manipulate:

  1. Job frequency
  2. Longboat overlap interval
  3. Kubernetes Engine Memory/CPU requests

In pursuit of latency in Reportify, we can turn knobs like its:

  1. BigTable node pool size 
  2. ProxySQL connection pool/queue size

3. Run Load Tests!

Now that we have some known controls, we can use them to deliberately constrain the service’s resources. As an example, to simulate an unrelenting N-times throughput increase, we can turn the infrastructure knobs so that we have 1/N the amount of compute headroom, so we’re at N-times nominal load.

For Longboat’s simulation, we manipulated its “overlap interval” configuration and tripled it. Every table suddenly looked like it had roughly three times more data to ingest within an unchanged job frequency; throughput was tripled.

For Reportify, we leveraged our load testing tools to simulate some truly haunting throughput scenarios, issuing an increasingly extreme volume of queries, as seen here:

A line graph showing streaming service queries per second by source. The graph shows increase in the volume of queries over time during a load test.
Streaming service queries per second metric after the load test

In this graph, the doom is shaded purple. 

Load testing answers a few questions immediately, among others:

  • Do infrastructure constraints affect service uptime? 
  • Does the service’s underlying code gracefully handle memory/CPU constraints?
  • Are the raised service alarms expected?
  • Do you know what to do in the event of every fired alarm?

If any of the answers to these questions leave us unsatisfied, the reliability roadmap writes itself: we need to engineer our way into satisfactory answers to those questions. That leads us to the next step. 

4. Confirm Mitigation Strategies Are Up-to-Date

A service’s reliability depends on the speed at which it can recover from interruption. Whether that recovery is performed by a machine or human doesn’t matter when your CTO is staring at a service’s reliability metrics! After deliberately constraining resources, the operations channel turns into a (controlled) hellscape and it's time to act as if it were a real production incident.

Talking about mitigation strategy could be a blog post on its own, but here are the tenets we found most important:

  1. Every alert must be directly actionable. Just saying “the curtains are on fire!” without mentioning “put it out with the extinguisher!” amounts to noise.
  2. Assume that mitigation instructions will be read by someone broken out of a deep sleep. Simple instructions are carried out the fastest.
  3. If there is any ambiguity or unexpected behavior during controlled load tests, you’ve identified new reliability risks. Your service is less reliable than you expected. For Tier 1 services, that means everything else drops and those risks should be addressed immediately.
  4. Plan another controlled load test and ensure you’re confident in your recovery.
  5. Always over-communicate, even if acting alone. Other engineers will devote their brain power to your struggle.

5. Turn the Knobs Back

Now that we know what can happen with an overburdened infrastructure, we can make an informed decision whether the service carries real throughput risk. If we absolutely hammered the service and it skipped along smiling without risking its primary objective, we can leave it alone (or even scale down, which will have the CFO smiling too).

If we don’t feel confident in our ability to recover, we’ve unearthed new risks. The service’s development team can use this information to plan resiliency projects, and we can collectively scale our infrastructure to minimize throughput risk in the interim.

In general, to be prepared infrastructure-wise to cover our capacity, we perform capacity planning. You can learn more about Shopify’s BFCM capacity planning efforts on the blog.

Overall, we concluded from our results that:

  • Our mitigation strategy for Longboat and Reportify was healthy, needing gentle tweaks to our load-balancing maneuvers.
  • We should scale up our clusters to handle the increased load, not only from shoppers, but also from some of our own fun stuff like the BFCM Live Map.
  • We needed to tune our systems to make sure our merchants could track their online store’s performance in real-time through the Live View in the analytics section of their admin.
  • Some jobs could use some tuning, and some of their internal queries could use optimization.

Most importantly, we refreshed our understanding of data service reliability. Ideally, it’s not any more exciting than that. Boring reliability studies are best.

We hope to perform these exercises more regularly in the future, so BFCM preparation isn’t particularly interesting. In this post we talked about throughput risk as one example, but there are other risks to data integrity, correctness, latency. We aim to get out in front of them too because data grows faster than engineering teams do. “Trillions of records every month” turns into “quadrillions” faster than you expect.

So, How’d It Go?

After months of rigorous preparation systematically improving our indices, schemas, query engines, infrastructure, dashboards, playbooks, SLOs, incident handling and alerts, we can proudly say BFCM 2020 went off without a hitch!

During the big moment we traced down every spike, kept our eyes glued to utilization graphs, and turned knobs from time to time, just to keep the margins fat. There were only a handful of minor incidents that didn’t impact merchants, buyers or internal teams - mainly self healing cases thanks to the nature of our platform and our spare capacity.

This success doesn’t happen by accident, it happens because of diligent planning, experience, curiosity and—most importantly—teamwork.

Arbab is a seven-year veteran at Shopify serving as Reliability Engineering lead. He's previously helped launch Shopify payments, some of the first Shopify public APIs, and Shopify's Retail offerings before joining the Data Platform. 99% of Shopifolk joined after him!
Bruno is a DPE TPM working with the Site Reliability Engineering team. He has a record of 100% successful BFCMs under his belt and plans to keep it that way.

Interested in helping us scale and tackle interesting problems? We’re planning to double our engineering team in 2021 by hiring 2,021 new technical roles. Learn more here!

Continue reading

The State of Ruby Static Typing at Shopify

The State of Ruby Static Typing at Shopify

Shopify changes a lot. We merge around 400 commits to the main branch daily and deploy a new version of our core monolith 40 times a day. The Monolith is also big: 37,000 Ruby files, 622,000 methods, more than 2,000,000 calls. At this scale with a dynamic language, even with the most rigorous review process and over 150,000 automated tests, it’s a challenge to ensure everything works properly. Developers benefit from a short feedback loop to ensure the stability of our monolith for our merchants.

Since 2018, our Ruby Infrastructure team has looked at ways to make the development process safer, faster, and more enjoyable for Ruby developers. While Ruby is different from other languages and brings amazing features allowing Shopify to be what it is today, we felt there was a feature from other languages missing: static typing.

Shipit! Presents: The State of Ruby Static Typing at Shopify

On November 25, 2020, Shipit!, our monthly event series, presented The State of Ruby Static Typing at Shopify. Alexandre Terrasa and I talked about the history of static typing at Shopify and our adoption of Sorbet.

We weren't able to answer all the questions during the event, so we've included answers to them below.

What are some challenges with adopting Sorbet? What was some code you could not type?

So far most of our problems are in modules that use ActiveSupport::Concern (many layers deep, even) and modules that assume they will be included in a certain kind of class, but have no way of making that explicit. For example, a module that assumes it will be included in an ActiveRecord model could be calling before_save to add a hook, but Sorbet would have no idea where before_save is defined. We are also looking to make those kinds of dependencies between modules and include sites explicit in Sorbet.

Inclusion requirements is also something we’re trying to fix right now, mostly for our helpers. The problem is explained in the description of this pull-request:

If a method has an array in argument, do you have to specify it is an array of what type? And if not, how do Sorbet makes the method you call on the array's element exists?

It depends on the type of elements inside the array. For simple types like Integer or Foo, you can easily type it as T::Array[Integer] and Sorbet will be able to type check method calls. For more complex types like arrays containing hashes it depends, you may use T::Array[T.untyped] in which case Sorbet won’t be able to check the calls. Using T.untyped you can go as deep and precise you want it to be: T::Array[T::Hash[String, T.untyped]], T::Array[T::Hash[String, T::Array[T.untyped]]], T::Array[T::Hash[String, T::Array[Integer]]] and Sorbet will check the calls on what it knows about. Note that as your type becomes more and more complex, maybe you should start thinking about making a class about it so you can just use T::Array[MyNewClass].

How would you compare the benefits of Sorbet to Ruby relative to the benefits of Typescript to Javascript?

There are similar benefits, but Ruby is a much more dynamic language than JavaScript and Sorbet is a much younger project than TypeScript. So the coverage of Ruby features and expressiveness of the type system of Sorbet lags behind the same benefits that TypeScript brings to JavaScript.On the other hand, Sorbet annotations are pure Ruby. That means you don’t have to learn a new language and you can keep using your existing editors and tooling to work with it. There is also no compilation of Ruby code with types to plain Ruby, like how you need to compile TypeScript to JavaScript. Finally, Sorbet also has a runtime type-checker and it can verify your types and alert you if they don’t check when your application is running, which is a great additional safety that TypeScript does not have.

Could you quickly relate Sorbet with RBS and what is the future of sorbet after Ruby 3.0?

Stripe gave an interesting answer to this question: RBS is about the language to write the signatures, you still need a type checker to check those signatures against your code. We see Sorbet as one of the solution that can use those types, and currently it’s the fastest solution. One limitation of RBS is the lack of inline type annotations, for example there is no syntax to cast a variable to another type. So type checkers have to use additional syntax to make this possible. Even if Sorbet doesn’t support RBS at the moment, it might in the future. And in the case it never happens, remember that it’s easier to go from one type specification to another rather than an untyped codebase to a typed one. So all the efforts are not lost.

Does Tapioca support Enums etc?

Tapioca is able to generate RBI files for T::Enum definitions coming from gems. It can also generate method definitions for the ActiveRecord enums as DSL generators.

In which scenarios would you NOT use Sorbet?

I guess the one scenario where using Sorbet would be counterproductive is if there is no team buy-in for adopting Sorbet. If I were on such a team and I couldn’t convince the rest of the team of the utility of it, I would not push to use Sorbet.Other than that, I can only think of a code base that has a LOT of metaprogramming idioms to be a bad target for using Sorbet. You would still get some benefits from even running Sorbet at typed: false but it might not be worth the effort.

What editors do you personally use? Any standardization across the organization?

We have not standardized on a single editor across the company and we probably will not do so, since we believe in developers’ freedom to use the tools that make them the most productive. However, we also cannot build tooling for all editors, either. So, most of our developer acceleration team builds tooling primarily for VSCode, today. Ufuk personally uses VSCode and Alex uses VIM.

Is there a roadmap for RBI -> RBS conversion for when Ruby 3.0 comes out?

No official roadmap yet, we’re still experimenting with this on the side of our main project: 100% typed: true files in our monolith. We can already say that some features from RBS will not directly translate to RBI and vice versa. You can see a comparison of both specifications here: (might not be completely up-to-date with the latest version of RBS).

What are the major challenges you had or are having with Ruby GraphQL libraries?

Our team tried to marry the GraphQL typing system and the Sorbet types using RBI generation but we got stuck in some very dynamic usages of GraphQL resolvers, so we paused that work for now. On the other hand, there are teams within Shopify who have been using Sorbet and GraphQL together by changing the way they write GraphQL endpoints. You can read more about the technical details of that from the blog post of one of the Shopify engineers that has worked on that:

What would be the first step in getting started with typing in a Rails project? What are kind of files that should be checked in to a repo?

The fastest way to start is to use the steps listed on the Sorbet site to start running with Sorbet. After doing that, you can take a look at using sorbet-rails to generate Rails RBI files for you, or you can look at tapioca to generate gem RBIs. Since you can go gradual it’s totally up to you.

Our advice would be to first target all files at typed: false. If you use tapioca, the price is really low and already brings a lot of benefit. Then try to move the files to type: true where it does not create new type errors (you can use spoom for that:

When it comes to adding signatures, prefer the files that are the most reused or, if you track the errors from production, going first with the files that create the most errors might be a good choice. Files touched by many teams are also an interesting target as signatures make collaboration easier. Files with a lot of churn. Or files defining methods reused a lot across your codebase.

As for the files that you need to check-in to a repository, the best practice is to check-in all the files (mostly RBI files) generated by Sorbet and/or tapioca/sorbet-typed. Those files enable the code to be type checked, so should be available to all the developers that work on the code base.

Learn More About Ruby Static Typing at Shopify

Additional Information

Open Source

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Organizing 2000 Developers for BFCM in a Remote World

Organizing 2000 Developers for BFCM in a Remote World

Shopify is an all-in-one commerce platform that serves over 1M+ merchants in approximately 175 countries across the world. Many of our merchants prepare months in advance for their biggest shopping season of the year, and they trust us to help them get through it successfully. As our merchants grow and their numbers increase, we must scale our platform without compromising on our stability, performance, and quality.  

With Black Friday and Cyber Monday (BFCM) being the two biggest shopping events of the year and with other events on the horizon, there is a lot of preparation that Shopify needs to do on our platform. This effort needs a key driver to set expectations for many teams and hold them accountable to complete the work for their area in the platform. 

Lisa Vanderschuit getting hyped about making commerce better for everyone
Lisa Vanderschuit getting hyped about making commerce better for everyone

I’m an Engineering Program Manager (EPM) with a focus on platform quality and was one of the main program managers (PgM) tapped midway in the year to drive these efforts. For this initiative I worked with three Production Engineering leads (BFCM leads) and three other program managers (with a respective focus in resiliency, scale, and capacity) to:

  • understand opportunities for improvement
  • build out a program that’s effective at scale
  • create adjustments to the workflow specifically for BFCM
  • execute the program
  • start iterating on the program for next year.

Understanding our Opportunities

Each year, the BFCM leads start a large cross company push to get the platform ready for BFCM. They ask the teams responsible for critical areas of the platform to complete the following prep:

Looking at the past years, the BFCM leads chosen to champion this in spend a significant time on administrative, communication, and reporting activities when their time is better spent in the weeds of the problems. Our PgM group was assigned to take on these responsibilities so that these leads could focus on investigating the technical challenges and escalations.

Before jumping into solutions, our PgM group looked into the past to find lessons to inform the future. In looking at past retrospective documents we found some common themes over the years that we needed to keep in mind as we put together our plan:

  • Shopify needs to prepare in advance for supporting large merchants with lots of popular inventory to sell. 
  • Scaling trends weren’t just on the two main days. Sales were spreading out through the week, and there were pre sale and post sale workflows where we needed to be well tested for how much load we could sustain without performance issues. 
  • There were some parts of the platform tied to disruptions in the past that would require additional load testing to give us more confidence in their stability. 

With Shopify moving to Digital by Default and the increasing number of timezones to consider, there were more complexities to getting the company aligned to the same goals and schedule. Our PgM group wanted to create structure around coordinating a large scale effort, but we also wanted to start thinking about how maintenance and prep work can be done throughout the year, so we’re ready for any large shopping event regardless of the time of year. 

Building the Program Plan

Our PgM group listed all the platform preparation tasks the BFCM leads asked developers to do in the past. Then we highlighted items that had to happen this year and took note of when they needed to happen. After this, we asked the BFCM leads to highlight the important things critical for their participation and then we assigned the rest of the work for our PgM group to manage. 

Example of our communication plan calendar
Example of our communication plan calendar

Once we had those details documented, we created a communication plan calendar (a.k.a spreadsheet) to see what was next, week over week. We split the PgM work into workstreams then we each selected ones respective to our areas of focus. In my workstream I had two main responsibilities:

  • Put together a plan to get people assigned to do the platform preparation work that the BFCM leads wanted them to. 
  • Determine what kind of PRs should or should not be shipped to production in the month before and after BFCM.

For platform preparation work listed earlier, I asked teams to identify which areas of the platform that need prepping for the large shopping event. Even with a reduced set of areas to focus on there were still quite a bit of people that I would need to get this prep work assigned to. Instead of working directly with every single person, I used a distributed ownership model. I asked each GM or VP with critical areas to assign a champion from their department to work with. Then I reached out to the champions to let them know of the prep work that needed to be done. They then either assigned the work themselves or they assigned people from their team. To keep track of this ownership I built a tracking spreadsheet and set up a schedule to report on progress week over week.

In the past, our Deploys team would lock the ability to automatically deploy to production for a week. Since BFCM is becoming more spread out year after year, we realized we needed to adjust our culture around shipping in the last two months of the year to make sure we could confidently provide merchants with a resilient platform. Merchants were also needing to train up staff further in advance of the year so we also had to consider slowing down new features that could require extra training for their staff. To start tackling these challenges I asked:

  • Teams to take inventory of all of our platform areas and highlight which areas were considered critical for the merchant experience. 
  • That we set up a rule in a bot we call Caution Tape to comment a thorough risk to value assessment on any new PRs created between November to December in repos that had been flagged as critical to successful large shopping events. 

If the PRs were proposing a merchant facing feature the Caution Tape bot message asked that they document the risks vs the value to shipping around BFCM and that they only ship if approved by a director or GM in their area. In many cases the people creating these PRs either investigated a safer approach, got more thorough reviews, or decided to wait until next year to launch the feature. 

On the week of Black Friday 60% of the work in GitHub was code reviews
On the week of Black Friday 60% of the work in GitHub was code reviews

To artificially slow down the rate of items being shipped to production we planned to reduce the amount of PRs that could be shipped in a deploy and increase the amount of time they would spend in canaries (pre-production deploy test). On top of this we also planned to lock deploys for a week around the BFCM weekend. 

Executing the Program

How do you rally 2500+ people around a mission who are responsible for 1000+ deploys across all services? You state your expectations and then repeat many times in different ways. 

1. Our PGM and BFCM lead group had two main communication options where we started engagement with the rest of the engineering group working on platform prep:

  • Shared Slack channels for discussions, questions, and updates.
  • GitHub repos for assigning work.

2. Our PGM group put together and shared internal documentation on the program details to make it easier to onboard participants to the program.

3. Our PGM group shared high-level announcements, reminders and presentations throughout the year, increasing in frequency leading up to BFCM, to increase awareness and engagement. Some examples of this were:

  • progress reports on each department posted in Slack. 
  • live-Streamed and recorded presentations on our internal broadcasting channel to inform the company about our mission and where to go for help.
  • emails sent to targeted groups to remind people of their respective responsibilities and deadlines.
  • GitHub issues created and assigned to teams with a checklist of the prep work we had asked them to do.

To make sure our BFCM leads had the support they needed our PgM group had regular check-in meetings with them to get a pulse on how they were feeling things were going. To make sure our PgM group was on top of the allocated tasks each week we had meetings at the start and end of each week. Then we hosted office hours along with the BFCM leads for any developers that wanted facetime to flag any potential concerns about their area.

Celebrations and Lessons Learned

Overall I’d say our program was a success. We had a very successful BFCM with sales of $5.1+ billion from the more than one million Shopify-powered brands around the world. We found that our predictions for which areas would take the most load were on target and that the load testing and spinning up of resources paid off. 

A photo of 3 women and 2 men celebrating. Gold confetti showers down on them.
Celebrating BFCM

From our internal developer view we had success in the sense that shipping to the platform didn’t need to come to a full stop. PR reviews were at an all time high which meant that developers focus on quality was at an all time high. For the areas where we did have to slow down on shipping code for features we found that our developers had more time to work on the other important aspects of work that needs to be done in engineering. Teams were able to

  • focus more on clean up tasks
  • write blog posts
  • put together strategic roadmaps and architecture design docs
  • plan team building exercises. 

Overall we still did take a hit in developer productivity and we could have been a bit more relaxed on how long we enforced the extra risk to value assessment on our PRs and expectations on deploying to production. Our PGM team hopes to find a more balanced approach for this for next year's plan. 

From a communication standpoint, some of the messaging to developers was inconsistent on whether or not they could ship to critical areas of the platform during November and December. Our PGM group also ended up putting together some of the announcement drafts last minute so in future years we want to have this included in our communication plan from the start with templates ready to go. 

Our PGM group is hoping to have a retrospective meeting later this year with the BFCM leads to see how we can adjust the program plan for next year. We will be taking everything we learned and find opportunities where we can automate some of the work or distribute the work throughout the year so we can be always ready for any large shopping event in the year. 

If you have a large initiative at your company, consider creating a role for people technical enough to be dangerous that can help drive engineering initiatives forward and work with your top developers to maximise their time and expertise to solve the big complex problems and get shit done.

Lisa Vanderschuit is an Engineering Program Manager who manages the engineering theme of Code Quality. She has been at Shopify for 6 years, working on areas from editing and reviewing Online Store themes to helping our engineering teams raise the bar of code quality at Shopify.

How does your team leverage program managers at your company? What advice do you have for coordinating cross company engineering initiatives? We want to hear from you on Twitter at @ShopifyEng.

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

A World Rendered Beautifully: The Making of the BFCM 3D Data Visualization

A World Rendered Beautifully: The Making of the BFCM 3D Data Visualization

By Mikko Haapoja and Stephan Leroux

2020 Black Friday Cyber Monday (BFCM) is over, and another BFCM Globe has shipped. We’re extremely proud of the globe, it focused on realism, performance, and the impact our merchants have on the world.

The Black Friday Cyber Monday Live Map

We knew we had a tall task in front of us this year, building something that could represent orders from our one million merchants in just two months. Not only that, we wanted to ship a data visualization for our merchants so they could have a similar experience to the BFCM globe every day in their Live View.

Prototypes for the 2020 BFCM Globe and Live View. **

With tight timelines and an ambitious initiative, we immediately jumped into prototypes with three.js and planned our architecture.

Working with a Layer Architecture

As we planned this project, we converged architecturally on the idea of layers. Each layer is similar to a React component where state is minimally shared with the rest of the application, and each layer encapsulates its own functionality. This allowed for code reuse and flexibility to build both the Live View Globe, BFCM Globe, and beyond.

A showcase of layers for the 2020 BFCM Globe. **

When realism is key, it’s always best to lean on fantastic artists, and that’s where Byron Delgado came in. We hoped that Byron would be able to use the 3D modeling tools he’s used to, and then we would incorporate his 3D models into our experience. This is where the EarthRealistic layer comes in.

EarthRealistic layer from the 2020 BFCM Globe. **

EarthRealistic uses a technique called physically based rendering, which most modern 3D modeling software supports. In three.js, physically based rendering is implemented via the MeshPhysicalMaterial or MeshStandardMaterial materials.

To achieve realistic lighting, EarthRealistic is lit by a 32bit EXR Environment Map. By using a 32bit EXR, it means we can have smooth image based lighting. Image based lighting is a technique where a “360 sphere” is created around the 3D scene, and pixels in that image are used to calculate how bright Triangles on 3D models should be. This allows for complex lighting setups without much effort from an artist. Traditionally images on the web such as JPGs and PNGs have a color depth of 8bits. If we were to use these formats and 8bit color depth, our globe lighting would have had horrible gradient banding, missing realism entirely.

Rendering and Lighting the Carbon Offset Visualization

Once we converged on physically based rendering and image based lighting, building the carbon offset layer became clearer. Literally!

Carbon Offset visualization layer from the 2020 BFCM Globe. **

Bubbles have an interesting phenomenon where they can be almost opaque at a certain angle and light intensity but in other areas completely transparent. To achieve this look, we created a custom material based on MeshStandardMaterial that reads in an Environment Map and simulates the bubble lighting phenomenon. The following is the easiest way to achieve this with three.js:

  1. Create a custom Material class that extends off of MeshStandardMaterial.
  2. Write a custom Vertex or Fragment Shader and define any Uniforms for that Shader Program.
  3. Override onBeforeCompile(shader: Shader, _renderer: WebGLRenderer): void on your custom Material and pass the custom Vertex or Fragment Shader and uniforms via the Shader instance.

Here’s our implementation of the above for the Carbon Offset Shield Material:

Let’s look at the above, starting with our Fragment shader. In shield.frag lines 94-97

These two lines are all that are needed to achieve a bubble effect in a fragment shader.

To calculate the brightness of an rgb pixel, you calculate the length or magnitude of the pixel using the GLSL length function. In three.js shaders, outgoingLight is an RGB vec3 representing the outgoing light or pixel to be rendered.

If you remember from earlier, the bubble’s brightness determines how transparent or opaque it should appear.  After calculating brightness, we can set the outgoing pixel’s alpha based on the brightness calculation. Here we use the GLSL mix function to go between the expected alpha of the pixel defined by diffuseColor.a and a new custom uniform defined as maxOpacity. By having the concept of min or expected opacity and max opacity, Byron and other artists can tweak visuals to their exact liking.

If you look at our shield.frag file, it may seem daunting! What on earth is all of this code?  three.js materials handle a lot of functionality, so it’s best to make small additions and not modify existing code. three.js materials all have their own shaders defined in the ShaderLib folder. To extend a three.js material, you can grab the original material shader code from the src/renderers/shaders/ShaderLib/ folder in the three.js repo and perform any custom calculations before setting gl_FragColor. An easier option to access three.js shader code is to simply console.log the shader.fragmentShader or shader.vertexShader strings, which are exposed in the onBeforeCompile function:

onBeforeCompile runs immediately before the Shader Program is created on the GPU. Here you can override shaders and uniforms. CustomMeshStandardMaterial.ts is an abstraction we wrote to make creating custom materials easier. It overrides the onBeforeCompile function and manages uniforms while your application runs via the setCustomUniform and getCustomUniform functions. You can see this in action in our custom Shield Material when getting and setting maxOpacity:

Using Particles to Display Orders

Displaying orders on Shopify from across the world using particles. **

One of the BFCM globe’s main features is the ability to view orders happening in real-time from our merchants and their buyers worldwide. Given Shopify’s scale and amount of orders happening during BFCM, it’s challenging to visually represent all of the orders happening at any given time. We wanted to find a way to showcase the sheer volume of orders our merchants receive over this time in both a visually compelling and performant way. 

In the past, we used visual “arcs” to display the connection between a buyer’s and a merchant’s location.

The BFCM Globe from 2018 showing orders using visual arcs.
The BFCM Globe from 2018 showing orders using visual arcs.

With thousands of orders happening every minute, using arcs alone to represent every order quickly became a visual mess along with a heavy decrease in framerate. One solution was to cap the number of arcs we display, but this would only allow us to display a small fraction of the orders we were processing. Instead, we investigated using a particle-based solution to help fill the gap.

With particles, we wanted to see if we could:

  • Handle thousands of orders at any given time on screen.
  • Maintain 60 frames per second on low-end devices.
  • Have the ability to customize style and animations per order, such as visualizing local and international orders.

From the start, we figured that rendering geometry per an order wouldn't scale well if we wanted to have thousands of orders on screen. Particles appear on the globe as highlights, so they don’t necessarily need to have a 3D perspective. Rather than using triangles for each particle, we began our investigation using three.js Points as a start, which allowed us to draw using dots instead. Next, we needed an efficient way to store data for each particle we wanted to render. Using BufferGeometry, we assigned custom attributes that contained all the information we needed for each particle/order.

To render the points and make use of our attributes, we created a ShaderMaterial, and custom vertex and fragment shaders. Most of the magic for rendering and animating the particles happens inside the vertex shader. Each particle defined in the attributes we pass to our BufferGeometry goes through a series of steps and transformations.

First, each particle has a starting and ending location described using latitude and longitude. Since we want the particle to travel along the surface and not through it, we use a geo interpolation function on our coordinates to find a path that goes along the surface.

A photo of a globe with an order represented as a particle traveling from New York City to London. The vertex shader uses each location’s latitude and longitude and determines the path it needs to travel.
An order represented as a particle traveling from New York City to London. The vertex shader uses each location’s latitude and longitude and determines the path it needs to travel. **

Next, to give the particle height along its path, we use high school geometry, a parabola equation based on time to alter the straight path to a curve.

A photo of a globe with particles that follow a curved path away from the earth’s surface using a parabola equation to determine its height.
Particles follow a curved path away from the earth’s surface using a parabola equation to determine its height. **

To render the particle to make it look 3D in its travels, we combine our height and projected path data then convert it to a vector position our shader uses as it’s gl_Position. With our particle now knowing where it needs to go, using a time uniform, we drive animations for other changes such as size and color. At the end of the vertex shader, we pass the position and point size to render onto the fragment shader that combines the calculated color and alpha at the time for each particle.

Once the vertex shader is complete, the vertex shader passes position and point size onto the fragment shader that combines the animated color and alpha for each particle.

Given that we wanted to support updating and animating thousands of particles at any moment, we wanted to be careful about how we access and update our attributes. For example, if we had 10000 particles in transit, we need to continue updating those and other data points that are coming in. Instead of updating all of our attributes every time, which can be processor-intensive, we made use of BufferAttribute’s updateRange to update a subset of the attributes we needed to change on each frame instead of the entire attribute set.

gl_Points enables us to render 150,000 particles flying around the globe at any given time without performance issues. **

Combining all of the above, we saw upwards of 150,000 particles animating to and from locations on the globe without noticing any performance degradation.

Optimizing Performance

In video games, you may have seen settings for different quality levels. These settings modify the render quality of the application. Most modern games will automatically scale performance. Most aggressively, the application may reduce texture quality or how many vertices are rendered per 3D object.

With the amount of development time we had for this project, we simply didn’t have time to be this aggressive. Yet, we still had to support old, low-power devices such as dated mobile phones. Here’s how we implemented an auto optimizer that could increase an iPhone 7+ render performance from 40 frames per second (fps) to a cool 60fps.

If your application isn’t performing well, you might see a graph like this:

Graph depicting the Globe application running at 40 frames per second on a low power device
Graph depicting the Globe application running at 40 frames per second on a low power device

Ideally, in modern applications, your application should be running at 60fps or more. You can also use this metric to determine when you should lower the quality of your application. Our initial implementation plan was to keep it simple and make every device with a low-resolution display run in low quality. However, this would mean new phones with low-resolution displays and extremely capable GPUs would receive a low-quality experience. Our final attempt monitors fps. If it’s lower than 55fps for over 2 seconds, we decrease the application’s quality. This adjustment allows phones such as the new iPhone 12 Pro Max to run in the highest quality possible while an iPhone 7+ can render at lower quality but consistent high framerate. Decreasing the quality of an application by reducing buffer sizes is optimal. However, in our aggressive timeline, this would have created many bugs and overall application instability.

Left side of the image depicts the application running in High-Quality mode, where the right side of the image depicts the application running in Low-Quality mode
Left side of the image depicts the application running in High-Quality mode, where the right side of the image depicts the application running in Low-Quality mode. **

What we opted for instead was simple and likely more effective. When our application retains a low frame rate, we simply reduce the size of the <canvas> HTML element, which means we’re rendering fewer pixels. After this, WebGL has to do far less work, in most cases, 2x or 3x less work. When our WebGLRenderer is created, we setPixelRatio based on window.devicePixelRatio. When we’ve retained a low frame rate, we simply drop the canvas pixel ratio back down to 1x. The visual differences are nominal and mainly noticeable in edge aliasing. This technique is simple but effective. We also reduce the resolution of our Environment Maps generated by PMREMGenerator, but most applications will be able to utilize the devicePixelRatio drop more effectively.

If you’re curious, this is what our graph looks like after the Auto Optimizer kicks in (red circled area)

Graph depicting the Globe application running at 60 frames per second on a low power device with a circle indicating when the application quality was reduced.
Graph depicting the Globe application running at 60 frames per second on a low power device with a circle indicating when the application quality was reduced

Globe 2021

We hope you enjoyed this behind the scenes look at the 2020 BFCM Globe and learned some tips and tricks along the way. We believe that by shipping two globes in a short amount of time, we were able to focus on the things that mattered most while still keeping a high degree of quality. However, the best part of all of this is that our globe implementation now lives on as a library internally that we can use to ship future globes. Onward to 2021!

*All data is unaudited and is subject to adjustment.
**Made with Natural Earth; textures from Visible Earth NASA

Mikko Haapoja is a development manager from Toronto. At Shopify he focuses on 3D, Augmented Reality, and Virtual Reality. On a sunny day if you’re in the beaches area you might see him flying around on his OneWheel or paddleboarding on the lake.

Stephan Leroux is a Staff Developer on Shopify's AR/VR team investigating the intersection of commerce and 3D. He has been at Shopify for 3 years working on bringing 3D experiences to the platform through product and prototypes.

Additional Information



We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM, and commerce isn't slowing down. Help us scale & make commerce better for everyone

Continue reading

Capacity Planning at Scale

Capacity Planning at Scale

By Kathryn Tang and Kir Shatrov

The fourth Thursday in November is Thanksgiving in the United States. The day after, Black Friday (coined in 1961), is the first day of the Christmas shopping season and since 2005 it’s the busiest shopping day of the year in North America. Cyber Monday is a more recent development. Getting its name in 2005, it refers to the Monday after the Thanksgiving weekend where retailers focus on sales offered online. At Shopify, we call the weekend including Black Friday and Cyber Monday BFCM.

From the engineering team’s point of view, every BFCM challenges the platform and all the things we’ve shipped throughout the year:

  • Would our clusters handle two times the number of virtual machines? 
  • Would we hit some sort of limitation on the new network design? 
  • Would the new logging pipeline handle such an increase in traffic? 
  • What’s going to be the next scalability bottleneck that we hit?

The other challenge is planning the capacity. We need to understand the magnitude of traffic ahead of us, and how many resources like CPUs and storage we’ll need to handle BFCM sales. On top of that, we need to have enough room in case of something unexpected, and we need to perform a regional failover.

Since 2017, we’ve partnered with Google Cloud Platform (GCP) as our main vendor for the cloud. Over these years, we’ve worked closely with their team on our capacity models, and prior to every BFCM that collaboration gets even closer.

In this post, we’ll cover our approaches to capacity planning, and how we rolled it out across the org and to dozens of teams. We’ll also share how we validated our capacity plans with scalability tests to make sure they work.

Capacity Planning 

Our Google Cloud resource needs depend on how much traffic our merchants see during BFCM. We worked with our data scientists to forecast traffic levels and set those levels as a bar for our platform to scale to. Additionally, we looked into historical numbers, applied a safety margin, and projected how many buyers would check out or view online stores.

A list of GCP projects for resource planning.  The list includes items like memcache, Kafka, MySQL, etc
A list of GCP projects for resource planning.

We created a master resourcing plan for our Google Cloud implementation and estimated how things like CPUs and storage would scale to BFCM traffic levels. Owners for our top 10 or so resource areas were tasked to estimate what they needed for BFCM. These estimates were detailed breakdowns of the machine types, geographic locations, and quantities of resources like CPUs. We also added buffers to our overall estimates to allow flexibility to change our resourcing needs, move machines across projects, or failover traffic to different regions if we needed to. What also helps is that we partition each component into a separate GCP project, which makes it a lot easier to think of quotas per every project.

A line graph showing the BFCM traffic forecasts over time. 4 different scenarios are shown in blue, red, yellow and green. The black line shows existing traffic patterns
A line graph showing the BFCM traffic forecasts over time

2020 is an exceptionally difficult year to plan for. Normally, we’d look at BFCM trends from years prior and predict BFCM traffic with a fairly high level of confidence. This year, COVID-19 lockdowns drove a rapid shift to selling online this spring, and we didn't know what to expect. Would we see a massive increase in online traffic this BFCM, or a global economic depression where consumers stopped buying much at all? To manage heightened uncertainty, we forecasted multiple scenarios and their respective needs for our cloud deployment.

From an investment perspective, planning for the largest scale scenario means spending a lot of money very quickly to handle sales that might not happen. Alternatively, not deploying enough machines means having too little computing power and putting our merchant storefronts at risk of outages. It was absolutely vital to avoid anything that would put our merchants at risk of downtime. We decided to scale to our more aggressive growth scenarios to ensure our platform is stable regardless of what happens. We’re transparent with our partners, finance teams, and internal teams about how we thought through these scenarios which helps them make their own operating decisions.

Scalability Testing

A sheet with a capacity plan is just a starting point. Once we start scaling to projected numbers, there’s a high chance that we’ll hit limits throughout our tech stack that need resolving. In a complex system, there’s always a limit like:

  • the number of VMs in a network
  • the number of packets that a busy Memcached server can accept 
  • the number of MB/s your logging pipeline can handle.

Historically, every BFCM brought us some scalability surprises, and what’s worse, we’d only notice them when fully scaled prior to BFCM. That left too little time to come up with mitigation plans.

Back in 2018, we decided that a “faux” BFCM in the middle of the year would increase our resilience as an organization and push us to find unknowns that we’d otherwise only discover during the real thing. As we started doing that, it allowed us to find problems at scale more often and created that mental muscle of preparing for critical events and finding unknowns. If you’re exercising and something feels hard, you train more and eventually your muscles get better. Shopify treats BFCM the same way.

We’ve started the practice of regular scale-up testing at Shopify, and of course we made sure to come up with fun names for each. We’ve had Mayday (2019), Spooky scale-up (2019), and Oktoberfest scale-up (2020). Another fun fact is that our Waterloo teams play a large part in running this testing, and the dates of our Oktoberfest matched the city of Kitchener-Waterloo’s Oktoberfest festivities (It’s the second-largest Oktoberfest in the world).

Oktoberfest scale-up’s goal was to simulate this year’s expected BFCM load based on the traffic forecasts from the data science team. And the fact that we run Shopify in cloud on Google Kubernetes Engine allowed us to grab extra compute capacity just for the window of the exercise, and only pay for those hours when we needed it.

Investment in our internal load testing tooling over the years is fundamental to our ability to run such large scale, platform-wide load tests. We’ve talked about go-lua, an open source project that powers our load testing tool. Thanks to embedded Lua, we feed it with a high-level set of steps for what we want to test: actions like browsing the storefront, adding a product to a card, proceeding to check out, and processing the transaction through a mock payment gateway.

Thanks to Oktoberfest scale-up, we identified and then fixed some bottlenecks that could have become an issue for the real BFCM. Doing the test in early October gave us time to address issues.

After addressing all the issues, we repeated the scale-up test to see how our mitigations helped. Seeing that going smoothly increased our confidence levels about the upcoming Black Friday and reduced stress levels for all teams.

We strive for a smooth BFCM and spend a lot of time preparing for it, from capacity planning, to setting the expectations for our vendors, to load testing, and failover simulations. Beyond delivering a smooth holiday season for our merchants, BFCM is time to reflect on the future. As Shopify continues to grow, BFCM traffic levels can become the normal everyday loads we see in the next year. Our job is to bring lessons from events like BFCM to make our systems even more automated, more dynamic, and more resilient. We relish this opportunity to think about where Shopify is going and to architect our platform to scale with it.

Kir Shatrov is an Engineering Lead who’s been with Production Engineering at Shopify for the past five years, working on areas from CI/CD infrastructure to sharding and capacity planning.

Kathryn Tang is an Engineering Program Manager who manages our Google Cloud relationship. She has been at Shopify for 4 years, working with a multitude of R&D and commercial teams to derive business insights and guide operating decisions to help us scale.

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone

Continue reading

Pummelling the Platform–Performance Testing Shopify

Pummelling the Platform–Performance Testing Shopify

Developing a product or service at Shopify requires care and consideration. When we deploy new code at Shopify, it’s immediately available for merchants and their customers. When over 1 million merchants rely on Shopify for a successful Black Friday Cyber Monday (BFCM), it’s extremely important that all merchants—big and small—can run sales events without any surprises.

We typically classify a “sales event” as any single event that attracts a high amount of buyers to Shopify merchants. This could be a product launch, a promotion, or an event like BFCM. In fact, BFCM 2020 was the largest single sales event that Shopify has ever seen, and many of the largest merchants on the planet also saw some of the biggest flash sales ever observed before on Earth.

In order to ensure that all sales are successful, we regularly and repeatedly simulate large sales internally before they happen for real. We proactively identify and eliminate issues and bottlenecks in our systems using simulated customer traffic on representative test shops. We do this dozens of times per day using a few tools and internal processes.

I’ll give you some insight into the tools we use to raise confidence in our ability to serve large sales events. I’ll also cover our experimentation and regression framework we built to ensure that we’re getting better, week-over-week, at handling load.

We use “performance testing” as an umbrella term that covers different types of high-traffic testing including (but not limited to) two types of testing that happen regularly at Shopify: “load testing” and “stress testing”. 

Load testing verifies that a service under load can withstand a known level of traffic or specific number of requests. An example load test is when a team wants to confirm that their service can handle 1 million requests per minute for a sustained duration of 15 minutes. The load test will confirm (or disconfirm) this hypothesis.

Stress testing, on the other hand, is when we want to understand the upper limit of a particular service. We do this by increasing the amount of load—sometimes very quicklyto the service being tested until it crumbles under pressure. This gives us a good indication of how far individual services at Shopify can be pushed, in general.

We condition our platform through performance testing on a massive scale to ensure that all components of Shopify’s platform can withstand the rush of customers trying to purchase during sales events like BFCM. Through proactive load tests and stress tests, we have a really good picture of what to expect even before a flash sale kicks off. 

Enabling Performance Testing at Scale

A platform as big and complex as Shopify has many moving parts, and each component needs to be finely tuned and prepared for large sales events. Not unlike a sports car, each individual part needs to be tested under load repeatedly to understand performance and throughput capabilities before assembling all the parts together and taking the entire system for a test drive. 

Individual teams creating services at Shopify are responsible for their own performance testing on the services they build. These teams are best positioned to understand the inner workings of the services they own and potential bottlenecks or situations that may be overwhelmed under extreme load, like during a flash sale. To enable these teams, performance testing needs to be approachable, easy to use and understand, and well-supported across Shopify. The team I lead is called Platform Conditioning, and our mission is to constantly improve the tooling, process, and culture around performance testing at Shopify. Our team makes all aspects of the Shopify platform stronger by simulating large sales events and making high-load events a common and regular occurrence for all developers. Think of Platform Conditioning as the personal trainers of Shopify. It’s Platform Conditioning that can help teams develop individualized workout programs and set goals. We also provide teams with the tools they need in order to become stronger.

Generating Realistic Load

At the heart of all our performance testing, we create “load”. A service at Shopify will add load to to cause stresses that—in the end—make it stronger, by using requests that hit specific endpoints of the app or service.

Not all requests are equal though, and so stress testing and load testing are never as easy as tracking the sheer volume of requests received by a service. It’s wise to hit a variety of realistic endpoints when testing. Some requests may hit CDNs or layers of caching that make the response very lightweight to generate. Other requests, however, can be extremely costly and include multiple database writes, N+1 queries, or other buried treasures. It’s these goodies that we want to find and mitigate up front, before a sales event like BFCM 2020.

For example, a request to a static CSS file is served from a CDN node in 40ms without creating any load to our internal network. Comparatively, making a search query on a shop hits three different layers of caching and queries Redis, MySQL, and Elasticsearch with total round-trip time taking 1.5 seconds or longer.

Another important factor to generating load is considering the shape of the traffic as it appears at our load balancers. A typical flash sale is extremely spiky and can begin with a rush of customers all trying to purchase a limited product simultaneously. It’s very important to simulate this same traffic shape when generating load and to run our tests for the same duration that we would see in the wild.

A flow diagram showing how we generate load with go-lua
A systems diagram showing how we generate load with go-lua

When generating load we use a homegrown, internal tool that generates raw requests to other services and receives responses from them. There are two main pieces to this tool: the first is the coordinator, and the second is the group of workers that generate the load. Our load generator is written in Go and executes small scripts written in Lua called “flows”. Each worker is running a Go binary and uses a very fast and lightweight Lua VM for executing the flows. (The Go-Lua VM that we use is open source and can be found on Github) Through this, the steps of a flow can scale to issue tens of millions of requests per minute or more. This technique stresses (or overwhelms) specific endpoints of Shopify and allows us to conduct formal tests using the generated load.

We use our internal ChatOps tool, ‘Spy’, to enqueue tests directly from Slack, so everyone can quickly see when a load test has kicked off and is running. Spy will take care of issuing a request to the load generator and starting a new test. When a test is complete, some handy links to dashboards, logs, and overall results of the test are posted back in Slack.

Here’s a snippet of a flow, written in Lua, that browses a Shopify storefront and logs into a customer account—simulating a real buyer visiting a Shopify store:

Just like a web browser, when a flow is executing it sends and receives headers, makes requests, receives responses and simulates many browser actions like storing cookies for later requests. However, an executing flow won’t automatically make subsequent requests for assets on a page and can’t execute Javascript returned by the server. So our load tests don’t make any XMLHttpRequest (XHR) requests, Javascript redirects or reloads that can happen in a full web browser.

So our basic load generator is extremely powerful for generating a great deal of load, but in its purest form it only can hit very specific endpoints as defined by the author of a flow. What we create as “browsing sessions” are only a streamlined series of instructions and only include a few specific requests for each page. We want all our performance testing as realistic as possible, simulating real user behaviour in simulated sales events and generating all the same requests that actual browsers make. To accomplish this, we needed to bridge the gap between scripted load generation and realistic functionality provided by real web browsers.

Simulating Reality with HAR-based Load Testing

Our first attempt at simulating real customers and adding realism to our load tests was an excellent idea, but fairly naive when it came to how much computing power it would require. We spent a few weeks exploring browser-based load testing. We researched tools that were already available and created our own using headless browsers and tools like Puppeteer. My team succeeded in making realistic browsing sessions, but unfortunately the overhead of using real browsers dramatically increased both computing costs and real money costs. With browser-based solutions, we could only drive thousands of browsing sessions at a time, and Shopify needs something that can scale to tens of millions of sessions. Browsers provide a lot of functionality, but they come with a lot of overhead. 

After realizing that browser-based load generation didn’t suit our needs, my team pivoted. We were still driving to add more realism to our load tests, and we wanted to make all the same requests that a browser would. If you open up your browser’s Developer Tools, and look at the Network tab while you browse, you see hundreds of requests made on nearly every page you visit. This was the inspiration for how we came up with a way to test using HTTP Archive (HAR) files as a solution to our problems.

An image of chrome developer tools showing a small sample of requests made by a single product page
A small sample of requests made by a single product page

HAR files are detailed JSON representations of all of the network requests and responses made by most popular browsers. You can export HAR files easily from your browser, or web proxy tools like Charles Proxy. A single HAR file includes all of the requests made during a browsing session and are easy to save, examine, and share. We leveraged this concept and created a HAR-based load testing solution. We even gave it a tongue-and-cheek name: Hardy Har Har.

Hardy Har Har (or simply HHH for those who enjoy brevity) bridges the gap between simple, lightweight scripted load tests and full-fledged, browser-based load testing. HHH will take a HAR file as input and extract all of the requests independently, giving the test author the ability to pick and choose which hostnames can be targeted by their load test. For example, we nearly always remove requests to external hostnames like Google Analytics and requests to static assets on CDN endpoints (They only add complexity to our flows and don’t require load testing). The resulting output of HHH is a load testing flow, written in Lua and committed into our load testing repository in Git. Now—literally at the click of a button—we can replay any browsing session in its full completeness. We can watch all the same requests made by our browser, scaled up to millions of sessions.

Of course, there are some aspects of a browsing session that can’t be simply replayed as-is. Things like logging into customer accounts and creating unique checkouts on a Shopify store need dynamic handling that HHH recognizes and intelligently swaps out the static requests and inserts dynamic logic to perform the equivalent functionality. Everything else lives in Lua and can be ripped apart or edited manually giving the author complete control of the behaviour of their load test. 

Taking a Scientific Approach to Performance Testing

The final step to having great performance testing leading up to a sales event is clarity in observations and repeatability of experiments. At Shopify, we ship code frequently, and anyone can deploy changes to production at any point in time. Similarly, anyone can kick off a massive load test from Slack whenever they please. Given the tools we’ve created and the simplicity in using them, it’s in our best interest to ensure that performance testing follows the scientific method for experimentation.

A flow diagram showing the performance testing scientific method of experimentation
Applying the scientific method of experimentation to performance testing

Developers are encouraged to develop a clear hypothesis relating to their product or service, perform a variety of experiments, observe the results of various experiment runs, and formulate a conclusion that relates back to their hypothesis. 

All this formality in process can be a bit of a drag when you’re developing, so the Platform Conditioning team created a framework and tool for load test experimentation called Cronograma. Cronograma is an internal Rails app making it easy for anyone to set up an experiment and track repeated runs of a performance testing experiment. 

Cronograma enforces the formal use of experiments to track both stress tests and load tests. The Experiment model has several attributes, including a hypothesis and one or more orchestrations that are coordinated load tests executed simultaneously in different magnitudes and durations. Also, each experiment has references to the Shopify stores targeted during a test and links to relevant dashboards, tracing, and logs used to make observations.

Once an experiment is defined, it can be run repeatedly. The person running an experiment (the runner) starts the experiment from Slack with a command that creates a new experiment run. Cronograma kicks off the experiment and assigns a dedicated Slack channel for the tests allowing multiple people to participate. During the running of an experiment any number of things could happen including exceptions, elevated traffic levels, and in some cases, actual people may be paged. We want to record all of these things. It’s nice to have all of the details visible in Slack, especially when working with teams that are Digital by Default. Observations can be made by anyone and comments are captured from Slack and added to a timeline for the run. Once the experiment completes, the experiment runner terminates the run and logs a conclusion based on their observations that relates back to the original hypothesis.

We also included additional fanciness in Cronograma. The tool automatically detects whether any important monitors or alerts were triggered during the experiment from internal or third-party data monitoring applications. Whenever an alert is triggered, it is logged in the timeline for the experiment. We also retrieve metrics from our data warehouse automatically and consume these data in Cronograma allowing developers to track observed metrics between runs of the same experiment. For example:

  • the response times of the requests made
  • how many 5xx errors were observed
  • how many requests per minute (RPM) were generated

All of this information is automatically captured so that running an experiment is useful and it can be compared to any other run of the experiment. It’s imperative to understand whether a service is getting better or worse over time.

Cronograma is the home of formal performance testing experiments at Shopify. This application provides a place for all developers to conduct experiments and repeat past experiments. Hypotheses, observations, and conclusions are available for everyone to browse and compare to. All of the tools mentioned here have led to numerous performance improvements and optimizations across the platform, and they give us confidence that we can handle the next major sales event that comes our way.

The Best Things Go Unnoticed 

Our merchants count on Shopify being fast and stable for all their traffic—whether they’re making their first sale, or they’re processing tens of thousands of orders per hour. We prepare for the worst case scenarios by proactively testing the performance of our core product, services, and apps. We expose problems and fix them before they become a reality for our merchants using simulations. By building a culture of load testing across all teams at Shopify, we’re prepared to handle sales events like BFCM and flash sales. My team’s tools make performance testing approachable for every developer at Shopify, and by doing so, we create a stronger platform for all our merchants. It’s easy to go unnoticed when large sales events go smoothly. We quietly rejoice in our efforts and the realization that it’s through strength and conditioning that we make these things possible.


Performance Scale at Shopify

Join Chris as he chats with Anita about how Shopify conditions our platform through performance testing on a massive scale to ensure that all components of Shopify’s platform can withstand the rush of customers trying to purchase during sales events like flash sales and Black Friday Cyber Monday.  

Chris Inch is a technology leader and development manager living in Kitchener, Ontario, Canada. By day, he manages Engineering teams at Shopify, and by night, he can be found diving head first into a variety of hobbies, ranging from beekeeping to music to amateur mycology.

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone

Continue reading

Vouching for Docker Images

Vouching for Docker Images

If you were using computers in the ‘90s and the early 2000s, you probably had the experience of installing a piece of software you downloaded from the internet, only to discover that someone put some nasty into it, and now you’re dragging your computer to IT to beg them to save your data. To remedy this, software developers started “signing” their software in a way that proved both who they were and that nobody tampered with the software after they released it. Every major operating system now supports code or application signature verification, and it’s a backbone of every app store.But what about Kubernetes? How do we know that our Docker images aren’t secret bitcoin miners, stealing traffic away from customers to make somebody rich? That’s where Binary Authorization comes in. It’s a way to apply the code signing and verification that modern systems now rely on to the cloud. Coupled with Voucher, an open source project started by my team at Shopify, we’ve created a way to prevent malicious software from being installed without making developers miserable or forcing them to learn cryptography.

Why Block Untrusted Applications?

Your personal or work computer getting compromised is a huge deal! Your personal data being stolen or your computer becoming unusable due to popup ads or background processes doing tasks you don’t know about is incredibly upsetting.

But imagine if you used a compromised service. Imagine if your email host ran in Docker containers in a cluster with a malicious service that wanted to access contents of the email databases? This isn’t just your data, but the data of everyone around you.

This is something we care about deeply at Shopify, since trust is a core component of our relationship with our merchants and our merchants’ relationships with their customers. This is why Binary Authorization has been a priority for Shopify since our move to Kubernetes.

What is Code Signing?

Code signing starts by taking a hash of your application. Hashes are made with hashing algorithms that take the contents of something (such as the binary code that makes up an application) and make a short, reproducible value that represents that version. A part of the appeal of hashing algorithms is that it takes an almost insurmountable amount of work (provided you’re using newer algorithms) to find two pieces of data that produce the same hash value.

For example, if you have a file that has the text:

Hello World

The hash representation of that (using the “sha256” hashing algorithm) is:


Adding an exclamation mark to the end of our file:

Hello World!

Results in a completely different hash:


Once you have a hash of an application, you can run the same hashing algorithm on it to ensure that it hasn’t changed. While this is an important part of the code signing, most signing applications will automatically create a hash of what you are signing, rather than requiring you to hash and then sign the hash separately. It makes the hash creation and verification transparent to the developers and their users.

Once the initial release is ready, the developer that’s signing the application creates a public and private key for signing it, and shares the public key with their future users. The developer then uses the private part of their signing key and the hash of the application to create a value that can be verified with the public part of the key.

For example, with Minisign, a tool for creating signatures quickly, first we create our signing key:

The public half of the key is now:


And the private half remains private, living in /Users/caj/.minisign/minisign.key.

Now, if our application was named “hello” we can create a signature with that private key:

And then your users could verify that “hello” hasn’t been tampered with by running:

Unless you’re a software developer or power user, you likely have never consciously verified a signature, but that’s what’s happening behind the scenes if you’re using a modern operating system. 

Where is Code Signing Used?

Two of the biggest users of code signing are Apple and Google. They use code signing to ensure that you don’t accidentally install software updates that have been tampered with or malicious apps from the internet. Signatures are usually verified in the background, and you only get notified if something is wrong. In Android, you can turn this off by allowing unknown apps in the phone's settings, whereas iOS requires the device be jailbroken to allow unsigned applications to be installed.

A macOS dialog window showing that Firefox is damaged and can't be opened. It gives users the option of moving it to the Trash.

A macOS dialog window showing that Firefox is damaged and can't be opened. It gives users the option of moving it to the Trash.

In macOS, applications that are missing their developer signatures or don’t have valid signatures are blocked by the operating system and advise users to move them to the Trash.

Most Linux package managers, (such as Apt/DPKG in Debian and Ubuntu, Pacman in ArchLinux) use code signing to ensure that you’re installing packages from the distribution maintainer, and verify those packages at install time.

Docker Hub showing a docker image created by the author.
Docker Hub showing a docker image created by the author.

Unfortunately, Kubernetes doesn’t have this by default. There are features that allow you to leverage code signing, but chances are you haven’t used them.

And at the end of the day, do you really trust some rando on the internet to actually give you a container that does what it says it does? Do you want to trust that for your organization? Your customers?

What is Binary Authorization?

Binary Authorization is a series of components that work together: 

  • A metadata service: a service that stores signatures and other image metadata
  • A Binary Authorization Enforcer: a service that blocks images that it can’t find valid signatures for
  • A signing service: a system that signs new images and stores those signatures in the metadata service.

Google provides the first two services for their Kubernetes servers, which Shopify uses, based on two open source projects:

  • Grafeas, a metadata service
  • Kritis, a Binary Authorization Enforcer

When using Kritis and Grafeas or the Binary Authorization feature in Google Kubernetes Engine (GKE), infrastructure developers will configure policies for their clusters, listing the keys (also referred to as attestors) that must have signed the container images before they can run.

When new resources are started in a Kubernetes cluster, the images they reference are sent to the Binary Authorization Enforcer. The Enforcer connects to the metadata service to verify the existence of valid signatures for the image in question and then compares those signatures to the policy for the cluster it runs in. If the image doesn’t have the required signatures, it’s blocked, and any containers that would use it won’t start.You can see how these two systems work together to provide the same security that one gets in one’s operating system! However, there’s one piece that wasn’t provided by Google until recently: the signing service.

Voucher: The Missing Piece

Voucher serves as the last piece for Binary Authorization, the signing service. Voucher allows Shopify to run security checks against our Docker images and sign them depending on how secure they are, without requiring that non-security teams manage their signing keys.

Using Voucher's client software to check an image with the 'is_shopify' check, which verifies if the image was from a Shopify owned repository.
Using Voucher's client software to check an image with the 'is_shopify' check, which verifies if the image was from a Shopify owned repository.

The way it works is simple:

  1. Voucher runs in Google Cloud Run or Kubernetes and is accessible as a REST endpoint
  2. Every build pipeline automatically calls to Voucher with the path to the image it built
  3. Voucher reviews the image, signs it, and pushes that signature to the metadata service

On top of the basic code signing workflow discussed previously, Voucher also supports validating more complicated requirements, using separate security checks and associated signing keys to mix and match required signatures on a per cluster basis to create distinct policies based on a cluster’s requirement.

For example, do you want to block images that weren’t built internally? Voucher has a distinct check for verifying that an image is associated with a Git commit in a Github repo you own, and signing those images with a separate key.

Alternatively, do you need to be able to prove that every change was approved by multiple people? Voucher can support that, creating signatures based on the existence of approvals in Github (with support for other code hosting services coming soon). This would allow you to use Binary Authorization to block images that would violate that requirement.

Voucher also has support for verifying the identity of the container builder, blocking images with a high number of vulnerabilities, and so on. And Voucher was designed to be extensible, allowing for the creation of new checks as need be.By combining Voucher’s checks and Binary Authorization policies, infrastructure teams can create a layered approach to securing their organization’s Kubernetes clusters. Compliance clusters can be configured to require approvals and block images with vulnerabilities, while clusters for experiments and staging can use less strict policies to allow developers to move faster, all with minimum work from non-security focused developers.

Voucher Joins Grafeas

As mentioned earlier, Voucher serves a need that hasn’t been provided by Google until recently. This is because Voucher has moved into the Grafeas organization and now is a service provided by Google to Google Kubernetes Engine users going forwards. 

Since our move to Kubernetes, Shopify’s security team has been working with Google’s Binary Authorization team to plan out how we’ll roll out Binary Authorization and design Voucher. We also released Voucher as an open source project in December 2018. This move to the Grafeas project simplifies things, putting it in the same place as the other open source Binary Authorization components.

Improving the security of the infrastructure we build makes everyone safer. And making Voucher a community project will put it in front of more teams which will be able to leverage it to further secure their Kubernetes clusters, and if we’re lucky, will result in a better, more powerful Voucher! Of course, Shopify’s Software Supply Chain Security team will continue our work on Voucher, and we want you to join us!

Please help us!

If you’re a developer or writer who has time and interest in helping out, please take a look at the open issues or fork the project and open a PR! We can always use more documentation changes, tutorials, and third party integrations!

And if you’re using Voucher, let us know! We’d love to hear how it’s going and how we can do a better job of making Kubernetes more secure for everyone!

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

How to Build a Production Grade Workflow with SQL Modelling

How to Build a Production Grade Workflow with SQL Modelling

By Michelle Ark and Chris Wu

In January of 2014, Shopify built a data pipeline platform for the data science team called Starscream. Back then, we were a smaller team and needed a tool that could deal with everything from ad hoc explorations to machine learning models. We chose to build with PySpark to get the power of a generalized distributed computer platform, the backing of the industry standard, and the ability to tap into the Python talent market. 

Fast forward six years and our data needs have changed. Starscream now runs 76,000 jobs and writes 300 terabytes a day. As we grew, some types of work went away, but others (like simple reports) became so commonplace we do them every day. While our Python tool based on PySpark was computationally powerful, it wasn’t optimized for these commonplace tasks. If a product manager needed a simple rollup for a new feature by country, pulling it, and modeling it wasn’t a fast task.

Below, we’ll dive into how we created a robust, production-ready SQL modeling workflow for building straightforward pipelines by leveraging dbt. We'll walkthrough how we built this system—which we call Seamster—and created tooling for testing and documentation on top of it. 

The Problem

When we interviewed our users to understand their workflow on Starscream, there were two issues we discovered: development time and thinking.

Development time encompasses the time data scientists use to prototype the data model they’d like to build, run it, see the outcome, and iterate. The PySpark platform isn’t ideal for running straightforward reporting tasks, often forcing data scientists to write boilerplate code and yielding long runtimes. This leads to long iteration cycles when trying to build models on unfamiliar data.

The second issue, thinking, is more subtle and deals with the way the programming language forces you to look at the data. Many of our data scientists prefer SQL to python because its structure forces consistency in business metrics. When interviewing users, we found a majority would write out a query in SQL then translate it to Python when prototyping. Unfortunately, query translation is time consuming and doesn’t add value to the pipeline.

To understand how widespread these problems were, we audited the jobs run and surveyed our data science team for the use cases. We found that 70 percent or so of the PySpark jobs on Starscream were full batch queries that didn’t require generalized computing. We viewed this as an opportunity to make a kickass optimization for a painful workflow. 

Enter Seamster

Our goal was to create a SQL pipeline for reporting that enables data scientists to create simple reporting faster, while still being production ready. After exploring a few alternatives, we felt that the dbt library came closest to our needs. Their tagline “deploy analytics code faster with software engineering practices” was exactly what we were looking for in a workflow. We opted to pair it with Google BigQuery as our data store and dubbed the system and its tools, Seamster.

We knew that any off-the-shelf system wouldn’t be one-size-fits-all. In moving to dbt, we had to implement our own:

  • Source and model structure to modularize data model development
  • Unit testing to increase the types of testable errors
  • Continuous integration (CI) pipelines to provide safety and consistency guarantees

Source Independence and Safety

With dozens of data scientists making data models in a shared repository, a great user experience would:

  • Maximize focus on work 
  • Minimize the impact of model changes by other data scientists

By default, dbt declares raw sources in a central sources.yml. This quickly became a very large file as it included the schema for each source, in addition to the source name. This caused a huge bottleneck for teams editing the same file across multiple PRs. 

To mitigate the bottleneck, we leveraged the flexibility of dbt and created a top-level ‘sources’ directory to represent each raw source with its own source-formatted YAML file. This way, data scientists can parse only the source documentation that’s relevant for them and contribute to the sources.yml file without stepping on each other’s toes.

Base models are one-to-one interfaces to raw sources.

We also created a base layer of models using the staging’ concept from dbt to implement their best practice of limiting references to raw data. Our base models serve as a one-to-one interface to raw sources. They don’t change the grain of the raw source, but do apply renaming, recasting, or any other cleaning operation that relates to the source data collection system. 

The base layer serves to protect users from breaking changes in raw sources. Raw external sources are by definition out of the control of Seamster and can introduce breaking changes for any number of reasons, at any point in time. If and when this happens, users only need to apply the fix to the base model representing the raw source, as opposed to every individual downstream model that depends on the raw source. 

Model Ownership for Teams

We knew that the tooling improvements of Seamster would only be one part of a greater data platform at Shopify. We wanted to make sure we’re providing mechanisms to support good dimensional modelling practices and support data discovery.

In dbt, a model is simply a .sql file. We’ve extended this definition in Seamster to define a model as a directory consisting of four files: 

  • model_name.sql
  • schema.yml

Users can further organize models into directories that indicate a data science team at Shopify like Finance or Marketing. 

To support a clean data warehouse, we also organized data models into layers that differentiate between:

  • Base: data models that are one-to-one with raw data, but cleaned, recast, and renamed
  • Application-ready: data that isn’t dimensionally modelled but still transformed and clean for consumption by another tool (for example,  training data for a machine learning algorithm)
  • Presentation: shareable and reliable data models that follow dimensional modelling best practices and can be used by data scientists across different domains

With these two changes, a data consumer can quickly understand the data quality they can expect from a model and find the owner in case there is an issue. We also pass this metadata upstream to other tools to help with the data discovery workflow.

More Tests

dbt has native support for ‘schema tests’, which are encoded in a model’s schema.yml file. These tests run against production data to validate data invariants, such as the presence of null values or the uniqueness of a particular key. This feature in dbt serves its purpose well, but we also want to enable data scientists to write unit tests for models that run against fixed input data (as opposed to production data).

Testing on fixed inputs allows the user to test edge cases that may not be in production yet. In larger organizations, there can and will be frequent updates and many collaborators for a single model. Unit tests give users confidence that the changes they’re making won’t break existing behaviour or introduce regressions. 

Seamster provides a Python-based unit testing framework. Users can write their unit tests in the file in the model directory. The framework enables constructing ‘mock’ input models from fixed data. The central object in this framework is a ‘mock’ data model, which has an underlying representation of a Pandas dataframe. Users can pass fixed data to the mock constructor as either a csv-style string, Pandas dataframe, or a list of dictionaries to specify input data. 

Input and expected MockModels are built from static data. The actual MockModel is built from input MockModels by BigQuery. Actual and expected MockModels can assert equality or any Great Expectations expectation
Input and expected MockModels are built from static data. The actual MockModel is built from input MockModels by BigQuery. Actual and expected MockModels can assert equality or any Great Expectations expectation.

A constructor creates a test query where a common table expression (CTE) represents each input mock data model, and any references to production models (identified using dbt’s ‘ref’ macro) are replaced by references to the corresponding CTE. Once a user executes a query, they can compare the output to an expected result. In addition to an equality assertion, we extended our framework to support all expectations from the open-source Great Expectations library to provide more granular assertions and error messaging. 

The main downside to this framework is that it requires a roundtrip to the query engine to construct the test data model given a set of inputs. Even though the query itself is lightweight and processes only a handful of rows, these roundtrips to the engine add up. It becomes costly to run an entire test suite on each local or CI run. To solve this, we introduced tooling both in development and CI to run the minimal set of tests that could potentially break given the change. This was straightforward to implement with accuracy because of dbt’s lineage tracking support; we simply had to find all downstream models (direct and indirect) for each changed model and run their tests. 

Schema and DAG Validation on the Cheap

Our objective in Seamster’s CI is to give data scientists peace of mind that their changes won’t introduce production errors the next time the warehouse is built. They shouldn’t have to wonder whether removing a column will cause downstream dependencies to break, or whether they made a small typo in their SQL model definition.

To achieve this accurately, we needed to build and tear down the entire warehouse on every commit. This isn’t feasible from both a time and cost perspective. Instead, on every commit we materialize every model as a view in a temporary BigQuery dataset which is created at the start of the validation process and removed as soon as the validation finishes. If we can’t build a view because its upstream model doesn’t provide a certain column, or if the SQL is invalid for any reason, BigQuery fails to build the view and produces relevant error messaging. 

Currently, this validation step takes about two minutes. We reduce validation time further by only building the portion of the DAG affected by the changed models, as done in the unit testing approach. 

dbt’s schema.yml serves purely as metadata and can contain columns with invalid names or types (data_type). We employ the same view-based strategy to validate the contents of a model’s schema.yml file ensuring the schema.yml is an accurate depiction of the actual SQL model.

Data Warehouse Rules

Like many large organizations, we maintain a data warehouse for reporting where accuracy is key. To power our independent data science teams, Seamster helps by enforcing conformance rules on the layers mentioned earlier (base, application-ready, and presentation layers). Examples include naming rules or inheritance rules which help the user reason over the data when building their own dependent models.

Seamster CI runs a collection of such rules that ensure consistency of documentation and modelling practices across different data science teams. For example, one warehouse rule enforces that all columns in a schema conform to a prescribed nomenclature. Another warehouse rule enforces that only base models can reference raw sources (via the ‘source’ macro) directly. 

Some warehouse rules apply only to certain layers. In the presentation layer, we enforce that any column name needs a globally unique description to avoid divergence of definitions. Since everything in dbt is YAML, most of this rule enforcement is just simple parsing.

So, How Did It Go?

To ensure we got it right and worked out the kinks, we ran a multiweek beta of Seamster with some of our data scientists who tested the system out on real models. Since you’re reading about it, you can guess by now that it went well!

While productivity measures are always hard, the vast majority of users reported they were shipping models in a couple of days instead of a couple of weeks. In addition, documentation of models increased due to the fact that this is a feature built into the model spec.

Were there any negative results? Well, there's always going to be things we can improve. dbt’s current incremental support doesn’t provide safe and consistent methods to handle late arriving data, key resolution, and rebuilds. For this reason, a handful of models (like type 2 dimensions or models in the over 1.5 billion event territory) that required incremental semantics weren’t doable—for now. We’ve got big plans to address this in the future.

Where to Next?

We’re focusing on updating the tool to ensure it’s tailored to Shopify data scientists. The biggest hurdle for a new product (internal and external) is adoption. We know we still have work to do to ensure that our tool is top of mind when users have simple (but not easy) reporting work. We’re spending time with each team to identify upcoming work that we can speed up by using Seamster. Their questions and comments will be part of our tutorials and documentations for new users.

On the engineering front, an exciting next step is looking beyond batch data processing. Apache Beam and Beam SQL provide an opportunity to consider a single SQL-centric data modelling tool for both batch and streaming use cases.

We’re also big believers in open source at Shopify. Depending on the dbt’s community needs, we’d also like to explore contributing our validation strategy and a unit testing framework to the project. 

If you’re interested in building solutions from the ground up and would like to come work with us, please check out Shopify’s career page.

Continue reading

Adopting Sorbet at Scale

Adopting Sorbet at Scale

On November 25, 2020 we held ShipIt! Presents: The State of Ruby Static Typing at Shopify. The video of the event is now available.

Shopify changes a lot. We merge around 400 commits to the main branch daily and deploy a new version or our monolith 40 times a day. Shopify is also big: 37,000 Ruby files, 622,000 methods, more than 2,000,000 calls. At this scale, with a dynamic language, even with the most rigorous review process and over 150 000 tests, it’s a challenge to ensure that everything runs smoothly. Developers benefit from a short feedback loop to ensure the stability of our monolith for our merchants.

In my first post, I talked about how we brought static typing to our core monolith. We adopted Sorbet in 2019, and the Ruby Infrastructure team continues to work on ways to make the development process safer, faster, and more enjoyable for Ruby developers. Currently, Sorbet is only enforced on our main monolith, but we have 60 internal projects using Sorbet as well. On our main monolith, we require all files to be at least typed: false and Sorbet is run on our continuous integration (CI) platform for every pull request and fails builds if type checking errors are found. As of today, 80% of our files (including tests) are typed: true or higher. Almost half of our calls are typed and half of our methods have signatures.

In this second post, I’ll present how we got from no Sorbet in our monolith to almost full coverage in the span of a few months. I’ll explain the challenges we faced, the tools we built to solve them, and the preliminary results of our experiment to reduce production errors with static typing.

Our Open-Source Tooling for Sorbet Adoption

Currently, Sorbet can’t understand all the constructs available in Ruby. Furthermore, Shopify relies on a lot of gems and frameworks, including Rails, that bring their own set of idioms. Increasing type coverage in our monolith meant finding ways to make Sorbet understand all of this. These are the tools we created to make it possible. They are open sourced in our effort to share our work with the community and make typing adoption easier for everyone.

Making Code Sorbet-compatible with RuboCop Sorbet

Even with gradual typing, moving our monolith to Sorbet required a lot of changes to remove or replace Ruby constructs that Sorbet couldn’t understand, such as non-constant superclasses or accessing constants through meta-programming with const_get. For this, we created RuboCop Sorbet, a suite of RuboCop rules allowing us to:

  • forbid some of the constructs not recognized by Sorbet yet 
  • automatically correct those constructs to something Sorbet can understand.

We also use these cops to require a minimal typed level on all files of our monolith (at least typed: false for now, but soon typed: true) and enforce some styling conventions around the way we write signatures.

Creating RBI Files for Gems with Tapioca

One considerable piece missing when we started using Sorbet was Ruby Interface file (RBI) generation for gems. For Sorbet to understand code from required gems, we had two options: 

  1. pass the full code of the gems to Sorbet which would make it slower and require making all gems compatible with Sorbet too
  2. pass a light representation of the gem content through an Ruby Interface file called a RBI file.

Being before the birth of Sorbet’s srb tool, we created our own: Tapioca. Tapioca provides an automated way to generate the appropriate RBI file for a given gem with high accuracy. It generates the definitions for all statically defined types and most of the runtime defined ones exported from a Ruby gem. It loads all the gems declared in the dependency list from the Gemfile into memory, then performs runtime introspection on the loaded types to understand their structure, and finally generates a complete RBI file for each gem with a versioned filename.

Tapioca is the de facto RBI generation tool at Shopify and used by a few renowned projects including Homebrew.

Creating RBI Files for Domain Specific Languages

Understanding the content of the gems wasn’t enough to allow type checking our monolith. At Shopify we use a lot of internal Domain Specific Languages (DSLs), most of them coming directly from Rails and often based on meta-programming. For example, the Active Record association belongs_to ends up defining tens of methods at runtime, none of which are statically visible to Sorbet. To enhance Sorbet coverage on our codebase we needed it to “see” those methods.

To solve this problem, we added RBI generation for Rails DSLs directly into Tapioca. Again, using runtime introspection, Tapioca analyzes the code of our application to generate RBI files containing a static definition for all the runtime-generated methods from Rails and other libraries.

Today Tapioca provides RBI generation for a lot of DSLs we use at Shopify:

  • Active Record associations
  • Active Record columns
  • Active Record enums
  • Active Record scopes
  • Active Record typed store
  • Action Mailer
  • Active Resource
  • Action Controller helpers
  • Active Support current attributes
  • Rails URL helpers
  • FrozenRecord
  • IdentityCache
  • Google Protobuf definitions
  • SmartProperties
  • StateMachines
  • …and the list is growing everyday

Building Tooling on Top of Sorbet with Spoom

As we began using Sorbet, the need for tooling built on top of it was more and more apparent. For example, Tapioca itself depends on Sorbet to list the symbols for which we need to generate RBI definitions.

Sorbet is a really fast Ruby parser that can build an Abstract Syntax Tree (AST) of our monolith in a matter of seconds versus a few minutes for the Whitequark parser. We believe that in the future a lot of tools such as linters, cops, or static analyzers can benefit from this speed.

Sorbet also provides a Language Server Protocol (LSP) with the option --lsp. Using this option, Sorbet can act as a server that is interrogated by other tools programmatically. LSP scales much better than using the file output by Sorbet with the --print option (see for example parse-tree-json or symbol-table-json) that spits out GBs of JSON for our monolith. Using LSP, we get answers in a few milliseconds instead of parsing those gigantic JSON files. This is generally how the language plugins for IDEs are implemented.

To facilitate the development of external tools to Sorbet we created Spoom, our toolbox to use Sorbet programmatically. It provides a set of useful features to interact with Sorbet, parse the configuration files, list the type checked files, collect metrics, or automatically bump files to higher strictnesses and comes with a Ruby API to connect with Sorbet’s LSP mode.

Today, Spoom is at the heart of our typing coverage reporting and provides the beautiful visualizations used in our SorbetMetrics dashboard.

Sharing Lessons Learned

After more than a year using Sorbet on our codebases, we learned a lot. I’ll share some insights about what typing did for us, which benefits it brings, and some of the limitations it implies.

Build, Measure, Learn

There’s a very scientific way to approach building products, encapsulated in the Build-Measure-Learn loop pioneered by Eric Ries. Our team believes in intuition, but we still prefer empirical proofs when we have access to them. So when we started with static typing in Ruby, we all believed it would be useful for our developers, but wanted to measure its effects and have hard data. This allows us to decide what we should concentrate on next based on the outcome of our measurements.

I talked about observing metrics, surveying developer happiness, or getting feedback through interviews in part 1, but my team wanted to go further and correlate the impact of typing on production errors. So, we conducted a series of controlled experiments to validate our assumptions.

Since our monolith evolves very fast, it becomes hard to observe the direct impact of typing on production. New features are added every day which gives rise to new errors while we work to decrease errors in other areas. Moreover, our monolith has about 500 gem dependencies (including transitive dependencies), any of which could introduce new errors in a version bump.

For this reason, we decreased our scope and targeted a smaller codebase for our experiment. Our internal developer tool, aptly named dev, was an ideal candidate. It’s a mature codebase that changes slowly and by a few people. It’s a very opinionated codebase with no external dependencies (the few dependencies it has are vendored), so it could satisfy the performance requirements of a command-line tool. Additionally, dev almost uses no meta-programming, especially not the kind normally coming from external libraries. Finally, it’s a tool with heavy usage since it’s the main development tool used by all developers at Shopify for their day-to-day work. It runs thousands of times a day on hundreds of different computers, there’s no edge case—at this scale, if something can break, it will.

We started monitoring all errors raised by dev in production, categorized the errors, analyzed their root cause, and tried to understand how typing could avoid them.

typed: ignore means typed: debt

Our first realisation was to keep away from typed: ignore. Ignoring a file can cause errors to appear in other files because those other files may reference something defined in the ignored file.

For example, if we opt to ignore this file:

Sorbet will raise errors in this file:

Since Sorbet doesn't even parse the file a.rb, it won’t know where constant A was defined. The more files you ignore, the more this case arises, especially when ignoring library files. This makes it harder and harder for other developers to type their own code.

As a rule of thumb at Shopify, we aim to have all our application files at least at typed: true and our test files at least at typed: false. We reserve typed: ignore for some test files that are particularly hard to type (because of mocking, stubbing, and fixtures), or some very specific files such as Protobuf definition files (which we handle through DSLs RBI generation with Tapioca).

Benefits Realized, Even at typed: false

Even at typed: false, Sorbet provides safety in our codebase by checking that all the constants resolve. Thanks to this, we now avoid mistakes triggering NameErrors either in CI or production.

Enabling Sorbet on our monolith allowed us to find and fix a few mistakes such as:

  • StandardException instead of StandardError
  • NotImplemented instead of NotImplementedError

We found dead code referencing constants deleted months ago. Interestingly, while most of the main execution paths were covered by tests, code paths for error handling were the places where we found the most NameErrors.

A bar graph showing the decreasing amount of NameErrors in dev over time
NameErrors raised in production for the dev project

During our experiment, we started by moving all files from dev to typed: false without adding any signatures. As soon as Sorbet was enabled in October 2019 on this project, no more NameErrors were raised in production.

Stacktrace showing NameError raised in production after Sorbet was enabled because of meta-programming like const_get
NameError raised in production after Sorbet was enabled because of meta-programming

The same observation was made on multiple projects: enabling Sorbet on a codebase eradicates all NameErrors due to developers’ mistakes. Note that this doesn’t avoid NameErrors triggered through metaprogramming, for example, when using const_get.

While Sorbet is a bit more restrictive when it comes to resolve constants, this strictness can be beneficial for developers:

Example of constant resolution error raised by Sorbet

typed: true Brings More Benefits

A circular tree map showing the relationship between strictness level and helpers in dev
Files strictnesses in dev (the colored dots are the helpers)

With our next experiment on gradual typing, we wanted to observe the effects of moving parts of the dev application to typed: true. We moved a few of the typed: false files to typed: true by focusing on the most reused part of the application, called helpers (the blue dots).

A bar graph showing the decrease in NoMethodErrors for files typed: true over time
NoMethodErrors in production for dev (in red the errors raised from the helpers)

By typing only this part of the application (~20% of the files) and still without signatures, we observed a decrease in NoMethodErrors for files typed: true.

Those preliminary results gave us confidence that a stricter typing can impact other classes of errors. We’re now in the process of adding signatures to the typed: true files in dev so we can observe their effect on TypeErrors and ArgumentErrors in production.

The Road Ahead of Us

The team working on Sorbet adoption is part of the broader Ruby Infrastructure team which is responsible for delivering a fast, scalable, and safe Ruby language for Shopify. As part of that mandate, we believe there are more things we can do to increase Ruby performance when the types of variables are known ahead of time. This is an interesting area to explore for the team as our adoption of typing increases and we’re certainly thinking about investing in this in the near future.

Support for Ruby 2.7

We keep our monolith as close as possible to Ruby trunk. This means we moved to Ruby 2.7 months ago. Doing so required a lot of changes in Sorbet itself to support syntax changes such as beginless ranges and numbered parameters, as well as new behaviors like forbidding circular argument references. Some work is still in progress to support the new forwarding arguments syntax and pattern matching. Stripe is currently working on making the keyword arguments compatible with Ruby 2.7 behavior.

100% Files at typed: true

The next objective for our monolith is to move 100% of our files to at least typed: true and make it mandatory for all new files going forward. Doing so implies making both Sorbet and our tooling smarter to handle the last Ruby idioms and constructs we can’t type check yet.

We’re currently focusing on changing Sorbet to support Rails ActiveSupport::Concerns (Sorbet pull requests #3424, #3468, #3486) and providing a solution to inclusion requirements. As well as, improvement to Tapioca for better RBI generation for generics and GraphQL support.

Investing in the Future of Static Typing for Ruby

Sorbet isn’t the end of the road concerning Ruby type checking. Matz announced that Ruby 3 will ship with RBS, another approach to type check Ruby programs. While the solution isn’t yet mature enough for our needs (mainly because of speed and some other limitations explained by Stripe) we’re already collaborating with Ruby core developers, academics, and Stripe to make its specification better for everyone.

Notably, we have open-sourced RBS parser, a C++ parser for RBS capable of translating a subset of RBS to RBI, and are now working on making RBS partially compatible with Sorbet.

We believe that typing, whatever the solution used is greatly beneficial for Ruby, Shopify, and our merchants so we'll continue to invest heavily in it. We want to increase the typing for the whole community.

We’ll continue to work with collaborators to push typing in Ruby even further. As we lessen the effort needed to adopt Sorbet, specifically on Rails projects, we’ll start making gradual typing mandatory on more internal projects. We will help teams start adopting at typed: false and move to stricter typing gradually.

As our long term goal, we hope to bring Ruby on par with compiled and statically typed languages regarding safety, speed and tooling.

Do you want to be part of this effort? Feel free to contribute to Sorbet (there are a lot of good first issues to begin with), check our many open-source projects or take a look at how you can join our team.

Happy typing!

—The Ruby Infrastructure Team

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Static Typing for Ruby

Static Typing for Ruby

On November 25, 2020 we held ShipIt! Presents: The State of Ruby Static Typing at Shopify. The video of the event is now available.

Shopify changes a lot. We merge around 400 commits to the main branch daily and deploy a new version of our core monolith 40 times a day. The Monolith is also big: 37,000 Ruby files, 622,000 methods, more than 2,000,000 calls. At this scale with a dynamic language, even with the most rigorous review process and over 150,000 automated tests, it’s a challenge to ensure everything runs smoothly. Developers benefit from a short feedback loop to ensure the stability of our monolith for our merchants.

Since 2018, our Ruby Infrastructure team has looked at ways to make the development process safer, faster, and more enjoyable for Ruby developers. While Ruby is different from other languages and brings amazing features allowing Shopify to be what it is today, we felt there was a feature from other languages missing: static typing.

The Three Key Requirements for a Typing Solution in Ruby

Even in 2018, typing for Ruby wasn't a novelty. A few attempts were made to integrate type annotations directly into the language or through external tools (RDL, Steep), or as libraries (dry-types). 

Which solution would best fit Shopify considering its codebase and culture? For the Ruby Infrastructure team, the best match for a typing solution needs:

  • Gradual typing: Typing a monolith isn't a simple task and can’t be done in a day. Our code evolves fast, and we can’t ask developers to stop coding while we add types to the existing codebase. We need flexibility to add types without blocking the development process or limiting our ability to satisfy merchants needs.
  • Speed: Considering the size of Shopify’s codebase, speed is a concern. If our goal is to provide quick feedback on errors and remove pressure from continuous integration (CI), we need a solution that’s fast.
  • Full Ruby support: We use all of Ruby at Shopify. Our culture embraces the language and benefits from all features, even hard to type ones like metaprogramming, overloading, and class reopening. Support for Rails is also a must. From code elegance to developer happiness, the chosen solution needs to be compatible as much as possible with all Ruby features.

With such a list of requirements, none of the contenders at the time could satisfy our needs, especially the speed requirement. We started thinking about developing our own solution, but a perfectly timed meeting with Stripe, who were working on a solution to the problem, introduced us to Sorbet.

Sorbet was closed-source at the time and under heavy development but was already promising. It’s built for gradual typing with phenomenal performance (able to analyze 100,000 lines per second per core) making it significantly faster than running automated tests. It can handle hard to type things like metaprogramming, thanks to Ruby Interface files (RBI). This is how, at the start of 2019, Shopify began its journey toward static type checking for Ruby.

Treat Static Typing as a Product

With only a three-person team and a lot on our plate, fully typing our monolith with Sorbet was going to be an approach based on Shopify’s Get Shit Done (GSD) framework.

  1. We tested the viability of Sorbet on our core monolith by only typing a few files, to check if we could observe benefits from it while not impairing other developers’ work. Sorbets’ gradual approach proved to be working.
  2. We manually created RBI files to represent what Sorbet could not understand yet. We checked we supported Ruby’s most advanced features as well as Rails constructs.
  3. We added more and more files while keeping an eye on performance ensuring Sorbet would scale with our monolith.

This gave us confidence Sorbet was the right choice to solve our problem. Once we officially decided to go with Sorbet we reflected on how we can reach 100% adoption in the monolith. To determine our roadmap we looked at:

  • how many files needed to be typed
  • the content of the files
  • the Ruby features they used.

Track Static Typing Adoption

Type checking in Sorbet comes in different levels of strictness. The strictness is defined on a per file basis by adding a magic comment in the file, called a sigil, written # typed: LEVEL, where LEVEL can be one of the following: 

  • ignore: At this level, the file is not even read by Sorbet, and no errors are reported for this file at all.
  • false: Only errors related to syntax, constant resolution and correctness of sigs are reported. At this level sorbet doesn’t check the calls in the files even if the methods called don't exist anywhere in the codebase.
  • true: This is the level where Sorbet actually starts to type check your code. All methods called need to exist in the code base. For each call, Sorbet will check that the arguments count matches the method definition. If the method has a signature, Sorbet will also check their types.
  • strict: At this level all methods must have a signature, and all constants and instance variables must have explicitly annotated types.
  • strong: Sorbet no longer allows untyped variables. In practice, this level is actually unusable for most files because Sorbet can’t type everything yet and even Stripe advises against using it.

Once we were able to run Sorbet on our codebase, we needed a way to track our progress and identify which parts of our monolith were typed with which strictness or which parts needed more work. To do so we created SorbetMetrics, a tool able to collect and display metrics about typing coverage for all our internal projects. We started tracking three key metrics to measure Sorbet adoption :

  • Sigils: how many files are typed ignore, false, true, strict or strong
  • Calls: how many calls are sent to a method with a signature
  • Signatures: how many methods have a signature

Bar graph showing increased Sorbet usage in projects over time. Below the bar graph is a table showing the percentage of sigils, calls, signatures in each project.
SorbetMetrics dashboard homepage

Each day SorbetMetrics pulls the latest version of our monolith and other Shopify projects using Sorbet, computes those metrics and displays them in a dashboard internally available to all our developers.

A selection of charts from the SorbetMetrics Dashboard. 3 pie charts showing the percentage of sigils, calls, and signatures in the monolith. 3 line charts showing Sigils, calls, and signature percentage over time. A circular tree map showing the relationship between strictness level and components. 2 line charts showing Sorbet versions and typechecking time over time
SorbetMetrics dashboard for our monolith

Sorbet Support at Scale

If we treat typing as a product, we also need to focus on supporting and enabling our “customers” who are developers at Shopify. One of our goals was to have a strong support system in place to help with any problems that arise and slow developers down.

Initially, we supported developers with a dedicated Slack channel where Shopifolk could ask questions to the team. We’d answer these questions real-time and help Shopifolk with typing efforts where our input was important.

This white glove support model obviously didn't scale, but it was an excellent learning opportunity for our team—we now understood the biggest challenges and recurring problems. We ended up solving some problems over and over again, but it solidified the effort to understand the patterns and decide which features to work on next.

Using Slack meant our answers weren't discoverable forever. We moved most of the support and conversation to our internal Discourse platform, increasing discoverability and broader sharing of knowledge. This also allows us to record solutions in a single place and let developers self-serve as much as possible. As we onboard more and more projects with Sorbet, this solution scales better.

Understand Developer Happiness

Going further from unblocking our users, we also need to ensure their happiness. Sorbet and more generally static typing in Ruby wouldn’t be a good fit for us if it made our developers miserable. We’re aware that it introduces a bit more work, so the benefits need to balance with the inconvenience.

Our first tool to measure developers’ opinions of Sorbet is surveys. Twice a year, we send a “Typing @ Shopify” survey to all developers and collect their sentiments regarding Sorbet’s benefits and limitations, as well as what we should focus on in the future.

A bar graph showing the increasing strongly agree answer over time to the question I want Sorbet to be applied to other Shopify projects. Below that graph is a bar graph showing the increasing strongly agree answer over time to the question I want more code to be typed.
Some responses from our “Sorbet @ Shopify” surveys

We use simple questions (“yes” or “no”, or a “Strongly Disagree” (1) to “Strongly Agree” (5) scale) and then look at how the answers evolve over time. The survey results gave us interesting insights:

  • Sorbet catches more errors on developer’s pull requests (PR) as adoption increased
  • Signatures help with code understanding and give developers confidence to ship
  • Added confidence directly impacted the increasing positive opinion about static typing in Ruby
  • Over time developers wanted more code and more projects to be typed
  • Developers get used to Sorbet syntax over time
  • IDE integration with Sorbet is a feature developers are rooting for

Our main observation is that developers enjoy Sorbet more as the typing coverage increases. This is one reason that's increasing our motivation to reach 100% of files at typed: true and maximize the amount of methods with a signature.

The second tool is interviews with individual developers. We select a team working with Sorbet and meet each member to talk about their experience using Sorbet either in the monolith or on another project. We get a better understanding of what their likes and dislikes are, what we should be improving, but also how we can better support them when introducing Sorbet, so the team keeps Sorbet in their project.

The Current State of Sorbet at Shopify

Currently, Sorbet is only enforced on our main monolith and we have about 60 other internal projects that opted to use Sorbet as well. On our main monolith, we require all files to be at least typed: false and Sorbet is run on our continuous integration platform (CI) for every PR and fails builds if type checking errors are found. We’re currently evaluating the idea of enforcing valid type checking on CI even before running the automated tests.

Three pie charts showing percentage of sigils, calls, and signatures in the monolith used to measure Sorbet adoption
Typing coverage metrics for Shopify’s monolith

As of today, 80% of our files (including tests) are typed: true or higher. Almost half of our calls are typed and half of our methods have signatures. All of this can be type checked under 15 seconds on our developers machines.

A circular tree map showing the relationship between strictness level and components
Files strictness map in Shopify’s monolith

The circle map shows which parts of our monolith are at which strictness level. Each dot represents a Ruby file (excluding tests). Each circle represents a component (a set of Ruby files serving the same application concern). Yes, it looks like a Petri dish and our goal is to eradicate the bad orange untyped cells.

A bar graph showing increased number of Shopify projects using Sorbet over time
Shopify projects using Sorbet

Outside of the core monolith, we’ve also observed a natural increase of Shopify projects, both internal and open-source, using Sorbet. As I write these lines, more than 60 projects now use Sorbet. Shopifolks like Sorbet and use it on their own without being forced to do so.

A bar graph showing manual dev tc runs from developers machine on our monolith in 2019
Manual dev tc runs from developers machine on our monolith in 2019

Finally, we track how many times our developers ran the command dev tc to typecheck a project with Sorbet on their development machine. This shows us that developers don’t wait for CI to use Sorbet—everyone enjoys a faster feedback loop.

The Benefits of Types for Ruby Developers 

Now that Sorbet is fully adopted in our core monolith, as well as in many internal projects, we’re starting to see the benefits of it on our codebases as well as on our developers. Our team that is working on Sorbet adoption is part of the broader Ruby Infrastructure team which is responsible for delivering a fast, scalable and safe Ruby language for Shopify. As part of that mandate, we believe that static typing has a lot to offer for Ruby developers, especially when working on big, complex codebases.

In this post I focused on the process we followed to ensure Sorbet was the right solution for our needs, treating static typing as a product and showed the benefits of this product on our customers: Shopifolk working on our monolith and outside. Are you curious to know how we got there? Then you’ll be interested in the second part: Adopting Sorbet at Scale where I present the tools we built to make adoption easier and faster, the projects we open-sourced to share with the community and the preliminary results of our experiment with static typing to reduce production errors.

Happy typing!

—The Ruby Infrastructure Team

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

How to Introduce Composite Primary Keys in Rails

How to Introduce Composite Primary Keys in Rails

Databases are a key scalability bottleneck for many web applications. But what if you could make a small change to your database design that would unlock massively more efficient data access? At Shopify, we dusted off some old database principles and did exactly that with the primary Rails application that powers online stores for over a million merchants. In this post, we’ll walk you through how we did it, and how you can use the same trick to optimize your own applications.


A basic principle of database design is that data that is accessed together should be stored together. In a relational database, we see this principle at work in the design of individual records (rows), which are composed of bits of information that are often accessed and stored at the same time. When a query needs to access or update multiple records, this query will be faster if those rows are “near” to each other. In MySQL, the sequential ordering of rows on disk is dictated by the table’s primary key.

Active Record is the portion of the Rails application framework that abstracts and simplifies database access. This layer introduces database practices and conventions that greatly simplify application development. One such convention is that all tables have a simple automatically incrementing integer primary key, often called `id`. This means that, for a typical Rails application, most data is stored on disk strictly in the order the rows were created. For most tables in most Rails applications, this works just fine and is easy for application developers to understand.

Sometimes the pattern of row access in a table is quite different from the insertion pattern. In the case of Shopify’s core API server, it is usually quite different, due to Shopify’s multi-tenant architecture. Each database instance contains records from many shops. With a simple auto-incrementing primary key, table insertions interleave the insertion of records across many shops. On the other hand, most queries are only interested in the records for a single shop at a time.

Let’s take a look at how this plays out at the database storage level. We will use details from MySQL using the InnoDB storage engine, but the basic idea will hold true across many relational databases. Records are stored on disk in a data structure called a B+ tree. Here is an illustration of a table storing orders, with the integer order id shown, color-coded by shop:

Individual records are grouped into pages. When a record is queried, the entire page is loaded from disk into an in-memory structure called a buffer pool. Subsequent reads from the same page are much faster while it remains in the buffer pool. As we can see in the example above, if we want to retrieve all orders from the “yellow” shop, every page will need loading from disk. This is the worst-case scenario, but it turned out to be a prevalent scenario in Shopify’s main operational database. For some of our most important queries, we observed an average of 0.9 pages read per row in the final query result. This means we were loading an entire page into memory for nearly every row of data that we needed!

The fix for this problem is conceptually very simple. Instead of a simple primary key, we create a composite primary key [shop_id, order_id]. With this key structure, our disk layout looks quite different:

Records are now grouped into pages by shop. When retrieving orders for the “yellow” shop, we read from a much smaller set of pages (in this example it’s only one page less, but imagine extrapolating this to a table storing records for 10,000 shops and the result is more profound).

So far, so good. We have an obvious problem with the efficiency of data access and a simple solution. For the remainder of this article, we’ll go through some of the implementation details and challenges we came across with rolling out composite primary keys in our main operational database, along with the impact for our Ruby on Rails application and other systems directly coupled to our database. We will continue using the example of an “orders” table, both because it is conceptually simple to understand. It also turned out to be one of the critical table names that we applied this change to.

Introducing Composite Primary Keys

The first challenge we faced with introducing composite primary keys was at the application layer. Our framework and application code contained various assumptions about the table’s primary key. Active Record, in particular, assumes an integer primary key, and although there is a community gem to monkey-patch this, we didn’t have confidence that this approach would be sustainable and maintainable in the future. On deeper analysis, it turned out that nearly all such assumptions in application layer code continued to hold if we changed the `id` column to be an auto-incrementing secondary key. We can leave the application layer blissfully unaware of the underlying database schema by forcing Active Record to treat the `id` column as a primary key:

class Order < ApplicationRecord
  self.primary_key = :id
  .. remainder of order model ...

Here is the corresponding SQL table definition:

CREATE TABLE `orders` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `shop_id` bigint(20) NOT NULL,
  … other columns ...
  PRIMARY KEY (`shop_id`,`id`),
  KEY `id` (`id`)
  … other secondary keys ...

Note that we chose to leave the secondary index as a non-unique key here. There is some risk to this approach because it is possible to construct application code that results in duplicate models with the same id (but with different shop_id in our case). You can opt for safety here and make a unique secondary key on id. We took this approach because the method we use for live schema migrations is prone to deadlock on tables containing multiple unique constraints. Specifically, we use Large Hadron Migrator (LHM), which uses MySQL triggers to copy records into a shadow table during migrations. Unique constraints are enforced in InnoDB through an exclusive table-level write lock. Since there are two tables accepting writes, each containing two exclusive locks, all of the necessary deadlock conditions are present during migration. You may be able to keep a unique constraint on `id` if any of the following are true:

  • You don’t perform live migrations on your application.
  • Your migrations don’t use SQL triggers (such as the default Rails migrations).
  • The write throughput on the table is low enough that a low volume of deadlocks is acceptable for your application.
  • The code path for writing to this table is resilient to database transaction failures.

The remaining area of concern is any data infrastructure that directly accesses the MySQL database outside of the Rails application layer. In our case, we had three key technologies that fell into this category: 

  • Our database schema migration infrastructure, already discussed above.
  • Our live data migration system, called Ghostferry. Ghostferry moves data across different MySQL instances while the application is still running, enabling load-balancing of sharded data across multiple databases. We implemented support for composite primary keys in ghostferry as part of this work, by introducing the ability to specify an alternate column for pagination during migration.
  • Our data warehousing system does both bulk and incremental extraction of MySQL tables into long term storage. Since this system is proprietary to Shopify we won’t cover this area further, but if you have a similar data extraction system, you’ll need to ensure it can accommodate tables with composite primary keys.


Before we dig into specific results, a disclaimer: every table and corresponding application code is different, so the results you see in one table do not necessarily translate into another. You need to carefully consider your data’s access patterns to ensure that the primary key structure produces the optimal clustering for those access patterns. In our case of a sharded application, clustering the data by shop was often the right answer. However, if you have multiple closely connected data models, you may find another structure works better. To use a common example, if an application has “Blog” and “BlogPost” models, a suitable primary key for the blog_posts table may be (blog_id, blog_post_id). This is because typical data access patterns will tend to query posts for a single blog at once. In some cases, we found no overwhelming advantage to a composite primary key because there was no such singular data access pattern to optimize for. In one more subtle example, we found that associated records tended to be written within the same transaction, and so were already sequentially ordered, eliminating the advantage of a composite key. To extend the previous blog example, imagine if all posts for a single blog were always created in a single transaction, so that blog post records were never interleaved with insertion of posts from other blogs.

Returning to our leading example of an “orders” table, we measured a significant improvement in database efficiency:

  • The most common queries that consumed most database capacity had a 5-6x improvement in elapsed query time.
  • Performance gains corresponded linearly with a reduction in MySQL buffer pool page reads per query. Adding a composite key on our single most queried table reduced the median buffer pool reads per query from 1.8 to 1.2.
  • There was a dramatic improvement in tail latency for our slowest queries. We maintain a log of slow queries, which showed a roughly 80% reduction in distinct queries relating to the orders table.
  • Performance gains varied greatly across different kinds of queries. The most dramatic improvement was 500x on a particularly egregious query. Most queries involving joins saw much lower improvement due to the lack of similar data clustering in other tables (we expect this to improve as more tables adopt composite keys).
  • A useful measure of aggregate improvement is to measure the total elapsed database time per day, across all queries involving the changed table. This helps to add up the net benefit on database capacity across the system. We observed a reduction of roughly one hour per day, per shard, in elapsed query time from this change.

There is one notable downside on performance that is worth clearly calling out. A simple auto-incrementing primary key has optimal performance on insert statements because data is always clustered in insertion order. Changing to a composite primary key results in more expensive inserts, as more distinct database pages need to be both read and flushed to disk. We observed a roughly 10x performance degradation on inserts by changing to a composite primary key. Most data are queried and updated far more often than inserted, so this tradeoff is correct in most cases. However, it is worth keeping this in mind if insert performance is a critical bottleneck in your system for the table in question.

Wrapping Up

The benefits of data clustering and the use of composite primary keys are well-established techniques in the database world. Within the Rails ecosystem, established conventions around primary keys mean that many Rails applications lose out on these benefits. If you are operating a large Rails application, and either database performance or capacity are major concerns, it is worth exploring a move to composite primary keys in some of your tables. In our case, we faced a large upfront cost to introduce composite primary keys due to the complexity of our data infrastructure, but with that cost paid, we can now introduce additional composite keys with a small incremental effort. This has resulted in significant improvements to query performance and total database capacity in one of the world’s largest and oldest Rails applications.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

Building Mental Models of Ideas That Don’t Change

Building Mental Models of Ideas That Don’t Change

Editor's note: These ideas were also presented live and the video is available. If you find this useful, stay updated by following me on Twitter or my blog.

There’s always new stuff: new frameworks, new languages, and new platforms. All of this adds up. Sometimes it feels like you’re just treading water, and not actually getting better at what you do. I’ve tried spending more time learning this stuff, but that doesn’t work—there’s always more. I have found a better approach is learning things at a deeper level and using those lessons as a checklist. This checklist of core principles are called mental models. 

I learned this approach by studying how bright people think. You might have heard Richard Feynman describe the handful of algorithms that he applies to everything. Maybe you’ve  seen Elon Musk describe his approach as thinking by fundamental principles. Charlie Munger also credits most of his financial success to mental models. All of these people are amazing and you won’t get to their level with mental models alone, but mental models give you a nudge in the right direction.

So, how does one integrate mental models into their life and work? The first thing that you need is a method for prioritizing new concepts that you should learn. After that, you’ll need a good system for keeping track of what you have identified as important. With this process, you’ll identify mental models and use them to make more informed decisions. Below I start by describing some engineering and management mental models that I have found useful over the years.

Table of Contents

Engineering Mental Models

 Management Mental Models

 Engineering Mental Models

Avoid Silent Failures

When something breaks you should hear about it. This is important because small issues can help you find larger structural issues. Silent failures typically happen when exceptions are silenced—this may be in a networking library, or the code that handles exceptions. Failures can also be silent when one of your servers is down. You can prevent this by using a third party system that pings each of the critical components.

As your project gets more mature, set up a dashboard to track key metrics and create automated alerts. Generally, computers should tell you when something is wrong. Systems become more difficult to monitor as they grow. You want to measure and log everything at the beginning and not wait until something goes wrong. You can encourage other developers to do this by creating helper classes with a really simple APIs since things that are easy and obvious are more likely to be used. Once you are logging everything, create automated alerts. Post these alerts in shared communication channels, and automatically page the oncall developer for emergencies.

Do Minimal Upfront Work and Queue the Rest

A system is scalable when it handles unexpectedly large bursts of incoming requests. The faster your system handles a request, the faster it gets to the next one. Turns out, that in most cases, you don’t have to give a response to the request right away—just a response indicating you've started working on the task. In practice, you queue a background job after you receive a request. Once your job is in a queue, you have the added benefit of making your system fault tolerant since failed jobs can be tried again.

Scaling Reads with Caching and Denormalizing

Read-heavy systems mean some data is being read multiple times. This can be problematic because your database might not have enough capacity to deal with all of that work. The general approach of solving this is by pre-computing this data (called denormalizing) and storing it somewhere fast. In practice, instead of letting each request hit multiple tables in a database, you pre-compute the expected response and store it in a single place. Ideally, you store this information somewhere that’s really fast to read from (think RAM). In practice this means storing data in data stores like Memcached.

Scaling Writes with Sharding or Design Choices

Write-heavy systems tend to be difficult to deal with. Traditional relational databases can handle reads pretty well, but have trouble with writes. They take more time processing writes because relational databases spend more effort on durability and that can lock up writes and create timeout errors.

Consider the scenario where a relational database is at it’s write-capacity and you can’t scale up anymore. One solution is to write data to multiple databases. Sharding is the process where you split your database into multiple parts (known as shards). This process allows you to group related data into one database. Another method of dealing with a write heavy system is by writing to Non-relational (NoSQL) databases. These databases are optimized to handle writes, but there’s a tradeoff. Depending on the type of NoSQL database and its configuration, it gives up:

  • atomic transactions (they don’t wait for other transactions to fully finish), 
  • consistency across multiple clusters (they don’t wait for other clusters to have the same data),
  • durability (they don’t spend time writing to disk). 

It may seem like you are giving up a lot, but you mitigate some of these losses with design choices. 

Design choices help you cover some of the weaknesses of SQL databases. For example, consider that updating rows is much more expensive than creating new rows. Design your system so you avoid updating the same row in multiple flows—insert new rows to avoid lock contention. With all of that said, I recommend starting out with a SQL database, and evolving your setup depending on your needs.

Horizontal Scaling Is the Only Real Long Term Solution

Horizontal scaling refers to running your software on multiple small machines, while vertical scaling refers to running your software on one large machine. Horizontal scaling is more fault tolerant since failure of a machine doesn’t mean an outage. Instead, the work for the failed machine is routed to the other machines. In practice, horizontally scaling a system is the only long term approach to scaling. All systems that appear ‘infinitely-scalable’ are horizontally scaled under the hood: Cloud object stores like S3 and GCS; NoSQL databases like Bigtable and Dynamo DB; and stream processing systems like Kafka are all horizontally scaled. The cost for horizontally scaling systems is application and operational complexity. It takes significant time and potential complexity to horizontally scale your system, but you want to be in a situation where you can linearly scale your system by adding more computers.

Things That are Harder to Test Are More Likely to Break

Among competing approaches to a problem, you should pick the most testable solution (this is my variant of Occam’s Razor). If something is difficult to test, people tend to avoid testing it. This means that future programmers (or you) will be less likely to fully test this system, and each change will make the system more brittle. This model is important to remember when you first tackle a problem because good testability needs to be baked into the architecture. You’ll know when something is hard to test because your intuition will tell you.

Antifragility and Root Cause Analysis

Nassim Taleb uses the analogy of a hydra in Antifragile; they grow back a stronger head every time they are struck. The software industry championed this idea too. Instead of treating failures as shameful incidents that should be avoided at all costs, they’re now treated as opportunities to improve the system. Netflix’s engineering team is known for Chaos Monkey, a resiliency system that turns off random components. Once you anticipate random events, you can build a more resilient system. When failures do happen, they’re treated as an opportunity to learn.

Root cause analysis is a process where the people involved in a failure try to extract the root cause in a blameless way by starting off by what went right, and then diving into the failure without blaming anyone.

Big-O and Exponential Growth

The Big-O notation describes the growth in complexity of an algorithm. There’s a lot to this, but you’ll get very far if you just understand the difference between constant, linear, and exponential growth. In layman’s terms, algorithms that perform one task are better than algorithms that perform many tasks, and algorithms that perform many tasks are better than ones where the tasks are ever increasing with each iteration. I have found this issue visible at an architectural level as well.

Margin of Safety

Accounting for a margin of safety means you need to leave some room for errors or exceptional events. For example, you might be tempted to run each server at 90% of its capacity. While this saves money, it leaves your server vulnerable to spikes in traffic. You’ll have more confidence in your setup, if you have auto-scaling setup. There’s a problem with this too, your overworked server can cause cascading failures in the whole system. By the time auto-scaling kicks in, the new server may have a disk, connection pool or an assortment of other random fun issues. Expect the unexpected and give yourself some room to breathe. Margin of safety also applies to planning releases of new software. You should add a buffer of time because unexpected things will come up.

Protect the Public API

Be very careful when making changes to the public API. Once something is in the public API, it’s difficult to change or remove. In practice, this means having a very good reason for your changes, and being extremely careful with anything that affects external developers; mistakes in this type of work affect numerous people and are very difficult to revert.


Any system with many moving parts should be built to expect failures of individual parts. This means having backup providers for systems like Memcached or Redis. For permanent data-stores like SQL, fail-overs and backups are critical. Keep in mind that you shouldn’t consider something a backup unless you do regular drills to make sure that you can actually recover that data.

Loose Coupling and Isolation

Tight coupling means that different components of a system are closely interconnected. This has two major drawbacks. The first drawback is that these tightly coupled systems are more complex. Complex systems, in turn, are more difficult to maintain and more error prone. The second major drawback is that failure in one component propagates faster. When systems are loosely coupled, failures can be self contained and can be replaced by potential backups (see Redundancy). At a code level, reducing tight coupling means following the single responsibility principle which states that every class has a single responsibility and communicates with other classes with a minimal public API. At an architecture level, you improve tightly coupled systems by following the service oriented architecture. This architecture system suggests dividing components by their business services and only allows communication between these services with a strict API.

Be Serious About Configuration

Most failures in well-tested systems occur due to bad configuration; this can be changes like environmental variables updates or DNS settings. Configuration changes are particularly error prone because of the lack of tests and the difference between the development and production environment. In practice, add tests to cover different configurations, and make the dev and prod environment as similar as possible. If something works in development, but not production, spend some time thinking about why that’s the case.

Explicit Is Better than Implicit

The explicit is better than implicit model is one of the core tenants from the Zen of Python and it’s critical to improving code readability. It’s difficult to understand code that expects the reader to have all of the context of the original author. An engineer should be able to look at class and understand where all of the different components come from. I have found that simply having everything in one place is better than convoluted design patterns. Write code for people, not computers.

Code Review

Code review is one of the highest leverage activities a developer can perform. It improves code quality and transfers knowledge between developers. Great code reviewers change the culture and performance of an entire engineering organization. Have at least two other developers review your code before shipping it. Reviewers should give thorough feedback all at once, as it’s really inefficient to have multiple rounds of reviews. You’ll find that your code review quality will slip depending on your energy level. Here’s an approach to getting some consistency in reviews: 

  1. Why is this change being made? 
  2. How can this approach or code be wrong? 
  3. Do the tests cover this or do I need to run it locally? 

Perceived Performance

Based on UX research, 0.1 second (100 ms) is the gold standard of loading time. Slower applications risk losing the user’s attention. Accomplishing this load time for non-trivial apps is actually pretty difficult, so this is where you can take advantage of perceived performance. Perceived performance refers to how fast your product feels. The idea is that you show users placeholder content at load time and then add the actual content on the screen once it finishes loading. This is related to the Do Minimal Upfront Work and Queue the Rest model.

Never Trust User Input Without Validating it First

The internet works because we managed to create predictable and secure abstractions on top of unpredictable and insecure networks of computers. These abstractions are mostly invisible to users but there’s a lot happening in the background to make it work. As an engineer, you should be mindful of this and never trust input without validating it first. There are a few fundamental issues when receiving input from the user.

  1. You need to validate that the user is who they say they are (authentication).
  2. You need to ensure that the communication channel is secure, and no one else is snooping (confidentiality).
  3. You need to validate that the incoming data was not manipulated in the network (data integrity).
  4. You need to prevent replay attacks and ensure that the same data isn’t being sent multiple times.
  5. You could also have the case where a trusted entity is sending malicious data.

This is simplified, and there are more things that can go wrong, so you should always validate user input before trusting it.

Safety Valves

Building a system means accounting for all possibilities. In addition to worst case scenarios, you have to be prepared to deal with things that you cannot anticipate. The general approach for handling these scenarios is stopping the system to prevent any possible damage. In practice, this means having controls that let you reject additional requests while you diagnose a solution. One way to do this is adding an environment variable that can be toggled without deploying a new version of your code.

Automatic Cache Expiration

Your caching setup can be greatly simplified with automatic cache expiration. To illustrate why, consider the example where the server is rendering a product on a page. You want to expire the cache whenever this product changes. The manual method is by expiring the cache expiration code after the product is changed. This requires two separate steps, 1) Changing the product, and then 2) Expiring the cache. If you build your system with key-based caching, you avoid the second step all together. It’s typically done by using a combination of the product’s ID and it’s last_updated_at_timestamp as the key for the product’s cache. This means that when a product changes it’ll have a different last_updated_at_timestamp field. Since you’ll have a different key, you won’t find anything in the cache matching that key and fetch the product in it’s newest state. The downside of this approach is that your cache datastore (e.g., Memcached or Redis) will fill up with old caches. You can mitigate it by adding an expiry time to all caches so old caches automatically disappear. You can also configure Memcached so it evicts the oldest caches to make room for new ones.

Introducing New Tech Should Make an Impossible Task Possible or Something 10x Easier

Most companies eventually have to evaluate new technologies. In the tech industry, you have to do this to stay relevant. However, introducing a new technology has two negative consequences. First, it becomes more difficult for developers to move across teams. This is a problem because it creates knowledge silos within the company, and slows down career growth. The second consequence is that fewer libraries or insights can be shared across the company because of the tech fragmentation. Moving over to new tech might come up because of people’s tendency to want to start over and write things from scratch—it’s almost always a bad idea. On the other hand, there are a few cases where introducing a new technology makes sense like when it enables your company to take on previously impossible tasks. It makes sense when the technical limitation of your current stack is preventing you from reaching your product goals.

Failure Modes

Designers and product folks focus on the expected use cases. As an engineer you also have to think about the worst case scenarios because that’s where the majority of your time will go. At scale, all bad things that can happen do happen. Asking “What could go wrong” or “How can I be wrong” really helps; these questions also cancel out our bias towards confirmation of our existing ideas. Think about what happens when no data, or a lot of data is flowing through the system (Think “Min-Max”). You should expect computers to occasionally die and handle those cases gracefully, and expect network requests to be slow or stall all together.

Management Mental Models

The key insight here is that "Management" might be the wrong name for this discipline all together. What you are really doing is growing people. You’ll rarely have to manage others if you align your interests with the people that report to you. Compared to engineering, management is more fuzzy and subjective. This is why engineers struggle with it. It's really about calibration; you are calibrating approaches for yourself and your reports. What works for you, might not work for me because the world has different expectations from us. Likewise, just reading books on this stuff doesn't help because advice from the book is calibrated for the author.

With that said, I believe the following mental models are falsifiable. I also believe that doing the opposite of these will always be harmful. I find these particularly valuable while planning my week. Enjoy!

Create Motivation by Aligning Incentives

Incentives drive our behavior above all else. You probably feel this yourself when you procrastinate on tasks that you don't really want to do.  Work with your reports to identify the intersection of:

  1. What do they want to work on?
  2. What does the product need?
  3. What does that company need?  

Venn diagram of the intersection of the three incentives
The intersection of the three incentives

Magic happens when these questions produce themes that overlap with each other. The person will have intrinsic motivation for tasks that build their skills, improve their product, and their company. Working on the two intersecting themes to these questions can be fruitful too. You can replace 'product' with 'direct team' if appropriate.

 Occasionally you'll find someone focusing on a task that's only:

  • done because that's what the person wants (neglecting the product and the company)
  • what the product needs (neglecting the person's needs or the company)
  • what the company wants (neglecting the person's needs or their product). 

This is fine in the short term, but not a good long term strategy. You should nudge your reports towards these overlapping themes.

Create Clarity by Understanding the "Why" and Having a Vision for the Product

You should have a vision for where your product needs to go. This ends up being super helpful when deciding between competing tactical options and also helps clear up general confusion. You must communicate the vision with your team.  While being a visionary isn't included in your job description, aligning on a "why" often counteracts the negative effects of broken-telephone effect in communication and entropy in organizations.

Focus on High Leverage Activities

This is the central idea in High output management. The core idea is similar to the "Pareto principle" where you focus your energy on the 20% of the tasks that have 80% of the impact. If you don't do this, your team will spend a lot of time, but not accomplish much. So, take some time to plan your approach and focus on the activities that give you the most leverage. I found Donella Medow’s research to be a super user for understanding leverage. A few examples of this include:

Promote Growth Mindset in Your Team

If you had to start with zero, what you'd want is the ability to acquire new skills. You want to instil a mindset of growth in yourself and the rest of the team. Create  an environment where reflection and failures are talked about. Lessons that you truly learn are the ones that you have learnt by making mistakes. Create an environment where people obsess about the craft and consider failures a learning opportunity.

Align Your Team on the Common Vision

Aligning your team towards a common direction is one of the most important things you can do. It'll mean that people will go in the same general direction.

Build Self-organizing Teams

Creating self-sufficient teams is the only way to scale yourself. You can enable a team to do this by promoting a sense of ownership. You can give your input without taking authority away from others and offer suggestions without steamrolling leaders.

Communication and Structural Organization

You should focus on communication and organization tools that keep the whole team organized. Communication fragmentation leads to massive waste.

Get the Architecture Right

This is where your engineering chops will come in handy. From a technical perspective, getting the core pieces of the architecture right ends up being critical and defines the flow of information in the system.

Don’t Try to be Efficient with Relationships

As an engineer your brain is optimized to seek efficiency. Efficiency isn’t a good approach when it comes to relationships with people as you often have the opposite effect as to what you intended. I have found that 30 minute meetings are too fast for one-on-ones with your reports. You want to give some time for banter and a free flow of information. This eases people up, you have better conversations and they often end up sharing more critical information than they would otherwise. Of course, you don't want to spend a lot of time in meetings, so I prefer to have longer infrequent meetings instead of frequent short meetings.

This model also applies to pushing for changes or influencing others in any way. This is a long game, and you should be prepared for that. Permanent positive behavioral changes take time.

Hire Smart People Who Get Stuff Done and You Want to Be Around

Pick business partners with high intelligence, energy, and, above all, integrity
Pick business partners with high intelligence, energy, and, above all, integrity. -@Naval

Hiring, when done right, is one of the highest-leverage activities that you can work on. You are looking for three key signals when hiring:

When looking for the "smart" signal, be aware of the "halo effect" and be weary of charmers. "Get stuff done" is critical because you don't want to be around smart people who aren’t adding value to your company. Just like investing, your aim should be to identify people on a great trajectory. "Good to be around" is tricky because it's filled with personal bias. A good rule is to never hire assholes. Even if they are smart and get stuff done, they’ll do that at the expense of others and wreak havoc on the team. Avoid! Focus on hiring people that you would want long term relationships with. It is also important to differentiate between assholes and disagreeable people. A disagreeable and useful person is much better to be around than someone who is agreeable but not useful. 

Be Useful

You could behave in a number of different ways at any given time or interaction. What you want is to be useful and add value. There is a useful way of giving feedback to someone that reports you and a useful way to review code. Apply this to yourself and plan your day to be more useful to others. Our default instinct is to seek confirmation bias and think about how we are right. We don’t give others the same courtesy. The right approach is to reverse that default instinct: Think “how can I make this work” for other people, and “how can this be wrong” for your own ideas.

Don’t compete with your reports either. As a manager, this is particularly important because you want to grow your reports. Be aware of situations or cases where you might be competing, and default to being useful instead of pushing your own agenda.

Get the Requirements Right Early and Come Up with a Game Plan

Planning gets a bad rep in fast moving organizations, but it ends up being critical in the long term. Doing some planning almost always ends up being much better than no planning at all. What you want to do is plan until you have a general direction defined and start iterating towards that. There are a few questions can help getting these requirements right:

  • What are the things that we want in the near future? You want to pick the path that gives you the most options for the expected future.
  • How can this be wrong? Counteract your confirmation bias with this question by explicitly thinking about failure modes.
  • Where do you not want to go? Inversion ends up being really useful. It’s easier to avoid stupidity than seeking brilliance
  • What happens once you get there? Seek second order effects. What will one path unlock or limit?
  • What other paths could you take? Has your team settled on a local maxima instead and not the global maxima?

Once you have a decent idea of how to proceed with this, you are responsible for communicating this plan with the rest of the team too. Not getting the requirements right early on means that your team can potentially end up going in the wrong direction which ends up being a net negative.

Establish Rapport Before Getting to Work

You will be much more effective at work, if you connect with the other people before getting to work. This could mean banter, or just listening—there’s a reason why people small-talk.  Get in the circle before attempting to change the circle. This leads to numerous positive improvements in your workflow. Slack conversations will sound like conversations instead of arguments and you'll assume positive intent.  You’ll also find that getting alignment in meetings and nudging reports towards positive changes ends up being much more useful this way. Icebreakers in meetings and room for silliness helps here.

There Is No One-size-fits-all Approach to People. Personality Tests Are Good Defaults

Management is about calibration. You are calibrating  your general approach to others, while calibrating a specific approach to each person. This is really important because an approach that might work for one person won’t work on others. You might find that personality tests like the Enneagram serve as great defaults to approaching a person. Type-5, the investigators, work best when you give them autonomy and new ideas. Type-6, the loyalists, typically want frequent support and the feeling of being entrusted. The Last Dance miniseries on Netflix is a master class on this topic.

Get People to Lead with Their Strengths and Address Their Growth Areas as a Secondary Priority

There are multiple ways to approach a person's personal growth. I’ve found that what works best is first identifying their strengths and then areas of improvements. Find people’s strengths and obsessions then point them to that. You have to get people to lead with their strengths. It’s the right approach because it gives people confidence and momentum. Turns out, that’s also how they add the most value. Ideally, one should develop to be more well-rounded, so it’s also important to come up with a game plan for addressing any areas of improvements. 

Focus on the Positives and Don't Over Index on the Negatives

For whatever reasons, we tend to focus on the negatives more than we should. It might be related to "deprival super reaction syndrome" where we hate losing more than we like winning. In management, we might have the proclivity to focus on what people are doing poorly instead of what they’re doing well. People may not feel appreciated if you only focus on the negatives. I believe this also means that we end up focusing on improving low-performers more than amplifying high-performers. Amplifying high-performers may have an order of magnitude higher impact.

People Will Act Like Owners if You Give Them Control and Transparency

Be transparent and don’t do everything yourself. Talk to people and make them feel included. When people feel left out of the loop, they generally grow more anxious as they feel that they’re losing control. Your ideal case is that your reports act like owners. You can do this by being transparent about how decisions are made. You also have to give others control and autonomy. Expect some mistakes as they calibrate their judgement and nudge in the right direction instead of steamrolling them.

There are other times where you'll have to act like an owner and lead by example. One hint of this case will be when you have a nagging feeling about something that you don't want to do. Ultimately, the right thing here is to take full ownership and not ask others to do what you wouldn't want.

Have High Standards for Yourself and the People Around You

Tech markets and products are generally winner-take-most. This means that second place isn’t a viable option—winning, in tech, leads to disproportionately greater rewards. Aim to be the best in the world in your space and iterate towards that. There’s no point in doing things in a mediocre way.  Aiming high, and having high standards is what pushes everything forward.

To make this happen, you need to partner with people who expect more from you. You should also have high standards for your reports. One interesting outcome of this is that you get the positive effects of the Pygmalion effect: people will rise to your positive expectations.

Hold People Accountable

When things don't get delivered or when you see bad behavior, you have to have high standards and hold people accountable. If you don't, that's an implicit message that these bad outcomes are ok. There are many cases where not holding others accountable can have a spiraling effect.

Your approach to this has to be calibrated for your work style and the situation. Ultimately, you should enforce all deal-breaker rules. Set clear expectations early on. When something goes wrong, work with the other person or team to understand why. Was it a technical issue, does the tooling need to be improved, or was it an issue with leadership?

Bring Other People Up with You

We like working with great and ambitious people because they raise the bar for everyone else. We’re allergic to self-obsessed people who only care about their own growth. Your job as a manager is to bring other people up. Don’t take credit for work that you didn’t do and give recognition to those that did the work. What's interesting is that most people only really care about their own growth. So, being the person who actually spends time thinking about the growth of others differentiates you, making more people want to work with you.

Maintain Your Mental Health with Mindfulness, Rest, and Distance

A team's culture is top down and bottom up. This means people mimic the behavior of others in the position of authority—for better or for worse. Keeping this in mind, you have to be aware of your own actions. Generally, most people become less aware of their actions as fatigue builds up. Be mindful of your energy levels when entering meetings.  Energy and positivity is phenomenal, because it's something that you can give to others, and it doesn't cost you anything.

Stress management is another important skill to develop. Most people can manage problems with clear yes/no solutions. Trickier problems with nuances and unclear paths, or split decisions tend to bubble up. Ambiguity and conflicting signals are a source of stress for many people and treating this like a skill is really important. Dealing with stressful situations by adding more emotion generally doesn't help. Keep your cool in stressful situations.

Aim to Operate Two to Three Months Into the Future

As an engineer you typically operate in scope between days and weeks. As you expand your influence, you also have to start thinking in a greater time horizon. As a manager, your horizon will be longer than a typical engineer, but smaller than someone who focuses on high level strategy. This means that you need to project your team and reports a few months into the future and anticipate their challenges. Ideally, you can help resolve these issues before they even happen. This exercise also helps you be more proactive instead of reacting to daily events.

Give Feedback to People That You Want Long Term Relationships with

Would you give feedback to a random stranger doing something that's bad for them? Probably not. Now, imagine that you knew this person. You would try to reason with this person, and hopefully nudge them in the right direction. You give feedback to people that you care about and want a long term relationship with. I believe this is also true at work. Even if someone isn’t your report, it’s worth sharing your feedback if you can deliver it usefully.

Giving feedback is tricky since people often get defensive. There are different schools of thoughts on this, but I try to build a rapport with someone before giving them direct feedback. Once you convince someone that you are on their side, people are much more receptive to it. Get in the circle. While code review feedback is best when it's all at once, that isn't necessarily true for one-on-one feedback. Many people default to quick-feedback and I think that doesn't work for people you don't have good rapport with and that it only really works if you are in a position of authority. The shortest path is not always the path of least resistance, and so you should build rapport before getting to work.

ShipIt! Presents: Building Mental Models of Ideas That Don’t Change


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you? Visit our Engineering career page to find out about our open positions.

Continue reading

How to Do an In-depth Liquid Render Analysis with Theme Inspector

How to Do an In-depth Liquid Render Analysis with Theme Inspector

Shopify’s Online Store provides greater flexibility to build themes representing the brand of a merchant’s online store. However, like with any programming language, one often writes code without being aware of all the performance impact it creates. Whether it be performance impact on Shopify’s servers or observable performance impact on the browser, ultimately, it’s the customers that experience the slowness. The speed of server-side rendering is one of the most important performance timings to optimize for. That’s because customers wait on a blank screen until server-side rendering completes. Even though we’re working hard to make server-side rendering as fast as possible, bottlenecks may originate from the Liquid source itself.

In this article, I’ll look at:  

  • how to interpret flame graphs generated by the Shopify Theme Inspector
  • what kind of flame graphs generate from unoptimized Liquid code patterns
  • tips for spotting and avoiding these performance issues.

Install the Shopify Theme Inspector

With a Google Chrome browser, install the Shopify Theme Inspector extension. Follow this article on Debug Liquid Render Performance with Shopify Theme Inspector for Chrome for how to start with the extension and get to a point where you can produce a flame graph on your store.

A flame graph example
A flame graph example

The flame graph produced by this tool is a data representation of the code path and the time it took to execute. With this tool, as a developer, you can find out how long a piece of code took to render.

Start with Clean Code

We often forget what a clean implementation looks like, and this is often how we, Shopify, envision this piece of liquid code will be used—it’s often not the reality as developers will find their own ways to achieve their goals. As time passes, code becomes complicated. We need to go back to the clean implementation to understand what makes it take the time to render.

The simple code above creates a flame graph that looks like this image below:

Flame graph for a 10 item paginated collection
Flame graph for a 10 item paginated collection

The template section took 13 ms to complete rendering. Let’s have a better understanding of what we are seeing here.

Highlighted flame graph for a 10 item paginated collection
Highlighted flame graph for a 10 item paginated collection

The area where the server took the time to render is where the code for the pagination loop is executed. In this case, we rendered 10 product titles. Then there’s a block of time that seems to disappear. It‘s actually the time spent on Shopify’s side collecting all the information that belongs to the products in the paginate collection.

Look at Inefficient Code

To know what’s an inefficient code, one must know what it looks like, why it is slow, and how to recognize it in the flame graph. This section walks through a side-by-side comparison of code and it’s flame graphs, and how a simple change results in bad performance.

Heavy Loop

Let’s take that clean code example and make it heavy.

What I’ve done here is accessed attributes in a product while iterating through a collection. Here’s the corresponding flame graph:

Flame graph for a 10 item paginated collection with accessing to its attributes
Flame graph for a 10 item paginated collection with accessing to its attributes

The total render time of this loop is now at 162 ms compared to 13 ms from the clean example. The product attributes access changes a less than 1 ms render time per tile to a 16 ms render time per tile. This produces exactly the same markup as the clean example but at the cost of 16 times more rendering time. If we increase the number of products to paginate from 10 to 50, it takes 800 ms to render.


  • Instead of focusing on how many 1 ms bars there are, focus on the total rendering time of each loop iteration
  • Clean up any attributes aren’t being used
  • Reduce the number of products in a paginated page (Potentially AJAX the next page of products)
  • Simplify the functionality of the rendered product

Nested Loops

Let’s take that clean code example and make it render with nested loops.

This code snippet is a typical example of iterating through the options and variations of a product. Here’s the corresponding flame graph:

Flame graph for two nested loop example
Flame graph for two nested loop example

This code snippet is a two-level nested loop rendering at 55 ms.

Nested loops are hard to notice when just looking at code because it’s separated by files. With the flame graph, we see the flame graph start to grow deeper.

Flame graph of a single loop on a product
Flame graph of a single loop on a product

As highlighted in the above screenshot, the two inner for-loops stacks side by side. This is okay if there are only one or two loops. However, each iteration rendering time will vary based on how many inner iterations it has.

Let’s look at what a three nested loop looks like.

Flame graph for three nested loop example
Flame graph for three nested loop example

This three level nested loop rendered at 72 ms. This can get out of hand really quickly if we aren’t careful. A small addition to the code inside the loop could blow your budget on server rendering time.


  • Look for a sawtooth shaped flame graph to target potential performance problem
  • Evaluate each flame graph layer and see if the nested loops are required

Mix Usage of Multiple Global Liquid Scope

Let’s now take that clean code example and add another global scoped liquid variable.

And here’s the corresponding flame graph:

Flame graph of when there’s one item in the cart with rendering time at 45 ms
Flame graph of when there’s one item in the cart with rendering time at 45 ms

Flame graph of when there’s 10 items in the cart with rendering time at 124 ms
Flame graph of when there’s 10 items in the cart with rendering time at 124 ms

This flame graph is an example of a badly nested loop where each variation is accessing the cart items. As more items are added to the cart, the page takes longer to render.


  • Look for hair comb or sawtooth shaped flame graph to target potential performance problem
  • Compare flame graphs between one item and multiple items in cart
  • Don’t mix global liquid variable usage. If you have to, use  AJAX to fetch for cart items instead

What is Fast Enough?

Try to aim for 200 ms but no more than 500 ms total page rendering time reported by the extension. We didn’t just pick a number out of the hat. It’s made with careful consideration of what other allocation of available time during a page render that we need to include to hit a performance goal. Google Web Vitals stated that a good score for Largest Content Paint (LCP) is less than 2.5 seconds. However, the largest content paint is dependent on many other metrics like time to first byte (TTFB) and first content paint (FCP). So, let’s make some time allocation! Also, let’s understand what each metric represents:

Flow diagram: Shopify Server to Browser to FCP to LCP

  • From Shopify’s server to a browser is the network overhead time required. It varies based on the network the browser is on. For example, navigating your store on 3G or Wi-Fi.
  • From a browser blank page (TTFB) to showing anything (FCP) is the time the browser needs to read and display the page.
  • From the FCP to the LCF is the time the browser needs to get all other resources (images, css, fonts, scripts, video, … etc.) to complete the page.

The goal is an LCP < 2.5 seconds to receive a good score

Server → Browser

300 ms for network overhead

Browser → FCP

200 ms for browser to do its work


1.5 sec for above the fold image and assets to download


Which leaves us 500 ms for total page render time.

Does this mean that as long as we keep server rendering below 500 ms, we can get a good LCP score? No, there’s other considerations like critical rendering path that aren’t addressed here, but we’re at least half way there.


  • Optimizing for critical rendering path on the theme level can bring the 200 ms requirement between the browser to FCP timing down to a lower number.

So, we have 500 ms for total page render time, but this doesn’t mean you have all 500 ms to spare. There’s some mandatory server render times that are dedicated to Shopify and others that the theme dedicates to rendering global sections like the header and footer. Depending how you want to allocate the rendering resources, the available rendering time you leave yourself for the page content varies. For example:


500 ms

Shopify (content for header)

50 ms

Header (with menu)

100 ms


25 ms


25 ms

Page Content

300 ms

I mentioned trying to aim for 200 ms total page rendering time—this is a stretch goal. By keeping ourselves mindful of a goal, it’s much easier to start recognizing when performance starts to degrade.

An Invitation to the Shopify Theme Developer Community

We couldn’t possibly know every possible combination of how the world is using Shopify. So, I invite you to share your experience with Shopify’s Theme Inspector and let us know how we can improve at or tweet us at @shopifydevs.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

How to Build an Experiment Pipeline from Scratch

How to Build an Experiment Pipeline from Scratch

One of the most compelling ways to prove the value of any decision or intervention—to technical and non-technical audiences alike—is to run an A/B test. But what if that wasn’t an option on your current stack? That’s the challenge we faced at Shopify. Our amazing team of engineers built robust capabilities for experimentation on our web properties and Shopify admin experiences, but testing external channels like email was unexplored. When it came time to ship a new recommendation algorithm that generates personalized blog post suggestions, we had no way to measure its incremental benefit against the generic blog emails.

To address the problem I built an email experimentation pipeline from the ground up. This quick build helps the Marketing Data Science team solve challenges around experimentation for external channels, and it’s in use by various data teams across Shopify. Below is a guide that teaches you how to implement a similar pipeline with a relatively simple setup from scratch. 

The Problem

Experimentation is one of the most valuable tools for data teams, providing a systematic proof of concept mechanism for interface tweaks, product variations, and changes to the user experience. With our existing experimentation framework Verdict, we can randomize shops, as well as sessions for web properties that exist before the login gate. However, this didn’t extend to email experiments since the randomization isn’t triggered by a site visit and the intervention is in the user’s inbox, not our platform.

As a result, data scientists randomized emails themselves, shipped the experiment, and stored the results in a local text file. This was problematic for a number of reasons: 

  1. Local storage isn’t discoverable and can be lost or deleted. 
  2. The ad hoc randomization didn’t account for users that unsubscribed from our mailing list and didn’t resolve the many-to-many relationship of emails to shops, creating the risk for cross-contamination between the variations. Some shops have multiple staff each with an email address, and some people create multiple shops under the same email address.
  3. Two marketers can simultaneously test the same audience with no exclusion criteria, violating the assumption that all other variables are controlled. 

Toward the end of 2019, when email experimentation became even more popular among marketers as the marketing department grew at Shopify, it became clear that a solution was overdue and necessary.

Before You Start

There are few things I find more mortifying than shipping code just to ship more code to fix your old code, and I’m no stranger to this. My pull requests (PRs) were rigorously reviewed, but myself and the reviewers were in uncharted territory. Exhibit A: a selection of my failed attempts at building a pipeline through trial and error: 

Github PR montage, showing a series of bug fixes.
Github PR montage, showing a series of bug fixes

All that to say that requirements gathering isn’t fun, but it’s necessary. Here are some steps I’d recommend before you start.

1. Understanding the Problem

The basic goal is to create a pipeline that can pull a group constrained by eligibility criteria, randomly assign each person to one of many variations, and disseminate the randomized groups to the necessary endpoints to execute the experiment. The ideal output is repeatable and usable across many data teams. 

We define the problem as: given a list of visitors, we want to randomize so that each person is limited to one experiment at a time, and the experiment subjects can be fairly split among data scientists who want to test on a portion of the visitor pool. At this point, we won’t outline the how, we’re just trying to understand the what.

2. Draw a System Diagram

Get the lay of the land with a high-level map of how your solution will interact with its environment. It’s important to be abstract to prevent prescribing a solution; the goal is to understand the inputs and outputs of the system. This is what mine looked like:

Example of a system diagram for email experiment pipeline
Example of a system diagram for email experiment pipeline

In our case, the data come from two sources: our data warehouse and our email platform.

In a much simpler setup—say, with no ETL at all—you can replace the inputs in this diagram with locally-stored CSVs and the experiment pipeline can be a Jupyter notebook. Whatever your stack may be, this diagram is a great starting point.

3. Plan the Ideal Output

I anticipated the implementation portion to be complicated, so I started by whiteboarding my ideal production table and reverse-engineered the necessary steps. Some of the immediate decisions that arose as part of this exercise were:

  1. Choosing the grain of the table: subjects will get randomized at the shop grain, but the experience of the experiment variation is surfaced with the primary email associated with that shop.
  2. Considering necessary resolvers: each experiment is measured on its own success metric, meaning the experiment output table needs to be joined to other tables in our database.
  3. Compatibility with existing analysis framework: I didn’t want to reinvent the wheel; we already have an amazing experiment dashboard system, which can be leveraged if my output is designed with that in mind.

I built a table with one row per email, per shop, per experiment, and with some additional attributes detailing the timing and theme of the experiment. Once I had a rough idea of this ideal output, I created a mock version of an experiment with some dummy data in a CSV file that I uploaded as a temporary table in our data warehouse. With this, I brainstormed some common use cases and experiment KPIs and attempted to query my fake table. This allowed me to identify pitfalls of my first iteration; for example, I realized that in my first draft that my keys wouldn’t be compatible with the email engagement data we get from the email platform API, which is the platform that sends our emails.

I sat with some stakeholders that included my teammates, members of the experimentation team, and non-technical members of the marketing organization. I did a guided exercise where I asked them to query my temporary table and question whether the structure can support the analysis required for their last few email experiments. In these conversations, we nailed down several requirements: 

  • Exclude subjects from other experiments: all subjects in a current experiment should be excluded from other experiments for a minimum of 30 days, but the tool should support an override for longer exclusion periods for testing high risk variations, such as product pricing.
  • Identify missing experiment category tags: the version of the output table I had was missing the experiment category tags (ex. research, promotional, etc) which is helpful for experiment discoverability.
  • Exclude linked shops: if an email was linked to multiple shops that qualified for the same experiment, all shops linked to that email should be excluded altogether.
  • Enable on-going randomization of experiments: the new pipeline should allow experiments to randomize on an ongoing basis, assigning new users as they qualify over time (as opposed to a one-time batch randomization).
  • Backfill past experiments into the pipeline: all past email experiments needed to be backfilled into the pipeline, and if a data scientist inadvertently bypassed this new tool, the pipeline needs to support a way to backfill these experiments as well. 

After a few iterations and with stakeholders’ blessing, I was ready to move to technical planning.

4. Technical Planning

At Shopify, all major projects are drafted in a technical document that’s peer-reviewed by a panel of data scientists and relevant stakeholders. My document included the ideal output and system requirements I’d gathered in the planning phase, as well as expected use cases and sample queries. I also had to draw a blueprint for how I planned to structure the implementation in our ETL platform. After chatting with my lead and discussions on Slack, I decided to build the pipeline in three stages, demonstrated by the diagram below.

Proposed high-level ETL structure for the randomization pipeline
Proposed high-level ETL structure for the randomization pipeline

Data scientists may need to ship experiments simultaneously; therefore for the first phase, I needed to create an experiment definition file that defines the criteria for candidates in the form of a SQL query. For example, you may want to limit a promotional offer to shops that have been on the platform for at least a year, and only in a particular region. This also allows you to tag your experiment with the necessary categories and specify a maximum sample size, if applicable. All experiment definition files are validated on an output contract as they need to be in agreement to be unioned in the next phase.

Phase two contains a many-to-one transform stage that consolidates all incoming experiments into a single output. If an experiment produces randomizations over time, it continues to append new rows incrementally. 

In phase three, the table is filtered down in many ways. First, users that have been chosen for multiple experiments are only included in the first experiment to avoid cross-contamination of controlled variables. Additionally, users with multiple shops within the same experiment are excluded altogether. This is done by deduping a list at the email grain with a lookup at the shop grain. Finally, the job adds features such as date of randomization and indicators for whether the experiment included a promotional offer.

With this blueprint in hand, I scheduled a technical design review session and pitched my game plan to a panel of my peers. They challenged potential pitfalls, provided engineering feedback, and ultimately approved the decision to move into build.

5. Building the Pipeline

Given the detailed planning, the build phase follows as the incremental implementation of the steps described above. I built the jobs in PySpark and shipped in increments, small enough increments to be consumable by code reviewers since all of the PRs totalled several thousand lines of code.

6. Ship, Ship, Ship! 

Once all PRs were shipped into production, the tool was ready to use. I documented its use in our internal data handbook and shared it with the experimentation team. Over the next few weeks, we successfully shipped several email experiments using the pipeline, which allowed me to work out small kinks in the implementation as well. 

The biggest mistake I made in the shipping process is that I didn’t  share the tool enough across the data organization. I found that many data scientists didn’t know the tool existed, and continued to use local files as their workaround solution. Well, better late than never, I did a more thorough job of sending team-wide emails, posting in relevant Slack channels, and setting up GitHub alerts to notify me when other contributors edit experiment files.

As a result, the tool has not only been used by the Marketing Data Science, but across the data organization by teams that focus on shipping, retail and international growth, to ship email experiments for the past year. The table produced by this pipeline integrated seamlessly with our existing analysis framework, so no additional work was required to see statistical results once an experiment is defined.

Key Takeaways

To quickly summarize, the most important takeaways are:

  1. Don’t skip out on requirement gathering! Understand the problem you’re trying to solve, create a high-level map of how your solution will interact with its environment, and plan your ideal output before you start.
  2. Draft your project blueprint in a technical document and get it peer-reviewed before you build.
  3. When building the pipeline, keep PRs smaller where possible, so that reviewers can focus on detailed design recommendations and so production failures are easier to debug.
  4. Once shipped, make sure you share effectively across your organization.

Overall, this project was a great lesson that a simple solution, built with care for engineering design, can quickly solve for the long-term. In the absence of a pre-existing A/B testing framework, this type of project is a quick and resourceful way to unlock experimentation for any data science team with very few requirements from the data stack.

Are you passionate about experiments and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

Continue reading

Images as Code: Representing Localized and Evolving Products on Marketing Pages

Images as Code: Representing Localized and Evolving Products on Marketing Pages

Last year, our marketing team kicked off a large effort based on user research to revamp to better serve the needs of our site visitors. We recognized that visitors wanted to see screenshots and visuals of the product itself, however we found that most of the screenshots across our website were outdated.

The image was used to showcase our Shopify POS software on, however it misrepresented our product when our POS software was updated and rebranded.
Old Shopify POS software on

The above image was used to showcase our Shopify POS software on, however it misrepresented our product when our POS software was updated and rebranded. 

While we first experimented with a Scalable Vector Graphics (SVG) based solution to visuals, we found that it wouldn’t scale and forced us to restrict usage to only “high-value” pages. Still, other teams expressed interest in this approach, so we recreated these in HTML and JavaScript (JS) and compared the lift between them. The biggest question was around getting these to resize in a given container—with SVG all content, including text size, grows and shrinks proportionally with a width of 100%, appearing as an image to users. With CSS there’s no way to get font sizes to scale proportionally to a container, only the window. We created a solution that resizes all the contents of the element at the same rate in response to container size, and reused it to create a better

The Design Challenge

We wanted to create new visuals of our product that needed to be available and translated across more than 35 different localized domains. Many domains support different currencies, features, and languages. Re-capturing screenshots on each domain to keep in sync with all our product changes is extremely inefficient.

Screenshots of our product were simplified in order to highlight features relevant to the page or section.
Screenshots of our product were simplified in order to highlight features relevant to the page or section.

After a number of iterations and as part of a collaborative effort outlined in more detail by Robyn Larsen on our UX blog, our design team came up with simplified representations of our user interface, UI Illustrations as we called them, for the parts of the product that we wanted to showcase. This was a clever solution to drive user focus to the parts of the product that we’re highlighting in each situation, however it required that someone maintain translations and versions of the product as separate image assets. We had an automated process for updating translations in our code but not in the design editor. 

What Didn’t Work: The SVG Approach

As an experimental solution, we attempted to export these visuals as SVG code and added those SVGs inline in our HTML. Then we’d replace the text and numbers with translated and localized text.

SVGs don’t support word wrapping so visuals with long translations would look broken.
SVGs don’t support word wrapping so visuals with long translations would look broken.

Exported SVGs were cool, they actually worked to accomplish what we had set out to do, but they had a bunch of drawbacks. Certain effects like gaussian blur caused performance issues in Firefox, and SVG text doesn’t wrap when reaching a max-width like HTML can. This resulted in some very broken looking visuals (see above). Languages with longer word lengths, like German, had overflowing text. In addition, SVG export settings in our design tool needed to be consistent for every developer to avoid massive changes to the whole SVG structure every time someone else exported the same visual. Even with a consistent export process, the developer would have to go through the whole process of swapping out text with our own hooks for translated content again. It was a huge mess. We were writing a lot of documentation just to create consistency in the process, and new challenges kept popping up when new settings in Sketch were used. It felt like we had just replaced one arduous process with another.

Our strategy of using SVGs for these visuals was quickly becoming unmanageable, and that was just with a few simple visuals. A month in, we still saw a lot of value in creating visuals as code, but needed to find a better approach.

Our Solution: The HTML/JavaScript Approach

After toying around with using JS to resize the content, we ended up with a utility we call ScaleContentAsImage. It calculates how much size is available for the visual and then resizes it to fit in that space. Let’s break down the steps required to do this.

Starting with A Simple Class

We start by creating a simple class that accepts a reference to the element in the DOM that we want to scale, and initialize it by storing the computed width of the element in memory. This assumes that we assigned the element a fixed pixel width somewhere in our code already (this fixed pixel width matches the width of the visual in the design file). Then we override the width of the element to 100% so that it can fill the space available to it.I’ve purposely separated the initialization sequence into a separate method from the constructor. While not demonstrated in this post, that separation allows us to add lazy loading or conditional loading to save on performance.

Creating an Element That We Can Transform

Next we’ll need to create an element that will scale as needed using a CSS transform. We assign that wrapper the fixed width that its parent used to have. Note that we haven’t actually added this element anywhere in the DOM yet.

Moving the Content Over

We transfer all the contents of the visual out from where it is now and into the wrapper we just created, and put that back into the parent element. This method preserves any event bindings (such as lazy load listeners) that previously were bound to these elements. At this point the content might overflow the container, but we’ll apply the transform to resolve that.

Applying the Transformation

Now, we determine how much the wrapper should scale the contents by and apply that property. For example, if the visual was designed to be 200px wide but it’s rendered in an element that’s 100px wide, the wrapper would be assigned transform: scale(0.5);.

Preserving Space in the Page

A screenshot of a webpage in a desktop view with text content aligned to the left and an image on the right. The image is of the Shopify admin, with the Shopify logo, a search bar, and a user avatar next to “Helen B.” on the top of the screen. Below in a grid are summaries of emails delivered, purchase totals, and total growth represented in graphs.
A screenshot of a webpage in a desktop view with text content aligned to the left and an image on the right.

So now our visual is resizing correctly, however our page layout is now looking all wonky. Our text content and the visual are meant to display as equal width side-by-side, like the above.

So why does the page look like this? The colored highlight shows what’s happening with our CSS transform.

A screenshot of a webpage in a desktop viewport where text is pushed to the very left of the screen taking up one sixth of the width. The image of the Shopify admin to the right is only one third of the screen wide. The entire right half of the page is empty. The image is highlighted in blue, however a larger green box is also highlighted in the same position taking up more of the empty space but matching the same dimensions of the image. A diagonal line from the bottom right corner of the larger green box to the bottom right corner of the highlighted image hints at a relationship between both boxes.
A screenshot of a webpage in a desktop viewport after CSS transform.

CSS transforms don’t change the content flow, so even though our visual size is reduced correctly, the element still takes up its fixed width. We add some additional logic here to fix this problem by adding an empty element that takes up the correct amount of vertical space. Unless the visual contains images that are lazy loaded, or animates vertically in some way, we only need to make this calculation once since a simple CSS trick to maintain an aspect ratio will work just fine.

Removing the Transformed Element from the Document Flow

We also need to set the transformed element to be absolutely positioned, so that it doesn’t affect the document flow.

A screenshot of a webpage in a desktop view with text content aligned to the left and an image on the right. The image is of the Shopify admin, with the Shopify logo, a search bar, and a user avatar next to “Helen B.” on the top of the screen. Below in a grid are summaries of emails delivered, purchase totals, and total growth represented in graphs.
A screenshot of a webpage in a desktop view with text content aligned to the left and an image on the right.

Binding to Resize

Success! Looks good! Now we just add a bit of logic to update our calculations if the window is resized.

Finishing theCode

Our class is now complete.

Looking at the Detailed Changes in the DOM

1. Before JS is initialized, in this example the container width is 378px and the assigned width of the element is 757px. The available space is about 50% of the original size of the visual.

A screenshot of a page open in the browser with developer tools open. In the page, a UI Illustration is shown side-by-side with some text, and the highlighted element in the inspector matches that as described above. The override for the container width can be seen in the style inspector. In addition, the described element is assigned a property of “aria-hidden: true”, and its container is a `div` with `role: “img”` and an aria-label describing the visual as “View of the Shopify admin showing emails delivered, purchase totals, and total growth represented in graphs”
A screenshot of a page open in the browser with developer tools open

2. As seen in our HTML post-initialization, in JS we have overridden the size of the container to be 100%

3. We’ve also moved all the content of the visual inside of a new element that we created, to which we apply a scale of 0.5 (based on the 50% calculated in step 1).

4. We absolutely position the element that we scaled so that it doesn’t disturb the document flow.

5. We added a placeholder element to preserve the correct amount of space in the document flow.

A Solution for a React Project

For a project using React, the same thing is accomplished without any of the logic we wrote to create, move, or update the DOM elements. The result is a much simpler snippet that only needs to worry about determining how much space is available within its container. A project using CSS-in-JS benefits in that the fixed width is directly passed into the element.

Problems with Localization and Currency

Shopify Order Summary Page
Shopify admin order summary page

An interesting problem we ran into was displaying prices in local currencies for fictional products. For instance, we started off with a visual of a product, a checkout button, and an order summary. Shown in the order summary were two chairs, each priced at ~$200, which were made-up prices and products for demonstrative purposes only.

It didn’t occur to us that 200 Japanese Yen is the equivalent of under $1.89 USD (today), so when we just swapped the currency symbol the visual of the chair did not realistically match the price. We ended up creating a table of currency conversion rates pulled on that day. We don’t update those conversion values on a regular basis, since we don’t need accurate rates for our invented prices. We’re ok with fluctuations, even large ones, as long as the numbers look reasonable in context. We obviously don’t take this approach with real products and prices.

Comparing Approaches: SVG vs HTML/JavaScript

The HTML/JS approach took some time to build upfront, but its advantages clearly outweighed the developer lift required even from the start. The UI Illustrations were fairly quick to build out given how simply and consistently they were designed. We started finding that other projects were reusing and creating their own visuals using the same approach. We created a comparison chart between approaches, evaluating major considerations and the support for these between the two.

Text Support

While SVG resizes text automatically and in proportion to the visual resizing, it didn’t support word wrap which is available in HTML

Implementation and Maintenance

HTML/JS had a lot going for compared to the SVG approach when it came to implementation and maintenance. Using HTML and JS would mean that developers don’t need to have technical knowledge of SVGs, and they code these visuals with the help of our existing components. Code is easy to parse and tested using our existing testing framework. From an implementation standpoint, the only thing that SVG really had going for it was that it usually resulted in fewer lines of code, since styles are inline and elements are absolutely positioned relative to each other. That in itself isn’t reason to choose a less maintainable and human-readable solution.


While both would support animations—something we may want to add in the future—an HTML/JS approach allows us to easily use our existing play/pause buttons to control these animations.


The SVG approach works with JS disabled, however it’s less performant and caused a lot of jankiness on the page when certain properties like shadows were applied to it


Design is where HTML/JS really stood out against SVG. With our original SVG approach, designers needed to follow a specific process and use a specific design tool that worked with that process. For example, we started requiring that shadows applied to elements were consistent in order to prevent multiple versions of Gaussian Blur from being added to the page and creating jankiness. It also required our designers to design in a way that text would never break onto a new line because of the lack of support for word wrapping. Without introducing SVG, none of these concerns applied and designers had more flexibility to use any tools they wanted to build freely.

Documentation and Ramp-up

HTML/JS was a clear winner , as we did away with all of the documentation describing the SVG export process, design guidelines, and quirks we discovered. With HTML, all we’d need to document that wouldn’t apply to SVGs is how to apply the resize functionality to the content.

Scaling Our Solution

We started off with a set of named visuals, and designed our system around a single component that accepted a name (for example “Shopify admin dashboard” or “POS software”) and rendered the desired visual. We thought that having a single entry point would help us better track each visual and restrict us to a small, maintainable set of UI Illustrations. That single component was tested and documented and for each new visual we added respective tests and documentation.

We worried about overuse given that each UI Illustration needed to be maintained by a developer. But with this system, a good portion of that development effort ended up being the education of the structure, maintenance of documentation, and tests for basic HTML markup that’s only used in one place. We’ve since provided a more generic container that can be used to wrap any block of HTML for initialization with our ScaleContentLikeImage module and provides a consistent implementation of descriptive text for screen readers.

The Future of UI Illustrations

ScaleContentLikeImage and its application for our UI Illustrations is a powerful tool for our team to highlight our product in a very intentional and relevant way for our users. Jen Taylor dives deeper into our UX considerations and user-focused approach to UI Illustrations on the Shopify UX Blog. There are still performance and structural wins to be had, specifically around how we recalculate sizing for lazy loaded images, and how we document existing visuals for reuse. However, until there’s a CSS-only solution to handle this use case our HTML/JS approach seems to be the cleanest. Looking to the future, this could be an excellent application to explore with CSS Houdini once the layout API is made available (it’s not yet supported in any major browser).

Based on Anton Dosov’s CSS Houdini with Layout API demo, I can imagine a scenario where we can create a custom layout renderer and then apply this logic with a few lines of CSS.

We’ve all learned a lot in this process and like any system, its long term success relies on our team’s collaborative relationship in order to keep evolving and growing in a maintainable, scalable way. At Shopify one of our core values is to thrive on change, and this project certainly has done so.

If sounds this sounds like the kind of projects you want to be a part of please check out our open positions.

Continue reading

How to Use Quasi-experiments and Counterfactuals to Build Great Products

How to Use Quasi-experiments and Counterfactuals to Build Great Products

Descriptive statistics and correlations are every data scientists’ bread and butter, but they often come with the caveat that correlation isn’t causation. At Shopify, we believe that understanding causality is the key to unlocking maximum business value. We aim to identify insights that actually indicate why we see things in the data, since causal insights can validate (or invalidate) entire business strategies. Below I’ll discuss different causal inference methods and how to use them to build great products.

The Causal Inference “Levels of Evidence Ladder”

A data scientist can use various different methods to estimate the causal effects of a factor. The “levels of evidence ladder” is a great mental model that introduces the ideas of causal inference.

Levels of evidence ladder. First level (clearest evidence): A/B tests (a.k.a statistical experiments). Second level (reasonable level of evidence): Quasi-experiments (including Difference-in-differences, matching, controlled regression). Third level (weakest level of evidence): Full estimation of counterfactuals. Bottom of the chart: descriptive statistics—provides no direct evidence for causal relationship.
Levels of evidence ladder. First level (clearest evidence): A/B tests (a.k.a statistical experiments). Second level (reasonable level of evidence): Quasi-experiments (including difference-in-differences, matching, controlled regression). Third level (weakest level of evidence): Full estimation of counterfactuals. Bottom of the chart: descriptive statistics—provides no direct evidence for causal relationship.

The ladder isn’t a ranking of methods, instead it’s a loose indication of the level of proof each method will give you. The higher the method is on the ladder, the easier it is to compute estimates that constitute evidence of a strong causal relationship. Methods at the top of the ladder typically (but not always) require more focus on the experimentation setup. On the other end, methods at the bottom of the ladder use more observational data, but require more focus on robustness checks (more on this later).

The ladder does a good job of explaining that there is no free lunch in causal inference. To get a powerful causal analysis you either need a good experimental setup, or a good statistician and a lot of work. It’s also simple to follow. I’ve recently started sharing this model with my non-data stakeholders. Using it to illustrate your work process is a great way to get buy-in from your collaborators and stakeholders.

Causal Inference Methods

A/B tests

A/B tests, or randomized controlled trials, are the gold standard method for causal inference—they’re on rung one of the levels of evidence ladder! For A/B tests, group A and group B are randomly assigned. The environment both groups are placed in is identical except for one parameter: the treatment. Randomness ensures that both groups are clones “on average”. This enables you to deduce causal estimates from A/B tests, because the only way they differ is the treatment. Of course in practice, lots of caveats apply! For example, one of the frequent gotchas of A/B testing is when the units in your treatment and control groups self-select to participate in your experiment.

Setting up an A/B test for products is a lot of work. If you’re starting from scratch, you’ll need

  • A way to randomly assign units to the right group as they use your product.
  • A tracking mechanism to collect the data for all relevant metrics.
  • To analyze these metrics and their associated statistics to compute effect sizes and validate the causal effects you suspect.

And that only covers the basics! Sometimes you’ll need much more to be able to detect the right signals. At Shopify, we have the luxury of an experiments platform that does all the heavy work and allows data scientists to start experiments with just a few clicks.


Sometimes it’s just not possible to set up an experiment. Here are a few reasons why A/B tests won’t work in every situation:

  • Lack of tooling. For example, if your code can’t be modified in certain parts of the product.
  • Lack of time to implement the experiment.
  • Ethical concerns  for example, at Shopify, randomly leaving some merchants out of a new feature that could help them with their business is sometimes not an option).
  • Just plain oversight (for example, a request to study the data from a launch that happened in the past).

Fortunately, if you find yourself in one of the above situations, there are methods that exist  which enable you to obtain causal estimates.

A quasi-experiment (rung two) is an experiment where your treatment and control group are divided by a natural process that isn’t truly random, but are considered close enough to compute estimates. Quasi-experiments frequently occur in product companies, for example, when a feature rollout happens at different dates in different countries, or if eligibility for a new feature is dependent on the behaviour of other features (like in the case of a deprecation). In order to compute causal estimates when the control group is divided using a non-random criterion, you’ll use different methods that correspond to different assumptions on how “close” you are to the random situation.

I’d like to highlight two of the methods we use at Shopify. The first is linear regression with fixed effects. In this method, the assumption is that we’ve collected data on all factors that divide individuals between treatment and control group. If that is true, then a simple linear regression on the metric of interest, controlling for these factors, gives a good estimate of the causal effect of being in the treatment group.

The parallel trends assumption for differences-in-differences. In the absence of treatment, the difference between the ‘treatment’ and ‘control’ group is a constant. Plotting both lines in a temporal graph like this can help check the validity of the assumption. Credits to Youcef Msaid.

The parallel trends assumption for differences-in-differences. In the absence of treatment, the difference between the ‘treatment’ and ‘control’ group is a constant. Plotting both lines in a temporal graph like this can help check the validity of the assumption. Credits to Youcef Msaid.

The second is also a very popular method in causal inference: difference in difference. For this method to be applicable, you have to find a control group that shows a trend that’s parallel to your treatment group for the metric of interest, prior to any treatment being applied. Then, after treatment happens, you assume the break in the parallel trend is only due to the treatment itself. This is summed up in the above diagram.


Finally, there will be cases when you’ll want to try to detect causal factors from data that only consists of observations of the treatment. A classic example in tech is estimating the effect of a new feature that was released to all the user base at once: no A/B test was done and there’s absolutely no one that could be the control group. In this case, you can try making a counterfactual estimation (rung three).

The idea behind counterfactual estimation is to create a model that  allows you to compute a counterfactual control group. In other words, you estimate what would happen had this feature not existed. It isn’t always simple to compute an estimate. However, if you have a model of your users that you’re confident about, then you have enough material to start doing counterfactual causal analyses!

Example of time series counterfactual vs. observed data
Example of time series counterfactual vs. observed data

A good way to explain counterfactuals is with an example. A few months ago, my team faced a situation where we needed to assess the impact of a security update. The security update was important and it was rolled out to everyone, however it introduced friction for users. We wanted to see if this added friction caused a decrease in usage. Of course, we had no way of finding a control group among our users.

With no control group, we created a time-series model to get a robust counterfactual estimation of usage of the updated feature. We trained the model on data such as usage of other features not impacted by the security update and global trends describing the overall level of activity on Shopify. All of these variables were independent from the security update we were studying. When we compared our model’s prediction to actuals, we found that there was no lift. This was a great null result which showed that the new security feature did not negatively affect usage.

When using counterfactual methods, the quality of your prediction is key. If a confounding factor that’s independent from your newest rollout varies, you don’t want to attribute this change to your feature. For example, if you have a model that predicts daily usage of a certain feature, and a competitor launches a similar feature right after yours, your model won’t be able to account for this new factor. Domain expertise and rigorous testing are the best tools to do counterfactual causal inference. Let’s dive into that a bit more.

The Importance of Robustness

While quasi-experiments and counterfactuals are great methods when you can’t perform a full randomization, these methods come at a cost! The tradeoff is that it’s much harder to compute sensible confidence intervals, and you’ll generally have to deal with a lot more uncertainty—false positives are frequent. The key to avoiding falling into traps is robustness checks.

Robustness really isn't that complicated. It just means clearly stating assumptions your methods and data rely on, and gradually relaxing each of them to see if your results still hold. It acts as an efficient coherence check if you realize your findings can dramatically change due to a single variable, especially if that variable is subject to noise, error measurement, etc.

Direct Acyclic Graphs (DAGs) are a great tool for checking robustness. They help you clearly spell out assumptions and hypotheses in the context of causal inference. Popularized by the famous computer scientist, Judea Pearl, DAGs have gained a lot of traction recently in tech and academic circles.

At Shopify, we’re really fond of DAGs. We often use Dagitty, a handy browser-based tool. In a nutshell, when you draw an assumed chain of causal events in Dagitty, it provides you with robustness checks on your data, like certain conditional correlations that should vanish. I recommend you explore the tool

The Three Most Important Points About Causal Inference

Let’s quickly recap the most important points regarding causal inference:

  • A/B tests are awesome and should be a go to tool in every data science team’s toolbox.
  • However, it’s not always possible to set up an A/B test. Instead, look for natural experiments to replace true experiments. 
  • If no natural experiment can be found, counterfactual methods can be useful. However, you shouldn’t expect to be able to detect very weak signals using these methods. 

I love causal inference applications for business and I think there is a huge untapped potential in the industry. Just like generalizing A/B tests lead to building a very successful “Experimentation Culture” since the end of the 1990s, I hope the 2020s and beyond will be an era of the “Causal Culture” as a whole! I hope sharing how we do it at Shopify will help. If any of this sounds interesting to you, we’re looking for talented data scientists to join our team.

Continue reading

Enforcing Modularity in Rails Apps with Packwerk

Enforcing Modularity in Rails Apps with Packwerk

On September 30, 2020 we held ShipIt! presents: Packwerk by Shopify. A video for the event is now available for you to learn more about our latest open source tool for creating packages with enforced boundaries in Rails apps. Click here to watch the video.

The Shopify core codebase is large, complex, and growing by the day. To better understand these complex systems, we use software architecture to create structural boundaries. Ruby doesn't come with a lot of boundary enforcements out of the box. Ruby on Rails only provides a very basic layering structure, so it's hard to scale the application without any solid pattern for boundary enforcement. In comparison, other languages and frameworks have built-in mechanisms for vertical boundaries, like Elixir’s umbrella projects.

As Shopify grows, it’s crucial we establish a new architecture pattern so large scale domains within the monolith can interact with each other through well-defined boundaries, and in turn, increase developer productivity and happiness. 

So, we created an open source tool to build a package system that can be used to guide and enforce boundaries in large scale Rails applications. Packwerk is a static analysis tool used to enforce boundaries between groups of Ruby files we call packages.

High Cohesion and Low Coupling In Code

Ideally, we want to work on a codebase that feels small. One way to make a large codebase feel small is for it to have high cohesion and low coupling.

Cohesion refers to the measure of how much elements in a module or class belong together. For example, functional cohesion is when code is grouped together in a module because they all contribute to one single task. Code that is related changes together and therefore should be placed together.

On the other hand, coupling refers to the level of dependency between modules or classes. Elements that are independent of each other should also be independent in location of implementation. When a certain domain of code has a long list of dependencies of unrelated domains, there’s no separation of boundaries. 

Boundaries are barriers between code. An example of a code boundary is to have a separate repository and service. For the code to work together in this case, network calls have to be made. In our case, a code boundary refers to different domains of concern within the same codebase.

With that, there are two types of boundaries we’d like to enforce within our applications—dependency and privacy. A class can have a list of dependencies of constants from other classes. We want an intentional and ideally small list of dependencies for a group of relevant code. Classes shouldn’t rely on other classes that aren’t considered their dependencies. Privacy boundaries are violated when there’s external use of private constants in your module. Instead, external references should be made to public constants, where a public API is established.

A Common Problem with Large Rails Applications

If there are no code boundaries in the monolith, developers find it harder to make changes in their respective areas. You may remember making a straightforward change that shockingly resulted in the breaking of unrelated tests in a different part of the codebase, or digging around a codebase to find a class or module with more than 2,000 lines of code. 

Without any established code boundaries, we end up with anti-patterns such as spaghetti code and large classes that know too much. As a codebase with low cohesion and high coupling grows, it becomes harder to develop, maintain, and understand. Eventually, it’s hard to implement new features, scale and grow. This is frustrating to developers working on the codebase. Developer happiness and productivity when working on our codebase is important to Shopify.

Rails Is Like an Open-concept Living Space

Let’s think of a large Rails application as a living space within a house without any walls. An open-concept living space is like a codebase without architectural boundaries. In an effort to separate concerns of different types of living spaces, you can arrange the furniture in a strategic manner to indicate boundaries. This is exactly what we did with the componentization efforts in 2017. We moved code that made sense together into folders we call components. Each of the component folders at Shopify represent domains of commerce, such as orders and checkout.

In our open-concept analogy, imagine having a bathroom without walls—it’s clear where the bathroom is supposed to be, but we would like it to be separate from other living spaces with a wall. The componentization effort was a great first step towards modularity for the great Shopify monolith, but we are still far from a modular codebase—we need walls. Cross-component calls are still being made, and Active Record models are shared across domains. There’s no wall imposing those boundaries, just an agreed upon social contract that can be easily broken.

Boundary Enforcing Solutions We Researched

The goal is to find a solution for boundary enforcement. The Ruby we all know and love doesn't come with boundary enforcements out of the box. It allows specifying visibility on the class level only and loads all dependencies into the global namespace. There’s no differences between direct and indirect dependencies.

There are some existing ways of potentially enforcing boundaries in Ruby. We explored a combination of solutions: using the private_constant keyword to set private constants, creating gems to set boundaries, using tests to prevent cross-boundary associations, and testing out external gems such as Modulation.

Setting Private Constants

The private_constant keyword is a built-in Ruby method to make a constant private so it cannot be accessed outside of its namespace. A constant’s namespace is the modules or classes where it’s nested and defined. In other words, using private_constant provides visibility semantics for constants on a namespace level, which is desirable. We want to establish public and private constants for a class or a group of classes.

However, there are drawbacks of using the private_constant method of privacy enforcement. If a constant is privatized after it has been defined, the first reference to it will not be checked. It is therefore not a reliable method to use.

There’s no trivial way to tell if there’s a boundary violation using private_constants. When declaring a constant private to your class, it is hard to determine if the use of the constant is getting bypassed or used appropriately. Plus, this is just a solution for privacy issues and not dependency.

Overall, only using private_constant is insufficient to enforce boundaries across large domains. We want a tool that is flexible and can integrate into our current workflow. 

Establishing Boundaries Through Gems

The other method of creating a modular Rails application is through gems. Ruby gems are used to distribute and share Ruby libraries between Rails applications. People may place relevant code into an internal gem, separating concerns from the main application. The gem may also eventually be extracted from the application with little to no complications.

Gems provide a list of dependencies through the gemspec which is something we wanted, but we also wanted the list of dependencies to be enforced in some way. Our primary concern was that gems don't have visibility semantics. Gems make transitive dependencies available in the same way as direct dependencies in the application. The main application can use any dependency within the internal gem as it would its own dependency. Again, this doesn't help us with boundary enforcement.

We want a solution where we’re able to still group code that’s relevant together, but only expose certain parts of that group of code as public API. In other words, we want to control and enforce the privacy and dependency boundaries for a group of code—something we can’t do with Ruby gems.

Using Tests to Prevent Cross-component Associations

We added a test case that rejects any PRs that introduce Active Record associations across components, which is a pattern we’re trying to avoid. However, this solution is insufficient for several reasons. The test doesn’t account for the direction of the dependency. It also isn’t a complete test. It doesn’t cover use cases of Active Record objects that aren’t associations and generally doesn’t cover anything that isn’t Active Record.

The test was good enforcement, but lacked several key features. We wanted a solution that determined the direction of dependencies and accounted for different types of Active Record associations. Nonetheless, the test case still exists in our codebase as we still found it helpful in triggering developer thought and discussions to whether or not an association between components is truly needed.

Using the Modulation Ruby Gem

Modulation is a Ruby gem for file-level dependency management within the Ruby application that was experimental at the time of our exploration. Modulation works by overriding the default Ruby code loading, which is concerning, as we’d have to replace the whole autoloading system in our Rails application. The level of complexity added to the code and runtime application behaviour is because dependency introspection performed at runtime.

There are obvious risks that come with modifying how our monolith works for an experiment. If we went with Modulation as a solution and had to change our minds, we’d likely have to revert changes to hundreds of files, which is impractical in a production codebase. Plus, the gem works at file-level granularity which is too fine for the scale we were trying to solve.

Creating Microservices?

The idea of extracting components from the core monolith into microservices in order to create code boundaries is often brought up at Shopify. In our monolith’s case, creating more services in an attempt to decouple code is solving code design problems the wrong way.

Distributing code over multiple machines is a topology change, not an architectural change. If we try to extract components from our core codebase into separate services, we introduce the added concern of networked communication and create a distributed system. A poorly designed API within the monolith will still be a poorly designed API within a service, but now with additional complexities. These complexities can come in forms such as stateless network boundary and serialisation between the systems, and reliability issues with networked communications. Microservices are a great solution when the service is isolated and unique enough to reason the tradeoff of the network boundary and complexities that come with it.

The Shopify core codebase still stands as a majestic modular monolith, with all the code broken up into components and living in a singular codebase. Now, our goal is to advance our application’s modularity to the next step—by having clear and enforced boundaries.

Packwerk: Creating Our Own Solution

Taking our learnings from the exploration phase for the project, we created Packwerk. There are two violations that Packwerk enforces: dependency and privacy. Dependency violations occur when a package references a private constant from a package that hasn’t been declared as a dependency. Privacy violations occur when an external constant references a package’s private constants. However, constants within the public folder, app/public, can be accessed and won't be a violation.

How Packwerk Works 

Packwerk parses and resolves constants in the application statically with the help of an open-sourced Shopify Ruby gem called ConstantResolver. ConstantResolver uses the same assumptions as Zeitwerk, the Rails code loader, to infer the constant's file location. For example, Some::Nested::Model will be resolved to the constant defined in the file path, models/some/nested/model.rb. Packwerk then uses the file path to determine which package defines the constant.

Next, Packwerk will use the resolved constants to check against the configurations of the packages involved. If all the checks are enforced (i.e. dependency and privacy), references from Package A to Package B are valid if:

  1. Package A declares a dependency on Package B, and;
  2. The referenced constant is a public constant in Package B

Ensuring Application Validity

Before diving into further details, we have to make sure that the application is in a valid state for Packwerk to work correctly. To be considered valid, an application has to have a valid autoload path cache, package definition files and application folder structure. Packwerk comes with a command, packwerk validate, that runs on a continuous integration (CI) pipeline to ensure the application is always valid.

Packwerk also checks for any acyclic dependencies within the application. According to the Acyclic Dependency Principle, no cycles should be allowed in the component dependency graph. If packages depend on each other in a cycle, making a change to one package will create a domino effect and force a change on all packages in the cycle. This dependency cycle will be difficult to manage.

In practical terms, imagine working on a domain of the codebase concurrently with 100 other developers. If your codebase has cyclic dependencies, your change will impact the components that depend on your component. When you are done with your work, you want to merge it into the main branch, along with the changes of other developers. This code will create an integration nightmare because all the dependencies have to be modified in each iteration of the application.

An application with an acyclic dependency graph can be tested and released independently without having the entire application change at the same time.

Creating a Package 

A package is defined by a package.yml file at the root of the package folder. Within that file, specific configurations are set. Packwerk allows a package to declare the type of boundary enforcement that the package would like to adhere to. 

Additionally, other useful package-specific metadata can be specified, like team and contact information for the package. We’ve found that having granular, package-specific ownership makes it easier for cross-team collaboration compared to ownership of an entire domain.

Enforcing Boundaries Between Packages

Running packwerk check
Running packwerk check

Packwerk enforces boundaries between packages through a check that can be run both locally and on the CI pipeline. To perform a check, simply run the line packwerk check. We also included this in Shopify’s CI pipeline to prevent any new violations from being merged into the main branch of the codebase.

Enforcing Boundaries in Existing Codebases

Because of the lack of code structure in Rails apps, legacy large scale Rails apps tend to have existing dependency and privacy violations between packages. If this is the case, we want to stop the bleeding and prevent new violations from being added to the codebase.

Users can still enforce boundaries within the application despite existing violations, ensuring the list of violations doesn't continue to increase. This is done by generating a deprecated references list for the package.

We want to allow developers to continue with their workflow, but prevent any further violations. The list of deprecated references can be used to help a codebase transition to a cleaner architecture. It iteratively establishes boundaries in existing code as developers work to reduce the list.

List of deprecated references for components/online_store
List of deprecated references for components/online_store

The list of deprecated references contains some useful information about the violation within the package. In the example above, we can tell that there was a privacy violation in the following files that are referring to the ::RetailStore constant that was defined in the components/online_store package.

By surfacing the exact references where the package’s boundaries are being breached, we essentially have a to-do list that can be worked off.

Conventionally, the deprecated references list was meant for developers to start enforcing the boundaries of an application immediately despite existing violations, and use it to remove the technical debt. However, the Shipping team at Shopify found success using this list to extract a domain out of their main application into its own service. Also, the list can be used if the package were extracted into a gem. Ultimately, we make sure to let developers know that the list of deprecated references should be used to refactor the code and reduce the amount of violations in the list.

The purpose of Packwerk would be defeated if we merely added to the list of violations (though, we’ve made some exceptions to this rule). When a team is unable to add a dependency in the correct direction because the pattern doesn’t exist, we recommend adding the violation to the list of deprecated references. Doing so will ensure that when such a pattern exists, we eventually refactor the code and remove the violation from the list. This results in a better alternative than creating a dependency in the wrong direction.

Preventing New Violations 

After creating packages within your application and enforcing boundaries for those packages, Packwerk should be ready to go. Packwerk will display violations when packwerk check is run either locally or on the CI pipeline.

The error message as seen above displays the type of violation, location of violation, and provides actionable next steps for developers. The goal is to make developers aware of the changes they make and to be mindful of any boundary breaking changes they add to the code.

The Caveats 

Statically analyzing Ruby is complex. If a constant is not autoloaded, Packwerk ignores it. This ensures that the results produced by Packwerk won’t have any false positives, but it can create false negatives. If we get most of the references right, it’ll be enough to shift the code quality in a positive direction. The Packwerk team made this design decision as our strategy to handle the inaccuracy that comes with Ruby static analysis. 

How Shopify Is Using Packwerk

There was no formal push for the adoption of Packwerk within Shopify. Several teams were interested in the tool and volunteered to beta test before it was released. Since its release, many teams and developers are adopting Packwerk to enforce boundaries within their components.

Currently Packwerk runs in six Rails applications at Shopify, including the core monolith. Within the core codebase, we have 48 packages with 30 boundary enforcements within those packages. Packwerk integrates in the CI pipeline for all these applications and has commands that can run locally for packaging-related checks.

Since Packwerk was released for use within the company, new conversations related to software architecture have been sparked. As developers worked on removing technical debt and refactoring the code using Packwerk, we noticed there’s no established pattern for decoupling of code and creating single-direction dependencies. We’re currently researching and discussing inversion of control and establishing patterns for dependency inversion within Rails applications.

Start Using Packwerk. It’s Open Source!

Packwerk is now out in the wild and ready for you to try it out!

To get Packwerk installed in your Rails application, add it as a gem and simply run the command packwerk init. The command will generate the configuration files needed for you to use Packwerk.

The Packwerk team will be maintaining the gem and we’re stoked to see how you will be using the tool. You are also welcome to report bugs and open pull requests in accordance with our contribution guidelines.


Packwerk is inspired by Stripe’s internal Ruby packages solution with its idea adapted to the more complex world of Rails applications.

ShipIt! Presents: Packwerk by Shopify

Without code boundaries in a monolith, it’s difficult for developers to make changes in their respective areas. Like when you make a straightforward change that shockingly results in breaking unrelated tests in a different part of the codebase, or dig around a codebase to find a class or module with more than 2,000 lines of code!

You end up with anti-patterns like spaghetti code and large classes that know too much. The codebase is harder to develop, maintain and understand, leading to difficulty adding new features. It’s frustrating for developers working on the codebase. Developer happiness and productivity is important to us.

So, we created an open source tool to establish code boundaries in Rails applications. We call it Packwerk.

During this event you will

  • Learn more about the problems Packwerk solves.
  • See how we built Packwerk.
  • Understand how we use Packwerk at Shopify.
  • See a demo of Packwerk.
  • Learn how you can get started with Packwerk.

Additional Information 

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default

Continue reading

Under Deconstruction: The State of Shopify’s Monolith

Under Deconstruction: The State of Shopify’s Monolith

Ruby on Rails is a great framework for rapidly building beautiful web applications that users and developers love. But if an application is successful, there’s usually continued investment, resulting in additional features and increased overall system complexity.

Shopify’s core monolith has over 2.8 million lines of Ruby code and 500,000 commits. Rails doesn’t provide patterns or tooling for managing the inherent complexity and adding features in a structured, well-bounded way.

That’s why, over three years ago, Shopify founded a team to investigate how to make our Rails monoliths more modular. The goal was to help us scale towards ever increasing system capabilities and complexity by creating smaller, independent units of code we called components. The vision went like this:

  • We can more easily onboard new developers to just the parts immediately relevant to them, instead of the whole monolith.
  • Instead of running the test suite on the whole application, we can run it on the smaller subset of components affected by a change, making the test suite faster and more stable.
  • Instead of worrying about the impact on parts of the system we know less well, we can change a component freely as long as we’re keeping its existing contracts intact, cutting down on feature implementation time.

In summary, developers should feel like they are working on a much smaller app than they actually are.

It’s been 18 months since we last shared our efforts to make our Rails monoliths more modular. I’ve been working on this modularity effort for the last two and a half years, currently on a team called Architecture Patterns. I’ll lay out the current state of my team’s work, and some things we’d do differently if we started fresh right now.

The Status Quo

We generally stand by the original ideas as described in Deconstructing the Monolith, but almost all of the details have changed.  We make consistent progress, but it's important to note that making changes at this scale requires a significant shift in thinking for a critical mass of contributors, and that takes time.

While we’re far from finished, we already reap the benefits of our work. The added constraints on how we write our code trigger deep software design discussions throughout the organization. We see a mindset shift across our developers with a stronger focus on modular design. When making a change, developers are now more aware of the consequences on the design and quality of the monolith as a whole. That means instead of degrading the design of existing code, new feature implementations now more often improve it. Parts of the codebase that received heavy refactoring in recent years are now easier to understand because their relationship with the rest of the system is clearer.

We automatically triage exceptions to components, enabling teams to act on them without having to dig through the sometimes noisy exception stream for the whole monolith. And with each component explicitly owned by a team, whole-codebase chores like Rails upgrades are easily distributed and collaboratively solved. Shopify is running its main monolith on the newest, unreleased revisions of Rails. The clearly defined ownership for areas of the codebase is one of the factors enabling us to do that.

What We Learned so Far

Our main monolith is one of the oldest, largest Rails codebases on the planet, under continuous development since at least 2006, with hundreds of developers currently adding features.

A refactor on this scale needs to be approached completely differently from smaller efforts. We learned that all large scale changes start

  • with understanding and influencing developer behavior
  • at the grassroots
  • with a holistic perspective on architecture 
  • with careful application of tooling
  • with being aware of the tradeoffs involved

Understand Developer Behaviour

A single centralized team can’t make change happen by working against the momentum of hundreds of developers adding features.

Also, it can’t anticipate all the edge cases and have context on all domains of the application. A single team can make simple change happen on a large scale, or complex change on a small scale. To modularize a large monolith though, we need to make complex change happen on a large scale. Even if a centralized team could make it happen, the design would degrade once the team switches its focus to something else. 

That’s why making a fundamental architecture change to a system that’s being actively worked on is in large part a people problem. We need to change the behavior of the average developer on the codebase. We need to all iteratively evolve the system towards the envisioned future together. The developers are an integral part of the system.

Dr. B.J. Fogg, founder of the Behavior Design Lab at Stanford University, developed a model for thinking about behaviors that matches our experiences. The model suggests that for a behavior to occur, three things need to be in place: Ability, Motivation, and Prompt.

Fogg Behaviour Model by BJ Fogg, PhD
Fogg Behaviour Model by  BJ Fogg, PHD

In a nutshell, prompts are necessary for a desired behavior to happen, but they're ineffective unless there's enough motivation and ability. Exceptionally high motivation can, within reason, compensate for low ability and vice versa.

Automated tooling and targeted manual code reviews provide prompts. That’s the easy part. Creating ability and motivation to make positive change is harder. Especially when that goes against common Ruby on Rails community practices and requires a view of the system that’s much larger than the area that most individual developers are working on. Spreading an understanding of what we’re aiming for, and why, is critical.

For example, we invested quite a bit of time and energy into developing patterns to ensure some consistency in how component boundary interfaces are designed. Again and again we pondered: How should components call each other? We then pushed developers to use these patterns everywhere. In hindsight, this strategy didn’t increase developer ability or motivation. It didn’t solve the problems actually holding them back, and it didn’t explain the reasons or long term goals well enough. Pushing for consistency added rules, which always add some friction, because they have to be learned, remembered, and followed. It didn’t make any hard problem significantly easier to solve. In some cases, the patterns were helpful. In other cases, they lead developers to redefine their problem to fit the solution we provided, which degraded the overall state of the monolith.

Today, we’re still providing some general suggestions on interface consistency, but we have a lot less hard rules. We’re focusing on finding the areas where developers are hungry to make positive change, but don’t end up doing it because it’s too hard. Often, making our code more modular is hard because legacy code and tooling are based on assumptions that no longer hold true. One of the most problematic outdated assumptions is that all Active Record models are OK to access everywhere, when in this new componentized world we want to restrict their usage to the component that owns them. We can help developers overcome this problem.

So in the words of Dr. Fogg, these days we’re looking for areas where the prompt is easy, the motivation is present, and we just have to amp up the ability to make things happen.

Foster the Grassroots

As I mentioned, we, as a centralized team, can’t make this change happen by ourselves. So, we work to create a grassroots movement among the developers at Shopify. We aim to increase the number of people that have ability, motivation and prompt to move the system a tiny step further in the right direction.

We give internal talks, write documentation, share wins, embed in other teams, and pair with people all over the company. Embedding and pairing make sure we’re solving the problems that product developers are most struggling with in practice, avoiding what’s often called Ivory Tower Syndrome where the solutions don’t match the problems. It also lets us gain context on different areas of the codebase and the business while helping motivated people achieve goals that align with ours.

As an example, we have a group called the Architecture Guild. The guild has a slack channel for software architecture discussions and bi-weekly meetups. It’s an open forum, and a way to grow more architecture conscious mindsets while encouraging architectural thinking. The Architecture Patterns team provides some content that we think is useful, but we encourage other people to share their thoughts, and most of the contributions come from other teams. Currently, the Architecture Guild has ~400 members and 54 documented meetups with meeting notes and recordings that are shared with all developers at Shopify.

The Architecture Guild grew organically out of the first Componentization team at Shopify after the first year of Componentization. If I were to start a similar effort again today, I’d establish a forum like this from the beginning to get as many people on board with the change as early as possible. It’s also generally a great vehicle to spread software design knowledge that’s siloed in specific teams to other parts of the company.

Other methods we use to create fertile ground for ambitious architecture projects are

  • the Developer Handbook, an internal online resource documenting how we develop software at Shopify.
  • Developer Talks, our internal weekly livestreamed and recorded talks about software development at Shopify.

Build Holistic Architecture

Some properties of software are so closely related that they need to be approached in pairs. By working on one property and ignoring its “partner property,” you could end up degrading the system.

Balance Encapsulation With A Simple Dependency Graph

We started out by focusing our work on building a clean public interface around each component to hide the internals. The expectation was that this would allow reasoning about and understanding the behavior of a component in isolation. Changing internals of a component wouldn’t break other components—as long as the interface stays stable.

It’s not that straightforward though. The public interface is what other components depend on; if a lot of components depend on it, it’s hard to change. The interface needs to be designed with those dependencies in mind, and the more components depend on it, the more abstract it needs to be. It’s hard to change because it’s used everywhere, and it will have to change often if it contains knowledge about concrete parts of the business logic.

When we started analyzing the graph of dependencies between components, it was very dense, to the point that every component depended on over half of all the other components. We also had lots of circular dependencies.

Circular Dependancies
Circular Dependancies

Circular dependencies are situations where for example component A depends on component B but component B also depends on component A. But circular dependencies don’t have to be direct, the cycles can be longer than two. For example, A depends on B depends on C depends on A.

These properties of the dependency graph mean that the components can’t be reasoned about, or evolved, separately. Changes to any component in a cycle can break all other components in the cycle. Changes to a component that has almost all other components depend on can break almost all other components. So these changes require a lot of context. A dense, cyclical dependency graph undermines the whole idea of Componentization—it blocks us from making the system feel smaller.

When we ignored the dependency graph, in large parts of the codebase the public interface turned out to just be an added layer of indirection in the existing control flows. This made it harder to refactor these control flows because it added additional pieces that needed to be changed. It also didn’t make it a lot easier to reason about parts of the system in isolation.

The simplest possible way to introduce a public interface to a private implementation
The simplest possible way to introduce a public interface to a private implementation

The diagram shows that the simplest possible way to introduce a public interface could just mean that a previously problematic design is leaked into a separate interface class, making the underlying design problem harder to fix by spreading it into more files.

Discussions about the desirable direction of a dependency often surface these underlying design problems. We routinely discover objects with too many responsibilities and missing abstractions this way.

Perhaps not surprisingly, one of the central entities of the Shopify system is the Shop and so almost everything depends on the Shop class. That means that if we want to avoid circular dependencies, the Shop class can depend on almost nothing. 

Luckily, there are proven tools we can use to straighten out the dependency graph. We can make arrows point in different directions, by either moving responsibilities into the component that depends on them or applying inversion of control. Inversion of control means to invert a dependency in such a way that control flow and source code dependency are opposed. This can be done for example through a publish/subscribe mechanism like ActiveSupport::Notifications.

This strategy of eliminating circular dependencies naturally guides us towards removing concrete implementation from classes like Shop, moving it towards a mostly empty container holding only the identity of a shop and some abstract concepts.

If we apply the aforementioned techniques while building out the public interfaces, the result is therefore much more useful. The simplified graph allows us to reason about parts of the system in isolation, and it even lays out a path towards testing parts of the system in isolation.

Dependencies diagram between Platform, Supporting, and Frontend components
Dependencies diagram between Platform, Supporting, and Frontend components

If determining the desired direction of all the dependencies on a component ever feels overwhelming, we think about the components grouped into layers. This allows us to prioritize and focus on cleaning up dependencies across layers first. The diagram above sketches out an example. Here, we have platform components, Platform and Shop Identity, that purely provide functionality to other components. Supporting components, like Merchandising and Inventory, depend on the platform components but also provide functionality to others and often serve their own external APIs. Frontend components, like Online Store, are primarily externally facing. The dependencies crossing the dotted lines can be prioritized and cleaned up first, before we look at dependencies within a layer, for example between Merchandising and Inventory.

Balance Loose Coupling With High Cohesion

Tight coupling with low cohesion and loose coupling with high cohesion
Tight coupling with low cohesion and loose coupling with high cohesion

Meaningful boundaries like those we want around components require loose coupling and high cohesion. A good approximation for this is Change Locality: The degree to which code that changes together lives together.

At first, we solely focused on decoupling components from each other. This felt good because it was an easy, visible change, but it still left us with cohesive parts of the codebase that spanned across component boundaries. In some cases, we reinforced a broken state. The consequence is that often small changes to the functionality of the system still meant changes in code across multiple components, for which the developers involved needed to know and understand all of those components.

Change Locality is a sign of both low coupling and high cohesion and makes evolving the code easier. The codebase feels smaller, which is one of our stated goals. And Change Locality can also be made visible. For example, we are working on automation analyzing all pull requests on our codebase for which components they touch. The number of components touched should go down over time.

An interesting side note here is that different kinds of cohesion exist. We found that where our legacy code respects cohesion, it’s mostly informational cohesion—grouping code that operates on the same data. This arises from a design process that starts with database tables (very common in the Rails community). Change Locality can be hindered by that. To produce software that is easy to maintain, it makes more sense to focus on functional cohesion—grouping code that performs a task together. That’s also much closer to how we usually think about our system. 

Our focus on functional cohesion is already showing benefits by making our business logic, the heart of our software, easier to understand.

Create a SOLID foundation

There are ideas in software design that apply in a very similar way on different levels of abstraction—coupling and cohesion, for example. We started out applying these ideas on the level of components. But most of what applies to components, which are really large groups of classes, also applies on the level of individual classes and even methods.

On a class level, the most relevant software design ideas are commonly summarized as the SOLID principles. On a component level, the same ideas are called “package principles.” Here’s a SOLID refresher from Wikipedia:

Single-responsibility principle

A class should only have a single responsibility, that is, only changes to one part of the software's specification should be able to affect the specification of the class.

Open–closed principle

Software entities should be open for extension, but closed for modification.

Liskov substitution principle

Objects in a program should be replaceable with instances of their subtypes without altering the correctness of that program.

Interface segregation principle

Many client-specific interfaces are better than one general-purpose interface.

Dependency inversion principle

Depend upon abstractions, not concretions.

The package principles express similar concerns on a different level, for example (source):

Common Closure Principle

Classes that change together are packaged together.

Stable Dependencies Principle

Depend in the direction of stability.

Stable Abstractions Principle

Abstractness increases with stability.

We found that it’s very hard to apply the principles on a component level if the code doesn’t follow the equivalent principles on a class and method level. Well designed classes enable well designed components. Also, people familiar with applying the SOLID principles on a class level can easily scale these ideas up to the component level.

So if you’re having trouble establishing components that have strong boundaries, it may make sense to take a step back and make sure your organization gets better at software design on a scale of methods and classes first.

This is again mostly a matter of changing people’s behavior that requires motivation and ability. Motivation and ability can be increased by spreading awareness of the problems and approaches to solving them.

In the Ruby world, Sandi Metz is great at teaching these concepts. I recommend her books, and we’re lucky enough to have her teach workshops at Shopify repeatedly. She really gets people excited about software design.

Apply Tooling Deliberately

To accelerate our progress towards the modular monolith, we’ve made a few major changes to our tooling based on our experience so far.

Use Rails Engines

While we started out with a lot of custom code, our components evolved to look more and more like Rails Engines. We’re doubling down on engines going forward. They are the one modularity mechanism that comes with Rails out of the box. They have the familiar looks and features of Rails applications, but other than apps, we can run multiple engines in the same process. And should we make the decision to extract a component from the monolith, an engine is easily transformed into a standalone application.

Engines don’t fit the use case perfectly though. Some of the roughest edges are related to libraries and tooling assuming a Rails application structure, not the slightly different structure of an engine. Others relate to the fact that each engine can (and probably should) specify its own external gem dependencies, and we need a predictable way to unify them into one set of gems for the host application. Thankfully, there are quite a few resources out there from other projects encountering similar problems. Our own explorations have yielded promising results with multiple production applications currently using engines for modularity, and we’re using engines everywhere going forward.

Define and Enforce Contracts

Strong boundaries require explicit contracts. Contracts in code and documentation allow developers to use a component without reading its implementation, making the system feel smaller.

Initially, we built a hash schema validation library called Component::Schema based on dry-schema. It served us well for a while, but we ran into problems keeping up with breaking changes and runtime performance for checking more complex contracts.

In 2019, Stripe released their static Ruby type checker, Sorbet. Shopify was involved in its development before that release and has a team contributing to Sorbet, as we are using it heavily. Now it’s our go-to tool for expressing input and output contracts on component boundaries. Configured correctly, it has barely any runtime performance impact, it’s more stable, and it provides advanced features like interfaces.

This is what an entrypoint into a component looks like using Component::Schema:

And this is what that entrypoint looks like today, using Sorbet:

Perform Static Dependency Analysis

As Kirsten laid out in the original blog post on Componentization at Shopify, we initially built a call graph analysis tool we called Wedge. It logged all method calls during test suite execution on CI to detect calls between components.

We found the results produced were often not useful. Call graph logging produces a lot of data, so it’s hard to separate the signal from the noise. Sometimes it’s not even clear which component a call is from or to. Consider a method defined in component A which is inherited by a class in component B. If this method is making a call to component C, which component is the call coming from? Also, because this analysis depended on the full test suite with added instrumentation, it took over an hour to run, which doesn’t make for a useful feedback cycle.

So, we developed a new tool called Packwerk to analyze static constant references. For example, the line Shop.first, contains a static reference to Shop and a method call to a method on that class that’s called first. Packwerk only analyzes the static constant reference to Shop. There’s less ambiguity in static references, and because they’re always explicitly introduced by developers, highlighting them is more actionable. Packwerk runs a full analysis on our largest codebase in a few minutes, so we’re able to integrate it with our Pull Request workflow. This allows us to reject changes that break the dependency graph or component encapsulation before they get merged into our main branch.

We’re planning to make Packwerk open source soon. Stay tuned!

Decide to Prioritize Ownership or Boundaries

There are two major ways to partition an existing monolith and create components from a big ball of mud. In my experience, all large architecture changes end up in an incomplete state. Maybe that’s a pessimistic view, but my experience tells me that the temporary incomplete state will at least last longer than you expect. So choose an approach based on which intermediary state is most useful for your specific situation.

One option is to draw lines through the monolith based on some vision of the future and strengthen those lines over time into full fledged boundaries. The other option is to spin off parts of it into tiny units with strong boundaries and then transition responsibilities over iteratively, growing the components over time.

For our main monolith, we took the first approach; our vision was guided by the ideas of Domain Driven Design. We defined components as implementations of subdomains of the domain of commerce, and moved the files into corresponding folders. The main advantage is that even though we’re not finished building out the boundaries, responsibilities are roughly grouped together, and every file has a stewardship team assigned. The disadvantage is that almost no component has a complete, strong boundary yet, because with the components containing large amounts of legacy code, it’s a huge amount of work to establish these. This vision of the future approach is good if well-defined ownership and a clearly visible partition of the app are most important for you—which they were for us because of the huge number of people working on the codebase.

On other large apps within Shopify, we’ve tried out the second approach. The advantage is that large parts of the codebase are in isolated and clean components. This creates good examples for people to work towards. The disadvantage of this approach is that we still have a considerable sized ball of mud within the app that has no structure whatsoever. This spin-off approach is good if clean boundaries are the priority for you.

What We’re Building Right Now

While feature development on the monolith is going on as fast as ever, many developers are making things more modular at the same time. We see an increase of people in a position to do this, and the number of good examples around the codebase is expanding.

We currently have 37 components in our main monolith, each with public entrypoints covering large parts of its responsibilities. Packwerk is used on about a third of the components to restrict their dependencies and protect the privacy of their internal implementation. We’re working on making Packwerk enticing enough that all components will adopt it.

Through increased adoption we’re progressively enforcing properties of the dependency graph. Total acyclicity is the long term goal, but the more edges we can remove from the graph in the short term the easier the system will be to reason about.

We have a few other monolithic apps going through similar processes of componentization right now; some with the goal of splitting into separate services long term, some aiming for the modular monolith. We are very deliberate about when to split functionality out into separate services, and we only do it for good reasons. That’s because splitting a single monolithic application into a distributed system of services increases the overall complexity considerably.

For example, we split out storefront rendering because it’s a read-only use case with very high throughput and it makes sense for us to scale and distribute it separately from the interface that our merchants use to manage their stores. Credit card vaulting is a separate service because it processes sensitive data that shouldn’t flow through other parts of the system.

In addition, we’re preparing to have all new Rails applications at Shopify componentized by default. The idea is to generate multiple separately tested engines out of the box when creating a Rails app, removing the top level app folder and setting up developers for a modular future from the start.

At the same time, we’re looking into some of the patterns necessary to unblock further adoption of Packwerk. First and foremost that means making the dependency graph easy to clean up. We want to encourage inversion of control and more generally dependency inversion, which will probably lead us to use a publish/subscribe mechanism instead of straightforward method calls in many cases.

The second big blocker is efficiently querying data across components without coupling them too tightly. The most interesting problems in this area are

  • Our GraphQL API exposes a partially circular graph to external consumers while we’d like the implementation in the components to be acyclic.
  • Our GraphQL query execution and ElasticSearch reindexing currently heavily rely on Active Record features, which defeats the “public interface, private implementation” idea.

The long term vision is to have separate, isolated test suites for most of the components of our main monolith.

Last But Not Least

I want to give a shout out to Josh Abernathy, Bryana Knight, Matt Todd, Matthew Clark, Mike Chlipala and Jakob Class at Github. This blog post is based on, and indirectly the result of a conversation I had with them. Thank you!

Anita Clarke, Edward Ocampo-Gooding, Gannon McGibbon, Jason Gedge, Martin LaRochelle, and Keyfer Mathewson contributed super valuable feedback on this article. Thank you BJ Fogg for the behavior model and use of your image.

If you’re interested in the kinds of challenges I described, you should join me at Shopify!

Further Reading


    Continue reading

    Tophatting in React Native

    Tophatting in React Native

    On average in 2019, Shopify handled billions of dollars of transactions per week. Therefore, it’s important to ensure new features are thoroughly tested before shipping them to our merchants. A vital part of the software quality process at Shopify is a practise called tophatting. Tophatting is manually testing your coworker’s changes and making sure everything is working properly before approving their pull request (PR).

    Earlier this year, we announced that React Native is the future of mobile development in the company. However, the workflow for tophatting a React Native app was quite tedious and time consuming. The reviewer had to 

    1. save their current work
    2. switch their development environment to the feature branch
    3. rebuild the app and load the new changes
    4. verify the changes inside the app.

    To provide a more convenient and painless experience, we built a tool enabling React Native developers to quickly load their peer’s work within seconds. I’ll explain how the tool works in detail.

    React Native Tophatting vs Native Tophatting

    About two years ago, the Mobile Tooling team developed a tool for tophatting native apps. The tool works by storing the app’s build artifacts in cloud storage, mobile developers can download the app and launch it in an emulator or a simulator on demand. However, the tool’s performance can be improved when there are only React Native code changes because we don’t need to rebuild and re-download the entire app. One major difference between React Native and native apps is that React Native apps produce an additional build artifact, the JavaScript bundle. If a developer only changes the React Native code and not native code, then the only build artifact needed to load the changes is the new JavaScript bundle. We leveraged this fact and developed a tool to store any newly built JavaScript bundles, so React Native apps can fetch any bundle and load the changes almost instantly.

    Storing the JavaScript Bundle

    The main idea behind the tool is to store the JavaScript bundle of any new builds in our cloud storage, so developers can simply download the artifact instead of building it on demand when tophatting.

    New PR on React Native project triggers a CI pipeline in Shopify Build

    New PR on React Native project triggers a CI pipeline in Shopify Build

    When a developer opens a new PR on GitHub or pushes a new commit in a React Native project, it triggers a CI pipeline in Shopify Build, our internal continuous integration/continuous delivery (CI/CD) platform then performs the following steps:

    1. The pipeline first builds the app’s JavaScript bundle.
    2. The pipeline compresses the bundle along with any assets that the app uses.
    3. The pipeline makes an API call to a backend service that writes the bundle’s metadata to a SQL database. The metadata includes information such as the app ID, the commit’s Secure Hash Algorithms (SHA) checksum, and the branch name.
    4. The backend service generates a unique bundle ID and a signed URL for uploading to cloud storage.
    5. The pipeline uploads the bundle to cloud storage using the signed URL.
    6. The pipeline makes an API call to the backend service to leave a comment on the PR.

    QR code that developers can scan on their mobile device

    QR code that developers can scan on their mobile device

    The PR comment records that the bundle upload is successful and gives developers three options to download the bundle, which include

    • A QR code that developers can scan on their mobile device, which opens the app on their device and downloads the bundle.
    • A bundle ID that developers can use to download the bundle without exiting the app using the Tophat screen. This is useful when developers are using a simulator/emulator.
    • A link that developers can use to download the bundle directly from a GitHub notification email. This allows developers to tophat without opening the PR on their computer.

    Loading the JavaScript Bundle

    Once the CI pipeline uploads the JavaScript bundle to cloud storage, developers need a way to easily download the bundle and load the changes in their app. We built a React Native component library providing a user interface (called the Tophat screen) for developers to load the changes.

    The Tophat Component Library 

    The component library registers the Tophat screen as a separate component and a URL listener that handles specific deep link events. All developers need to do is to inject the component into the root level of their application.

    The library also includes an action that shows the Tophat screen on demand. Developers open the Tophat screen to see the current bundle version or to reset the current bundle. In the example below, we use the action to construct a “Show Tophat” button, which opens the Tophat screen on press.

    The Tophat Screen

    The Tophat screen looks like a modal or an overlay in the app, but it’s separate from the app’s component tree, so it introduces a non-intrusive UI for React Native apps. 

    React Native tophat screen in action
    Tophat screen in action

    Here’s an example of using the tool to load a different commit in our Local Delivery app.

    The Typical Tophat Workflow

    Typical React Native tophat workflow

    Typical React Native tophat workflow

    The typical workflow using the Tophat library looks like:

    1. The developer scans the QR code or clicks the link in the GitHub PR comment that resolves to an URL in the format “{appId}://tophat_bundle/{bundle_id}”.
    2. The URL opens the app on the developer’s device and triggers a deep link event.
    3. The component library captures the event and parses the URL for the app ID and bundle ID.
    4. If the app ID in the URL matches the current app, then the library makes an API call to the backend service requesting a signed download URL and metadata for the corresponding JavaScript bundle.
    5. The Tophat screen displays the bundle’s metadata and asks the developer to confirm whether or not this is the bundle they wish to download.
    6. Upon confirmation, the library downloads the JavaScript bundle from cloud storage and saves the bundle’s metadata using local storage. Then it decompresses the bundle and restarts the app.
    7. When the app is restarting, it detects the new JavaScript bundle and starts the app using that bundle instead.
    8. Once the developer verifies the changes, they can reset the bundle in the Tophat screen.

    Managing Bundles Using a Backend Service

    In the native tophatting project, we didn’t use a backend service. However, we decided to use a backend service to handle most of the business logic in this tool. This creates additional maintenance and infrastructure cost to the project, but we believe its proven advantages outweigh its costs. There are two main reasons why we chose to use a backend service:

    1. It abstracts away authentication and implementation details with third-party services.
    2. It provides a scalable solution of storing metadata that enables better UI capabilities.

    Abstracting Implementation Details

    The tool requires the use of Google Cloud’s and GitHub’s SDKs, which means the client needs to have an authentication token for each of these services. If a backend service didn’t exist, then each app and its respective CI pipeline would need to configure their own tokens. The CI pipeline and the component library would also need to have consistent storage path formats. This introduces extra complexity and adds additional steps in the tool’s installation process.

    The backend service abstracts away the interaction with third party services such as authentication, uploading assets, and creating Github comments. The service also generates each bundle’s storage path, eliminating the issue of having inconsistent paths across different components. 

    Storing Metadata

    Each JavaScript bundle has important metadata that developers need to quickly retrieve along with the bundle. A solution used by the native tophatting project is to store the metadata in the filename of the build artifact. We could leverage the same technique to store the metadata in the JavaScript Bundle’s storage path. However, this isn’t scalable if we wish to include additional metadata. For example, if we want to add the author of the commit to the bundle’s metadata, it would introduce a change in the storage path format, which requires changes in every app’s CI pipeline and the component library.

    By using a backend service, we store more detailed metadata in a SQL database and decouple it from the bundle’s storage. This opens up the possibility of adding features like a confirmation step before downloading the bundle and querying bundles by app IDs or branch names.

    What’s Next?

    The first iteration of the tool is complete and React Native developers use the tool to tophat each other’s pull request by simply scanning a QR code or entering a bundle ID. There are improvements that we want to make in the future:

    • Building and uploading the JavaScript bundle directly from the command line.
    • Showing a list of available JavaScript bundles in the Tophat screen.
    • Detecting native code changes.
    • Designing a better UI in the Tophat screen.

    Almost all of the React Native projects at Shopify are now using the tool and my team keeps working to improve the tophatting experience for our React Native developers.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    5 Ways to Improve Your React Native Styling Workflow

    5 Ways to Improve Your React Native Styling Workflow

    In April, we announced Shop, our digital shopping assistant that brings together the best features of Arrive and Shop Pay. The Shop app started from our React Native codebase for our previous package tracking app Arrive, with every screen receiving a complete visual overhaul to fit the new branding.

    While our product designers worked on introducing a whole new design system that would decide the look and feel of the app, we on the engineering side took the initiative to evolve our thinking around how we work with styling of screens and components. The end-product became Restyle, our open source library that allowed us to move forward quickly and easily in our transformation from Arrive to Shop.

    I'll walk you through the styling best practices we learned through this process. They served as the guiding principles for the design of Restyle. However anyone working with a React app can benefit from applying these best practices, with or without using our library.

    The Questions We Needed to Answer

    We faced a number of problems with our current approach in Arrive, and these were the questions we needed to answer to take our styling workflow to the next level:

    • With a growing team working in different countries and time zones, how do we make sure that the app keeps a consistent style throughout all of its different screens?
    • What can we do to make it easy to style the app to look great on multiple different device sizes and formats?
    • How do we allow the app to dynamically adapt its theme according to the user’s preferences, to support for example dark mode?
    • Can we make working with styles in React Native a more enjoyable experience?

    With these questions in place, we came up with the following best practices that provided answers to them. 

    #1. Create a Design System

    A prerequisite for being able to write clean and consistent styling code is for the design of the app to be based on a clean and consistent design system. A design system is commonly defined as a set of rules, constraints and principles that lay the foundation for how the app should look and feel. Building a complete design system is a topic far too big to dig into here, but I want to point out three important areas that the system should define its rules for.


    Size and spacing are the two parameters used when defining the layout of an app. While sizes often vary greatly between different components presented on a screen, the spacing between them should often stay as consistent as possible to create a coherent look. This means that it’s preferred to stick to a small set of predefined spacing constants that’s used for all margins and paddings in the app.

    There are many conventions to choose between when deciding how to name your spacing constants, but I've found the t-shirt size scale (XS, S, M, L, XL, etc) work best. The order of sizes are easy to understand, and the system is extensible in both directions by prefixing with more X’s.


    When defining colors in a design system, it’s important not only to choose which colors to stick with, but also how and when they should be used. I like to split these definitions up into two layers:

    • The color palette - This is the set of colors that’s used. These can be named quite literally, e.g. “Blue”, “Light Orange”, “Dark Red”, “White”, “Black”.
    • The semantic colors - A set of names that map to and describe how the color palette should be applied, that is, what their functions are. Some examples are “Primary”, “Background”, “Danger”, “Failure”. Note that multiple semantic colors can be mapped to the same palette color, for example, both the “Danger” and “Failure” color could both map to “Dark Red”.

    When referring to a color in the app, it should be through the semantic color mapping. This makes it easy to later change, for example, the “Primary” color to be green instead of blue. It also allows you to easily swap out color schemes on the fly to, for example, easily accommodate a light and dark mode version of the app. As long as elements are using the “Background” semantic color, you can swap it between a light and dark color based on the chosen color scheme.


    Similar to spacing, it‘s best to stick to a limited set of font families, weights and sizes to achieve a coherent look throughout the app. A grouping of these typographic elements are defined together as a named text variant. Your “Header” text might be size 36, have a bold weight, and use the font family “Raleway”. Your “Body” text might use the “Merriweather” family with a regular font weight, and size 16.

    #2. Define a Theme Object

    A carefully put together design system following the spacing, colour, and typography practices above should be defined in the app‘s codebase as a theme object. Here‘s how a simple version might look:

    All values relating to your design system and all uses of these values in the app should be through this theme object. This makes it easy to tweak the system by only needing to edit values in a single source of truth.

    Notice how the palette is kept private to this file, and only the semantic color names are included in the theme. This enforces the best practice with colors in your design system.

    #3. Supply the Theme through React‘s Context API

    Now that you've defined your theme object, you might be tempted to start directly importing it in all the places where it's going to be used. While this might seem like a great approach at first, you’ll quickly find its limitations once you’re looking to work more dynamically with the theming. In the case of wanting to introduce a secondary theme for a dark mode version of the app, you would need to either:

    • Import both themes (light and dark mode), and in each component determine which one to use based on the current setting, or
    • Replace the values in the global theme definition when switching between modes. 

    The first option will introduce a large amount of tedious code repetition. The second option will only work if you force React to re-render the whole app when switching between light and dark modes, which is typically considered a bad practice. If you have a dynamic value that you want to make available to all components, you’re better off using React’s context API. Here’s how you would set this up with your theme:

    The theme in React’s context will make sure that whenever the app changes between light and dark mode, all components that access the theme will automatically re-render with the updated values. Another benefit of having the theme in context is being able to swap out themes on a sub-tree level. This allows you to have different color schemes for different screens in the app, which could, for example, allow users to customize the colors of their profile page in a social app.

    #4. Break the System into Components

    While it‘s entirely possible to keep reaching into the context to grab values from the theme for any view that needs to be styled, this will quickly become repetitious and overly verbose. A better way is to have components that directly map properties to values in the theme. There are two components that I find myself needing the most when working with themes this way, Box and Text.

    The Box component is similar to a View, but instead of accepting a style object property to do the styling, it directly accepts properties such as margin, padding, and backgroundColor. These properties are configured to only receive values available in the theme, like this:

    The “m” and “s” values here map to the spacings we‘ve defined in the theme, and “primary” maps to the corresponding color. This component is used in most places where we need to add some spacing and background colors, simply by wrapping it around other components.

    While the Box component is handy for creating layouts and adding background colors, the Text component comes into play when displaying text. Since React Native already requires you to use their Text component around any text in the app, this becomes a drop in replacement for it:

    The variant property applies all the properties that we defined in the theme for textVariant.header, and the color property follows the same principle as the Box component’s backgroundColor, but for the text color instead.

    Here’s how both of these components would be implemented:

    Styling directly through properties instead of keeping a separate style sheet might seem weird at first. I promise that once you start doing it you’ll quickly start to appreciate how much time and effort you save by not needing to jump back and forth between components and style sheets during your styling workflow.

    #5. Use Responsive Style Properties

    Responsive design is a common practice in web development where alternative styles are often specified for different screen sizes and device types. It seems that this practice has yet to become commonplace within the development of React Native apps. The need for responsive design is apparent in web apps where the device size can range from a small mobile phone to a widescreen desktop device. A React Native app only targeting mobile devices might not work with the same extreme device size differences, but the variance in potential screen dimensions is already big enough to make it hard to find a one-size-fits-all solution for your styling.

    An app onboarding screen that displays great on the latest iPhone Pro will most likely not work as well with the limited screen estate available on a first generation iPhone SE. Small tweaks to the layout, spacing and font size based on the available screen dimensions are often necessary to craft the best experience for all devices. In responsive design this work is done by categorizing devices into a set of predefined screen sizes defined by their breakpoints, for example:

    With these breakpoints we're saying that anything below 321 pixels in width should fall in the category of being a small phone, anything above that but below 768 is a regular phone size, and everything wider than that is a tablet.

    With these set, let's expand our previous Box component to also accept specific props for each screen size, in this manner:

    Here's roughly how you would go about implementing this functionality:

    In a complete implementation of the above you would ideally use a hook based approach to get the current screen dimensions that also refreshes on change (for example when changing device orientation), but I’ve left that out in the interest of brevity.

    #6. Enforce the System with TypeScript

    This final best practice requires you to be using TypeScript for your project.

    TypeScript and React pair incredibly well together, especially when using a modern code editor such as Visual Studio Code. Instead of relying on React’s PropTypes validation, which only happens when the component is rendered at run-time, TypeScript allows you to validate these types as you are writing the code. This means that if TypeScript isn’t displaying any errors in your project, you can rest assured that there are no invalid uses of the component anywhere in your app.

    Using the prop validation mechanisms of TypeScript

    Using the prop validation mechanisms of TypeScript

    TypeScript isn’t only there to tell you when you’ve done something wrong, it can also help you in using your React components correctly. Using the prop validation mechanisms of TypeScript, we can define our property types to only accept values available in the theme. With this, your editor will not only tell you if you're using an unavailable value, it will also autocomplete to one of the valid values for you.

    Here's how you need to define your types to set this up:

    Evolve Your Styling Workflow

    Following the best practices above via our Restyle library made a significant improvement to how we work with styles in our React Native app. Styling has become more enjoyable through the use of Restyle’s Box and Text components, and by restricting the options for colors, typography and spacing it’s now much easier to build a great-looking prototype for a new feature before needing to involve a designer in the process. The use of responsive style properties has also made it easy to tailor styles to specific screen sizes, so we can work more efficiently with crafting the best experience for any given device.

    Restyle’s configurability through theming allowed us to maintain a theme for Arrive while iterating on the theme for Shop. Once we were ready to flip the switch, we just needed to point Restyle to our new theme to complete the transformation. We also introduced dark mode into the app without it being a concrete part of our roadmap—we found it so easy to add we simply couldn't resist doing it.

    If you've asked some of the same questions we posed initially, you should consider adopting these best practices. And if you want a tool that helps you along the way, our Restyle library is there to guide you and make it an enjoyable experience.

    Wherever you are, your next journey starts here! Intrigued? We’d love to hear from you.

    Continue reading

    How to Track State with Type 2 Dimensional Models

    How to Track State with Type 2 Dimensional Models

    Application databases are generally designed to only track current state. For example, a typical user’s data model will store the current settings for each user. This is known as a Type 1 dimension. Each time they make a change, their corresponding record will be updated in place:







    2019-01-01 12:14:23

    2019-01-01 12:14:23



    2019-01-01 15:21:45

    2019-01-02 05:20:00


    This makes a lot of sense for applications. They need to be able to rapidly retrieve settings for a given user in order to determine how the application behaves. An indexed table at the user grain accomplishes this well.

    But, as analysts, we not only care about the current state (how many users are using feature “X” as of today), but also the historical state. How many users were using feature “X” 90 days ago? What is the 30 day retention rate of the feature? How often are users turning it off and on? To accomplish these use cases we need a data model that tracks historical state:








    2019-01-01 12:14:23

    2019-01-01 12:14:23




    2019-01-01 15:21:45

    2019-01-02 05:20:00




    2019-01-02 05:20:00




    This is known as a Type 2 dimensional model. I’ll show how you can create these data models using modern ETL tooling like PySpark and dbt (data build tool).

    Implementing Type 2 Dimensional Models at Shopify

    I currently work as a data scientist in the International product line at Shopify. Our product line is focused on adapting and scaling our product around the world. One of the first major efforts we undertook was translating Shopify’s admin in order to make our software available to use in multiple languages.

    Shopify admin translatedShopify admin translated

    At Shopify, data scientists work across the full stack—from data extraction and instrumentation, to data modelling, dashboards, analytics, and machine learning powered products. As a product data scientist, I’m responsible for understanding how our translated versions of the product are performing. How many users are adopting them? How is adoption changing over time? Are they retaining the new language, or switching back to English? If we default a new user from Japan into Japanese, are they more likely to become a successful merchant than if they were first exposed to the product in English and given the option to switch? In order to answer these questions, we first had to figure out how our data could be sourced or instrumented, and then eventually modelled.

    The functionality that decides which language to render Shopify in is based on the language setting our engineers added to the users data model. 








    2019-01-01 12:14:23

    2019-06-01 07:15:03




    2019-02-02 11:00:35

    2019-02-02 11:00:35


    User 1 will experience the Shopify admin in English, User 2 in Japanese, etc... Like most data models powering Shopify’s software, the users model is a Type 1 dimension. Each time a user changes their language, or any other setting, the record gets updated in place. As I alluded to above, this data model doesn’t allow us to answer many of our questions as they involve knowing what language a given user is using at a particular point in time. Instead, we needed a data model that tracked user’s languages over time. There are several ways to approach this problem.

    Options For Tracking State

    Modify Core Application Model Design

    In an ideal world, the core application database model will be designed to track state. Rather than having a record be updated in place, the new settings are instead appended as a new record. Due to the fact that the data is tracked directly in the source of truth, you can fully trust its accuracy. If you’re working closely with engineers prior to the launch of a product or new feature, you can advocate for this data model design. However, you will often run into two challenges with this approach:

    1. Engineers will be very reluctant to change the data model design to support analytical use cases. They want the application to be as performant as possible (as should you), and having a data model which keeps all historical state is not conducive to that.
    2. Most of the time, new features or products are built on top of pre-existing data models. As a result, modifying an existing table design to track history will come with an expensive and risky migration process, along with the aforementioned performance concerns.

    In the case of rendering languages for the Shopify admin, the language field was added to the pre-existing users model, and updating this model design was out of the question.

    Stitch Together Database Snapshots

    System that extracts newly created or updated records from the application databases on a fixed schedule

    System that extracts newly created or updated records from the application databases on a fixed schedule

    At most technology companies, snapshots of application database tables are extracted into the data warehouse or data lake. At Shopify, we have a system that extracts newly created or updated records from the application databases on a fixed schedule.

    Using these snapshots, one can leverage them as an input source for building a Type 2 dimension. However, given the fixed schedule nature of the data extraction system, it is possible that you will miss updates happening between one extract and the next.

    If you are using dbt for your data modelling, you can leverage their nice built-in solution for building Type 2’s from snapshots!

    Add Database Event Logging

    Newly created or updated record is stored in this log stored in Kafka

    Newly created or updated record is stored in this log in Kafka

    Another alternative is to add a new event log. Each newly created or updated record is stored in this log. At Shopify, we rely heavily on Kafka as a pipeline for transferring real-time data between our applications and data land, which makes it an ideal candidate for implementing such a log.

    If you work closely with engineers, or are comfortable working in your application codebase, you can get new logging in place that will stream any new or updated record to Kafka. Shopify is built on the Ruby on Rails web framework. Rails has something called “Active Record Callbacks”, which allows you to trigger logic before or after an alternation of an object’s (read “database records”) state. For our use case, we can leverage the after_commit callback to log a record to Kafka after it has been successfully created or updated in the application database.

    While this option isn’t perfect, and comes with a host of other caveats I will discuss later, we ended up choosing it as it was the quickest and easiest solution to implement that provided the required granularity.

    Type 2 Modelling Recipes

    Below, I’ll walk through some recipes for building Type 2 dimensions from the event logging option discussed above. We’ll stick with our example of modelling user’s languages over time and work with the case where we’ve added event logging to our database model from day 1 (i.e. when the table was first created). Here’s an example of what our user_update event log would look like:







    2019-01-01 12:14:23

    2019-01-01 12:14:23



    2019-02-02 11:00:35

    2019-02-02 11:00:35



    2019-02-02 11:00:35

    2019-02-02 12:15:06



    2019-02-02 11:00:35

    2019-02-02 13:01:17



    2019-02-02 11:00:35

    2019-02-02 14:10:01


    This log describes the full history of the users data model.

    1. User 1 gets created at 2019-01-01 12:14:23 with English as the default language.
    2. User 2 gets created at 2019-02-02 11:00:35 with English as the default language.
    3. User 2 decides to switch to French at 2019-02-02 12:15:06.
    4. User 2 changes some other setting that is tracked in the users model at 2019-02-02 13:01:17.
    5. User 2 decides to switch back to English at 2019-02-02 14:10:01.

    Our goal is to transform this event log into a Type 2 dimension that looks like this:








    2019-01-01 12:14:23





    2019-02-02 11:00:35

    2019-02-02 12:15:06




    2019-02-02 12:15:06

    2019-02-02 14:10:01




    2019-02-02 14:10:01




    We can see that the current state for all users can easily be retrieved with a SQL query that filters for WHERE is_current. These records also have a null value for the valid_to column, since they are still in use. However, it is common practice to fill these nulls with something like the timestamp at which the job last ran, since the actual values may have changed since then.


    Due to Spark’s ability to scale to massive datasets, we use it at Shopify for building our data models that get loaded to our data warehouse. To avoid the mess that comes with installing Spark on your machine, you can leverage a pre-built docker image with PySpark and Jupyter notebook pre-installed. If you want to play around with these examples yourself, you can pull down this docker image with docker pull jupyter/pyspark-notebook:c76996e26e48 and then run docker run -p 8888:8888 jupyter/pyspark-notebook:c76996e26e48 to spin up a notebook where you can run PySpark locally.

    We’ll start with some boiler plate code to create a Spark dataframe containing our sample of user update events:

    With that out of the way, the first step is to filter our input log to only include records where the columns of interest were updated. With our event instrumentation, we log an event whenever any record in the users model is updated. For our use case, we only care about instances where the user’s language was updated (or created for the first time). It’s also possible that you will get duplicate records in your event logs, since Kafka clients typically support “at-least-once” delivery. The code below will also filter out these cases:

    We now have something that looks like this:






    2019-01-01 12:14:23



    2019-02-02 11:00:35



    2019-02-02 12:15:06



    2019-02-02 14:10:01


    The last step is fairly simple; we produce one record per period for which a given language was enabled:








    2019-01-01 12:14:23

    2020-05-23 00:56:49




    2019-02-02 11:00:35

    2019-02-02 12:15:06




    2019-02-02 12:15:06

    2019-02-02 14:10:01




    2019-02-02 14:10:01

    2020-05-23 00:56:49




    dbt is an open source tool that lets you build new data models in pure SQL. It’s a tool we are currently exploring using at Shopify to supplement modelling in PySpark, which I am really excited about. When writing PySpark jobs, you’re typically taking SQL in your head, and then figuring out how you can translate it to the PySpark API. Why not just build them in pure SQL? dbt lets you do exactly that:

    With this SQL, we have replicated the exact same steps done in the PySpark example and will produce the same output shown above.

    Gotchas, Lessons Learned, and The Path Forward

    I’ve leveraged the approaches outlined above with multiple data models now. Here are a few of the things I’ve learned along the way.

    1. It took us a few tries before we landed on the approach outlined above. 

    In some initial implementations, we were logging the record changes before they had been successfully committed to the database, which resulted in some mismatches in the downstream Type 2 models. Since then, we’ve been sure to always leverage the after_commit callback based approach.

    2. There are some pitfalls with logging changes from within the code:

    • Your event logging becomes susceptible to future code changes. For example, an engineer refactors some code and removes the after_commit call. These are rare, but can happen. A good safeguard against this is to leverage tooling like the CODEOWNERS file, which notifies you when a particular part of the codebase is being changed.
    • You may miss record updates that are not triggered from within the application code. Again, these are rare, but it is possible to have an external process that is not using the Rails User model when making changes to records in the database.

    3. It is possible to lose some events in the Kafka process.

    For example, if one of the Shopify servers running the Ruby code were to fail before the event was successfully emitted to Kafka, you would lose that update event. Same thing if Kafka itself were to go down. Again, rare, but nonetheless something you should be willing to live with. There are a few ways you can mitigate the impact of these events:

    • Have some continuous data quality checks running that compare the Type 2 dimensional model against the current state and checks for discrepancies.
    • If & when any discrepancies are detected, you could augment your event log using the current state snapshot.

    4. If deletes occur in a particular data model, you need to implement a way to handle this.

    Otherwise, the deleted events will be indistinguishable from normal create or update records with the logging setup I showed above. Here are some ways around this:

    • Have your engineers modify the table design to use soft deletes instead of hard deletes. 
    • Add a new field to your Kafka schema and log the type of event that triggered the change, i.e. (create, update, or delete), and then handle accordingly in your Type 2 model code.

    Implementing Type 2 dimensional models for Shopify’s admin languages was truly an iterative process and took investment from both data and engineering to successfully implement. With that said, we have found the analytical value of the resulting Type 2 models well worth the upfront effort.

    Looking ahead, there’s an ongoing project at Shopify by one of our data engineering teams to store the MySQL binary logs (binlogs) in data land. Binlogs are a much better source for a log of data modifications, as they are directly tied to the source of truth (the MySQL database), and are much less susceptible to data loss than the Kafka based approach. With binlog extractions in place, you don’t need to add separate Kafka event logging to every new model as changes will be automatically tracked for all tables. You don’t need to worry about code changes or other processes making updates to the data model since the binlogs will always reflect the changes made to each table. I am optimistic that with binlogs as a new, more promising source for logging data modifications, along with the recipes outlined above, we can produce Type 2s out of the box for all new models. Everybody gets a Type 2!

    Additional Information

    SQL Query Recipes

    Once we have our data modelled as a Type 2 dimension, there are a number of questions we can start easily answering:

    Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

    Continue reading

    ShipIt! Presents: A Look at Shopify's API Health Report

    ShipIt! Presents: A Look at Shopify's API Health Report

    On July 17, 2020, ShipIt!, our monthly event series, presented A Look at Shopify's API Health Report. Our guests, Shuting Chang, Robert Saunders, Karen Xie, and Vrishti Dutta join us to talk about Shopify’s API Health Report, the tool, this multidisciplinary team, built to surface breaking changes affecting Shopify Partner apps. 

    Additional Information

    The links shared to the audience during the event:

    API Versioning at Shopify

    Shopify GraphQL

    API Support Channels

    Other Links

    If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions.

    Continue reading

    How Shopify Reduced Storefront Response Times with a Rewrite

    How Shopify Reduced Storefront Response Times with a Rewrite

    In January 2019, we set out to rewrite the critical software that powers all online storefronts on Shopify’s platform to offer the fastest online shopping experience possible, entirely from scratch and without downtime.

    The Storefront Renderer is a server-side application that loads a Shopify merchant's storefront Liquid theme, along with the data required to serve the request (for example product data, collection data, inventory information, and images), and returns the HTML response back to your browser. Shaving milliseconds off response time leads to big results for merchants on the platform as buyers increasingly expect pages to load quickly, and failing to deliver on performance can hinder sales, not to mention other important signals like SEO.

    The previous storefront implementation‘s development, started over 15 years ago when Tobi launched Snowdevil, lived within Shopify’s Ruby on Rails monolith. Over the years, we realized that the “storefront” part of Shopify is quite different from the other parts of the monolith: it has much stricter performance requirements and can accept more complexity implementation-wise to improve performance, whereas other components (such as payment processing) need to favour correctness and readability.

    In addition to this difference in paradigm, storefront requests progressively became slower to compute as we saw more storefront traffic on the platform. This performance decline led to a direct impact on our merchant storefronts’ performance, where time-to-first-byte metrics from Shopify servers slowly crept up as time went on.

    Here’s how the previous architecture looked:

    Old Storefront Implementation
    Old Storefront Implementation

    Before, the Rails monolith handled almost all kinds of traffic: checkout, admin, APIs, and storefront.

    With the new implementation, traffic routing looks like this:

    New Storefront Implementation
    New Storefront Implementation

    The Rails monolith still handles checkout, admin, and API traffic, but storefront traffic is handled by the new implementation.

    Designing the new storefront implementation from the ground up allowed us to think about the guarantees we could provide: we took the opportunity of this evergreen project to set us up on strong primitives that can be extended in the future, which would have been much more difficult to retrofit in the legacy implementation. An example of these foundations is the decision to design the new implementation on top of an active-active replication setup. As a result, the new implementation always reads from dedicated read replicas, improving performance and reducing load on the primary writers.

    Similarly, by rebuilding and extracting the storefront-related code in a dedicated application, we took the opportunity to think about building the best developer experience possible: great debugging tools, simple onboarding setup, welcoming documentation, and so on.

    Finally, with improving performance as a priority, we work to increase resilience and capacity in high load scenarios (think flash sales: events where a large number of buyers suddenly start shopping on a specific online storefront), and invest in the future of storefront development at Shopify. The end result is a fast, resilient, single-purpose application that serves high-throughput online storefront traffic for merchants on the Shopify platform as quickly as possible.

    Defining Our Success Criteria

    Once we clearly outlined the problem we’re trying to solve and scoped out the project, we defined three main success criteria:

    • Establishing feature parity: for a given input, both implementations generate the same output.
    • Improving performance: the new implementation runs on active-active replication setup and minimizes server response times.
    • Improving resilience and capacity: in high-load scenarios, the new implementation generally sustains traffic without causing errors.

    Building A Verifier Mechanism

    Before building the new implementation, we needed a way to make sure that whatever we built would behave the same way as the existing implementation. So, we built a verifier mechanism that compares the output of both implementations and returns a positive or negative result depending on the outcome of the comparison.

    This verification mechanism runs on storefront traffic in production, and it keeps track of verification results so we can identify differences in output that need fixing. Running the verifier mechanism on production traffic (in addition to comparing the implementations locally through a formal specification and a test suite) lets us identify the most impactful areas to work on when fixing issues, and keeps us focused on the prize: reaching feature parity as quickly as possible. It’s desirable for multiple reasons:

    • giving us an idea of progress and spreading the risk over a large amount of time
    • shortening the period of time that developers at Shopify work with two concurrent implementations at once
    • providing value to Shopify merchants as soon as possible.

    There are two parts to the entire verifier mechanism implementation:

    1. A verifier service (implemented in Ruby) compares the two responses we provide and returns a positive or negative result depending on the verification outcome. Similar to a `diff` tool, it lets us identify differences between the new and legacy implementations.
    2. A custom nginx routing module (implemented in Lua on top of OpenResty) sends a sample of production traffic to the verifier service for verification. This module acts as a router depending on the result of the verifications for subsequent requests.

    The following diagram shows how each part interacts with the rest of the architecture:

    Legacy implementation and new implementation at the same conceptual layer
    Legacy implementation and new implementation at the same conceptual layer

    The legacy implementation (the Rails monolith) still exists, and the new implementation (including the Verifier service) is introduced at the same conceptual layer. Both implementations are placed behind a custom routing module that decides where to route traffic based on the request attributes and the verification data for this request type. Let’s look at an example.

    When a buyer’s device sends an initial request for a given storefront page (for example, a product page from shop XYZ), the request is sent to Shopify’s infrastructure, at which point an nginx instance handles it. The routing module considers the request attributes to determine if other shop XYZ product page requests have previously passed verification.

    First request routed to Legacy implementation
    First request routed to Legacy implementation

    Since this is the first request of this kind in our example, the routing module sends the request to the legacy implementation to get a baseline reference that it will use for subsequent shop XYZ product page requests.

    Routing module sends original request and legacy implementation’s response to the new implementation
    Routing module sends original request and legacy implementation’s response to the new implementation

    Once the response comes back from the legacy implementation, the Lua routing module sends that response to the buyer. In the background, the Lua routing module also sends both the original request and the legacy implementation’s response to the new implementation. The new implementation computes a response to the original request and feeds both its response and the forwarded legacy implementation’s response to the verifier service. This is done asynchronously to make sure we’re not adding latency to responses we send to buyers, who don’t notice anything different.

    At this point, the verifier service received the responses from both the legacy and new implementations and is ready to compare them. Of course, the legacy implementation is assumed to be correct as it’s been running in production for years now (it acts as our reference point). We keep track of differences between the two implementations’ responses so we can debug and fix them later. The verifier service looks at both responses’ status code, headers, and body, ensuring they’re equivalent. This lets us identify any differences in the responses so we make sure our new implementation behaves like the legacy one.

    Time-related and randomness-related exceptions make it impossible to have exactly byte-equal responses, so we ignore certain patterns in the verifier service to relax the equivalence criteria. The verifier service uses a fixed time value during the comparison process and sets any random values to a known value so we reliably compare the outputs containing time-based and randomness-based differences.

    The verifier service sends comparison result back to the Lua module
    The verifier service sends comparison result back to the Lua module

    The verifier service sends the outcome of the comparison back to the Lua module, which keeps track of that comparison outcome for subsequent requests of the same kind.

    Dynamically Routing Requests To the New Implementation

    Once we had verified our new approach, we tested rendering a page using the new implementation instead of the legacy one. We iterated upon our verification mechanism to allow us to route traffic to the new implementation after a given number of successful verifications. Here’s how it works.

    Just like when we only verified traffic, a request arrives from a client device and hits Shopify’s architecture. The request is sent to both implementations, and both outputs are forwarded to the verifier service for comparison. The comparison result is sent back to the Lua routing module, which keeps track of it for future requests.

    When a subsequent storefront request arrives from a buyer and reaches the Lua routing module, it decides where to send it based on the previous verification results for requests similar to the current one (based on the request attributes

    For subsequent storefront requests, the Lua routing module decides where to send it
    For subsequent storefront requests, the Lua routing module decides where to send it

    If the request was verified multiple times in the past, and nearly all outcomes from the verifier service were “Pass”, then we consider the request safe to be served by the new implementation.

    If nearly all verifier service results are “Pass”, then it uses the new implementation
    If most verifier service results are “Pass”, then it uses the new implementation

    If, on the other hand, some verifications failed for this kind of request, we’ll play it safe and send the request to the legacy implementation.

    If most verifier service results are “Fail”, then it uses the old implementation
    If most verifier service results are “Fail”, then it uses the old implementation

    Successfully Rendering In Production

    With the verifier mechanism and the dynamic router in place, our first goal was to render one of the simplest storefront pages that exists on the Shopify platform: the password page that protects a storefront before the merchant makes it available to the public.

    Once we reached full parity for a single shop’s password page, we tested our implementation in production (for the first time) by routing traffic for this password page to the new implementation for a couple of minutes to test it out.

    Success! The new implementation worked in production. It was time to start implementing everything else.

    Increasing Feature Parity

    After our success with the password page, we tackled the most frequently accessed storefront pages on the platform (product pages, collection pages, etc). Diff by diff, endpoint by endpoint, we slowly increased the parity rate between the legacy and new implementations.

    Having both implementations running at the same time gave us a safety net to work with so that if we introduced a regression, requests would easily be routed to the legacy implementation instead. Conversely, whenever we shipped a change to the new implementation that would fix a gap in feature parity, the verifier service starts to report verification successes, and our custom routing module in nginx automatically starts sending traffic to the new implementation after a predetermined time threshold.

    Defining “Good” Performance with Apdex Scores

    We collected Apdex (Application Performance Index) scores on server-side processing time for both the new and legacy implementations to compare them.

    To calculate Apdex scores, we defined a parameter for a satisfactory threshold response time (this is the Apdex’s “T” parameter). Our threshold response time to define a frustrating experience would then be “above 4T” (defined by Apdex).

    We defined our “T” parameter as 200ms, which lines up with Google’s PageSpeed Insights recommendation for server response times. We consider server processing time below 200ms as satisfying and a server processing time of 800ms or more as frustrating. Anything in between is tolerated.

    From there, calculating the Apdex score for a given implementation consists of setting a time frame, and counting three values:

    • N, the total number of responses in the defined time frame
    • S, the number of satisfying responses (faster than 200ms) in the time frame
    • T, the number of tolerated responses (between 200ms and 800ms) in the time frame

    Then, we calculate the Apdex score: 

    By calculating Apdex scores for both the legacy and new implementations using the same T parameter, we had common ground to compare their performance.

    Methods to Improve Server-side Storefront Performance

    We want all Shopify storefronts to be fast, and this new implementation aims to speed up what a performance-conscious theme developer can’t by optimizing data access patterns, reducing memory allocations, and implementing efficient caching layers.

    Optimizing Data Access Patterns

    The new implementation uses optimized, handcrafted SQL multi-select statements maximizing the amount of data transferred in a single round trip. We carefully vet what we eager-load depending on the type of request and we optimize towards reducing instances of N+1 queries.

    Reducing Memory Allocations

    We reduce the number of memory allocations as much as possible so Ruby spends less time in garbage collection. We use methods that apply modifications in place (such as #map!) rather than those that allocate more memory space (like #map). This kind of performance-oriented Ruby paradigm sometimes leads to code that’s not as simple as idiomatic Ruby, but paired with proper testing and verification, this tradeoff provides big performance gains. It may not seem like much, but those memory allocations add up quickly, and considering the amount of storefront traffic Shopify handles, every optimization counts.

    Implementing Efficient Caching Layers

    We implemented various layers of caching throughout the application to reduce expensive calls. Frequent database queries are partitioned and cached to optimize for subsequent reads in a key-value store, and in the case of extremely frequent queries, those are cached directly in application memory to reduce I/O latency. Finally, the results of full page renders are cached too, so we can simply serve a full HTTP response directly from cache if possible.

    Measuring Performance Improvement Successes

    Once we could measure the performance of both implementations and reach a high enough level of verified feature parity, we started migrating merchant shops. Here are some of the improvements we’re seeing with our new implementation:

    • Across all shops, average server response times for requests served by the new implementation are 4x to 6x faster than the legacy implementation. This is huge!
    • When migrating a storefront to the new implementation, we see that the Apdex score for server-side processing time improves by +0.11 on average.
    • When only considering cache misses (requests that can’t be served directly from the cache and need to be computed from scratch), the new implementation increases the Apdex score for server-side processing time by a full +0.20 on average compared to the previous implementation.
    • We heard back from merchants mentioning a 500ms improvement in time-to-first-byte metrics when the new implementation was rolled out to their storefront.

    So another success! We improved store performance in production.

    Now how do we make sure this translates to our third success criteria?

    Improving Resilience and Capacity

    While working on the new implementation, the Verifier service identified potential parity gaps, which helped tremendously. However, a few times we shipped code to production that broke in exceedingly rare edge cases that it couldn’t catch.

    As a safety mechanism, we made it so that whenever the new implementation would fail to successfully render a given request, we’d fall back to the legacy implementation. The response would be slower, but at least it was working properly. We used circuit breakers in our custom nginx routing module so that we’d open the circuit and start sending traffic to the legacy implementation if the new implementation was having trouble responding successfully. Read more on tuning circuit breakers in this blog post by my teammate Damian Polan.

    Increase Capacity in High-load Scenarios

    To ensure that the new implementation responds well to flash sales, we implemented and tweaked two mechanisms. The first one is an automatic scaling mechanism that adds or removes computing capacity in response to the amount of load on the current swarm of computers that serve traffic. If load increases as a result of an increase in traffic, the autoscaler will detect this increase and start provisioning more compute capacity to handle it.

    Additionally, we introduced in-memory cache to reduce load on external data stores for storefronts that put a lot of pressure on the platform’s resources. This provides a buffer that reduces load on very-high traffic shops.

    Failing Fast

    When an external data store isn’t available, we don’t want to serve buyers an error page. If possible, we’ll try to gracefully fall back to a safe way to serve the request. It may not be as fast, or as complete as a normal, healthy response, but it’s definitely better than serving a sad error page.

    We implemented circuit breakers on external datastores using Semian, a Shopify-developed Ruby gem that controls access to slow or unresponsive external services, avoiding cascading failures and making the new implementation more resilient to failure.

    Similarly, if a cache store isn’t available, we’ll quickly consider the timeout as a cache miss, so instead of failing the entire request because the cache store wasn’t available, we’ll simply fetch the data from the canonical data store instead. It may take longer, but at least there’s a successful response to serve back to the buyer.

    Testing Failure Scenarios and the Limits of the New Implementation

    Finally, as a way to identify potential resilience issues, the new implementation uses Toxiproxy to generate test cases where various resources are made available or not, on demand, to generate problematic scenarios.

    As we put these resilience and capacity mechanisms in place, we regularly ran load tests using internal tooling to see how the new implementation behaves in the face of a large amount of traffic. As time went on, we increased the new implementation’s resilience and capacity significantly, removing errors and exceptions almost completely even in high-load scenarios. With BFCM 2020 coming soon (which we consider as an organic, large-scale load test), we’re excited to see how the new implementation behaves.

    Where We’re at Currently

    We’re currently in the process of rolling out the new implementation to all online storefronts on the platform. This process happens automatically, without the need for any intervention from Shopify merchants. While we do this, we’re adding more features to the new implementation to bring it to full parity with the legacy implementation. The new implementation is currently at 90%+ feature parity with the legacy one, and we’re increasing that figure every day with the goal of reaching 100% parity to retire the legacy implementation.

    As we roll out the new implementation to storefronts we are continuing to see and measure performance improvements as well. On average, server response times for the new implementation are 4x faster than the legacy implementation. Rhone Apparel, a Shopify Plus merchant, started using the new implementation in April 2020 and saw dramatic improvements in server-side performance over the previous month.

    We learned a lot during the process of rewriting this critical piece of software. The strong foundations of this new implementation make it possible to deploy it around the world, closer to buyers everywhere, to reduce network latency involved in cross-continental networking, and we continue to explore ways to make it even faster while providing the best developer experience possible to set us up for the future.

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    Building Reliable Mobile Applications

    Building Reliable Mobile Applications

    Merchants worldwide rely on Shopify's Point Of Sale (POS) app to operate their brick and mortar stores. Unlike many mobile apps, the POS app is mission-critical. Any downtime leads to long lineups, unhappy customers, and lost sales. The POS app must be exceptionally reliable, and any outages resolved quickly.

    Reliability engineering is a well-solved problem on the server-side. Back-end teams are able to push changes to production several times a day. So, when there's an outage, they can deploy fixes right away.

    This isn't possible in the case of mobile apps as app developers don’t own distribution. Any update to an app has to be submitted to Apple or Google for review. It's available to users for download only when they approve it. A review can take anywhere between a few hours to several days. Additionally, merchants may not install the update for weeks or even months.

    It's important to reduce the likelihood of bugs as much as possible and resolve issues in production as quickly as possible. In the following sections, we will detail the work we’ve done in both these areas over the last few years.


    We rely heavily on automation testing at Shopify. Every feature in the POS app has unit, integration, functional, and UI snapshot tests. Developers on the team write these simultaneously as they are adding new functionality to the code-base. Changes aren’t merged unless they include automated tests that cover them. These tests run for each push to the repo in our Continuous Integration environment. You can learn more about our testing strategy here.

    Besides automation testing, we also perform manual testing at various stages of development. Features like pairing a Bluetooth card reader or printing a receipt are difficult to test using automation. While we use mocks and stubs to test parts of such features, we manually test the full functionality.

    Sometimes tests that can be automated, inadvertently end up in the manual test suite. This causes us to spend time testing something manually when computers can do that for us. To avoid this, we audit the manual test suite every few months to weed out all such test cases.

    Code Reviews

    Changes made to the code-base aren’t merged until reviewed by other engineers on the team. These reviews allow us to spot and fix issues early in the life-cycle. This process works only if the reviewers are knowledgeable about that particular part of the code-base. As the team grew, finding the right people to do reviews became difficult.

    To overcome this, we have divided the code-base into components. Each team owns the component(s) that make up the feature that they are responsible for. Anyone can make changes to a component, but the team that owns it must review them before merging. We have set up Code Owners so that the right team gets added as reviewers automatically.

    Reviewers must test changes manually, or in Shopify speak, "tophat", before they approve them. This can be a very time-consuming process. They need to save their work, pull the changes, build them locally, and then deploy to a device or simulator. We have automated this process, so any Pull Request can be top-hatted by executing a single command:

    `dev android tophat <pull-request-url>`


    `dev ios tophat <pull-request-url>`


    You can learn more about mobile tophatting at Shopify here.

    Release Management

    Historically, updates to POS were shipped whenever the team was “ready.” When the team decided it was time to ship, a release candidate was created, and we spent a few hours testing it manually before pushing it to the app stores.

    These ad-hoc releases made sense when only a handful of engineers were working on the app. As the team grew, our release process started to break down. We decided to adopt the release train model and started shipping monthly.

    This method worked for a few months, but the team grew so fast that it wasn’t working anymore. During this time, we went from being a single engineering team to a large team of teams. Each of these teams is responsible for a particular area of the product. We started shipping large changes every month, so testing release candidates was taking several days.

    In 2018, we decided to switch to weekly releases. At first, this seemed counter-intuitive as we were doing the work to ship updates more often. In practice, it provided several benefits:

    • The number of changes that we had to test manually reduced significantly.
    • Teams weren’t as stressed about missing a release train as the next train left in a few days.
    • Non-critical bug fixes could be shipped in a few days instead of a month.

    We then made it easier for the team to ship updates every week by introducing Release Captain and ShipIt Mobile.

    Release Captain

    Initially, the engineering lead(s) were responsible for shipping updates, which included:
    • making sure all the changes are merged before the cut-off
    • incrementing the build and version numbers
    • updating the release notes
    • making sure the translations are complete
    • creating release candidates for manual testing
    • triaging bugs found during testing and getting them fixed
    • submitting the builds to app stores
    • updating the app store listings
    • monitoring the rollout for any major bugs or crashes

    As you can see, this is quite involved and can take a lot of time if done by the same person every week. Luckily, we had quite a large team, so we decided to make this a rotating responsibility.

    Each week, the engineer responsible for the release is called the Release Captain. They work on shipping the release so that the rest of the team can focus on testing, fixing bugs, or working on future releases.

    Each engineer on the team is the Release Captain for two weeks before the next engineer in the schedule takes over. We leverage PagerDuty to coordinate this, and it makes it very easy for everyone to know when they will be Release Captain next. It also simplifies planning around vacations, team offsites, etc.

    To simplify things even further, we configured our friendly chatbot, spy, to automatically announce when a new Release Captain shift begins.

    ShipIt Mobile

    We’ve automated most of the manual work involved in doing releases using ShipIt Mobile. With just a few clicks, the Release Captain can generate a new release candidate.

    Once ready, the rest of the team is automatically notified in Slack to start testing.

    After fixing all the bugs found, the update is submitted to the app store with just a single click. You can learn more about ShipIt Mobile here. These improvements not only make weekly releases easier, but they also make it significantly faster to ship hotfixes in case of a critical issue in production.

    Staged Rollouts

    Despite our best efforts, bugs sometimes slip into production. To reduce the surface area of a disruption, we make the updates available only to a small fraction of our user base at first. We then monitor the release to make sure there are no crashes or regressions. If everything goes well, we gradually increase the percentage of users the update is available to over the next few days. This is done using Phased Releases and Staged Rollouts in iOS AppStore and Google Play, respectively.

    The only exception to this approach is when a fix for a critical issue needs to go out immediately. In such cases, we make the update available to 100% of the users right away. We also can block users from using the app until they update to the latest version.

    We do this by having the POS app query the server for the minimum supported version that we set. If the current version is older than that, the app blocks the UI and provides update instructions. This is quite disruptive and can be annoying to merchants who are trying to make a sale. So we do it very rarely and only for critical security issues.

    Beta Flags

    Staged rollouts are useful for limiting how many users get the latest changes. But, they don’t provide a way to explicitly pick which users. When building new features, we often handpick a few merchants to take part in early-access. During this phase, they get to try the new features and give us feedback that we can work on before a final release.

    To do that, we put features, and even big refactors behind server-side beta flags. Only merchants whose stores we have explicitly set a beta flag will see the app’s new feature. This makes it easy to run closed betas with selected merchants. We also can do staged rollouts for beta flags, which gives us another layer of flexibility.

    Automated Monitoring and Alerts

    When something goes wrong in production, we want to be the first to know about it. The POS app and backend is instrumented with comprehensive metrics, reported in real-time. Using these metrics, we have dashboards set up to track the health of the product in production.

    Using these dashboards, we can check the health of any feature in a geography with just a few clicks. For example, the % of successful chip transactions made using a VISA credit card with the Tap, Chip & Swipe reader in the UK, or the % of successful tap transactions made using an Interac debit card with the Tap & Chip reader in Canada for a particular merchant.

    While this is handy, we didn’t want to have to keep checking these dashboards for anomalies all the time. Instead, we wanted to get notified when something goes wrong. This is important because while most of our engineering team is in North America, Shopify POS is used worldwide.

    This is harder to do than it may seem because the volume of commerce varies throughout the year. Time of day, day of the week, holidays, seasons, and even the ongoing pandemic affect how much merchants are able to sell. Setting manual thresholds to detect issues can cause a lot of false negatives and alert fatigue. To overcome this, we leverage Datadog’s Anomaly Detection. Once the selected algorithm has enough data to establish a baseline, alerts will only get fired if there’s an anomaly for that particular time of the year.

    We direct these alerts to Slack so that the right folks can investigate and fix them.

    Handling Outages

    Air Traffic Control

    In the early days of POS, bugs and outages were reported in the team Slack channel, and whoever on the team had the bandwidth, investigated them. This worked well when we had a handful of developers, but this approach didn’t scale as the team grew. Issues kept going to just a few folks who had the most context, and teams kept getting distracted from regular project work, causing delays.

    To fix this, we set up a rotating on-call schedule called Retail ATC (Air Traffic Control). Every week, there is a group of developers on the team dedicated to monitoring how things are working in production and handling outages. These developers are responsible only for this and are not expected to contribute to regular project work. When there are no outages, ATCs spend time tackling tech debt and helping our Technical Merchant Support team.

    Every developer on the team is on-call for two weeks at a time. The first week they are Primary ATC, and the next week they are Secondary ATC. Primary ATC is paged when something goes wrong, and they are responsible for triaging and investigating it. If they need help or are unavailable (commute time, connectivity issues, etc.), the Secondary ATC is paged. ATCs are not expected to fix all issues that arise by themselves, while often they can. They are instead responsible for working with the team that has the most context.


    Since we offer the POS app on both Android on iOS, we have ATC schedules for developers that work on each of those apps. Some areas, like payments, for instance, need a lot of domain knowledge to investigate issues. So we have dedicated ATCs for developers that work in those areas.

    Having folks dedicated to handling issues in production frees up the rest of the team to focus on regular project work. This approach has greatly reduced the amount of context switching teams had to do. It has also reduced the stress that comes with the responsibility of working on a mission-critical mobile application.

    Over the last couple of years, ATC has also become a great way for us to help new team members onboard faster. Investigating bugs and outages exposes them to various tools and parts of our codebase in a short amount of time. This allows them to become more self-sufficient quickly. However, being on-call can be stressful. So, we only add them to the schedule after they have been on the team for a few months and have undergone training. We also pair them with more experienced folks when they go on call.

    Incident Management

    When an outage occurs, it must be resolved as quickly as possible. To do this, we have a set of best practices that the team can follow so that we can spend more time investigating the issue vs figuring out how to do something.

    An incident is started by the ATC in response to an automated alert. ATCs use our ChatOps tools to start the incident in a dedicated Slack channel.


    Incidents are always started in the same channel, and all communication happens in it. This is to ensure that there is a single source of information for all stakeholders.

    As the investigation goes on, findings are documented by adding the 📝 emoji to messages. Our chatbot, spy automatically adds them to a service disruption document and confirms it by adding a  emoji to the same message.

    Once we identify the cause of the outage and verify that it has been resolved, the incident is stopped.



    The ATC then schedules a Root Cause Analysis (RCA) for the incident on the next working day. We have a no-blame culture, and the meeting is focused on determining what went wrong and how we can prevent it from happening in the future.

    At the end of the RCA, action items are identified and assigned owners. Keeping track of outages over time allows us to find areas that need more engineering investment to improve reliability.

    Thanks to these efforts, we've been able to take an app built for small stores and scale it for some of our largest merchants. Today, we support a large number of businesses to sell products worth billions of dollars each year. Along the way, we also scaled up our engineering team and can ship faster while improving reliability.

    We are far from done, though, as each year we are onboarding bigger and bigger merchants onto our platform. If these kinds of challenges sound interesting to you, come work with us! Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.

    Continue reading

    Using DNS Traffic Management to Add Resiliency to Shopify’s Services

    Using DNS Traffic Management to Add Resiliency to Shopify’s Services

    If you are lacking understanding of what is DNS, traffic management, or why we would even use it, read Part 1: Introduction to DNS traffic management.

    Distributed systems are only as resilient as we build them to be. Domain Name System (DNS) traffic management (TM) is a well-used approach to do so. In this second part of the two-part series, we’re sharing Shopify’s DNS traffic management journey from the numerous manually set-up, maintained, and updated traffic management approaches to the fully automated self-served system used by 40+ domains owned amongst 12+ different teams, handling 100M+ requests per day.

    Shopify’s Previous Approaches to DNS Traffic Management

    DNS traffic management isn’t entirely new at Shopify. A number of different teams had their own way of doing traffic management through DNS changes before we automated in 2019, which brought different sets of features and techniques to update records. 

    Traffic management through DNS changes

    Streaming Platform Team

    The team handling our Kafka pipelines used Kubernetes ConfigMaps to define target clusters. So, making changes required

    On top of the process duration which isn't ideal for failover time, using this manual approach doesn't open the door to any active/active configuration (where we share the traffic between two active clusters), since it would require to use two target clusters at once, while this only allows using the one defined in the ConfigMaps. At the time, being able to share traffic wasn't considered necessary for Kafka.

    Search Platform Team

    The team handling Elasticsearch set up the beginnings of DNS updates automation. They used our chatops bot, spy, to run commands requesting failovers. The bot creates a PR in our record-store repository (which uses the record_store open source project) with the requested DNS change, which then needs to be approved, merged, and deployed after passing the tests. Except for the automation, it’s a similar approach to the manual one, hence it has the same limitations with failover delays and active/active capabilities.

    Edgescale Team

    Part of the Edgescale team's responsibilities is to handle our assets (images and static files) and Content Delivery Networks (CDN), which bring the assets as close to the client as possible thanks to a network of distributed servers that store those files. To succeed in this mission, the team wanted more control and active/active capabilities. They used DNS providers allowing for weighted traffic management. They set weights to their records and define which share of traffic goes to which endpoint. It allowed them to share the traffic between their two CDN providers. To set this up, they used the DNS provider’s APIs with weights that could span from 0 (disabled) to 15, using DNS A records (hostname to IP address). To make their lives easier, they wrote a spy cdn command, responsible for making the API calls to the two DNS providers. It reduced the limitations of failover time, as well as provided the active/active capabilities. However, we couldn’t produce a stable, easy to reproduce, and version-controlled configuration of the providers. Adding endpoints to the traffic management was to be done manually, and thus prone to errors. The providers we use for our CDN needs don’t perform the same way in every region of the world and incur different costs in those regions. Using different traffic shares depending on the geographical location of the requesters is one of the traffic-management use-cases we presented in Part 1: Introduction to DNS traffic management(URL), but that wasn’t available here.

    Other Teams

    Finally, a few other teams decided to manually create records in one of the DNS providers used by the team handling our assets. They had fast failover and active/active capabilities but didn’t have a stable configuration, nor redundancy from using a single DNS provider.

    We had too many setups going all over the place, corresponding to maintainability problems. Also, any impactful changes needed a lot of coordination and moving pieces. All of these setups had similar use cases and needs. We started working on how to improve and consolidate our approach to DNS traffic management to

    • reduce the limitations encountered by teams
    • connect other teams with similar needs
    • improve maintainability
    • create ownership

    Consolidation and Ownership of DNS Traffic Management

    The first step of creating a better service used by many teams at Shopify was to define the requirements and goals. We wanted a reliable and redundant system that would provide

    • regionalized traffic
    • fine-grained traffic sharing
    • failover capabilities
    • easy setup and updating by the teams owning traffic-managed domains

    Consolidation and ownership of DNS traffic management

    The final state of our setup provides a unified way of setting up and handling our services. We use a git repository to store the domain's configuration that’s then deployed to two DNS providers. The configuration can be tweaked in a fast and easy manner for both providers through a set of spy commands, allowing for efficient failovers. Let's talk about those choices to build our system, and how we built it.

    Establish Reliability and Redundancy

    Each domain name has a set of nameservers, and when using a DNS client, one of those nameservers is selected and queried first, another one is used when a timeout occurs. Shopify used a single DNS provider until 2016, where a large DNS outage happened while our DNS provider was under a distributed denial of service (DDoS) attack, effectively dropping a large number of legitimate requests. We learned from our mistakes and increased our reliability and redundancy by using more than one DNS provider.

    When CDN traffic management was set up, it used two different DNS providers to follow in the steps of our static DNS records in the record_store. The decision for our new system was easy to make since it prevented being dependent on a single provider, we wanted to follow the same approach and adopt two providers to build our new standard.

    Define The Traffic Management Layers

    Two of our DNS providers allowed for regionalized and weighted traffic management, as well as multiple failover layers. It was just a matter of defining how we wanted things to work and build the equivalent approach for both providers.

    We defined our approach in layers of traffic management and considered that each layer had a decision to make that would reduce the set of options that the next layers can choose from.

    Layer 1 Geographical Fencing

    The layer supports globally matching endpoints, which are mandatoryThe layer supports globally matching endpoints, which are mandatory. We always have an answer to a DNS request for an existing traffic-managed domain, even if there is no region specifically matching the requester. We defined a global region that is selected when nothing more specific matches. The geographical fencing layer selects the endpoints that fit the region where the request originates from. This layer selects the best geographical match with the client’s request. For instance, we set a rule to have endpoint A answered for Canada and endpoint B for Quebec. When we get a request originating from Montreal, we return B. If the request originates from Ottawa, we return A.

    Layer 2 Endpoint Status

    Automatically setting endpoints status depends on a process called health checking or monitoringWe provided a way to enable and disable the use of endpoints depending on their status, which is manually or automatically set. Automatically setting endpoints status depends on a process called health checking or monitoring, where we try to reach the endpoint regularly in order to verify if it does (healthy) or does not (unhealthy) answer. We added a layer of traffic management based on the endpoint status aimed at selecting only the endpoints currently considered as healthy. However, we don’t want the requester to receive an empty answer for a domain that does exist, as it would trigger the negative TTL, most of the time higher than our traffic-managed domain TTL. If any of the endpoints is healthy, then only the healthy endpoints are returned by Layer 2. If none of the endpoints are healthy, all the endpoints are returned. The logic behind this is simple: returning something that doesn’t work is better than not returning anything, as it allows the client to start back using the service as soon as endpoints are back online.

    Layer 3 Endpoint Priority

    Endpoint priority. We allow users to define levels of priorities for the traffic shares of their domainsAnother aspect we want control over is the failover approach for our endpoints. We allow users to define levels of priorities for the traffic shares of their domains. For instance, they could define, as the highest priority, that three endpoints A, B, and C would receive 100%, 0%, and 0% of the traffic respectively. However, when A is unhealthy, instead of using B and C, we define, as a second priority, that B would receive 100% of the traffic. This can’t be done without a layer selecting endpoints based on their priority, as we’d be sending a share of the normal traffic to B, or we don’t have automated control over how B and C share the traffic in the case where A is failing. Endpoints of higher priority layers with a weight set to 0 (not receiving traffic) are also considered down for those layers. This means when the endpoints receiving traffic are unhealthy, through health checking, any higher-priority endpoints get discarded at Layer 2, allowing Layer 3 to keep only the highest priority endpoints left in the returned list.

    Layer 4 Weighted Selection

    This final layer deals with the weights defined for the endpointsThis final layer deals with the weights defined for the endpoints. Each endpoint E reaching this layer has a probability PE of being selected as the answer. PE is obtained through the formula <weight of E>/<sum of the weights of all endpoints reaching Layer 4>. Any 0-weighted endpoints will be automatically discarded unless there are only 0-weighted endpoints where all endpoints will have an equal chance of being selected and returned to the requester. 

    Deploying and Maintaining The Traffic-Managed Domains

    We try to build tooling in a self-service way. It creates a new standard requiring us to make tooling easily accessible for other teams to deploy their traffic-managed domains. Since we use Terraform with Atlantis for a number of our deployments, we built a Terraform module that receives only the required parameters for an application owner and hides most of the work happening behind the scenes to configure our providers.

    The above code represents the gist of what an application owner needs to provide to deploy their own traffic-managed domain.

    We work to keep our deployments organized, so we derive the zone and subdomain parameters from the path of the domain being terraformed. For example, the path to this file is terraform/ allows deriving that the zone is and the subdomain is test.

    When an application owner wants to make changes to their traffic-managed domain, they just need to update the file and apply the terraform change. There are a number of extra features that control

    • automated monitoring and failover for their domains 
    • monitoring configuration for domains
    • paging when a failover is automatically triggered or not.

    When we make changes to how traffic-managed domains are deployed, or add new features, we update the module and move domains to the new module version one by one. Everything stays transparent for the application owners and easy to maintain for us, the Edgescale team.

    Everyday Traffic Steering Operations

    We allowed the users of our new standard to make changes fast and easily applied to their traffic-managed domain. We built a new command in our chatops bot, spy endpoints, to perform operations on the traffic-managed domains.

    Those commands will operate on relative and absolute domains. Relative domains will automatically receive our default traffic-managed zone as a suffix. It’s also possible to specify in which region the change should apply by using square brackets; for example, cdn[us-*,na] would concern the cdn traffic-managed domain but only in region na and only ones starting with us-.

    The spy endpoints get command gets the current traffic shares between endpoints for a given domain

    The spy endpoints get command gets the current traffic shares between endpoints for a given domain. If all providers are holding the same data (which should be the case most of the time), then the command results without any mention of the DNS providers. When results are different (the providers went out of sync), the data will be shown with mentions of the providers to make sure that the current traffic shares are known.

    The spy endpoints set command changes the traffic shares using specific weight values we provide

    The spy endpoints set command changes the traffic shares using specific weight values we provide. It updates every provider and runs the spy endpoints set command to show the new traffic shares. Instead of specifying the weights for each endpoint, it’s possible to use a number of defined profiles that set predefined traffic shares. For example, our test domain with its two endpoints mostly[central-4] will define weights of 95 for central-4 and 5 for central-5.

    Our Success Stories

    Moving to ElasticSearch 7.0.0 - We talked about the process that was used by the ElasticSearch team and the fact that it was limited to failovers and didn’t allow traffic shares. When we moved our internal ElasticSearch clusters to ElasticSearch 7.0.0, the team was able to use the weighted load balancing provided by our tools to move the traffic chunk by chunk and ensure everything was working properly. It allowed them to keep the regular traffic going and mitigate any issue they might have encountered along the way, making the transition to ElasticSearch 7.0.0 seamless to the different systems using it.

    Recovering from Kafka overload during a flash sale - During a large flash sale, the Kafka brokers in one of our clusters started overloading from the traffic they had to handle. Once the problem was identified, it took a few minutes for the Kafka team to realize that they now had traffic share capabilities from our DNS traffic manager. They used it to divert half the traffic from the overloaded region and send it to another available region. Less than five minutes after making that change, the Kafka queues started recovering.

    Relieving on-call stress - Being on-call is stressful, especially when running errands and we don’t want to be stuck at home waiting for our phone to ring. Even with the great on-call culture at Shopify, and people always happy to override parts of your shift, being able to use the DNS traffic manager to steer traffic of an application to another cluster when something happens helps in so many cases. One aspect the different teams appreciated is that work can be done from the phone easily (thanks spy!). Another one is allowing Shopifolk to stay serene while solving the incidents which are mitigated thanks to traffic management and don’t impact our merchants and their clients. In summary, easy to use tooling and practical features together improve the experience of both our merchants and coworkers.

    Since the creation of our new DNS traffic management standard in the middle of 2019, we’ve onboarded more than 40 different domains across more than 12 different teams.

    Why Ownership Is Important. Demonstrated By Example

    A few months ago, while we moved teams to use the initial version of our DNS traffic manager, we got an email from one of our DNS providers letting us know that they would discontinue their service because it would be merged with the services of the company that bought them a few years prior. Of course, we weren’t so lucky, having their systems merged together would require action on our part. We needed to manually migrate our zones to the new provider.

    We launched a project to find our next DNS provider as a result. Since we needed to manually migrate our zones and consequently all of our tooling, we might as well evaluate our options. We looked at more than 40 providers, keeping in mind our needs for our static zones and traffic management requirements. We selected a few providers that fit our needs and decided on which one to sign a contract with.

    Once we chose the provider, the big migration happened. First, we updated our terraform module to support the new provider and deployed the traffic-managed domains in the three providers. Then we updated our spy endpoints tooling to update all providers when making changes so everything was ready and in sync. Next, we moved the nameservers of our traffic-managed zone one by one from the DNS provider we were leaving to the new DNS provider, making sure that in case of a problem only a controlled share of the traffic would be affected. We explained our migration plans to the different teams owning domains in the traffic manager, letting them know when it would happen, and that it should be transparent to them, but if anything unexpected seems to be happening, they should contact us. We also told the incident manager on-call of the changes happening and the timelines.

    Everything was in the plan for the change to happen. However, 30 minutes before change time, the provider we were leaving had an incident, preventing us from moving traffic and happening at the same time as one of the application owners having an incident that they wanted to mitigate with the traffic manager. It pushed our timeline forward, but we continued with the change without any issue and it was fully transparent to all the application owners.

    Looking back to how things were before we rolled out our new standard for DNS traffic management, we easily can say that moving to a new DNS provider wouldn’t have been that smooth. We would have had to

    • contact every team using their own approach to gather their needs and usage so we could find a good alternative to our current DNS provider (luckily this was done while preparing and building this project)
    • coordinate between those teams for the change to happen, and then chase after them to make sure they updated any tooling used

    The change couldn’t be handled for all of them as a whole as there wasn’t one product that one team handles, but many products that many teams handle.

    With our DNS traffic management system, we brought ownership to this aspect of our infrastructure because we understand the capabilities and requirements of teams, and how we can maintain and evolve as our teams’ needs evolve, improving the experience of our merchants and their customers.

    Our DNS traffic management journey took us from many manually setup, maintained, and updated traffic management approaches to a fully automated self-served system used by more than 40 domains owned by more than 12 different teams, and handling more than 100M requests per 24h. If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.

    Continue reading

    An Introduction to DNS Traffic Management

    An Introduction to DNS Traffic Management

    Distributed systems are only as resilient as we build them to be. Domain Name System (DNS) traffic management is a well-used approach to do so. In this first part of a two-part series, we aim to give a broad overview of DNS and how it’s used for traffic management, as well as the different reasons why we want to use DNS traffic management.

    If you already have context on what is DNS, what is traffic management, and the reasons why you would need to use DNS traffic management, you can skip directly to where we share our journey and improvements made regarding DNS traffic management at Shopify in Part 2: Shopify’s DNS Traffic Management.

    A Summarized History of DNS

    Everything started with humans trying to communicate and a plain text file, even before the advent of the modern internet.

    The Advanced Research Projects Agency Network (ARPANET) was thought, in 1966, to enable access to remote computers. In 1969, the first computers were connected to ARPANET, followed by the implementation of the Network Control Program (NCP) in 1970. Guided by the need to connect more and more computers together, and as the work on the Transmission Control Protocol (TCP), started in 1974, evolved, TCP/IP was created in the late 1970s to provide the ability to join separate networks in a network of networks and replaced NCP in ARPANET on January 1st, 1983.

    At the beginning of ARPANET there were just a handful of computers from four different universities connected together, which was easier for people to remember the addresses. This became challenging with new computers joining the network. The Stanford Research Institute provided, through file sharing, a manually maintained file containing the hostnames and related addresses of hosts as provided by member organizations of ARPANET. This file, originally named HOSTS.TXT, is now also broadly known as the /etc/hosts file on Unix and Unix-like systems.

    A growing network with an increasingly large number of computers meant an increasingly large file to download and maintain. By the early 1980s, this process became slow and an automated naming system was required to address the technical and personnel issues of the current approach. The Domain Name System (DNS) was born, a protocol converting human-readable (and rememberable) domain names into Internet Protocol (IP) addresses.

    What is DNS?

    Let’s consider that DNS is a very large library where domains are organized from the least to the most meaningful parts of their names. For instance, if you (the client) want to resolve, you would consider that .com is the least meaningful part as it’s shared with many domain names, and shops the most meaningful part as it’s a specification on the subdomain you’re requesting. Finding in this DNS Library would thus mean going to the .com shelves and finding the myshopify book. Once the book in hand, we would then open it to the shops page, and see something that looks like the following:

    The image is telling us that corresponds to the IP address
    DNS Library Book

    The image is telling us that corresponds to the IP address Also, our DNS Library provides us with the equivalent of a Due Date, which is called Time To Live (TTL). It corresponds to the amount of time the association of hostname to IP address is valid. We remember or cache that information for the given amount of time. If we need this information after expiration, we have to “find that book” again to verify if the association is still valid.

    The opposite concept already exists: if you’re trying to find a page in the book and can’t find it, chances are that you won’t wait there until someone writes it down for you. In DNS, this concept is driven by the Negative TTL, which represents the duration we consider a NXDOMAIN (non-existing domain) answer can be cached. This means that the author of a new page in this book cannot consider their update is known by everyone until that period of time has elapsed.

    Another relevant element is that the DNS Library doesn't necessarily hold only one book but multiple identical ones, from different editors, enabling others to be consulted if one copy is unavailable.

    In DNS terms, the editors are DNS providers. The shelves contain multiple sections for each domain nameserver, the servers that provide DNS resolution as a source of truth for a domain. The books are zones in the domain nameservers, and the book pages are DNS records, the direct relation between the queried record and the value it should resolve to. The DNS Library is what we call root servers, a set of 13 nameservers (named from a to m) that hold the keys to the root of the hostnames. The root servers are responsible for helping to locate the shelves, the nameservers of the Top Level Domains (TLD), the domains at the highest level in the hierarchy for DNS.

    What is Traffic Management?

    Traffic management is a key branch of logistics that aims to plan and control everything required to provide for the safe, orderly, and efficient movement of persons and goods. Traffic management helps to manage situations such as congestion or roadblocks, by redirecting traffic or sharing traffic between multiple routes. For instance, some navigation applications use data they get from their users (current location, current speed, etc.) to know where congestion is happening and improve the situation by suggesting alternative routes instead of sending them to the already overloaded roads.

    A more generic description is that traffic management uses data to decide where to direct the traffic. We could have different paths depending on the country of origin (think country border waiting lines for the booths, where the checks are different depending on the passport you hold), different paths depending on vehicle size (bike lanes, directions for trucks vs. cars, etc.) or any other information we find relevant.

    DNS + Traffic Management = DNS Traffic Management

    Bringing the concept of traffic management to DNS means serving data-driven answers to DNS queries resulting in different answers depending on the location of the requester or for each request. For instance, we could have two clusters of servers and want to split the traffic between the two: we can decide to answer 50% of the requests with the first cluster and the other 50% with the second. The clients obtaining the answers would connect to the cluster they got directed to, without any other action on their part.

    DNS queries are cached to avoid overloading servers with queries.

    However, from the previous section, DNS queries are cached to avoid overloading servers with queries. Each time a query is cached by a resolver, it won’t be repeated by that resolver for the duration of its TTL. Using a low TTL will make sure that the information is kept around but not for too long. For example, returning a TTL of 15 seconds means that after 15 seconds the client needs to resolve the record again, and can get a different answer than before.

    A low TTL needs careful consideration, as the time it takes to obtain the DNS record’s content from the DNS servers, called DNS resolution time, sometimes can dominate the time it takes to retrieve a resource like a webpage. The connection performance and accuracy of the result are thus often at odds. For instance, if I want my changes to appear to users in at most 15 seconds (hence setting a 15 seconds TTL), but the DNS resolution time takes 1 second means that every 15 seconds the users will take 1 more second to reach the service they are connecting to. Over a day, this added resolution time adds up to 5760 seconds, or 1 hour and 36 minutes. If we slightly sacrifice the accuracy by moving the TTL to 60 seconds, the resolution time becomes 1440 seconds over a day, or only 24 minutes, improving the overall performance.

    The use of caching and TTL implies that doing DNS traffic management isn’t instant. There's a short delay in refreshing the record that should be at most the TTL that we configured. In practice, it can be slightly more as some DNS resolvers, unbeknownst to the client, might cache the results for a longer time than they see fit. The override of TTL shouldn’t happen often, however, but it’s something to be aware of when choosing DNS to do traffic management.

    Examining Four DNS Traffic Management Use Cases

    DNS traffic management is interesting when handling systems that don’t necessarily hold load balancer capabilities at the network level, either through an IP-level load balancer or any front-facing proxy, i.e. once already connected to the service we are trying to reach. There are many reasons to use DNS traffic management in front of services, and multiple reasons why we use it at Shopify.

    Easy Failover

    One of Shopify’s use cases is easily failing over a service when the live instance crashes or is rendered unavailable for any reasonOne of Shopify’s use cases is easily failing over a service when the live instance crashes or is rendered unavailable for any reason. Using DNS management and having it ready to target two clusters, but using one by default, simply redirects the traffic to the second cluster whenever the first one crashes, it then redirects back the traffic when it recovers. This is commonly called active-passive. If you’re able to identify the unavailability of your main cluster in a timely fashion, this approach makes it almost seamless (considering the TTL) to the clients using the service, as they’d use the still-working cluster while the issue is solved, either automatically or through the intervention of the responsible on-call team (as a last resort). The pressure is relieved on those on-call teams, as they know that clients can still use the service while they solve the issues, sometimes even pushing the work to be done to the next working day.

    Share traffic between endpoints

    Share traffic between endpoints

    Services inevitably grow and end up receiving requests from many clients. Now those requests need to be shared between available endpoints offering the exact same service. This is called active-active. Another motivation behind this approach is money related, when using external vendors (an external company contracted to provide your users a service) with minimum commitment, allowing you to share your traffic load between those vendors in a way that ensures reaching those commitments. You define the percent of traffic sent to each given endpoint corresponding to the percentage of DNS requests answered with that endpoint.

    Deploy a Change Progressively

    DNS traffic management can help by allowing movement of a small percentage of your traffic to a cluster that’s already updated

    When developing production services, sometimes making a potentially disruptive change (such as deploying a new feature, changing the behavior of an existing one, or updating a system to a new version) is needed. In such cases, deploying your change and crossing your fingers while hoping for success is, at best, risky. DNS traffic management can help by allowing movement of a small percentage of your traffic to a cluster that’s already updated, then move more and more chunks of traffic until all of the traffic has been moved to the cluster with the new feature. This approach is called green-blue deployment. You can then update the other cluster, which allows you to be ready for the next update or failover.

    Regionalize Traffic Decisions

    geolocation can be fine-grained to the country, state or province level, or applied to a broader region of the world

    You might find cases where some endpoints are more performant in some regions than others, which might happen when using external vendors. If performance is important for users, as it is for Shopify’s merchants and their customers, then you want to make sure the most performant endpoints are used for users in each region by allowing DNS answers based on the client’s location. Most of the time, geolocation can be fine-grained to the country, state or province level, or applied to a broader region of the world. Routing rules are defined to indicate what should be answered depending on the origin of requests. Once done, a client connecting from a location will get the answer that fits them. 

    Our DNS traffic management journey took us from many manually set-up, maintained, and updated traffic management approaches to a fully automated self-served system used by more than 40 domains owned by more than 12 different teams, and handling more than 100M requests per 24h. If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.

    Continue reading

    Migrating Large TypeScript Codebases To Project References

    Migrating Large TypeScript Codebases To Project References

    In 2017, we began migrating the merchant admin UI of Shopify from a traditional Ruby on Rails Embedded RuBy (ERB) based front-end to an entirely new codebase, TypeScript paired with React and GraphQL. Using TypeScript enabled our ever-growing Admin teams to leverage TypeScript’s compiler to catch potential bugs and errors well before they ship. The VSCode editor also provides useful TypeScript language-specific features, such as inline feedback from TypeScript’s type-checker.

    Many teams work independently but in parallel on the Shopify merchant admin UI. These teams need to stay highly aligned and loosely coupled to ensure they quickly ship reliable code. We chose TypeScript to empower our developers with great tools that help them ship with confidence. Pairing TypeScript with VSCode provides developers with actionable real-time feedback right in their editor without the need to run separate commands or push to CI. With these tools, teams leverage the benefits of TypeScript’s static type-checker right in the editor as they pull in the latest changes.

    The Problems with Large TypeScript Codebases

    Over the years as teams grew, many new features shipped, and the codebase dramatically increased in size. Teams had a poorer experience in the editor. Our codebase was taxing the editor tooling available to use. TS-Server (VSCode’s built-in language server for TypeScript) would seemingly halt, leaving developers with all the weight of TypeScript’s syntax and none of the editor tooling payoff. The editor took ~2-3 minutes to load up all of TypeScript’s great tooling, slowing how fast we ship great features for our merchants, and it also created a very frustrating development experience.

    We need to ship commerce features fast, so we investigated solutions to bring the developer experience in the editor up to speed.

    Understanding What VSCode Didn’t Like About Our Codebase

    First, we started by understanding why and where our codebase was taxing VSCode and TypeScript’s type-checker. Thankfully, the TypeScript team has an in-depth wiki page on Performance, and the VSCode team has an insightful wiki page on Performance Issues.

    Before implementing any solutions, we needed to understand what VSCode didn’t like about our codebase. To my pleasant surprise, around this time Taryn Musgrave, a developer from our Orders & Fulfillment team, had recently attended JSConfHi 2020. At the conference, she met a few folks from TC39 (the team that standardizes the JavaScript language under the “ECMAScript” specification) , one of them being Daniel Rosenwasser, the Program Manager for TypeScript at Microsoft. After the conference, we connected and shared our codebase diagnostics with the core TypeScript team from Microsoft.

    The TypeScript team had asked us for more insight into our codebase. We ran some initial diagnostics against the single TypeScript configuration (tsconfig.json) file for the project. The TypeScript compiler provides an --extendedDiagnostics flag that we can pass to get some useful diagnostic info

    Their team was surprised by the size of our codebase and mentioned that it was in the top 1% of codebase sizes they’d seen. After sharing some more context into our tools, libraries, and build setup they suggested we break up our codebase into smaller projects and leverage a new TypeScript feature released in 3.0 called project references.

    Measuring Improvements

    Before diving into migrating our entire admin codebase, we needed to take a step back to decide how we could measure and track our improvements. To confidently know we’re making the right improvements, we need tools that enable us to measure changes.

    At first, we used VSCode’s TSServer logs to measure and verify our changes. For our Admin codebase the TSServer would spit out ~80,000 lines of logs over the course of ~2m30s on every bootup of VSCode.

    We quickly realized that this approach wasn’t scalable for our teams. Expecting teams to parse through 80,000 lines to verify their improvement wasn’t feasible. So, for this reason we set out to build a VSCode plugin to help our teams at Shopify measure and track their editor initialization times over time. Internally we called this plugin TypeTrack. 

    TypeTrack in action
    TypeTrack in action

    TypeTrack is a lightweight plugin to measure VSCode's TypeScript language feature initialization time. This plugin is intended to be used by medium-large TypeScript codebases to track editor performance improvements as projects migrate their code to TypeScript project references.

    Migrating to Project References

    In large projects, migrating to project references in one go isn’t feasible. The migration would have to be done gradually over time. We found it best to start out by identifying leaf nodes in our project’s dependency graph. These leaf nodes generally include the most widely shared parts of our codebase and have little to no coupling with the rest of our codebase.

    Our Admin codebase’s structure loosely resembles this:

    In our case, a good place to start was by migrating the packages and tests folders. Both of these folders are meant to be treated as “isolated projects” already. Migrating them to leverage project references not only improves their initialization performance in VSCode, but also encourages a healthier codebase. Developers are encouraged to break up their codebase into smaller pieces that are more manageable and reusable.

    Starting At The Leaves

    Once we’ve identified our leaf nodes, we start by creating an entrypoint for the project references.

    Whenever a project is referenced, that respective folder is also migrated to a project reference, meaning it requires a project-level tsconfig specifying the dependencies that project references in turn. In the case of the above-mentioned project-references.tsconfig.json, the ../../package path reference looks for a packages/tsconfig.json.

    Once you’ve got a few folder with project references created, you run the TypeScript compiler to check if your project builds:

    > Protip: Use the --diagnostics flag to get insight into what the compiler is doing.

    At this point it’s very likely that you’ll get tons of errors, the most common being that the compiler can’t find a module that your project is referencing.

    To illustrate this, see the screenshot below. The compiler isn’t able to find the module @shopify/address-consts from within the @shopify/address package

    Compiler can’t find a module the project is referencing
    Compiler can’t find a module the project is referencing

    The solution here would be:

    1. Create a project-level tsconfig.json for the module being depended on. You can think of this as a child node in your project dependency graph. For our example error from above, we need to create a tsconfig.json in packages/address-consts.
    2. Reference the new project reference (the tsconfig.json created from 1) in the consuming parent tsconfig.json that requires the new module. In our example, we need to include the new project reference.

    Keep in mind that you'll have to respect the following restrictions for project references:

    • The include/files compiler option must include all the input files that the project reference relies on.
    • Any project that’s referenced must itself have a references array (which may be empty).
    • Any local namespaces you import must be listed in the config/typescript/tsconfig.base.json #compilerOptions#paths field.

    General Steps to Migrating Your Codebase to Project References

    Once the TypeScript compiler successfully builds the leaf nodes that you’ve migrated to project references, you can start working your way up your project dependency graph. Outlined below are general steps you’ll take to continue migrating your whole codebase:

    1. Identify the folder you plan to migrate. Add this to your project’s config/typescript/project-references.tsconfig.json
    2. Create a tsconfig.json for the folder you’ve identified.
    3. Run the TypeScript compiler
    4. Fix the errors. Determine what other dependencies outside of the identified folder need to be migrated. Migrate and reference those dependencies.
    5. Go to step 3 until the compiler succeeds.

    For large projects, you may see hundreds or even thousands of compiler errors in step 4. If this is the case, pipe out those errors to a log file and look for patterns in the errors.

    Here are some common solutions we’ve found when we’ve come across this roadblock:


    • The folder is just too large. At the root of the folder create a tsconfig.json that includes references to child folders. Migrate the child folders with the general steps mentioned above
    • This folder contains spaghetti code. It’s possible this folder could be reaching into other dependencies/folders that it shouldn’t. In our case we’ve found it best to move shared code into our packages folder consisting of shared isolated packages.

    We Saw Drastic Improvements in VSCode

    Once we migrated our packages and tests folders in our Admin codebase, we made great improvements in the amount of time it takes VSCode to initialize the editor’s TypeScript language features in those folders.

    File Before  After
    packages/@web-utilities/env/index.ts ~155s ~13s
    packages/@shopify/address-autocomplete/AddressAutocomplete.ts ~155s ~8s
    tests/utilities/environment/app.tsx ~155s ~10s

    How Does This Make Sense? Why Is This 10x Faster?

    When a TypeScript file is opened up in VSCode, the editor sends off a request to the TypeScript Server for analysis on that file. The TS Server then looks for the closest tsconfig.json relative to that file for it to understand the types associated with that project. In our case, that’s the project-level tsconfig.json which included our whole codebase.

    With the addition of project references (i.e tsconfig.json with references to its dependent projects) in the packages and tests folders, we’re explicitly separating the codebase into smaller blocks of code. Now, when VSCode loads a file that uses project references, the TS Server will only load up the files that the project depends on.

    Scaling Migration Duties to Other Teams

    In our case, given the size of our Admin codebase, my team couldn’t migrate all of it ourselves. For this reason we wrote documentation and provided tools (our internal VSCode plugin) to other Admin section teams to improve their section’s editor performance.

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    ShipIt! Presents: AR/VR at Shopify

    ShipIt! Presents: AR/VR at Shopify

    On June 20, 2020, ShipIt!, our monthly event series, presented AR/VR at Shopify. Daniel Beauchamp, Head of AR/VR at Shopify talked about how we’re using this increasingly ubiquitous technology at Shopify.

    The missing Shopify AR Stroller in video in the AR/VR at Shopify presentation can be viewed at

    Additional Information

    Daniel Beauchamp



    Shopify AR



    Continue reading

    How We’re Solving Data Discovery Challenges at Shopify

    How We’re Solving Data Discovery Challenges at Shopify

    Humans generate a lot of data. Every two days we create as much data as we did from the beginning of time until 2003! The International Data Corporation estimates the global datasphere totaled 33 zettabytes (one trillion gigabytes) in 2018. The estimate for 2025 is 175 ZBs, an increase of 430%. This growth is challenging organizations across all industries to rethink their data pipelines.

    The nature of data usage is problem driven, meaning data assets (tables, reports, dashboards, etc.) are aggregated from underlying data assets to help decision making about a particular business problem, feed a machine learning algorithm, or serve as an input to another data asset. This process is repeated multiple times, sometimes for the same problems, and results in a large number of data assets serving a wide variety of purposes. Data discovery and management is the practice of cataloguing these data assets and all of the applicable metadata that saves time for data professionals, increasing data recycling, and providing data consumers with more accessibility to an organization’s data assets.

    To make sense of all of these data assets at Shopify, we built a data discovery and management tool named Artifact. It aims to increase productivity, provide greater accessibility to data, and allow for a higher level of data governance.

    The Discovery Problem at Shopify

    Data discovery and management is applicable at every point of the data process:

    • Where is the data coming from?
    • What is the quality of this data?
    • Who owns the data source? 
    • What transformations are being applied?
    • How can this data be accessed?
    • How often does this process run?
    • What business logic is being applied?
    • Is the model stale or current?
    • Are there other similar models out there?
    • Who are the main stakeholders?
    • How is this data being applied
    • What is the provenance of these applications?

    High level data process

    The data discovery issues at Shopify can be categorized into three main challenges: curation, governance, and accessibility.


    “Is there an existing data asset I can utilize to solve my problem?”

    Before Artifact, finding the answer to this question at Shopify often involved asking team members in person, reaching out on Slack, digging through GitHub code, sifting through various job logs, etc. This game of information tag resulted in multiple sources of truth, lack of full context, duplication of effort, and a lot of frustration. When we talked to our Data team, 80% felt the pre-Artifact discovery process hindered their ability to deliver results. This sentiment dropped to 41% after Artifact was released.

    The current discovery process hinders my ability to deliver results survey answers
    The current discovery process hinders my ability to deliver results survey answers


    “Who is going to be impacted by the changes I am making to this data asset?”

    Data governance is a broad subject that encompasses many concepts, but our challenges at Shopify are related to lack of granular ownership information and change management. The two are related, but generally refer to the process of managing data assets through their life cycle. The Data team at Shopify spent a considerable amount of time understanding the downstream impact of their changes, with 16% of the team feeling they understood how their changes impacted other teams:

    I am able to easily understand how my changes impact other teams and downstream consumers survey answers
    I am able to easily understand how my changes impact other teams and downstream consumers survey answers

    Each data team at Shopify practices their own change management process, which makes data asset revisions and changes hard to track and understand across different teams. This leads to loss of context for teams looking to utilize new and unfamiliar data assets in their workflows. Artifact has helped each data team understand who their downstream consumers are, with 46% of teams now feeling they understand the impact their changes have on them.


    “How many merchants did we have in Canada as of January 2020?”

    Our challenge here is surfacing relevant, well documented data points our stakeholders can use to make decisions. Reporting data assets are a great way to derive insights, but those insights often get lost in Slack channels, private conversations, and archived powerpoint presentations. Lack of metadata surrounding these report/dashboard insights directly impacts decision making, causes duplication of effort for the Data team, and increases the stakeholders’ reliance on data as a service model that in turn inhibits our ability to scale our Data team.

    Our Solution: Artifact, Shopify’s Data Discovery Tool

    We spent a considerable amount of time talking to each data team and their stakeholders. On top of the higher level challenges described above, there were two deeper themes that came up in each discussion:

    • The data assets and their associated metadata is the context that informs the data discovery process.
    • There are many starting points to data discovery, and the entire process involves multiple iterations.

    Working off of these themes, we wanted to build a couple of different entry points to data discovery, enable our end users to quickly iterate through their discovery workflows, and provide all available metadata in an easily consumable and accessible manner.

    Artifact is a search and browse tool built on top of a data model that centralizes metadata across various data processes. Artifact allows all teams to discover data assets, their documentation, lineage, usage, ownership, and other metadata that helps users build the necessary data context. This tool helps teams leverage data more effectively in their roles.

    Artifact’s User Experience

    Artifact Landing Page

    Artifact Landing Page

    Artifact’s landing page offers a choice to either browse data assets from various teams, sources, and types, or perform a plain English search. The initial screen is preloaded with all data assets ordered by usage, providing users who aren’t sure what to search for a chance to build context before iterating with search. We include the usage and ownership information to give the users additional context: highly leveraged data assets garner more attention, while ownership provides an avenue for further discovery.

    Artifact Search Results

    Artifact Search Results

    Artifact leverages Elasticsearch to index and store a variety of objects: data asset titles, documentation, schema, descriptions, etc. The search results provide enough information for users to decide whether to explore further, without sacrificing the readability of the page. We accomplished this by providing the users with data asset names, descriptions, ownership, and total usage.

    Artifact Data Asset Details

    Artifact Data Asset Details

    Clicking on the data asset leads to the details page that contains a mix of user and system generated metadata organized across horizontal tabs, and a sticky vertical nav bar on the right hand side of the page.

    The lineage information is invaluable to our users as it:

    • Provides context on how a data asset is utilized by other teams.
    • Lets data asset owners know what downstream data assets might be impacted by changes.

    Artifact Data Asset Lineage

    Artifact Data Asset Lineage

    This lineage feature is powered by a graph database, and allows the users to search and filter the dependencies by source, direction (upstream vs. downstream), and lineage distance (direct vs. indirect).

    Artifact’s Architecture

    Before starting the build, we decided on these guiding principles:

    1. There are no perfect tools; instead solve the biggest user obstacles with the simplest possible solutions.
    2. The architecture design has to be generic enough to easily allow future integrations and limit technical debt.
    3. Quick iterations lead to smaller failures and clear, focused lessons.

    Artifact High Level Architecture
    Artifact High Level Architecture

    With these in mind, we started with a generic data model, and a simple metadata ingestion pipeline that pulls the information from various data stores and processes across Shopify. The metadata extractor also builds the dependency graph for our lineage feature. Once processed, the information is stored in Elasticsearch indexes, and GraphQL APIs expose the data via an Apollo client to the Artifact UI.


    Buy vs. Build

    The recent growth in data, and applications utilizing data, has given rise to data management and cataloguing tooling. We researched a couple of enterprise and open source solutions, but found the following challenges were common across all tools:

    • Every organization’s data stack is different. While some of the upstream processes can be standardized and catalogued appropriately, the business context of downstream processes creates a wide distribution of requirements that are near impossible to satisfy with a one-size-fits-all solution.
    • The tools didn’t capture a holistic view of data discovery and management. You are able to effectively catalogue some data assets. However, cataloguing the processes surrounding the data assets were lacking: usage information, communication & sharing, change management, etc.
    • At Shopify, we have a wide range of data assets, each requiring its own set of metadata, processes, and user interaction. The tooling available in the market doesn’t offer support for this type of variety without heavy customization work.

    With these factors in mind, the buy option would’ve required heavy customization, technical debt, and large efforts for future integrations. So, we went with the build option as it was:

    • The best use case fit.
    • Provided the most flexibility.
    • Left us with full control of how much technical debt we take on.

    Metadata Push vs. Pull

    The architecture diagram above shows the metadata sources our pipeline ingests. During the technical design phase of the build, we reached out to the teams responsible for maintaining the various data tools across the organization. The ideal solution was for each tool to expose a metadata API for us to consume. All of the teams understood the value in what we were building, but writing APIs was new incremental work to their already packed roadmaps. Since pulling the metadata was an acceptable workaround and speed to market was a key factor, we chose to write jobs that pull the metadata from their processes; with the understanding that a future optimization will include metadata APIs for each data service.

    Data Asset Scope

    Our data processes create a multitude of data assets: datasets, views, tables, streams, aliases, reports, models, jobs, notebooks, algorithms, experiments, dashboards, CSVs, etc. During the initial exploration and technical design, we realized we wouldn’t be able to support all of them with our initial release. To cut down the data assets, we evaluated each against the following criteria:

    • Frequency of use:how often are the data assets being used across the various data processes?
    • Impact to end users:what is the value of each data asset to the users and their stakeholders?
    • Ease of integration:what is the effort required to integrate the data asset in Artifact?

    Based on our analysis, we decided to integrate the top queryable data assets first, along with their downstream reports and dashboards. The end users would get the highest level of impact with the least amount of build time. The rest of the data assets were prioritized accordingly, and added to our roadmap.

    What’s Next for Artifact?

    Since its launch in early 2020, Artifact has been extremely well received by data and non-data teams across Shopify. In addition to the positive feedback and the improved sentiment, we are seeing over 30% of the Data team using the tool on a weekly basis, with a monthly retention rate of over 50%. This has exceeded our expectations of 20% of the Data team using the tool weekly, with a 33% monthly retention rate.

    Our short term roadmap is focused on rounding out the high impact data assets that didn’t make the cut in our initial release, and integrating with new data platform tooling. In the mid to long term, we are looking to tackle data asset stewardship, change management, introduce notification services, and provide APIs to serve metadata to other teams. The future vision for Artifact is one where all Shopify teams can get the data context they need to make great decisions.

    Artifact aims to be a well organized toolbox for our teams at Shopify, increasing productivity, reducing the business owners’ dependence on the Data team, and making data more accessible. 

    Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

    Continue reading

    How We Built

    How We Built

    On July 20, 2020 we held ShipIt! Presents: AR/VR at Shopify. Daniel talked about what we’re doing with AR/VR at Shopify. The video of the event is available. is a free tool built by Shopify’s Augmented Reality (AR) team that lets anyone view the size of a product in the space around them using their smartphone camera.

    The tool came out of our April Hack Days focused on helping retailers impacted by Covid-19. Our idea was to create an easy way for merchants to show how big their products are—something that became increasingly important as retail stores were closed. While Shopify does support 3D models of products, it does take time and money to get 3D models made. We wanted to provide a quick stopgap solution.

    The ideal flow is for merchants to provide a link on their product page (e.g., that would open up an AR view when someone clicked on it. For this to be as seamless as possible, it had to be all done on the web (no app required!), fast, accurate, and work on iOS and Android.

    Let’s dive into how we pulled it off.

    AR on the Web

    3D on the web has been around for close to a decade. There are great WebGL libraries like Three.js that make it quick to display 3D content. For, all we needed to show was a 3D cube which is essentially the “Hello World” of computer graphics.

    The problem is that WebAR, the ability to power AR experiences on the web with JavaScript, isn’t supported on iOS. In order to build fully custom AR experiences on iOS, it needs to be in an app.

    Luckily, there’s a workaround. iOS has a feature called AR Quick Look, which is a native AR viewer built into the operating system. By opening a link to a 3D model file online, an AR viewer will launch right in your browser to let you view the model. Android has something similar called Scene Viewer. The functionality of both of these viewers is limited to only placing and moving a single 3D object in your space, but for that was all we needed.

    Example of a link to view a 3D model with iOS AR Quick Look:

    Example of a link to view a 3D model with Android Scene Viewer:

    You might have noticed that the file extensions in the above viewers are different between Android and iOS. Welcome to the wonderful world of 3D formats!

    A Quick Introduction to 3D File Formats

    Image files store pixels, and 3D files store information about an object’s shape (geometry) and what it’s made of (materials & textures). Let’s take a look at a simple object: a cube. To represent a cube as a 3D model we need to know the positions of each of its corners. These points are called vertices

    Representing a cube in 3D format.  Left: Each vertex of the cube.  Right: 3d-coordinates of each vertex
    Left: Each vertex of the cube.
    Right: 3d-coordinates of each vertex

    These can be represented in array like this:

    We also need to know how each side of the cube is connected. These are called the faces.

    A face made up of vertices 0, 1, 5, 4
    A face made up of vertices 0, 1, 5, 4

    The face array [0, 1, 5, 4] denotes that there’s a face made up by connecting vertices 0, 1, 5, and 4 together. Our full cube would look something like this:

    There is also a scale property that lets us resize the whole object instead of moving the vertices manually. A scale of (1,2,3) would scale by 1 along the x-axis, 2 along the y-axis, and 3 along the z-axis.

    A scale of (1,2,3) used to resize the whole object
    A scale of (1,2,3) used to resize the whole object

    So what file format can we use to store this information? Well, just like how images have different file formats (.jpeg, .png, .tiff, .bmp, etc), there’s more than one type of 3D format. And because the geometry data can get quite large, these formats are often binary instead of ASCII based.

    Android’s Scene Viewer uses .glTF, which is quickly becoming a 3D standard in the industry. Its aim is to be the jpeg of 3D. The binary version of a .glTF file is a .glb file. Apple on the other hand is using their own format called USDZ, which is based off of Pixar’s USD file format.

    For to work on both operating systems, we needed a way to create these files dynamically. When a user entered dimensions we’d have to serve them a 3D model of that exact size.

    Approach 1: Creating 3D Models Dynamically

    There are many libraries out there for creating and manipulating .glTF, and they’re very lightweight. You can use them client side or server side.

    USDZ is a whole other story. It’s a real pain to compile all the tools, and an even bigger pain to get running on a server. Apple distributes precompiled executables, but they only work on OSX. We definitely didn’t want to spend half of Hack Days wrestling with tooling, but assuming we could get it working, the idea was:

    1. Generate a cube as a .gltf file
    2. Use the usdzconvert tool to convert the cube.gltf to a .usdz

    The problem here is that this process could end up taking a non-trivial amount of time. We’ve seen usdzconvert take up to 3-4 seconds to convert, and if you add the time it takes to create the .gltf, users might be waiting 5 seconds or more before the AR view launches.

    There had to be a faster and easier way.

    Approach 2: Pre-generate All possible Models

    What if we generated every possible combination of box beforehand in both .gltf and .usdz formats? Then when someone entered in their dimensions, we would just serve up the relevant file.

    How many models would we need? Let’s say we limited sizes for width, length, and depth to be between 1cm and 1000cm. There’s likely not many products that are over 10 meters. We’d then have to pick how granular we could go. Likely people wouldn’t be able to visually tell the difference between 6.25cm and 6cm. So we’d go in increments of 25mm. That would require 4000 * 4000 * 4000 * 2, or 128,000,000,000 models. We didn’t really feel like generating that many models, nor did we have the time during Hack Days!

    How about if we went in increments of 1cm? That would need 1000 * 1000 * 1000 * 2, or 2 billion models. That’s a lot. At 3 KB a model, that’s roughly 6TB of models.

    This approach wasn’t going to work.

    Approach 3: Modify the Binary Data Directly

    All cubes have the same textures and materials, and the same numbers of vertices and faces. The only difference between them is their scale. It seemed inefficient to regenerate a new file every time just to change this one parameter, and seemed wasteful to pre-generate billions of almost identical files.

    But what if we took the binary data of a .usdz file, found which bytes correspond to the scale property, and swapped them out with new values? That way we could bypass all of the usdz tooling.

    The first challenge was to find the byte location of the scale values. The scale value (1, 1, 1) would be hard to look for because the value “1” likely comes up many times in the cube’s binary data. But if we scaled a cube with values that were unlikely to be elsewhere in the file, we could narrow down the location. We ended up creating a cube with scale = (1.7,1.7,1.7).

    By loading up our file in a hex editor, we’re able to look up and find the value. USDZ stores values as 32-bit floats, so 1.7 is represented as 0x9a99d93f. With a quick search we found at byte offset 1344 the values corresponding to scale along the x, y, and z axes.

    Identifying 0x9a99d93f within the USDZ binary
    Identifying 0x9a99d93f within the USDZ binary

    To test our assumption, the next step was to try changing these bytes and seeing that the scale would change.

    It worked! With this script we could generate .usdz files on the fly and it was really fast.The best part is that this could also run completely client side with a few modifications. We could modify the .usdz in the browser, encode the data in a URL, and pass that URL to our AR Quick Look link:

    Unfortunately, our dreams of running this script entirely on the client were dashed when it came to Android. Turns out you can’t launch Scene Viewer from a local data URL. The file has to be served somewhere, so we’re back to needing to do this on a server.

    But before we went about refactoring this script to run on a little server written in Ruby, we wanted to give our lovely cube a makeover. The default material was this grey colour that looked a bit boring.

    The new cube is semi-transparent and has a white outline that makes it easier to align with the room
    The new cube is semi-transparent and has a white outline that makes it easier to align with the room

    Achieving the transparency effect was as simple as setting the opacity value of the material. The white outline proved to be a bit more challenging because it could easily get distorted with scale.

    If one side of the cube is much longer than the other, the outline starts stretching

    The outline starts stretching on the box
    The outline starts stretching on the box

    You can see in the above image how the outlines are no longer consistent in thickness. The ones of the sides are now twice as thick the others.

    We needed a way to keep the outline thickness consistent regardless of the overall dimensions, and that meant that we couldn’t rely on scale. We’d have to modify the vertex positions individually.

    Vertices are highlighted in red
    Vertices are highlighted in red

    Above is the structure of the new cube we landed on with extra vertices for the outline. Since we couldn’t rely on the scale property anymore for resizing our cube, we needed to change the byte values of each vertex in order to reposition them. But it seemed tedious to maintain a list of all the vertex byte offsets, so we ended up taking a “find and replace” approach.

    Left: Outer vertices. Right: Inner outline vertices
    Left: Outer vertices. Right: Inner outline vertices

    We gave each x, y, z position of a vertex a unique value that we could search for in the byte array and replace. In this case the outer vertices for x, y, z were 51, 52, 53 respectively. We picked these numbers at random just like we picked the number 1.7 before. We wanted something unique.

    Vertex values that affected the outline were 41, 42, and 43. Vertex positions with the same value meant that they moved together. Also, since the cube is symmetrical, we gave opposite vertices the same value except negated.

    As an example, let’s say we wanted to make a cube 2m long, and 1m wide and tall. We’d first search for all the bytes with a float value of 51 (0x00004c42) and replace it with the value 2 * 0.5 = 1 (0x0000803f). We use 1 instead of 2 because of the vertices being symmetrical. If the left-most corner has a value of -1 and the right-most has a value of 1, then the distance between them is 2

    Distance between the vertices is 2
    Distance between the vertices is 2

    We’d then move the outline vertices by looking for all the bytes with value 41 (0x00002442) and replace them with 1.98 * 0.5 = 0.99 (0xa4707d3f) to keep the outline 2cm thick. We’d repeat this process for the width and the height

    The template cube is transformed to the proper dimensions with consistent outline thickness.
    The template cube is transformed to the proper dimensions with consistent outline thickness.

    Here’s what part of our server side Ruby code ended up looking like to do this:

    Et voilà! We now had a way to generate beautiful looking cubes on the fly. And most importantly, it’s blazingly fast.This new way can create usdz files in well under 1 millisecond, something much better than relying on the python USD toolset.

    All that remained was to write the .glb version, which we did using the same approach as above.

    What’s Next for

    We’re really happy with how turned out, and with the simplicity of its implementation. Now anytime you are shopping and want to see how big a product is, you can simply go to and visualize the dimensions.

    But why stop at showing 3D cubes? There are lots of standard-sized products that we could help visualize like rugs, posters, frames, mattresses, etc—Imagine being able to upload a product photo of a 6x9 rug and instantly load it up in front of you in AR. All we need to figure out is how to dynamically insert textures into .usdz and .glb files.

    Time to boot up the ol’ hex editor again…

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    Media at Scale: Callbacks vs pipelines

    Media at Scale: Callbacks vs pipelines

    The Shopify admin is a Ruby on Rails monolith that hundreds of developers work on, with a continuous deployment cycle used by millions of people.

    Shopify recently launched the ability to add videos and 3d models to products as a native feature within the Admin section. This feature expands a merchant's ability to showcase their products through media using videos and 3d models in addition to the already existing images. Images have been around since the beginning of Shopify, and there are currently over 7 billion images on the platform.

    A requirement of this project was to support the legacy image infrastructure while adding the new capabilities for all our merchants. In a large system like Shopify, it is critical to have control over the code and safe database operations. For this reason, at Shopify, transactions are used to make database writes safe.

    One of the challenges we faced during the design of this project was deciding whether to use callbacks or a pipeline approach. The first approach we can take to implement the feature is a complete rails approach with callbacks and dependencies. Callbacks, as rails describes them, allow you to trigger logic during an object's life cycle.

    “Callbacks are methods that get called at certain moments of an object's life cycle. With callbacks, it is possible to write code that will run whenever an Active Record object is created, saved, updated, deleted, validated, or loaded from the database.”

    Media via Callbacks

    Callbacks are quick and easy to set up. It is fast to hit the ground running. Let’s look at what this means for our project. For the simplest case of only one media type, i.e., image, we have an image object and a media object. The image object has a polymorphic association to media. Video and 3d models can be other media types. The code looks something like this:

    With this setup, the creation process of an Image will look something like this:

    We have to add a transaction during creation since we need both image and media records to exist. To add a bit more complexity to this example, we can add another media type, video, and expand beyond create. Video has unique aspects, such as the concept of a thumbnail that doesn’t exist for an image. This adds some conditionals to the models.

    From the two previous examples, it is clear that getting started with callbacks is quick when the application is simple; however, as we add more logic, the code becomes more complex. We see that, even for a simple example like this one, our media object has granular information on the specifics of video and images. As we add more media types, we end up with more conditional statements and business logic spread across multiple models. In our case, using callbacks made it hard to keep track of the object’s lifecycle and the order in which callbacks are triggered during each state. As a result, it became challenging to maintain and debug the code.

    Media via Pipelines

    In an attempt to have better control over the code, we decided not to use callbacks and instead try an approach using pipelines. Pipeline is the concept where the output of one element is the input to the next. In our case, the output of one class is the input to the next. There is a separate class that is responsible for only one operation of a single media type.

    Let’s imagine the whole system as a restaurant kitchen. The Entrypoint class is like the kitchen’s head chef. All the orders come to this person. In our case, the user input comes into this service (Product Create Media Service). Next, the head chef assigns the orders to her sous chef. This is the media create handler. The sous chef looks at the input and decides which of the kitchen staff get’s the order. She assigns the order for a key lime pie to the pastry chef and the order for roasted chicken to the roast chef. Similarly, the media create handler assigns the task to create a video to the video create handler and the task to create an image to the image create handler. Each of these individuals specializes in their tasks and are not aware of the details of others. The video create handler is only responsible for creating a media of type video. It has no information about image or 3d models.

    All of our individual classes have the same structure but are responsible for different media types. Each class has three methods:
    • before_transaction
    • during_transaction
    • after_transaction

    As the name suggests, these methods have to be called in that specific order. Going back to our analogy, the before transaction method is responsible for all the prep work that goes in before we create the food. The during transaction is everything involved in creating the dish, which involves cooking the dish and plating it. For rails, this is the method that is responsible for persisting the data to the database. Finally, after_transaction is the clean up that is required after each dish is created.

    Let's look at what the methods do in our case. For example, the video create handler will look like this:

    Similarly, if we move a step up, we can look at the media create handler. This handler will also follow a similar pattern with three methods. Each of these methods in turn calls the handler for the respective media type and creates a cascading effect.

    Media Create Handler

    The logic for each media type remains confined to its specific class. Each class is only aware of its operation, like how the example above is only concerned with creating a video. This creates a separation of concerns. Let's take a look at the product create media service. The service is unaware of the media type, and it’s only responsibility is to call the media handler.

    Product Create Media Service

    The product create media service also has a public entry point, which is used to call the service.

    The caller of the service has a single entry point and is completely unaware of the specifics of how each media type is created. Like in a restaurant, we want to make sure that the food for an entire table is delivered together. Similarly, in our code, we can manage that interdependent objects are created together using a transaction. This approach gives us the following features:

    • Different components of the system can create media and manage their own transactions.
    • The system components no longer have access to the media models but can interact with them using the service entry point.
    • The media callbacks don't get fired with those of the caller, making it easier to follow the code. When developers new to rails use callbacks, it requires a lot of knowledge of the framework and hides away the implementation details.
    • This approach makes it easier to follow and debug the code. The cognitive load on the reader is low, they are all ruby objects, and it is easy to understand.
    • It also gives us control over the order in which objects are created, deleted, updated.

    From the code example, we see that the methods of implementation using callbacks is quick and easy to set up. Ruby on rails can speed up the development process by abstracting away the implementation details, and it is a great feature to use when working with a simple use case. However, as the code evolves and grows more complex, It can be hard to maintain a large production application with callbacks. As we saw in the example above, we had conditionals spread across the active record models.

    An alternate approach can better serve the purpose of long-term maintenance and understandability of the code. In our case, pipelines helped achieve this. We separated the business logic in separate files, enabling us to understand the implementation details better. It also avoided having conditionals spread across the active record models. The most significant advantage of the approach is that it created a clear separation of concerns and different parts of the application do not know the particulars of the media component.

    When designing a pipeline it is important to make sure that there is a single entry point that can be used by the consumer. The pipeline should only perform the actions it is expected to and not have side effects. For example, our pipeline is responsible for creating media and no other action, the client does not expect any other product attribute to be modified. Pipelines are designed to make it easy for the caller to perform certain tasks and so we hide away the implementation details of creating media from the caller. And finally having several steps that perform smaller subtasks can create a clear separation of concern within the pipeline.

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    A First Look at Reanimated 2

    A First Look at Reanimated 2

    Last month, Software Mansion announced the alpha release of Reanimated 2. Version 2 is a complete paradigm shift in the way we build gestures and animations in React Native.

    Last January, at the React Native Community meetup in Kraków, Krzysztof Magiera (co-creator of React Native Android and creator of Gesture Handler and Reanimated) mentioned the idea of a new implementation of Reanimated based on a concept borrowed from an experimental web API, animation worklets: JavaScript functions that run on an isolated context to compute an animation frame. And a few days later Shopify announced its support for Software Mansion’s effort in the open source community, including backing the new implementation of Reanimated.

    When writing gestures and animations in React Native, the key to success is to run the complete user interaction on the UI thread. This means that you don’t need to communicate with the JavaScript thread, nor expect this thread to be free to compute an animation frame.

    In Reanimated 1, the strategy employed was to use a declarative domain-specific language to declare gestures and animations beforehand. This approach is powerful, but came with drawbacks:

    • learning curve is steep
    • some limitations in the available set of instructions
    • performance issues at initialization time.

    To use the Reanimated v1 Domain Specific Language, you have to adopt a declarative mindset which is challenging, and simple instructions could end-up being quite verbose. Basic mathematical tools such as coordinate conversions, trigonometry, and bezier curves, just to name a few, had to be rewritten using the declarative DSL.

    And while the instruction set offered by v1 is large, there are still some use cases where you are forced to cross the React Native async bridge (see Understanding the React Native bridge concept), for example, when doing date or number formatting for instance.

    Finally, the declaration of the animations at initialization time has a performance cost: the more animation nodes are created, the more messages need to be exchanged between the JavaScript and UI thread. On top of that, you need to take care of memoization: making sure to not re-declare animation nodes more than necessary. Memoization in v1 proved to be challenging and a substantial source of potential bugs when writing animations.

    Enter Reanimated 2.

    Animation Worklets

    One of the interesting takeaways from the official announcement is that Software Mansion approached the problem from a different angle. Instead of starting from the main constraint, to not cross the React Native bridge, and offering a way to circumvent it, they asked: How would it look if there were no limitations when writing gestures and animations? They started the work on a solution from there.

    Reanimated 2 is based on a new API named animation worklets. These are JavaScript functions that run on the UI thread independently from the JavaScript thread. These functions can be declared as a worklet via the worklet directive.

    Worklets can receive parameters, access constants from the JavaScript thread, invoke other worklet functions, and invoke functions from the JavaScript thread asynchronously. This new API might remind you of web workers which are also isolated JS functions that talk to the main thread via asynchronous messages. They may also remind you of OpenGL shaders which are also pieces of code to be compiled and executed in an independent manner.

    The code snippet above showcases six properties from worklets. Worklets:

    1. run on the UI thread.
    2. are declared via the ‘worklet’ directive.
    3. can be invoked from the JS thread and receive parameters.
    4. can invoke other worklet functions synchronously.
    5. can access constants from the JS thread.
    6. can asynchronously call functions from the JS thread.

    Liquid-swipe example
    Reanimated 2 liquid-swipe example - click here to see animation

    The team at Software Mansion is offering us a couple of great examples to showcase the new Reanimated API. An interesting one is the liquid-swipe example. It features a couple of advanced animation techniques such as bezier curve interpolation, gesture handling, and physics-based animations, showing us that Reanimated 2 is ready for any kind of advanced gestures and animations

    The API

    When writing gestures and animations, you need to do three things:

    1. create animation values
    2. handle gesture states
    3. assign animation values to component properties.

    The new Reanimated API is offering us five new hooks to perform these three tasks.

    Create Values

    There are two hooks available to create animation values. useSharedValue() creates a shared value that are like Animated.Value but they exist in both the JavaScript and the UI thread. Hence the name.

    The useDerivedValue() hook creates a shared value based on some worklet execution. For instance, in the code snippet below, the theta value is computed in a worklet.

    Handle Gesture States

    useAnimatedGestureHandler() hook can connect worklets to any gesture handler from react-native-gesture-handler. Each event implements a callback with two parameters: event, which contains the values of the gesture handler, and context which you can conveniently use to keep some state across gesture events

    Assign Values to Properties

    useAnimatedStyle() returns a style object that can be assigned to an animated component.

    Finally, useAnimatedProps() is similar to useAnimatedStyle but for animated properties. Now animated props are set via the animatedProps property

    A Reanimated 2 Animation Example

    Let’s build our first example with Reanimated 2. We have an object that we want to drag around the screen. Like in v1, we create two values for us to translate the card on the x and y axis, and we wrap the card component with a PanGesture handler. So far so good.

    As seen in the above code, the animation values are created using useSharedValue and assigned to an animated view via useAnimatedStyle. Now let’s create a gesture handler via useAnimatedGestureHandler. In our gesture handler, we want to do three things:

    1. When the gesture starts, we store the translate values into the context object. This allows us to keep track of the cumulated translations across different gestures.
    2. When the gesture is active, we assign to translate the gesture translation plus its offset values.
    3. When the gesture is released, we want to add a decay animation based on its velocity to give the effect that the view moves like a real physical object

    The final example source can be found on GitHub. You can see the gesture in action below.

    Reanimated 2 PanGesture example
    Reanimated 2 PanGesture example - click here to see the animation

    Going Forward

    With Reanimated 2, the team at Software Mansion asked: “How would it look if there were no constraints when writing gestures and animations in React Native?” and started from there.

    This new version is based on the concept of animation worklets, JavaScript functions that are executed on the UI thread independently from the JavaScript thread. Worklets can receive parameters, access constants, and asynchronously invoke functions from your React code. The new Reanimated API offers five new hooks to create animation values, gesture handlers, and assign animated values to component properties. Part of the Github repository, are many examples that showcase the power of the new Reanimated API.

    We hope that you are as excited about this new version as we are. Reanimated 2 dramatically lowers the barrier to entry in building complex user-interactions in React Native. It also enables new use-cases where we previously had to cross the React Native bridge (to format values for instance). It also dramatically improves the performance at initialization time which in the future might have a substantial impact on particular tasks such as navigation transitions.

    We are looking forward to following the progress of this new exciting way to write gestures and animations.

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    Writing Better, Type-safe Code with Sorbet

    Writing Better, Type-safe Code with Sorbet

    Hey, I’m Jay and I recently finished my first internship at Shopify as a Backend Developer Intern on the App Store Ads team. Out of all my contributions to the ad platform, I wanted to talk about one that has takeaways for any Ruby developer. Here are four reasons for why we adopted Sorbet for static type checking in our repository.

    1. Type-safe Method Calls

    Let’s take a look at an example.

    On the last line we call the method action and then call value.to_h on its return type. If action returns nil, calling value.to_h will cause an undefined method error.

    Without a unit test covering the case when action returns nil, such code could go by undetected. To make matters worse, what if foo() is overridden by a child class to have a different return type? When types are inferred from the names of variables such as in the example, it is hard for any new developer to know that their code needs to handle different return types. There is no clue to suggest what result contains, so the developer would have to search the entire code base for what it could be.

    Let’s see the same example with method signatures.

    In the revised example, it’s clear from the signature that `action` returns a Result object or nil. Sorbet type checking will raise an error to say that calling action.value.to_h is invalid because action can potentially return nil. If Sorbet doesn’t raise any errors regarding our method, we deduce that foo() returns a Result object, as well as an object (most likely an array) that we can call empty? on. Overall, method annotations give us additional clarity and safety. Now, instead of writing trivial unit tests for each case, we let Sorbet check the output for us.

    2. Type-safety for Complex Data Types

    When passing complex data types around, it’s easy to use hashes such as the following:

    This approach has a few concerns:

    • :id and :score may not be defined properties until the object is created in the database. If they’re not properties, calling or ad.score on the ad object will return nil, which is unexpected behavior in certain contexts.
    •  :state may be intended to be an enum. There are no runtime checks that ensure that a value such as running isn't accidentally put in the hash.
    •  :start_date has a value, but :end_date is nil. Can they both be nil? Will the :start_date always have a value? We don’t know without looking at the code that generated the object.

    Situations like this put a large onus on the developer to remember all the different variants of the hash and the contexts in which particular variants are used. It’s very easy for a developer to make a mistake by trying to access a key that doesn’t exist or assign the incorrect value to a key. Fortunately, Sorbet helps us solve these problems.

    Consider the example of creating an ad:

    Creating an ad
    Creating an ad

    Input data flows from an API request to the database through some layers of code. Once stored, a database record is returned.

    Here we define typed Sorbet structs for the input data and the output data. A Database::Ad extends an Input::Ad by additionally having an :id and :score.

    Each of the previous concerns have been addressed:

    • :id and :score clearly do not exist on ads being sent to the database as inputs, but definitely exist on ads being returned.
    • :state must be a State object (as an aside, we implement these using Sorbet enums), so invalid strings cannot be assigned to :state.
    • :end_date can be nil, but :start_date will never be nil.

    Any failure to obey these rules will raise errors during static type checking by Sorbet, and it is clear to developers what fields exist on our object when it’s being passed through our code.

    To extend beyond the scope of this article, we use GraphQL to specify type contracts between services. This lets us guarantee that ad data sent to our API will parse correctly into Input::Ad objects.

    3. Type-safe Polymorphism and Duck Typing

    Sorbet interfaces are integral to implementing the design patterns used in the Ad Platform repository. We’re committed to following a Hexagonal Structure with dependency injection:

    Hexagonal Structure with dependency injection
    Hexagonal Structure with dependency injection

    When we get an incoming request, we first compose a task to execute some logic by injecting the necessary ports/adapters. Then we execute the task and return its result. This architecture makes it easy to work on components individually and isolate logic for testing. This leads to very organized code, fast unit tests, and high maintainability—however, this strategy relies on explicit interfaces to keep contracts between components.

    Let’s see an example where errors can easily occur:

    In the example method, we call Action.perform with either a SynchronousIndexer or an AsynchronousIndexer. Both implement the index method in a different manner. For example, the AsynchronousIndexer may enqueue a job via a job queue, whereas the SynchronousIndexer may store values in a database immediately. The problem is that there’s no way to know if both indexers have the index method or if they return the correct result type expected by Action.perform.

    In this situation, Sorbet interfaces are handy:

    We define a module called Indexer that serves as our interface. AsynchronousIndexer and SynchronousIndexer as classes which implement this interface, which means that they both implement the index method. The index method must take in an array of keyword strings, and return a Result object as well as a list of errors.

    Now we can modify action to take an Indexer as a parameter so that it’s guaranteed that the indexer provided will implement the index method as expected. Now it’s clear to a developer what types are being used and it also ensures that the code behaves as expected.

    4. Support for Gradual Refactoring

    One roadblock to adding Sorbet to an entire codebase is that it’s a lot of work to refactor every file to be typed. Fortunately, Sorbet supports gradual typing. It statically types your codebase on a file-by-file level, so one can refactor at their own pace. A nice feature is that it comes with 5 different typing strictness levels, so one can choose the level of granularity. These levels also allow for gradual adoption across files in a codebase.

    On the ads team, we decided to refactor using a namespace-by-namespace scheme. When a particular Github issue requires committing to a set of files in the same namespace, we upgrade those to the minimum typed level of true, adding method signatures, interfaces, enums, and structs as needed.

    Enforcing Type Safety Catches Errors

    Typing our methods and data types with Sorbet encourages us to adhere to our design patterns more strictly. Sticking to our patterns keeps our code organized and friendly to developers while also discouraging duplication and bad practices. Enforcing type safety in our code saves us from shipping unsafe code to production and catches errors that our unit tests may not catch.

    We encourage everyone to try it in their projects!

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions. 

    Continue reading

    Shopify's Data Science & Engineering Foundations

    Shopify's Data Science & Engineering Foundations

    At Shopify, our mission is to make commerce better for everyone. With over one million businesses in more than 175 countries, Shopify is a mini-economy with merchants, partners, buyers, carriers, and payment providers all interacting. Careful and thoughtful planning helps us build products that positively impact the entire system.

    Commerce is a rapidly changing environment. Shopify’s Data Science & Engineering team supports our internal teams, merchants, and partners with high quality, daily insights so they can “Make great decisions quickly.” Here are the foundational approaches to data warehousing and analysis that empower us to deliver the best results for our ecosystem.

    1. Modelled Data

    One of the first things we do when we onboard (at least when I joined) is get a copy of The Data Warehouse Toolkit by Ralph Kimball. If you work in Data at Shopify, it’s required reading! Sadly it’s not about fancy deep neural nets or technologies and infrastructure. Instead, it focuses on data schemas and best practices for dimensional modelling. It answers questions like, “How should you design your tables so they can be easily joined together?” or “Which table makes the most sense to house a given column?” In essence, it explains how to take raw data and put it in a format that is queryable by anyone. 

    I’m not saying that this is the only good way to structure your data. For what it's worth, it could be the 10th best strategy. That doesn’t matter. What counts is that we agreed, as a Data Team, to use this modelling philosophy to build Shopify's data warehouse. Because of this agreed upon rule, I can very easily surf through data models produced by another team. I understand when to switch between dimension and fact tables. I know that I can safely join on dimensions because they handle unresolved rows in a standard way—with no sneaky nulls silently destroying rows after joining.

    The modelled data approach has a number of key benefits for working faster and more collaboratively. These are crucial as we continue to provide insights to our stakeholders and merchants in a rapidly changing environment.

    Key Benefits

    • No need to understand raw data’s structure
    • Data is compatible between teams

    2. Data Consistency and Open Access

    We have a single data modelling platform. It’s built on top of Spark in a single GitHub repo that everyone at Shopify can access, and everyone uses it. With everyone using the same tools as me, I can gather context quickly and independently: I know how to browse Ian's code, I can find where Ben has put the latest model, etc. I simply need to pick a table name and I can see 100% of the code that built that model.

    What is more, all of our modelled data sits on a Presto Cluster that’s available to the whole company, and not just data scientists (except PII information). That’s right! Anyone at the company can query our data. We also have internal tools to discover these data sets. That openness and consistency makes things scalable.

    Key Benefits

    • Data is easily discoverable
    • Everyone can take advantage of existing data

    3. Rigorous ETL (Extract, Transform, Load)

    As a company focused on software, the skills we’ve developed as a Data Team were influenced by our developer friends. All of our data pipeline jobs are unit tested. We test every situation that we can think of: errors, edge cases, and so on. This may slow down development a bit, but it also prevents many pitfalls. It’s easy to lose track of a JOIN that occasionally doubles the number of rows under a specific scenario. Unit testing catches this kind of thing more often than you would expect.

    We also ensure that the data pipeline does not let jobs fail in silence. While it may be painful to receive a Slack message at 4 pm on Friday about a five-year-old dataset that just failed, the system ensures you can trust the data you play with to be consistently fresh and accurate.

    Key Benefits

    • Better data accuracy and quality
    • Trust in data across the company

    4. Vetted Dashboards

    Like our data pipeline, we have one main visualization engine. All finalized reports are centralized on an internal website. Before blindly jumping into the code like a university student three hours before a huge deadline, we can go see what others have already published. In most cases, a significant portion of the metrics you’re looking for are already accessible to everyone. In other cases, an existing dashboard is pretty close to what we’re looking for. Since the base code for every dashboard is centralized, this is a great starting point.

    Key Benefits

    • Better discovery speed
    • Reuse of work

    5. Vetted data points

    All data points that form the basis for major decisions, or that need to be published externally are what we call vetted data points. They’re stored together with the context we need to understand them. This includes the original question, its answer, and the code that generated the results. One of the fundamentals in producing vetted data points is that the result shouldn’t change over time. For example, if I ask how many merchants were on the platform in Q1 2019, the answer should be the same today and in 4 years from now. Sounds trivial, but it’s harder than it looks! By having it all in a single GitHub repo, it's discoverable, reproducible, and easy to update each year

    Key Benefits

    • Reproducibility of key metrics

    6. Everything is Peer Reviewed

    All of our work is peer reviewed, usually by at least two other data scientists. Even my boss and my boss's boss go through this. This is another practice we gleaned by working closely with developers. Dashboards, vetted data points, dimensional models, unit tests, data extraction, etc… it’s all reviewed. Knowing several people looked at a query invokes a high level of trust in the data across the company. When we do work that touches more than one team, we make sure to involve reviewers from both teams. When we touch raw data, we add developers as reviewers. These tactics really improve the overall quality of data outputs by ensuring pipeline code and analytics meet a high standard that is upheld across the team.

    Key Benefits

    • Better data accuracy and quality
    • Higher trust in data

    7. Deep Product Understanding

    Now for my favourite part: all analyses require a deep understanding of the product. At Shopify, we strive to fall in love with the problem, not the tools. Excellence doesn’t come from just looking at the data, but from understanding what it means for our merchants.

    One way we do this is to divide the Data Team into smaller sub-teams, each of which is associated with a product (or product area). A clear benefit is that sub-teams become experts about a specific product and its data. We know it inside and out! We truly understand what enable means in the column status of some table.

    Product knowledge allows us to slice and dice quickly at the right angles. This has allowed us to focus on metrics that are vital for our merchants. Deep product understanding also allows us to guide stakeholders to good questions, identify confounding factors to account for in analyses, and design experiments that will really influence the direction of Shopify’s products.

    Of course, there is a downside, which I call the specialist gap: sub-teams have less visibility into other products and data sources. I’ll explain how we address that soon.

    Key Benefits

    • Better quality analysis
    • Emphasis on substantial problems

    8. Communication

    What is the point of insights if you don’t share them? Our philosophy is that discovering an insight is only half the work. The other half is communicating the result to the right people in a way they can understand.

    We try to avoid throwing a solitary graph or a statistic at anyone. Instead, we write down the findings along with our opinions and recommendations. Many people are uncomfortable with this, but it’s crucial if you want a result to be interpreted correctly and spur the right actions. We can't expect non-experts to focus on a survival analysis. This may be the data scientist’s tool to understand the data, but don’t mistake it for the result.

    On my team, every time anyone wants to communicate something, the message is peer reviewed, preferably by someone without much background knowledge of the problem. If they cannot understand your message, it’s probably not ready yet. Intuitively, it might seem best to review the work with someone who understands the importance of the message. However, assumptions about the message become clear when you engage someone with limited visibility. We often forget how much context we have on a problem when we’ve just finished working on it, so what we think is obvious might not be so obvious for others.

    Key Benefits

    • Stakeholder engagement
    • Positive influence on decision making

    9. Collaboration Across Data Teams

    Since Shopify went Digital by Default, I have worked with many people I’ve never met, and they’ve all been incredible! Because we share the same assumptions about the data and underlying frameworks, we understand each other. This enables us to work collaboratively with no restrictions in order to tackle important challenges faced by our merchants. Take COVID-19 for example. We created a fully cross-functional task force with one champion per data sub-team to close the specialist gap I mentioned previously. We meet to share findings on a daily basis and collaborate on deep dives that may require or affect multiple products. Within hours of establishing this task force, the team was running at full speed. Everyone has been successfully working together towards one goal, making things better for our merchants, without being constrained to their specific product area.

    Key Benefits

    • Business-wide impact
    • Team spirit

    10. Positive Philosophy About Data

    If you share some game-changing insights with a big decision maker at your company, do they listen? At Shopify, leaders might not action every single recommendation from Data because there are other considerations to weigh, but they definitely listen. They’re keen to consider anything that could help our merchants.

    Shopify announced several features at Reunite to help merchants like gift card features for all merchants and the launch of local deliveries. The Data Team provided many insights that influenced these decisions.

    At the end of the day, it is the data scientists job to make sure insights are understood by the key people. That being said, having leaders that listen helps a lot. Our company’s attitude towards data transforms our work from interesting to impactful.

    Key Benefits

    • Impactful data science

    No Team Member Starts from Scratch at Shopify

    Shopify isn’t perfect. However, our emphasis on foundations and building for the long term is paying off. No one on the Data Team needs to start from scratch. We leverage years of data work to uncover valuable insights. Some we get from existing dashboards and vetted data points. In other cases, modelled data allows us to calculate new metrics with fewer than 50 lines of SQL. Shopify’s culture of data sharing, collaboration, and informed decision making ensures these insights turn into action. I am proud that our investment in foundations is positively impacting the Data Team and our merchants.

    If you’re passionate about data at scale, and you’re eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

    Continue reading

    Spark Joy by Running Fewer Tests

    Spark Joy by Running Fewer Tests

    Developers write tests to ensure correctness and allow future changes to be made safely. However, as the number of features grows, so does the number of tests. Tests are a double-edged sword. On one hand, well-written ones catch bugs and maintain a program’s stability, but as the code base grows, a high number of tests impedes scalability because they take a long time to run and increase the likelihood of intermittently failing tests. Software projects often require all tests to pass before merging to the main branch. This adds overhead to all developers. Intermittently failing tests worsen the problem. Some causes of intermittently failing tests are

    • timing
    • instability in the database
    • HTTP connections/mockings
    • random generators
    • tests that leak state to other tests: the test passes every single time by itself, but fails other tests depending on the order.

    Unfortunately, one can’t fully eradicate intermittently failing tests, and the likelihood of them occurring increases as the codebase grows. They make already slow test suites even slower, because now you have to retry them until they pass.

    I’m not implying that one shouldn’t write tests. The benefits of quality assurance, performance monitoring, and speeding up development by catching bugs early instead of in production outweigh its downsides. However, improvements can be made. My team thus embarked on a journey of making our continuous integration (CI) more stable and faster. I’ll share the dynamic analysis system to select tests that we implemented, followed by other approaches we explored but decided against. Test selection sparks joy in my life. I wish that I can bring the same joy to you.

    Problems with Tests at Shopify

    Tests impede developers’ productivity here. The test suite of our monolithic repository:

    • has over 150,000 tests
    • is growing by 20-30% in size annually
    • takes about 30-40 min to run on hundreds of docker containers in parallel.

    Each pull request requires all tests to pass. Developers have to either wait for tests or pay for the price of context switching. In our bi-annual survey, build stability and speed is a recurring topic. So this problem is clearly felt by our developers.

    Solving the Problem with Tests

    There’s an abundance of blog posts/articles/research papers on optimizing code, unfortunately few on tests. In fact, we learned that it’s unrealistic to optimize tests because of the sheer quantity and growth. We also learned that this is an uncharted territory for many companies.

    As our research progressed, it became apparent that the right solution was to only run the tests related to the code change. This was challenging for a large, dynamically typed Ruby codebase that makes ample use of the language flexibility. Furthermore, the difficulty was exacerbated by metaprogramming in Rails as well as other non-Ruby files in the code base that affect how the application behaves, for example, YAML, JSON, and JavaScript.

    What Is Dynamic Analysis?

    Dynamic analysis, in essence, is logging every single method call. We run each test and track all the files in the call graph. Once we have the call graphs, we create a test mapping: for every file, we find what tests have that file in its call graph. By looking at what files have been modified, added, removed, or renamed, we can look up the tests we need to run.

    You can check out Rotoscope and Tracepoint to record call graphs for Ruby applications.

    Why Use Dynamic Analysis?

    Ruby is a dynamically typed language, we can’t retrieve a dependency graph using static analysis. Thus, we don’t know the corresponding tests for the code.

    Downsides of Running Dynamic Analysis on Ruby on Rails

    1. It’s Slow.

    It’s computationally intensive to generate the call graphs, and we can’t run it for every single PR. Instead, we run dynamic analysis on every deployed commit.

    2. Mapping Can Lag Behind HEAD

    The generated mapping lags behind the head of the main because it runs asynchronously. To solve this problem, we run all tests that satisfy at least one of the following criteria:

    • added or modified on the current branch
    • added or modified between the head of the last generated mapping and current branch head
    • mapped tests per current branch’s code change

    3. There Are Untraceable Files

    There are non-Ruby files such as YAML, JSON, etc. that can’t be traced on the call graph. We added custom patches to Rails to trace some of them. For example, we patched the I18n::Backend class to trace the translation files in YAML. For changes to files that haven’t been traced, we run every single test.

    4. Metaprogramming Obfuscates Call Paths

    We added existing metaprogramming in a known directory and added glob rules on the file path to determine which tests to run. We discourage new metaprogramming through Sorbet and other linters.

    5. Some Failing Tests Won’t Be Caught

    The generated mapping from dynamic analysis can get out of date with the latest main, and sometimes failing tests won’t get selected and get merged to main. To circumvent the issue, we run the full test suite every deploy and automatically disable any failing test, so other developers won’t be blocked from shipping their code. The full test suite runs asynchronously. Pull requests can get merged before the full test suite completes.

    Automatic disabling of failing tests sounds counterintuitive to many people. From what we observed, the percentage of pull requests with failing tests being merged to the main branch is about 0.06%. We also have other mechanisms to mitigate the risks, such as canary deploys and type checking using Sorbet. The code owners of the failing tests are notified. We expect developers to fix or remove the failures without blocking future deploys.

    How Was the Dynamic Analysis Rolled Out?

    In the experimentation phase of the new dynamic analysis system, the test-selection pipeline ran in parallel with the pipeline that runs the full test suite for each new PR. The recall of the new test selection pipeline was measured. Out of all the failing tests, we measured if the new pipeline selects the same failing tests. We didn’t care about the tests that pass because it’s only the failing tests that cause trouble.

    We measured our results using three metrics.

    Failure Recall

    We define recall as the percentage of legitimately failing tests, excluding intermittently failing tests, that our system selected. We want this to be as close as possible to 100%. It’s hard to measure this accurately because of the occurrence of intermittently failing tests. Instead, we approximate the recall by looking at the number of consistently failing tests merged into main.

    After two months that the project has been active, out of the 8,360 commits that were merged, we’ve only failed to detect five failing tests that landed on main. We also managed to resolve most of the root causes of those missed failures, so the same problems don’t repeat in the future.

    We achieved a 99.94% recall.

    Speed Improvement

    The overall selection rate, the ratio of selected tests over total number of tests, is about 60%:

    Percentage of selected test files per build
    Percentage of selected test files per build

    About 40% of builds run fewer than 20% of tests. This shows that many developers will significantly benefit from our test selection system:

    Percentage of builds that selected fewer than 20% of all tests
    Percentage of builds that selected fewer than 20% of all tests

    Compute Time

    In total, we’re saving about 25% compute time. This is measured by adding up the time spent preparing and running tests on every docker container in a build, and averaging that across all builds. It didn’t decrease more because a significant chunk of computing time is still used for setting up containers, databases, and pulling caches. Note that we’re also adding compute time by running the dynamic analysis for every deployed commit on main. We estimate that this will undo some, but not all of the infrastructure cost savings.

    Other Approaches We Explored

    Prior to choosing dynamic analysis, we explored other approaches but ultimately ruled them out.

    Static Analysis

    To determine a dependency graph, we briefly explored using Sorbet, but this would only be possible if the entire code base was converted to strict Sorbet type. Unfortunately, the code base is only partially in strict Sorbet type and too big for my team to convert the rest.

    Machine Learning

    It’s possible to use machine learning to find the dependency graph. Facebook has an excellent research paper [PDF] on it. We chose dynamic analysis at Shopify because we’re not sure if we have enough data to make the prediction, and we want to choose an approach that’s deterministic and reproducible.

    More Machines for Tests

    We tried adding more machines for the test suite. Unfortunately, the performance didn’t increase linearly as we scaled horizontally. In fact, tests on average take longer as we increase the number of machines past a certain number. Increasing machines doesn’t reduce intermittently failing tests and it increases the possibility of failing connections to sidecars, thus increasing test retries.

    Benefits of Running Fewer Tests

    There are three major benefits of selectively running tests:

    1. Developers get faster feedback from tests.
    2. The likelihood of encountering intermittently failing tests decreases, and. Thus, it increases the speed of developers further.
    3. CI costs less.

    Skepticism about Dynamic Analysis/Test Selection

    Before the feature was rolled out, many developers were skeptical that it would work. Frankly, I was one of them. Many people voiced their doubts both privately and openly. However, much to my surprise, after it went live, people were silent. On average, developers request running the full test suites on under 2% of the pull requests.

    If you’re in a similar situation, I hope our experience helps you. It’s hard for developers to embrace the idea that some tests won’t be run when the importance of tests is ingrained in our heads.

    If this sounds like the kind of problems you'd enjoy solving, come work for us. Check out the Software Development at Shopify (Expression of Interest) career posting and apply specifying an interest in Developer Acceleration.  

    Continue reading

    Understanding Programs Using Graphs

    Understanding Programs Using Graphs

    A recording for our event, ShipIt! presents Understanding Programs Using Graphs in TruffleRuby, is now available in the ShipIt! Presents section of this page.

    You may have heard that a compiler uses a data structure called an abstract-syntax-tree, or AST, when it parses your program, but it’s less common knowledge that an AST is normally used just at the very start of compilation, and a powerful compiler generally uses a more advanced type of data structure as its intermediate representation to optimize your program and translate it to machine code in the later phases of compilation. In the case of TruffleRuby, the just-in-time compiler for Ruby that we’re working on at Shopify, this data structure is something called a sea-of-nodes graph.

    I want to show you a little of what this sea-of-nodes graph data structure looks like and how it works, and I think it’s worth doing this for a couple of reasons. First of all, I just think it’s a really interesting and beautiful data structure, but also a practical one. There’s a lot of pressure to learn about data structures and algorithms in order to pass interviews in our industry, and it’s nice to show something really practical that we’re applying here at Shopify. Also, the graphs in sea-of-nodes are just really visually appealing to me and I wanted to share them.

    Secondly, knowing just a little about this data structure can give you some pretty deep insights into what your program really means and how the compiler understands it, so I think it can increase your understanding of how your programs run. 

    I should tell you at this point that I’m afraid I’m actually going to be using Java to show my examples, in order to keep them simpler and easier to understand. This is because compared to Ruby, Java has much simpler semantics—simpler rules for how the language works—and so much simpler graphs. For example, if you index an array in Java that’s pretty much all there is to it, simple indexing. If you index an array in Ruby you could have a positive index, a negative index, a range, conversion, coercion, and lots more—it’s just more complicated. But don’t worry, it’s all super-basic Java code, so you can just pretend it’s pseudo code if you’re coming from Ruby or any other language.

    Reading Sea-of-nodes Graphs

    Lets dive straight in by showing some code and the corresponding sea-of-nodes graph.

    Here’s a Java method. As I said, it’s not using any advanced Java features so you can just pretend it’s pseudo code if you want to. It returns a number from a mathematical sequence known as the Fibonacci sequence, using a simple recursive approach.

    Here’s the traditional AST data structure for this program. It’s a tree, and it directly maps to the textual source code, adding and removing nothing. To run it you’d follow a path in the tree, starting at the top and moving through it depth-first.

    Abstract syntax tree for the Fibonacci sequence program
    Abstract syntax tree for the Fibonacci sequence program

    And here’s the sea-of-nodes graph for the same program. This is a real dump of the data structure as used in practice to compile the Java method for the Fibonacci sequence we showed earlier.

    Sea-of-nodes graph for the Fibonacci sequence program
    Sea-of-nodes graph for the Fibonacci sequence program

    There’s quite a lot going on here, but I’ll break it down.

    We’ve got boxes and arrows, so it’s like a flowchart. The boxes are operations, and the arrows are connections between operations. An operation can only run after all the operations with arrows into it have run.

    A really important concept is that there are two main kinds of arrows that are very different. Thick red arrows show how the control flows in your program. Thin green arrows show how the data flows in your program. The dashed black arrows are meta-information. Some people draw the green arrows pointing upwards, but in our team we think it’s simpler to show data flowing downwards.

    There are also two major kinds of boxes for operations. Square red boxes do something; they have a side-effect or are imperative in some way. Diamond green boxes compute something (they’re pure, or side-effect free), and green for safe to execute whenever.

    P(0) means parameter 0, or the first parameter. C(2) means a constant value of 2. Most of the other nodes should be understandable from their labels. Each node has a number for easy reference.

    To run the program in your head, start at the Start node at the top and move down thick red arrows towards one of the Return nodes at the bottom. If your square red box has an arrow into it from an oval or diamond green box, then you run that green box, and any other green boxes pointing into that green box, first.

    Here’s one major thing that I think is really great about this data structure. The red parts are an imperative program, and the green parts are mini functional programs. We’ve separated the two out of the single Java program. They’re joined where it matters, and not where it doesn’t. This will get useful later on.

    Understanding Through Graphs

    I said that I think you can learn some insights about your program using these graphs. Here’s what I mean by that.

    When you write a program in text, you’re writing in a linear format that implies lots of things that aren’t really there. When we get the program into a graph format, we can encode only the actual precise rules of the language and relax everything else.

    I know that’s a bit abstract, so here’s a concrete example.

    Based on a three-way if-statement it does some arithmetic. Notice that b * c is common to two of the three branches.

    Sea-of-nodes graph for a three-way if-statement-program
    Sea-of-nodes graph for a three-way if-statement-program

    When we look at the graph for this we can see a really clear division between the imperative parts of the program and the functional parts. Notice in particular that there is only one multiplication operation. The value of a * b is the same on whichever branch you compute it, so we have just one value node in the graph to compute it. It doesn’t matter that it appeared twice in the textual source code—it has been de-duplicated by a process known as global value numbering. Also, the multiplication node isn’t fixed in either of the branches, because it’s a functional operation and it could happen at any point and it makes no change to what the program achieves.

    When you look at the source code you think that you pick a branch and only then you may execute a * b, but looking at the graph we can see that the computation a * b is really free from which branch you pick. You can run it before the branch if you want to, and then just ignore it if you take the branch which doesn’t need it. Maybe doing that produces smaller machine code because you only have the code for the multiplication once, and maybe it’s faster to do the multiplication before the branch because your processor can then be busy doing the multiplication while it decides which branch to go to.

    As long as the multiplication node’s result is ready when we need it, we’re free to put it wherever we want it.

    You may look at the original code and say that you could refactor it to have the common expression pulled out in a local variable. We can see here that doing that makes no difference to how the compiler understands the code. It may still be worth it for readability, but the compiler sees through your variable names and moves the expression to where it thinks it makes sense. We would say that it floats the expression.

    Graphs With Loops

    Here’s another example. This one has a loop.

    It adds the parameter a to an accumulator n times.

    Sea-of-nodes graph for a program with loops
    Sea-of-nodes graph for a program with loops

    This graph has something new, an extra thick red arrow backward now. That closes the loop, it’s the jump back to the start of the loop for a new iteration.

    The program is written in an imperative way, with a traditional iterative looping construct as you’d use in C, but if we look at the little isolated functional part, we can see the repeated addition on its own very clearly. There’s literally a little loop showing that the + 1 operation runs repeatedly on its own result.

    Isolated functional part of sea-of-nodes graph
    Isolated functional part of sea-of-nodes graph

    That phi node (the little circle with the line in it is a Greek letter) is a slightly complicated concept with a traditional name. It means that the value at that point may be one of multiple possibilities.

    Should We Program Using Graphs?

    Every few years someone writes a new PhD thesis on how we should all be programming graphically instead of using text. I think you can possibly see the potential benefits and the practical drawbacks of doing that by looking at these graphs.

    One benefit is that you’re free to reshape, restructure, and optimize your program by manipulating the graph. As long as you maintain a set of rules, the rules for the language you’re compiling, you can do whatever you want.

    A drawback is that it’s not exactly compact. This is a 6 line method but it’s a full-screen to draw it as a graph, and it already has 21 nodes and 22 arrows in it. As we get bigger graphs it becomes impossible to draw them without the arrows starting to cross and they become so long that they have no context—you can’t see where they’re going to or coming from, and then it becomes much harder to understand.

    Using Sea-of-nodes Graphs at Shopify

    At Shopify we’re working on ways to understand these graphs at Shopify-scale. The graphs for idiomatic Ruby code in a codebase like our Storefront Renderer can get very large and very complicated—for example this is the Ruby equivalent to the Java Fibonacci example.

    Sea-of-nodes graph for the Java Fibonacci example in Ruby
    Sea-of-nodes graph for the Java Fibonacci example in Ruby (click for larger SVG version)

    One tool we’re building is the program to draw these graphs that I’ve been showing you. It takes compiler debug dumps and produces these illustrations. We’re also working on a tool to decompile the graphs back to Ruby code, so that we can understand how Ruby code is optimized, by printing the optimized Ruby code. That means that developers who just know Ruby can use Ruby to understand what the compiler is doing.


    In summary, this sea-of-nodes graph data structure allows us to represent a program in a way that relaxes what doesn’t matter and encodes the underlying connections between parts of the program. The compiler uses it to optimize your program. You may think of your program as a linear sequence of instructions, but really your compiler is able to see through that to something simpler and more pure, and in TruffleRuby it does that using sea-of-nodes.

    Sea-of-nodes graphs are an interesting and, for most people, novel way to look at your program.

    Additional Information

    ShipIt! Presents: Understanding Programs Using Graphs in TruffleRuby

    Watch Chris Seaton and learn about TruffleRuby’s sea-of-nodes. This beautiful data structure can reveal surprising in-depth insights into your program.


    If this sounds like the kind of problems you'd enjoy solving, come work for us. Check out the Software Development at Shopify (Expression of Interest) career posting and apply specifying an interest in Developer Acceleration. 

    Continue reading

    ShipIt! Presents: How Shopify Uses Nix

    ShipIt! Presents: How Shopify Uses Nix

    On May 25, 2020,  ShipIt!, our monthly event series, presented How Shopify Uses Nix. Building upon on my What is Nix post, I show how we rebuilt our developer tooling using Nix, and show off some of the tooling we actually use at Shopify on a day-to-day basis.

    I wasn't able to answer all the questions during the event, so I've included answers to those ones below.

    Would runix interop well with lorri if/when it's open sourced?

    Maybe. Not effortlessly, because our whole shadowenv strategy is similar but different. It could probably be made to work without too much effort, and as long as compatibility didn’t make some major tradeoff that I’m not able to guess at right now. We’d be super open to a PR to make it compatible.

    Do you use nix for CI/CD, and if you do, how is it set up?

    Not yet. Hoping to get to that late this year.

    For which Lisp was that Lisp code you showed earlier?

    It uses Ketos, a little Rust implementation, but it’s almost not important: we document the available functions, and there are very few. I like to think of it more as a DSL than even as a “real” Lisp.

    I'm curious about how everyone WFH affects this tooling? Is there some limit to how often you can update dependencies because it'll force people to re-download everything on a rebase over their home internet connections?

    Yeah, this is something we’re still puzzling through. We don’t bump our nixpkgs revision very often just as a matter of, I don’t know, laziness maybe, but we’ve definitely seen more people complaining about large downloads when we do since moving out of our offices with nice multi-Gbit fiber. Mainly, it’s going to be interesting to see the world struggle with trying to provide home-workers with better internet speeds over the next year. This is something Canada and the US do an abysmal job of right now.

    What's been the pain points with Nix for your use case?

    The tooling is in general really optimized for “build” workflows, not development workflows. This is really clear in the bundleEnv/bundleApp workflows. I had to spend a lot of time putting together a solution for that that mapped into the way we actually want to use it, better. I’m going to try to upstream that at some point, but I anticipate more of these problem.

    One of your dev.yml files had import < nixpkgs > in it. Do you pin nixpkgs versions in your system?

    Yes. I think I kind of touched on this probably later in the demo but that dev tool has a specific nixpkgs revision that it enforces each time dev up is run.

    Is there more of an issue with gem dependencies as they populate a global cache by default vs something like node? Do you "node"-nix your node deps or are them being in the project-local node_modules "enough" in this case?

    We don’t yet manage node modules with Nix but I’m looking at doing that whenever we get the (large amount of) time required to do that successfully. Profpatsch/yarn2nix seems promising. We started with gems just because they annoyed me more and I had more familiarity with how they work.

    If you’re looking for more Nix content, I’ve re-released a series of screencasts I recorded for developers at Shopify to the public. Check out Nixology on YouTube.

    If this sounds like the kind of problems you'd enjoy solving, come work for us. Check out the Software Development at Shopify (Expression of Interest) career posting and apply specifying an interest in Developer Acceleration.

    Continue reading

    7 Ways to Make Your SQL Workshop Beginner-friendly

    7 Ways to Make Your SQL Workshop Beginner-friendly

    Data is key to making great decisions at scale, so it’s no surprise that all new hires in RnD (Research & Development) at Shopify are invited to take a 90-minute hands-on workshop to learn about Shopify’s data practices, architecture, and some SQL. Since RnD is a multi-disciplinary organization, participants come from a variety of technical backgrounds ( engineering, data science, UX, project management) and sometimes non-technical disciplines that work closely with engineering teams. We split the workshop into two groups: a beginner-friendly group that assumes no prior knowledge of SQL or other data tools, and an intermediate group for those with familiarity.

    Beginners have little-to-no experience with SQL, although they may have experience with other programming languages. Still, many participants aren’t programmers or data analysts. They’ve never encountered databases, tables, data models, and have never seen a SQL query. The goal is to empower them to responsibly use the data available to them to discover insights and make better decisions in their work.

    That’s a lofty goal to achieve in 90 minutes, but creating an engaging, interactive workshop can help us get there. Here are some of the approaches we use to help participants walk away equipped and excited to start using data to power their work.

    1. Answer the Basics: Where Does Data Come From?

    Before learning to query and analyze data, you should first know some basics about where the data comes from and how to use it responsibly. Data Scientists at Shopify are responsible for acquiring and preprocessing data, and generating data models that can be leveraged by anyone looking to query the data. This means participants of the workshop, who aren’t data scientists, won’t be expected to know how to collect and clean big data, but should understand how the data available to them was collected. We take the first 30 minutes to walk participants through the basics of the data warehouse architecture:

    • Where does data come from?
    • How does it end up in the data warehouse?
    • What does “raw data” vs. “cleaned/modelled data” mean?

    Data Warehouse Architecture

    Data Warehouse Architecture

    We also caution participants to use data ethically and responsibly. We warn them not to query production systems directly, rely on modelled data when possible, and respect data privacy. We also give them tools that will help them find and understand datasets (even if that “tool” is to ask a data scientist for help). By the time they are ready to start querying, they should have an appreciation for how datasets have been prepared and what they can do with them.

    2. Use Real Data and Production Tools

    At Shopify, anyone in RnD has access to tools to help them query and analyze data extracted from the ecosystem of apps and services that power Shopify. Our workshop doesn’t play around with dummy data—participants go right into using real data from our production systems, and learn the tools they’ll be using throughout their career at Shopify.

    If you have multiple tools available, choose one that has a simple querying interface, easy chart-building capabilities, and is connected (or can easily be connected) to your data source. We use Mode Analytics, which has a clean console UI, drag-and-drop chart and report builders, clean error feedback, and a pane for searching and previewing data sources from our data lake.

    In addition to choosing the right tool, the dataset can make or break the workshop. Choosing a complex dataset with cryptic column names that is poorly documented, or one that requires extensive domain knowledge will draw the attention of the workshop away from learning to query. Instead, participants will be full of questions about what their data means. For these reasons, we often use our support tickets dataset. Most people understand the basic concepts of customer support: a customer who has an issue submits a ticket for support and an agent handles the question and closes the ticket once it’s been solved. That ticket’s information exists in a table where we have access to facts like when the ticket was created, when it was solved, who it was handled by, what customer it was linked to, and more. As a rule of thumb, if you can explain the domain and the structure of the data in 3 sentences or less, it’s a good candidate to use in a beginner exercise.

    3. Identify Your Objectives

    To figure out what your workshop should touch, it’s often helpful to think about the types of questions you want your participants to be able to answer. At the very least, you should introduce the SELECT, FROM, WHERE, ORDER BY, and LIMIT clauses. These are foundational for almost any type of question participants will want to answer. Additional techniques can depend on your organization’s goals.

    Some of the most common questions we often answer with SQL include basic counts, trends over time, and segmenting metrics by certain user attributes. For example, a question we might get in Support is “How many tickets have we processed monthly for merchants with a Shopify Plus plan?”, or “What was the average handling time of tickets in January 2020?”

    For our workshop, we teach participants foundations of SQL that include the keywords SELECT, DISTINCT, FROM, WHERE, JOIN, GROUP BY, and ORDER BY, along with functions like COUNT, AVG, and SUM. We believe this provides a solid foundation to answer almost any type of question someone outside of Data Science could be able to self-solve.

    4. Translate Objectives Into Real Questions

    Do you remember the most common question from your high school math classes? If it was anything like my classroom, it was probably, “When will we actually need to know this in the real world?” Linking the techniques to real questions helps all audiences grasp the “why” behind what we’re doing. It also helps participants identify real use cases of the queries they build and how they can apply queries to their own product areas, which motivates them to keep learning!

    Once you have an idea what keywords and functions you want to include in your workshop, you’ll need to figure out how you want to teach them to your participants. Using the dataset you chose for the workshop, construct some interesting questions and identify the workshop objectives required for each question.

    Identifying Objectives and Questions

    Identifying Objectives and Questions


    It also helps to have one big goal question for participants to aim to answer. Ideally, answering the question should result in a valuable, actionable insight. For example, using the domain of support tickets, our goal question can be “How have wait times for customers with a premium plan changed over time?” Answers to this question can help identify trends and bottlenecks in our support service, so participants can walk away knowing their insights can solve real problems.

    A beginner-friendly exercise should start with the simplest question and work its way toward answering the goal question.

    5. Start With Exploration of Data Sources

    An important part of any analysis is the initial exploration. When we encounter new data sources, we should always spend time understanding the structure and quality of the data before trying to build insights off it. For example, we should know what the useful columns for our analysis will be, the ranges for any numerical or date columns, the unique values for any text columns, and the filtering we’ll need to apply later, such as filtering out test tickets created by the development team.

    The first query we run with participants is always a simple “SELECT * FROM {table}”. Not only does this introduce participants to their first keywords, but it gives us a chance to see what the data in the table looks like. We then learn how to select specific columns and apply functions like MIN, MAX, or DISTINCT to explore ranges.

    6. Build on Each Query to Answer More Complex Questions

    Earlier, we talked about identifying real questions we want to have participants answer with SQL. It turns out that it only really requires one or two additional keywords or functions to answer increasingly-difficult questions from a simple query.

    We discussed how participants start with a simple “SELECT * FROM {tickets_table}” query to explore the data. This forms the foundational query used for the rest of the exercise. Next, we might start adding keywords or functions in this sequence:

    Question → Objective

    1. How many tickets were created last month? → Add COUNT and WHERE.
    2. What is the daily count of tickets created over the last month? → Add GROUP BY and date-parsing functions (e.g. Presto’s DATE_TRUNC function to convert a timestamp to a day).
    3. How do the metrics reported in #2 change by the customer’s plan type? → Add JOIN and GROUP BY multiple columns.

    Each question above can be built from the prior query and the addition of up to two new concepts, and already participants can start deriving some interesting insights from the data!

    7. Provide Resources to Keep the Learning Journey Going

    Unfortunately, there’s only so far we can go in a single 90-minute workshop and we won’t be able to answer every question participants may have. However, there are many free resources out there for those who want to keep learning. Before we part, here’s a list of my favourite learning resources from beginner to advanced levels.

    Why Teach SQL to Non-Technical Users?

    I haven’t run into anyone in RnD at Shopify who hasn’t used SQL in some capacity. Those who invest time into learning it are often able to use the available data tools to 10x their work. When employees are empowered to self-serve their data, they have the ability to quickly gather information that helps them make better decisions.

    There are many benefits to learning SQL, regardless of discipline. SQL is a tool that can help us troubleshoot issues, aggregate data, or identify trends and anomalies. As a developer, you might use SQL to interact with your app’s databases. A UX designer can use SQL to get some basic information on the types of users using their feature and how they interact with the product. SQL helps technical writers explore pageview and interaction data of their documents to inform how they should spend their time updating documents that may be stale or filling resource gaps. The opportunities are endless when we empower everyone.

    Learn SQL!

    Learn SQL from the ground up with these free resources! Don’t be discouraged by the number of resources shared below - pick and choose which resources best fit your learning style and goals. See the comments section to understand the key concepts you should learn. If you choose to skip a resource, make sure you understand any concepts listed before moving on.

    Database theory

    Database management systems allow data to be stored and accessed in a computer system. At Shopify, many applications store their data inRelational Database Management Systems (RDMS). It can be useful to understand the basics behind RDMS since much of the raw data we encounter will come from these databases, which can be “spoken to” with SQL. Learning about RDMS also exposes you to concepts like tables, keys, and relational operations that are helpful in learning SQL.




    Udacity - Intro to Relational Databases 

    This free course introduces databases for developers. Complete the 1st lesson to learn about data and tables. Complete the rest of the lessons if they interest you, or continue on with the rest of the resources

    Stanford Databases 1 MOOC

    This free MOOC from Stanford University is a great resource for tailoring your learning path to your specific goals. Unfortunately as of March 2020, the Lagunita online learning platform was retired and they are working on porting the course over to Stay tuned for these resources to come back online!

    When they become available, check out the following videos: Introduction, Relational Model, Querying Relational Databases

    Some terms you should be familiar with at this point: data model, schema, DDL (Data Definition Language), DML (Data Manipulation Language), Columns/attributes, Rows, Keys (primary and foreign keys), Constraints, and Null values

    Coursera Relational Database Systems

    While the Stanford MOOC is being migrated, this course may be the next best thing. It covers many of the concepts discussed above

    Coursera - Database Management Essentials

    This is another comprehensive course that covers relational models and querying. This is part of a specialization focusing on business intelligence warehousing, so there’s also some exposure to data modelling and normalization (advanced concepts!)

     Querying with SQL

    Before learning the SQL syntax, it’s useful to familiarize yourself with some of the concepts listed in the prior section. You should understand the concept of a table (which has columns/attributes and rows), data types, and keys (primary and foreign keys).

    Beginner resources




    SQLBolt interactive lessons

    This series of interaction lessons will take you through the basics of SQL. Lessons 1-6 cover basic SQL exercises, and lessons 7-12 cover intermediate-level concepts. The remaining exercises are primarily for developers and focus on creating and manipulating tables.

    You should become familiar with SQL keywords like: SELECT, FROM, WHERE, LIMIT, ORDER BY, GROUP BY, and JOIN

    Intermediate-level concepts you should begin to understand include: aggregations, types of joins, unions

    Article: A beginner's guide to SQL

    This is a great read that breaks down all the keywords you’ve encountered so far in the most beginner-friendly way possible. If you’re still confused about what a keyword means, how to use it properly, or how to get the syntax right for the question you’re trying to answer (what order do these keywords go in?), this is a great resource to help you build your intuition.

    W3 Schools SQL Tutorial

    W3 has a comprehensive tutorial for learning SQL. I don’t recommend you start with w3, since it doesn’t organize the concepts in a beginner-friendly way as the other resources have done. But, even as an advanced SQL user, I still find their guides helpful as a reference. It’s also a good reference for understanding all the different JOIN types.

    YouTube - SQL Tutorial

    For the audio-visual learners, this video lesson covers the basics of database management and SQL

    Intermediate resources




    SQL Basics - Subqueries

    SQL Basics - Aggregations

    Despite its name, this resource covers some topics I’d consider beyond the basic level.

    Learn about subqueries and aggregations.This site also has good notes on joins if you’re still trying to wrap your head around them.

    SQL Zoo Interactive Lessons

    Complete the exercises here for some more hands-on practice

    Modern SQL - WITH Keyword

    Learn about the WITH keyword (also known as Common Table Expressions (CTE)) for writing more readable queries

    Mode's SQL Tutorial

    At Shopify, we use Mode to analyze data and create shareable reports. Mode has its own SQL tutorial that also shares some tips specific to working in the Mode environment. I find their section on the UNION operator more comprehensive than other sources. If you’re interested in advanced topics, you can also explore their advanced tutorials.>

    Complete some of their exercises in the Intermediate and Advanced sections. Notably, at this point you should start to become comfortable with aggregate functions - functions like COUNT, SUM, MIN, MAX, AVG, GROUP BY and HAVING keywords, CASE statements, joins, and date formats

    Here’s a full list of interactive exercises available:

    And finally, bookmark this SQL cheat sheet for future reference

    Advanced resources

    The journey to learning SQL can sometimes feel endless - there are many keywords, functions, or techniques that can help you craft readable, performant queries. Here are some advanced concepts that might be useful to add to your toolbelt.

    Note: database systems and querying engines may have different functions built-in. At Shopify, we use Presto to query data across several datastores - consult the Presto docs on available functions




    Window Functions 

    Window functions are powerful functions that allow you to perform a calculation across a set of rows in a “window” of your data.

    The Mode SQL tutorial also shares some examples and practice problems for working with window functions

    Common Table Expressions (CTE)

    Common Table Expressions, aka CTEs, allow you to create a query that can be reused throughout the context of a larger query. They are useful for enhancing the readability and performance of your query.

    Explain (Presto)

    The EXPLAIN keyword in Presto’s SQL engine allows you to see a visual representation of the operations performed by the engine in order to return the data required by your query, which can be used to troubleshoot performance issues. Other SQL engines may have alternative methods for creating an execution plan

    Grouping Sets

    Grouping sets can be used in the GROUP BY clause to define multiple groupings in the same query


    Coalesce is used to evaluate a number of arguments and return the first non-NULL argument seen. For example, it can be used to set default values for a column holding NULL values.

    String functions

    More string functions (Presto)

    Explore functions for working with string values

    Date functions

    More date functions (Presto)

    Explore functions for working with date and time values

    Best practices

    More best practices

    These “best practices” share tips for enhancing the performance of your queries

    Magic of SQL (YouTube)

    Chris Saxon shares tips and tricks for advancing your SQL knowledge and tuning performance

    If you’re passionate about data at scale, and you’re eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

    Continue reading

    What Is Nix

    What Is Nix

    Over the past year and a bit, Shopify has been progressively rebuilding parts of our developer tooling with Nix. I initially planned to write about how we're using Nix now, and what we're going to do with it in the future (spoiler: everything?). However, I realize that most of you won't have a really clear handle on what Nix is, and I haven't found a lot of the introductory material to convey a clear impression very quickly, so this article is going to be a crash course in what Nix is, how to think about it, and why it's such a valuable and paradigm-shifting piece of technology.

    There are a few places in this post where I will lie to you in subtle ways to gloss over all of the small nuances and exceptions to rules. I'm not going to call these out. I'm just trying to build a general understanding. At the end of this post, you should have the basic conceptual scaffolding you need in order to think about Nix. Let's dive in!

    What is Nix?

    The most basic, fundamental idea behind Nix is this:

    Everything on your computer implicitly depends on a whole bunch of other things on your computer.

    • All software exists in a graph of dependencies.
    • Most of the time, this graph is implicit.
    • Nix makes this graph explicit.

    Four Building Blocks

    Let's get this out of the way up front: Nix is a hard thing to explain.

    There are a few components that you have to understand in order to really get it, and all of their explanations are somewhat interdependent; and, even after explaining all of these building blocks, it still takes a bit of mulling over the implications of how they compose in order for the magic of Nix to really click. Nevertheless, we'll try, one block at a time.

    The major building blocks, at least in my mental model of Nix, are:

    1. The Nix Store
    2. Derivations
    3. Sandboxing
    4. The Nix Language.

    The Nix Store

    The easiest place to start is the Nix Store. Once you've installed Nix, you'll wind up with a directory at /nix/store, containing a whole bunch of entries that look something like this:


    This directory, /nix/store, is a kind of Graph Database. Each entry (each file or directory directly under /nix/store) is a Node in that Graph Database, and the relationships between them constitute Edges.

    The only thing that's allowed to write directories and files into /nix/store is Nix itself, and after Nix writes a Node into this Graph Database, it's completely immutable forever after: Nix guarantees that the contents of a Node doesn't change after it's been created. Further, due to magic that we'll discuss later, the contents of a given Node is guaranteed to be functionally identical to a Node with the same name in some other Graph, regardless of where they're built.

    What, then, is a "relationship between them?" Put another way, what is an Edge? Well, the first part of a Store path (the 32-character-long alphanumeric blob) is a cryptographic hash (of what, we'll discuss later). If a file in some other Store path includes the literal text "h9bvv0qpiygnqykn4bf7r3xrxmvqpsrd-nix-2.3.3", that constitutes a graph Edge pointing from the Node containing that text to the Node referred to by that path. Nodes in the Nix store are immutable after they're created, and the Edges they originate are scanned and cached elsewhere when they're first created.

    To demonstrate this linkage, if you run otool -L (or ldd on Linux) on the nix binary, you'll see a number of libraries referenced, and these look like:


    That's extracted by otool or ldd, but ultimately comes from text embedded in the binary, and Nix sees this too when it determines the Edges directed from this Node.

    Highly astute readers may be skeptical that scanning for literal path references in a Node after it's created is a reliable way to determine a dependency. For now, just take it as given that this, surprisingly, works almost flawlessly in practice.

    To put this into practice, we can demonstrate just how much of a Graph Database this actually is using nix-store --query. /nix/store is a tool built in to Nix that interacts directly with the Nix Store, and the --query mode has a multitude of flags for asking different questions of the Graph Database that is the Store.

    Let's find all of the Nodes that <hash>-nix-2.3.3 has Edges pointing to:

    $ nix-store --query --references /nix/store/h9bvv0qpiygnqykn4bf7r3xrxmvqpsrd-nix-2.3.3/
    ...(and 21 more)...

    Similarly, we could ask for the Edges pointing to this node using --referers, or we could ask for the full transitive closure of Nodes reachable from the starting Node using --requisites.

    The transitive closure is an important concept in Nix, but you don't really have to understand the graph theory: An Edge directed from a Node is logically a dependency: if a Node includes a reference to another Node, it depends on that Node. So, the transitive closure (--requisites) also includes those dependencies' dependencies, and so on recursively, to include the total set of things depended on by a given Node.

    For example, a Ruby application may depend on the result of bundling together all the rubygems specified in the Gemfile. That bundle may depend on the result of installing the Gem nokogiri, which may depend on libxml2 (which may depend on libc or libSystem). All of these things are present in the transitive closure of the application (--requisites), but only the gem bundle is a direct reference (--references).

    Now here's the key thing: This transitive closure of dependencies always exists, even outside of Nix: these things are always dependencies of your application, but normally, your computer is just trusted to have acceptable versions of acceptable libraries in acceptable places. Nix removes these assumptions and makes the whole graph explicit.

    To really drive home the "graphiness" of software dependencies, we can install Ruby via nix (nix-env -iA nixpkgs.ruby) and then build a graph of all of its dependencies:

    nix-store --query --graph $(which ruby) \
    | nix run nixpkgs.graphviz -c dot > ruby.svg

    Graphiness of Software Dependencies

    Graphiness of Software Dependencies


    The second building block is the Derivation. Above, I offhandedly mentioned that only Nix can write things into the Nix Store, but how does it know what to write? Derivations are the key.

    A Derivation is a special Node in the Nix store, which tells Nix how to build one or more other Nodes.

    If you list your /nix/store, you'll see a whole lot of items most likely, but some of them will end in .drv:


    This is a Derivation. It's a special format written and read by Nix, which gives build instructions for anything in the Nix store. Just about everything (except Derivations) in the Nix store is put there by building a Derivation.
    So what does a Derivation look like?

    $ cat /nix/store/ynzfmamryf6lrybjy1zqp1x190l5yiy5-demo.drv
    Derive([("out","/nix/store/76gxh82dqh6gcppm58ppbsi0h5hahj07-demo","","")],[],[],"x86_64-darwin","/bin/sh",["-c","echo 'hello world' > $out"],[("builder","/nix/store/5arhyyfgnfs01n1cgaf7s82ckzys3vbg-bash-4.4-p23/bin/bash"),("name","demo"),("out","/nix/store/76gxh82dqh6gcppm58ppbsi0h5hahj07-demo"),("system","x86_64-darwin")])

    That's not especially readable, but there's a couple of important concepts to communicate here:

    • Everything required to build this Derivation is explicitly listed in the file by path (you can see "bash" here, for example).
    • The hash component of the Derivation's path in the Nix Store is essentially a hash of the contents of the file.

    Since every direct dependency is mentioned in the contents, and the path is a hash of the contents, that means that if the dependencies and whatever other information the derivation contains don't change, the hash won't change, but if a different version of a dependency is used, the hash changes.

    There are a few different ways to build Derivations. Let's use nix-build:

    $ nix-build /nix/store/ynzfmamryf6lrybjy1zqp1x190l5yiy5-demo.drv

    This ran whatever the build instructions were and generated a new path in the Nix Store (a new Node in the Graph Database).

    Take a close look at the hash in the newly-created path. You'll see the same hash in the Derivation contents above. That output path was pre-defined, but not pre-generated. The output path is also a stable hash. You can essentially think of it as being a hash of the derivation and also the name of the output (in this case: "out"; the default output).

    So, if a dependency of the Derivation changes, that changes the hash of the Derivation. It also changes the hashes of all of that Derivation's outputs. This means that changing a dependency of a dependency of a dependency bubbles all the way down the tree, changing the hashes of every Derivation and all those Derivation's outputs that depend on the changed thing, directly or indirectly.

    Let's break apart that unreadable blob of Derivation content from above a little bit.

    • outputs: What nodes can this build?
    • inputDrvs: Other Derivations that must be built before this one
    • inputSrcs: Things already in the store on which this build depends
    • platform: Is this for macOS? Linux?
    • builder: What program gets run to do the build?
    • args: Arguments to pass to that program
    • env: Environment variables to set for that program

    Or, to dissect that Derivation:



    This Derivation has one output, named "out" (the default name), with some path that would be generated if we would build it.


    [ ]

    This is a simple toy derivation, with no inputDrvs. What this really means is that there are no dependencies, other than the builder. Normally, you would see something more like:


    This indicates a dependency on the OpenSSH Derivation's default output.


    [ ]

    Again, we have a very simple toy Derivation! Commonly, you will see:


    It's not really critical to the mental model, but Nix can also copy static files into the Nix Store in some limited ways, and these aren't really constructed by Derivations. This field just lists any static files in the Nix store on which this Derivation depends.



    Nix runs on multiple platforms and CPU architectures, and often the output of compilers will only work on one of these, so the derivation needs to indicate which architecture it's intended for.

    There's actually an important point here: Nix Store entries can be copied around between machines without concern, because all of their dependencies are explicit. The CPU details are a dependency in many cases.



    This program is executed with args and env, and is expected to generate the output(s).


    ["-c","echo 'hello world' > $out"]

    You can see that the output name ("out") is being used as a variable here. We're running, basically, bash -c "echo 'hello world' > $out". This should just be writing the text "hello world" into the Derivation output.



    Each of these is set as an Environment Variable before calling the builder, so you can see how we got that $out variable above, and note that it's the same as the path given in outputs above.

    Derivation in Summary

    So, if we build that Derivation, let's see what the output is:

    $ nix-build /nix/store/ynzfmamryf6lrybjy1zqp1x190l5yiy5-demo.drv
    $ cat /nix/store/76gxh82dqh6gcppm58ppbsi0h5hahj07-demo
    hello world

    As we expected, it's "hello world".

    A Derivation is a recipe to build some other path in the Nix Store.


    After walking through that Derivation in the last section, you may be starting to develop a feel for how explicitly-declared dependencies make it into the build, and how that Graph structure comes together—but what prevents builds from referring to things at undeclared paths, or things that aren't in the Nix store at all?

    Nix does a lot of work to make sure that builds can only see the Nodes in the Graph which their Derivation has declared, and also, that they don't access things outside of the store.

    A Derivation build simply cannot access anything not declared by the Derivation. This is enforced in a few ways:

    • For the most part, Nix uses patched versions of compilers and linkers that don't try to look in the default locations (/usr/lib, and so on).
    • Nix typically builds Derivations in an actual sandbox that denies access to everything that the build isn't supposed to access.

    A Sandbox is created for a Derivation build that gives filesystem read access to—and only to—the paths explicitly mentioned in the Derivation.

    What this amounts to is that artifacts in the Nix Store essentially can't depend on anything outside of the Nix Store.

    The Nix Language

    And finally, the block that brings it all together: the Nix Language.
    Nix has a custom language used to construct derivations. There's a lot we could talk about here, but there are two major aspects of the language's design to draw attention to. The Nix Language is:

    1. lazy-evaluated
    2. (almost) free of side-effects.

    I'll try to explain these by example.

    Lazy Evaluation

    Take a look at this code:

    data = {
      a = 1;
      b = functionThatTakesMinutesToRun 1;

    This is Nix code. You can probably figure out what's going on here: we're creating something like a hash table containing keys "a" and "b", and "b" is the result of calling an expensive function.

    In Nix, this code takes approximately no time to run, because the value of "b" isn't actually evaluated until it's needed. We could even:

      data = {
       a = 1;
       b = functionThatTakesMinutesToRun 1;
    in data.a

    Here, we're creating the table (technically called an Attribute Set in Nix), and extracting "a" from it.

    This evaluates to "1" almost instantly, without ever running the code that generates "b".

    Conspicuously absent in the code samples above is any sort of actual work getting done, other than just pushing data around within the Nix language. The reason for this is that the Nix language can’t actually do very much.

    Free of Side Effects (almost)

    The Nix language lacks a lot of features you will expect in normal programming languages. It has

    • No networking
    • No user input
    • No file writing
    • No output (except limited debug/tracing support).

    It doesn't actually do anything at all in terms of interacting with the world…well, except for when you call the derivation function.

    One Side Effect

    The Nix Language has precisely one function with a side effect. When you call derivation with the right set of arguments, Nix writes out a new <hash>-<name>.drv file into the Nix Store as a side effect of calling that function.

    For example:

    derivation {
      name = "demo";
      builder = "${bash}/bin/bash";
      args = [ "-c" "echo 'hello world' > $out" ];
      system = "x86_64-darwin";

    If you evaluate this in nix repl, it will print something like:

    «derivation /nix/store/ynzfmamryf6lrybjy1zqp1x190l5yiy5-demo.drv»

    That returned object is just the object you passed in (with name, builder, args, and system keys), but with a few extra fields (including drvPath, which is what got printed after the call to derivation), but importantly, that path in the Nix store was actually created.

    It's worth emphasizing again: This is basically the only thing that the Nix Language can actually do. There's a whole lot of pushing data and functions around in Nix code, but it all boils down to calls to derivation.

    Note that we referred to ${bash} in that Derivation. This is actually the Derivation from earlier in this article, and that variable substitution is actually how Derivations depend on each other. The variable bash refers to another call to derivation, which generates instructions to build bash when it's evaluated.

    The Nix Language doesn't ever actually build anything. It creates Derivations, and later, other Nix tools read those derivations and build the outputs. The Nix Language is just a Domain Specific Language for creating Derivations

    Nixpkgs: Derivation and Lazy Evaluation

    Nixpkgs is the global default package repository for Nix, but it's very unlike what you probably think of when you hear "package repository."

    Nixpkgs is a single Nix program. It makes use of the fact that the Nix Language is Lazy Evaluated, and includes many, many calls to derivation. The (simplified but) basic structure of Nixpkgs is something like:

      ruby = derivation { ... };
      python = derivation { ... };
      nodejs = derivation { ... };

    In order to build “ruby”, various tools just force Nix to evaluate the “ruby” attribute of that Attribute Set, which calls derivation, generating the Derivation for Ruby into the Nix Store and returning the path that was built. Then, the tool runs something like nix-build on that path to generate the output.

    Shipit! Presents: How Shopify Uses Nix

    Well, it takes a lot more words than I can write here—and probably some amount of hands-on experimentation—to let you really, viscerally, feel the paradigm shift that Nix enables, but hopefully I’ve given you a taste.

    If you’re looking for more Nix content, I’m currently re-releasing a series of screencasts I recorded for developers at Shopify to the public. Check out Nixology on YouTube.

    You can also join me for a discussion about how Shopify is using Nix to rebuild our developer tooling. I’ll cover some of this content again, and show off some of the tooling we actually use on a day-to-day basis.

    What: ShipIt! Presents: How Shopify Uses Nix

    Date: May 25, 2020 at 1:00 pm EST

    Please view the recording at

    If you want to work on Nix, come join my team! We're always hiring, so visit our Engineering career page to find out about our open positions. 

    Continue reading

    A Brief History of TLS Certificates at Shopify

    A Brief History of TLS Certificates at Shopify

    Transport Layer Security (TLS) encryption may be commonplace in 2020, but this wasn’t always the case. Back in 2014, our business owner storefront traffic wasn’t encrypted. We manually provisioned the few TLS certificates that were in production. In this post, we’ll cover Shopify’s journey from manually provisioning TLS certificates to the fully automated system that supports over 1M business owners today.

    In the Beginning

    Up to 2014, only business owner shop administration and checkout traffic were encrypted. All checkouts were on the domain. Secured shop administration functions used the * certificate and a single-domain certificate for Our Operations team renewed the certificates manually as needed. During this time, teams began research on what it would take for us to offer TLS encryption for all business owners in an automated fashion.

    Shopify Plus

    We launched Shopify Plus in early 2014. One of Plus’s earliest features was TLS encrypted storefronts. We manually provisioned certificates, adding new domains to the Subject Alternative Name (SAN) list as required. As our certificate authority placed a limit on the number of domains per certificate, certificates were added to support the new domains being onboarded. At the time, Internet Explorer on Windows XP was still used by a significant number of users, which prevented our use of the Server Name Indication (SNI) extension.

    While this addressed our immediate needs, there were several drawbacks:

    • Manual certificate updates and provisioning were labor-intensive and needed to be handled with care.
    • Additional IP addresses were needed to support new certificates.
    • Having domains for non-related shops in a single certificate wasn’t ideal.

    The pace of onboarding was manageable at first. As we onboarded more merchants, it was apparent that this process wasn’t sustainable. At this point, there were dozens of certificates that all had to be manually provisioned and renewed. For each Plus account onboarded, the new domains had to be manually added. This was labor-intensive and error-prone. We worked on a fully automated system during Shopify’s Hack Days, and it became a fully staffed project in May 2015.

    Shopify’s Notary System

    Automating TLS certificates had to address multiple facets of the process including

    • How are the certificates provisioned from the certificate authority?
    • How to serve the certificates at scale?
    • What other considerations are there for offering encrypted storefronts?

    Shopify's Notary System
    Shopify's Notary System

    Provisioning Certificates

    Our Notary system provisions certificates. When a business owner adds a domain to their shop, the system receives a request for a certificate to be provisioned. The certificate provisioning is fully automated via Application Programming Interface (API) calls to the certificate authority. This includes the order request, domain ownership verification, and certificate/private key pair delivery. Certificate renewals are performed automatically in the same fashion.

    While it makes sense that we group domains from a shop to one certificate, the system handles all domains separately for simplicity. Each certificate has one domain with a unique private key. The certificate and private key are stored in a relational database. This relational database is accessible by the load balancers for terminating TLS connections.

    Scaling Up Certificate Provisioning

    At the time, we hosted our nginx load balancers at our datacenters. Storing the TLS certificates on disk and reloading nginx when certificates changed wasn’t feasible. In a past article, we talked about our use of nginx and OpenResty Lua modules. Using OpenResty allowed us to script nginx to serve dynamic content outside of the nginx configuration. In addition, browser support for the TLS SNI extension was almost universal. By leveraging the TLS SNI extension, we dynamically load TLS certificates from our database in a Lua middleware via the ssl_certificate_by_lua module. Certificates and private keys are directly accessible from the relational database via a single SQL query. An in-memory Least Recently Used (LRU) cache reduced the latency of TLS handshakes for frequently accessed domains.

    Solving Mixed Content Warnings

    With TLS certificates in place for business owner shop domains, we could offer encrypted storefronts for all shops. However, there was still a significant hurdle to overcome. Each shop’s theme could have images or assets referencing non-encrypted Uniform Resource Locators (URLs). Mixing of encrypted and unencrypted content would cause the browser to display a Mixed Content warning, denoting that some resources on the page are not encrypted. To resolve this problem, we had to process all the shop themes to replace references to HTTP with HTTPS.

    With all the infrastructure in place, we realized the goal of supporting encrypted storefronts for all merchants in February 2016. The same system is still in place and has scaled to provide TLS certificates for all of our 1M+ merchants.

    Let’s Encrypt!

    Let’s Encrypt is a non-profit certificate authority that provides TLS certificates at no charge. Shopify has been and is currently a sponsor. The service launched in April 2016, shortly after our Notary went into production. With the exception of Extended Verification (EV) certificates and other special cases, we’ve migrated away from our paid certificate authority in favor of Let’s Encrypt.

    Move to the Cloud

    In June 2019, our network edge moved from our datacenter to a cloud provider. The number of TLS certificates in our requirements needing support drastically reduced the viable vendor list. Once the cloud provider was selected, our TLS provisioning system had to be adapted to work with their system. There were two paths forward, using the cloud provider’s managed certificates or continuing to provision Let’s Encrypt certificates and upload them. The initial migration leveraged the provider’s certificate provisioning.

    Using managed certificates from the cloud provider has the advantage of being maintenance-free after they’ve been provisioned. There are no storage concerns for certificates and private keys. In addition, certificates are automatically renewed by the vendor. Administrative work was required during the migration to guide merchants to modify their domain’s Certification Authority Authorization (CAA) Domain Name System (DNS) records as needed. Backfilling the certificates for our 1M+ merchants took several weeks to complete.

    After the initial successful migration to our cloud provider, we revisited the certificate provisioning strategy. As we maintain an alternate edge network for contingency, the Notary infrastructure is still in place to provide certificates for that infrastructure. The intent of using provider managed certificates is for it to be a stepping stone for deprecating Notary in the future. While the cloud provider-provisioned certificates worked well for us, there are now two sets of certificates to keep synchronized. To simplify certificate state and operation load, we now use the Notary provisioned certificates for both edge networks. Instead of provisioning certificates on our cloud provider, certificates from Notary are uploaded as new ones are required.

    Outside of our business owner shop storefronts, we rely on nginx for other services that are part of our cloud infrastructure. Some of our Lua middleware, including the dynamic TLS certificate loading code, was contributed to the ingress-nginx Kubernetes project.

    Our TLS certificate journey took us from a handful of manually provisioned certificates to a fully automated system that can scale up to support over 1M merchants. If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Learn about the actions we’re taking as we continue to hire during COVID‑19

    Continue reading

    Dev Degree: Behind the Scenes

    Dev Degree: Behind the Scenes

    On April 24th, 2020, we proudly celebrated the graduation of our first Dev Degree program cohort. This milestone holds a special place in Shopify history because it’s a day born out of trial and error, experimentation, iteration, and hustle. The 2016 cohort had the honor and challenge of being our first class, lived through the churn and pivots of a newly designed program, and completed their education during a worldwide pandemic, thrust into remote learning and work. The students’ success is a testament to their dedication, adaptability, and grit. It’s also the product of a thoughtfully designed program and a high-functioning Dev Degree team.

    What does it take to create an environment where students can thrive and develop into work-ready employees in four years?

    We’ve achieved this mission with the Dev Degree program. The key to our success is our learning structure and multidisciplinary team. With our model, students master development skills faster than traditional methods.

    The Dev Degree Program Structure

    When we set out to shake-up software education in 2016, we had no prescriptive blueprint to guide us and no tried-and-true best practices. Still, we embraced the opportunity to forge a new path in partnership with trusted university advisors and experienced internal educators at Shopify.

    Our vision was to create an alternative to the traditional co-op model that alternates between university studies and work placements. In Dev Degree, students receive four years of continual hands-on developer experience at Shopify, through skills training and team placements, in parallel to attending university classes. This model accelerates understanding and allows students to apply classroom theory to real-life product development across a breadth of technology.

    Dev Degree Timeline - Year 1: Developer skill training at Shopify. Year 2, 3, 4: New development team and mentor every 8/12 months for 3 years
    Dev Degree Timeline

    While computer science and technology are at the core of our learning model, what elevates the program is the focus on personal growth, actionable feedback loops, and the opportunity to make an impact on the company, coworkers, and merchants.

    University Course Curriculum

    The Dev Degree program leads to an accredited Computer Science degree, which is a deciding factor for many students and their families exploring post-secondary education opportunities. All required core theoretical concepts, computer science courses, math courses, and electives are defined by and taught at the universities. Students take three university courses per semester while working 25 hours per week at Shopify throughout the four-year program. All formal assessments, grading, and final exams for university courses are carried out by the universities.

    Dev Degree Program - 20 Hrs/week at Carleton or York and 25hrs/week at Shopify
    Dev Degree Structure

    While the universities set the requirements for the courses, we work collaboratively to define the course sequencing to ensure the students are exposed to computer science content as early as possible in their program before they start work on team placements at Shopify.

    In addition to the core university courses, there are internship courses that teach software development concepts applicable to the technology industry. The universities assess the learning outcomes of the internship courses through practicum reports and meetings with university supervisors or advisors.

    The courses and concepts taught at Shopify build on the university courses and teach students hands-on development skills, communication skills, developer tools training, and how to contribute to a real-world product development team effectively.

    Developer Skills Training: Building a Strong Foundation

    One of the lessons we learned early in the program was that students need a solid foundation of developer skills before being placed on development teams to feel confident and ready to contribute. The first year at Shopify sets the Dev Degree program apart from other work-integrated learning programs because we immerse the students in our Shopify-led developer skills training.

    In the first year, we introduce the students to skills and tools that form the foundation for how they work at Shopify and other companies. They need to develop certain skills before moving into teams, such as working with code repositories and committing code, using a command line, front-end development, working with data, and more.

    The breadth of technical skills that students learn in their first year in Dev Degree goes beyond the traditional university curriculum. This foundation allows students to confidently join their first placement team and have an immediate impact.

    We teach this way on purpose. Universities often chose a bottom-up learning model by front-loading theory and concepts. We designed our program to immerse students somewhere in the middle of top-down and bottom-up, allowing them to gradually discover the fundamentals after they develop base skills and code a bit every day.

    Due to the ever-evolving nature of software development, we update the developer skills training path often. Our current courses include the following technologies:

    • Command Line Interface (CLI)
    • Git & GitHub
    • Ruby
    • HTML, CSS, JavaScript
    • Databases
    • Ruby on Rails
    • React
    • TypeScript
    • GraphQL

    Team Placements: Working on Merchant-Facing Projects

    After they’ve completed their developer skills training courses, students spend the next three years on team placements. This is a big deal. On team placements, students get to apply what they learn in their developer skills training at Shopify and from their university courses to meaningful, real-world software development work. Our placements are purposefully designed to expose students to a wide range of disciplines and teams to make them well-rounded developers, give them new perspectives, and introduce them to new people.

    Working with their placement specialist, students interview with teams from back-end, front-end, data, security, and production engineering disciplines.

    Over the course of the Dev Degree program, each student receives:

    • One 12-month team placement
    • Three 8-month team placements
    • Four different technical mentors
    • A dedicated Life@Shopify mentor
    • Twenty hours per week working on a Shopify team
    • Five hours per week of personal development
    • Actionable feedback on a regular cadence
    • Evaluations from mentors and leads
    • High-touch support from the Dev Degree team
    • Access to new people and workplace culture

    By the time students complete the program, they’ve been on four back-to-back team placements in their final three years at Shopify. This experience makes them a valuable asset to their future company. It also allows students to launch their careers confident that they are well-prepared to make a positive contribution.

    It Takes a Village: Building an Impactful Program Team

    Creating a successful work-integrated learning program requires a significant commitment of time and resources from a team that spans multiple disciplines and functions. While the Dev Degree program team is responsible for the bulk of the heavy-lifting, including logistics, mentorship, and support, the program doesn’t happen without expertise and time from other Shopify subject matter experts and university stakeholders.

    Dev Degree Program Team

    The Dev Degree team are the most actively involved in all aspects of the program and with the students from onboarding to graduation. They are responsible for ensuring that the program meets the needs of the students, the university, and Shopify.

    Student Success Specialists

    Many of the Dev Degree students come to Shopify straight from high school, which can be daunting. In traditional co-op programs, students have a couple of years of university experience before starting their internships and being dropped into a professional workplace setting. To ease the transition to Shopify, our student success specialists are responsible for supporting students’ well-being, connecting them with other mentors, helping them learn how to become effective communicators, and being the voice of the students at Shopify. This nurturing environment helps protect first-year students from being overwhelmed and underprepared for team placements.

    Placement Specialists

    Team placements are an integral part of the applied learning in the Dev Degree program. Placement specialists are responsible for coordinating and overseeing all placements for each cohort. This high-touch role requires extensive relationship-building with development teams and a deep understanding of the goals and interests of the students to ensure appropriate compatibility. To ensure that development teams get the return on investment (ROI) from investing in mentorship, placement specialists place students on teams where they can be impactful. They also support the leads and mentors on the development teams and play an active role in advocating for an influential culture of mentorship within Shopify.


    The courses we teach at Shopify are foundational for students to prepare them for their team placements. Dev Degree educators have an extensive background in education and computer science and are responsible for building out the curriculum for all the developer skills training courses taught at Shopify. They design and deliver the course material and evaluate the students on technical expertise and subject knowledge. The instructors create courses for a wide range of development skills relevant to development at Shopify and other companies.

    Recruitment Team

    As with all recruitment at Shopify, we aim to recruit a diverse mix of students to the Dev Degree program. Our talent team is actively involved in helping us create a recruitment strategy to engage and attract top talent from a variety of schools and programs, including university meetups and mentorship in youth programs like Technovation.


    After four years of having Dev Degree students on teams across twelve disciplines, the program is woven into the Shopify culture, and mentorship plays a big role.

    Development Team Mentors

    Development team mentors are critical to helping students build confidence, technical skills, and gain the experience needed to become an asset to the team. Mentors are responsible for guiding, evaluating, and providing actionable feedback to students throughout their 8-month placements. Mentorship requires a strong commitment and takes up about 10% of a developer’s time. We feel it’s worth the investment to build mentors and invest in a culture of mentorship. It’s a challenging but rewarding role, and especially helpful to developers looking to grow leadership skills and level up in their roles.

    Life@Shopify Mentors

    In addition to placement mentors, we also have experienced Shopify employees who volunteer to mentor students as they navigate through the program, their placements, and the company on the whole. These Life@Shopify mentors act as a trusted guide and help round out the mentorship experience at Shopify.

    University Stakeholders

    Close relationships between the universities and Shopify help integrate the theory and development practices and deepen both the understanding of concepts and work experience. We’re fortunate to have both Carleton and York University as part of the Dev Degree program and fully engaged in the model that we’ve built. The faculty advisors play an active role in working with the students to guide them on their course selections, navigate the program, and evaluate their internship courses and practicums. Without university buy-in and support, a program like Dev Degree doesn’t happen.

    Dev Degree is Worth the Investment

    Building a new work-integrated learning program requires a big commitment of company time, resources, and cost, but we are reaping the benefits of our gamble.

    • Graduates are well-rounded developers with a rich development experience across a range of teams and disciplines.
    • 100% of graduates have accepted full-time positions within six months of graduation.
    • Students who accept positions at Shopify have already built four years of relationships and have acquired vast knowledge and skills that will help them make an immediate impact on their teams.
    • We are building future leaders through mentorship. 

    While we are excited about how far we’ve come, we still have room to grow. We are looking at metrics and data to help us quantify the success of the program and to drive program improvements to take computer science education to a new level. When we started this ambitious endeavor, we wanted to mature it to a point where we could create a blueprint of the Dev Degree program for other companies and universities to adopt it and evolve it. There’s interest in what we’re doing here. It’s just a matter of time before we help make the Dev Degree model of computer science education the norm rather than the exception.

    The Future of Dev Degree

    In 2020, Shopify made the shift to become a fully remote company, opening the doors to potential software engineers from almost anywhere in the world. However, while Dev Degree students have been leveraging remote work for their Shopify-led developer skills training and team placements, they still need to live in the same city as their university campus for their in-person learning. So, starting in September 2021, we’re growing the Dev Degree program to welcome fully remote students from across North America in collaboration with our new educational partner, Make School. Our new remote 3-year program with Make School will provide students with the same hands-on learning and work experience at Shopify that our current students enjoy. Instead of attending in-person classes, the Make School students will attend classes remotely with Make School to attain a Bachelor's Degree in Applied Computer Science.

    In September of 2021, we are launching a small pilot program with Make School before opening the program to general admission for September 2022. The main differences for students enrolled in Make School are that they will earn a 3-year Applied Computer Science degree instead of a 4-year Bachelor of Science/Computer Science and will participate in three work placements instead of four. Students will have the same Developer Skills training, mentorships, and team support as those in the 4-year Dev Degree program, as support and mentorship are keys to student success. Stay tuned for more information later this summer on applying for Dev Degree programs for September 2022.

    Additional Information

    We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

    Continue reading

    How to Fix Slow Code in Ruby

    How to Fix Slow Code in Ruby

    By Jay Lim and Gannon McGibbon

    At Shopify, we believe in highly aligned, loosely coupled teams to help us move fast. Since we have many teams working independently on a large monolithic Rails application, inefficiencies in code are sometimes inadvertently added to our codebase. Over time, these problems can add up to serious performance regressions.

    By the time such performance regressions are noticeable, it might already be too late to track offending commits down. This can be exceedingly challenging on codebases with thousands of changes being committed each day. How do we effectively find out why our application is slow? Even if we have a fix for the slow code, how can we prove that our new code is faster?

    It all starts with profiling and benchmarking. Last year, we wrote about writing fast code in Ruby on Rails. Knowing how to write fast code is useful, but insufficient without knowing how to fix slow code. Let’s talk about the approaches that we can use to find slow code, fix it, and prove that our new solution is faster. We’ll also explore some case studies that feature real world examples on using profiling and benchmarking.


    Before we dive into fixing unperformant code, we need to find it first. Identifying code that causes performance bottlenecks can be challenging in a large codebase. Profiling helps us to do so easily.

    What is Profiling?

    Profiling is a type of program analysis that collects metrics about the program at runtime, such as the frequency and duration of method calls. It’s carried out using a tool known as a profiler, and a profiler’s output can be visualized in various ways. For example, flat profiles, call graphs, and flamegraphs.

    Why Should I Profile My Code?

    Some issues are challenging to detect by just looking at the code (static analysis, code reviews, etc.). One of the main goals of profiling is observability. By knowing what is going on under the hood during runtime, we gain a better understanding of what the program is doing and reason about why an application is slow. Profiling helps us to narrow down the scope of a performance bottleneck to a particular area.

    How Do I Profile?

    Before we figure out what to profile, we need to first figure out what we want to know: do we want to measure elapsed time for a specific code block, or do we want to measure object allocations in that code block? In terms of granularity, do we need elapsed time for every single method call in that code block, or do we just need the aggregated value? Elapsed time here can be further broken down into CPU time or wall time.

    For measuring elapsed time, a simple solution is to measure the start time and the end time of a particular code block, and report the difference. If we need a higher granularity, we do this for every single method. To do this, we use the TracePoint API in Ruby to hook into every single method call made by Ruby. Similarly, for object allocations, we use the ObjectSpace module to trace object allocations, or even dump the Ruby heap to observe its contents.

    However, instead of building custom profiling solutions, we can use one of the available profilers out there, and each has its own advantages and disadvantages. Here are a few options:

    1. rbspy

    rbspy samples stack frames from a Ruby process over time. The main advantage is that it can be used as a standalone program without needing any instrumentation code.

    Once we know the Ruby Process Identifier (PID) that we want to profile, we start the profiling session like this:

    rbspy record —pid $PID

    2. stackprof

    Like rbspy, stackprof samples stack frames over time, but from a block of instrumented Ruby code. Stackprof is used as a profiling solution for custom code blocks:

    profile = :cpu) do
      # Code to profile

    3. rack-mini-profiler

    The rack-mini-profiler gem is a fully-featured profiling solution for Rack-based applications. Unlike the other profilers described in this section, it includes a memory profiler in addition to call-stack sampling. The memory profiler collects data such as Garbage Collection (GC) statistics, number of allocations, etc. Under the hood, it uses the stackprof and memory_profiler gems.

    4. app_profiler

    app_profiler is a lightweight alternative to rack-mini-profiler. It contains a Rack-only middleware that supports call-stack profiling for web requests. In addition to that, block level profiling is also available to any Ruby application. These profiles can be stored in a configurable storage backend such as Google Cloud Storage, and can be visualized through a configurable viewer such as Speedscope, a browser-based flamegraph viewer.

    At Shopify, we collect performance profiles in our production environments. Rack Mini Profiler is a great gem, but it comes with a lot of extra features such as database and memory profiling, and it seemed too heavy for our use case. As a result, we built App Profiler that similarly uses Stackprof under the hood. Currently, this gem is used to support our on-demand remote profiling infrastructure for production requests.

    Case Study: Using App Profiler on Shopify

    An example of a performance problem that was identified in production was related to unnecessary GC cycles. Last year, we noticed that a cart item with a very large quantity used a ridiculous amount of CPU time and resulted in slow requests. It turns out, the issue was related to Ruby allocating too many objects, triggering the GC multiple times.

    The figure below illustrates a section of the flamegraph for a similar slow request, and the section corresponds to approximately 500ms of CPU time.

    A section of the flamegraph for a similar slow request

    A section of the flamegraph for a similar slow request

    The highlighted chunks correspond to the GC operations, and they interleave with the regular operations. From this section, we see that GC itself consumed about 35% of CPU time, which is a lot! We inferred that we were allocating too many Ruby objects. Without profiling, it’s difficult to identify these kinds of issues quickly.


    Now that we know how to identify performance problems, how do we fix them? While the right solution is largely context sensitive, validating the fix isn’t. Benchmarking helps us prove performance differences in two or more different code paths.

    What is Benchmarking?

    Benchmarking is a way of measuring the performance of code. Often, it’s used to compare two or more similar code paths to see which code path is the fastest. Here’s what a simple ruby benchmark looks like:

    This code snippet is benchmarking at its simplest. We’re measuring how long a method takes to run in seconds. We could extend the example to measure a series of methods, a complex math equation, or anything else that fits into a block. This kind of instrumentation is useful because it can unveil regression or improvement in speed over time.

    While wall time is a pretty reliable measurement of “performance”, there’s other methods one can measure code by besides realtime, the Ruby standard library’s Benchmark module includes bm and bmbm.

    The bm method shows a more detailed breakdown of timing measurements. Let’s take a look at a script with some output:

    User, system, and total are all different measurements of CPU time. User refers to time spent working in user space. Similarly, system denotes time spent working in kernel space. Total is the sum of CPU timings, and real is the same wall time measurement we saw from Benchmark.realtime.

    What about bmbm? Well, it is exactly the same as bm with one unique difference. Here’s what the output looks like:

    The rehearsal, or warmup step is what makes bmbm useful. It runs benchmark code blocks once before measuring to prime any caching or similar mechanism to produce more stable, reproducible results.

    Lastly, let’s talk about the benchmark-ips gem. This is the most common method of benchmarking Ruby code. You’ll see it a lot in the wild, this is what a simple script looks like:

    Here, we’re benchmarking the same method using familiar syntax with ips method. Notice the inline bundler and gemfile code. We need this in a scripting context because benchmark-ips isn’t part of the standard library. In a normal project setup, we add gem entries to the Gemfile as usual.

    The output of this script is as follows:

    Ignoring the bundler output, we see the warmup iteration score per 100 milliseconds ran for the default of 2 seconds, and how many times the code block was able to run in 5 seconds. It’ll become more apparent why benchmark-ips is so popular later.

    Why Should I Benchmark My Code?

    So, now we know what benchmarking is and some tools available to us. But why even bother benchmarking at all? It may not be immediately obvious why benchmarking is so valuable.

    Benchmarks are used to quantify the performance of one or more blocks of code. This becomes very useful when there are performance questions that need answers. Often, these questions boil down to “which is faster, A or B?”. Let’s look at an example:

    In this script, we’re doing most of what we did in the first benchmark-ips example. Pay attention to the addition of another method, and how it changes the benchmark block. When benchmarking more than one thing at once, simply add another report block. Additionally, the compare! method prints a comparison of all reports:

    Wow, that’s pretty snazzy! compare! is able to tell us which benchmark is slower and by how much. Given the amount of thread sleeping we’re doing in our benchmark subject methods, this aligns with our expectations.

    Benchmarking can be a means of proving how fast a given code path is. It’s not uncommon for developers to propose a code change that makes a code path faster without any evidence.

    Depending on the change, comparison can be challenging. As in the previous example, benchmark-ips may be used to benchmark individual code paths. Running the same single report benchmark on versions of code easily tests pre and post patch performance.

    How Do I Benchmark My Code?

    Now we know what benchmarking is and why it is important. Great! But how do you get started benchmarking in an application? Trivial examples are easy to learn from but aren’t very relatable.

    When developing in a framework like Ruby on Rails, it can be difficult to understand how to set up and load framework code for benchmark scripts. Thankfully, one of the latest features of Ruby on Rails can generate benchmarks automatically. Let’s take a look:

    This benchmark can be generated by running bin/rails generate benchmark my_benchmark, placing a file in script/benchmarks/my_benchmark.rb. Note the inline gemfile isn’t required because we piggyback off of the Rails app’s Gemfile. The benchmark generator is slated for release in Rails 6.1.

    Now, let’s look at a real world example of a Rails benchmark:

    In this example, we’re subclassing Order and caching the calculation it does to find the total price of all line items. While it may seem obvious that this would be a beneficial code change, it isn’t obvious how much faster it is compared to the base implementation. Here’s a more unabridged version of the script for full context.

    Running the script reveals a ~50x improvement to a simple order of 4 line items. With orders with more line items, the payoff only gets better.

    One last thing to know about benchmarking effectively is being aware of micro-optimization. These are optimizations that are so small, the performance improvement isn’t worth the code change. While these are sometimes acceptable for hot code paths, it’s best to tackle larger scale performance issues first.

    Case Study: Rails Contributions

    As with many open source projects, Ruby on Rails usually requires performance optimization pull requests to include benchmarks. The same is common for new features to performance sensitive areas like Active Record query building or Active Support’s cache stores. In the case of Rails, most benchmarks are made with benchmark-ips to simplify comparison.

    For example, changes how primary keys are accessed in Active Record instances. Specifically, refactoring class method calls to instance variable references. It includes before and after benchmark results with a clear explanation of why the change is necessary. changes model attribute assignment in Active Record so that key stringification of attribute hashes is no longer needed. A benchmark script with multiple scenarios is provided with results. This is a particularly hot codepath because creating and updating records is at the heart of most Rails apps.

    Another example, reduces object allocations in ActiveRecord#respond_to?. It provides a memory benchmark that compares total allocations before and after the patch, with a calculated diff. Reducing allocations delivers better performance because the less Ruby allocates, the less time Ruby spends assigning objects to blocks of memory.

    Final Thoughts

    Slow code is an inevitable facet of any codebase. It isn’t important who introduces performance regressions, but how they are fixed. As developers, it’s our job to leverage profiling and benchmarking to find and fix performance problems.

    At Shopify, we’ve written a lot of slow code, often for good reasons. Ruby itself is optimized for the developer, not the servers we run it on. As Rubyists, we write idiomatic, maintainable code that isn’t always performant, so profile and benchmark responsibly, and be wary of micro-optimizations!

    Additional Information

    If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions. Learn about the actions we’re taking as we continue to hire during COVID‑19




    Continue reading

    Categorizing Products at Scale

    Categorizing Products at Scale

    By: Jeet Mehta and Kathy Ge

    With over 1M business owners now on Shopify, there are billions of products being created and sold across the platform. Just like those business owners, the products that they sell are extremely diverse! Even when selling similar products, they tend to describe products very differently. One may describe their sock product as a “woolen long sock,” whereas another may have a similar sock product described as a “blue striped long sock.”

    How can we identify similar products, and why is that even useful?

    Applications of Product Categorization

    Business owners come to our platform for its multiple sales/marketing channels, app and partner ecosystem, brick and mortar support, and so much more. By understanding the types of products they sell, we provide personalized insights to help them capitalize on valuable business opportunities. For example, when business owners try to sell on other channels like Facebook Marketplace, we can leverage our product categorization engine to pre-fill category related information and save them time.

    In this blog post, we’re going to step through how we implemented a model to categorize all our products at Shopify, and in doing so, enabled cross-platform teams to deliver personalized insights to business owners. The system is used by 20+ teams across Shopify to power features like marketing recommendations for business owners (imagine: “t-shirts are trending, you should run an ad for your apparel products”), identification of brick-and-mortar stores for Shopify POS, market segmentation, and much more! We’ll also walk through problems, challenges, and technical tradeoffs made along the way.

    Why is Categorizing Products a Hard Problem?

    To start off, how do we even come up with a set of categories that represents all the products in the commerce space? Business owners are constantly coming up with new, creative ideas for products to sell! Luckily, Google has defined their own hierarchical Google Product Taxonomy (GPT) which we leveraged in our problem.

    The particular task of classifying over a large-scale hierarchical taxonomy presented two unique challenges:

    1. Scale: The GPT has over 5000 categories and is hierarchical. Binary classification or multi-class classification can be handled well with most simple classifiers. However, these approaches don’t scale well as the number of classes increases to the hundreds or thousands. We also have well over a billion products at Shopify and growing!
    2. Structure: Common classification tasks don’t share structure between classes (i.e. distinguishing between a dog and a cat is a flat classification problem). In this case, there’s an inherent tree-like hierarchy which adds a significant amount of complexity when classifying.

    Sample visualization of the GPT

    Sample visualization of the GPT

    Representing our Products: Featurization 👕

    With all machine learning problems, the first step is featurization, the process of transforming the available data into a machine-understandable format.

    Before we begin, it’s worth answering the question: What attributes (or features) distinguish one product from another? Another way to think about this is if you, the human, were given the task of classifying products into a predefined set of categories: what would you want to look at?

    Some attributes that likely come to mind are

    • Product title
    • Product image
    • Product description
    • Product tags.

    These are the same attributes that a machine learning model would need access to in order to perform classification successfully. With most problems of this nature though, it’s best to follow Occam’s Razor when determining viable solutions.

    Among competing hypotheses, the one with the fewest assumptions should be selected.

    In simpler language, Occam’s razor essentially states that the simplest solution or explanation is preferable to ones that are more complex. Based on the computational complexities that come with processing and featurizing images, we decided to err on the simpler side and stick with text-based data. Thus, our classification task included features like

    • Product title
    • Product description
    • Product collection
    • Product tags
    • Product vendor
    • Merchant-provided product type.

    There are a variety of ways to vectorize text features like the above, including TF-IDF, Word2Vec, GloVe, etc. Optimizing for simplicity, we chose a simple term-frequency hashing featurizer using PySpark that works as follows:

    HashingTF toy example

     HashingTF toy example

    Given the vast size of our data (and the resulting size of the vocabulary), advanced featurization methods like Word2Vec didn’t scale since they involved storing an in-memory vocabulary. In contrast, the HashingTF provided fixed-length numeric features which scaled to any vocabulary size. So although we’re potentially missing out on better semantic representations, the upside of being able to leverage all our training data significantly outweighed the downsides.

    Before performing the numeric featurization via HashingTF, we also performed a series of standard text pre-processing steps, such as:

    • Removing stop words (i.e. “the”, “a”, etc.), special characters, HTML, and URLs to reduce vocabulary size
    • Performing tokenization: splitting a string into an array of individual words or “tokens”.

    The Model 📖

    With our data featurized, we can now move towards modelling. Ensuring that we maintain a simple, interpretable, solution while tackling the earlier mentioned challenges of scale and structure was difficult.

    Learning Product Categories

    Fortunately, during the process of solution discovery, we came across a method known as Kesler’s Construction [PDF]. This is a mathematical maneuver that enables the conversion of n one-vs-all classifiers into a single binary classifier. As shown in the figure below, this is achieved by exploding the training data with respect to the labels, and manipulating feature vectors with target labels to turn a multi-class training dataset into a binary training dataset.

    Figure 3: Kesler’s Construction formulation

    Kesler’s Construction formulation

    Applying this formulation to our problem implied pre-pending the target class to each token (word) in a given feature vector. This is repeated for each class in the output space, per feature vector. The pseudo-code below illustrates the process, and also showcases how the algorithm leads to a larger, binary-classification training dataset.

    1. Create a new empty dataset called modified_training_data
    2. For each feature_vector in the original_training_data:
      1. For each class in the taxonomy:
        1. Prepend the class to each token in the feature_vector, called modified_feature_vector
        2. If the feature_vector is an example of the class, append (modified_feature_vector, 1) to modified_training_data
      2. If the feature vector is not an example of the class, append (modified_feature_vector, 0) to modified_training_data
    3. Return modified_training_data

    Note: In the algorithm above, a vector can be an example of a class if its ground truth category belongs to a class that’s a descendant of the category being compared to. For example, a feature vector that has the label Clothing would be an example of the Apparel & Accessories class, and as a result would be assigned a binary label of 1. Meanwhile, a feature vector that has the label Cell Phones would not be an example of the Apparel & Accessories class, and as a result would be assigned a binary label of 0.

    Combining the above process with a simple Logistic Regression classifier allowed us to:

    • Solve the problem of scale - Kesler’s construction allowed a single model to scale to n classes (in this case, n was into the thousands)
    • Leverage taxonomy structure - By embedding target classes into feature vectors, we’re also able to leverage the structure of the taxonomy and allow information from parent categories to permeate into features for child categories. 
    • Reduce computational resource usage - Training a single model as opposed to n individual classifiers (albeit on a larger training data-set) ensured a lower computational load/cost.
    • Maintain simplicity - Logistic Regression is one of the most simple classification methods available. It’s coefficients allow interpretability, and reduced friction with hyperparameter tuning.

    Inference and Predictions 🔮

    Great, we now have a trained model, how do we then make predictions to all products on Shopify? Here’s an example to illustrate. Say we have a sample product, a pair of socks, below:

    Figure 4: sample product entry for a pair of socks

    Sample product entry for a pair of socks

    We aggregate all of its text (title, description, tags, etc.) and clean it up using the Kesler’s Construction formulation resulting in the string:

    “Check out these socks”

    We take this sock product and compare it to all categories in the available taxonomy we trained on. To avoid computations on categories that will likely be low in relevance, we leverage the taxonomy structure and use a greedy approach in traversing the taxonomy.

    Figure 5: Sample traversal of taxonomy at inference time

    Sample traversal of taxonomy at inference time

    For each product, we prepend a target class to each token of the feature vector, and do so for every category in the taxonomy. We score the product against each root level category by multiplying this prepended feature vector against the trained model coefficients. We start at the root level and keep track of the category with the highest score. We then score the product against the children of the category with the highest score. We continue in this fashion until we’ve reached a leaf node. We output the full path from root to leaf node as a prediction for our sock product.

    Evaluation Metrics & Performance ✅

    The model is built. How do we know if it’s any good? Luckily, the machine learning community has an established set of standards around evaluation metrics for models, and there are good practices around which metrics make the most sense for a given type of task.

    However, the uniqueness of hierarchical classification adds a twist to these best practices. For example, commonly used evaluation metrics for classification problems include accuracy, precision, recall, and F1 Score. These metrics work great for flat binary or multi-class problems, but there are several edge cases that show up when there’s a hierarchy of classes involved.

    Let’s take a look at an illustrating example. Suppose for a given product, our model predicts the following categorization: Apparel & Accessories > Clothing > Shirts & Tops. There’s a few cases that can occur, based on what the product actually is:

    Product is a shirt - Model example

    Product is a shirt - Model example

    1. Product is a Shirt: In this case, we’re correct! Everything is perfect.

    Figure 7. Product is a dress - Model example

    Product is a dress - Model example

    2. Product is a Dress: Clearly, our model is wrong here. But how wrong is it? It still correctly recognized that the item is a piece of apparel and is clothing

    Figure 8. Product is a watch - Model example

    Product is a watch - Model example

    3. Product is a Watch: Again, the model is wrong here. It’s more wrong than the above answer, since it believes the product to be an accessory rather than apparel.

    Figure 9. Product is a phone - Model example

    Product is a phone - Model example

    4. Product is a Phone: In this instance, the model is the most incorrect, since the categorization is completely outside the realm of Apparel & Accessories.

    The flat metrics discussed above would punish each of the above predictions equally, when it’s clear that this isn’t the case. To rectify this, we leveraged work done by Costa et al. on hierarchical evaluation measures [PDF] which use the structure of the taxonomy (output space) to punish incorrect predictions accordingly. This includes:

    • Hierarchical accuracy
    • Hierarchical precision
    • Hierarchical recall
    • Hierarchical F1

    As shown below, the calculation of the metrics largely remains the same as their original flat form. The difference is that these metrics are regulated by the distance to the nearest common ancestor. In the examples provided, Dresses and Shirts & Tops are only a single level away from having a common ancestor (Clothing). In contrast, Phones and Shirts & Tops are in completely different sub-trees, and are four levels away from having a common ancestor

    Example hierarchical metrics for “Dresses” vs. “Shirts & Tops”

    Example hierarchical metrics for “Dresses” vs. “Shirts & Tops”

    This distance is used as a proxy to indicate the magnitude of incorrectness of our predictions, and allows us to present, and better assess the performance of our models. The lesson here is to always question conventional evaluation metrics, and ensure that they indeed fit your use-case, and measure what matters.

    When Things Go Wrong: Incorrect Classifications ❌

    Like all probabilistic models, our model is bound to be incorrect on occasions. While the goal of model development is to reduce these misclassifications, it’s important to note that 100% accuracy will never be the case (and it shouldn’t be the gold standard that teams drive towards).

    Instead, given that the data product is delivering downstream impact to the business, it's best to determine feedback mechanisms for misclassification instances. This is exactly what we implemented through a unique setup of schematized Kafka events and an in-house annotation platform.

    Feedback system design

    Feedback system design

    This flexible human-in-the-loop setup ensures a plug-in system that any downstream consumer can leverage, leading to reliable, accurate data additions to the model. It also extends beyond misclassifications to entire new streams of data, such that new business owner-facing products/features that allow them to provide category information can directly feed this information back into our models.

    Back to the Future: Potential Improvements 🚀

    Having established a baseline product categorization model, we’ve identified a number of possible improvements that can significantly improve the model’s performance, and therefore its downstream impact on the business.

    Data Imbalance ⚖️

    Much like other e-commerce platforms, Shopify has large sets of merchants selling certain types of products. As a result, our training dataset is skewed towards those product categories.

    At the same time, we don’t want that to preclude merchants in other industries from receiving strong, personalized insights. While we’ve taken some efforts to improve the data balance of each product category in our training data, there’s a lot of room for improvement. This includes experimenting with different re-balancing techniques, such as minority class oversampling (e.g. SMOTE [PDF]), majority class undersampling, or weighted re-balancing by class size.

    Translations 🌎

    As Shopify expands to international markets, it’s increasingly important to make sure we’re providing equal value to all business owners, regardless of their background. While our model currently only supports English language text (that being the primary source available in our training data), there’s a big opportunity here to capture products described and sold in other languages. One of the simplest ways we can tackle this is by leveraging multi-lingual pre-trained models such as Google’s Multilingual Sentence Embeddings.

    Images 📸

    Product images would be a great way to leverage a rich data source to provide a universal language in which products of all countries and types can be represented and categorized. This is something we’re looking to incorporate into our model in the future, however with images come increased engineering resources required. While very expensive to train images from scratch, one strategy we’ve experimented with is using pre-trained image embeddings like Inception v3 [PDF] and developing a CNN for this classification problem.

    Our simple model design allowed us interpretability and reduced computational resource usage, enabling us to solve this problem at Shopify’s scale. Building out a shared language for products unlocked tons of opportunities for us to build out better experiences for business owners and buyers. This includes things like being able to identify trending products or identifying product industries prone to fraud, or even improving storefront search experiences.

    If you’re passionate about building models at scale, and you’re eager to learn more - we’re always hiring! Reach out to us or apply on our careers page.

    Additional Information



    Continue reading

    Software Release Culture at Shopify

    Software Release Culture at Shopify

    A recording of the event and the additional questions are now available in the Release Culture @ Shopify Virtual Event section at the end of the post.

    By Jack Li, Kate Neely, and Jon Geiger

    At the end of last year, we shared the Merge Queue v2 in our blog post Successfully Merging the Work of 1000+ Developers. One question that we often get is, “why did you choose to build this yourself?” The short answer is that nothing we found could quite solve the problem in the way we wanted. The long answer is that it’s important for us to build an optimized experience for how our developers want to work and to continually shape our tooling and process around our “release culture”.

    Shopify defines culture as:

    “The sum of beliefs and behavior of everyone at Shopify.”

    We approach the culture around releasing software the exact same way. We have important goals, like making sure that bad changes don’t deploy to production and break for our users, and that our changes can make it into production without compromises in security. But there are many ways of getting there and a lot of right answers on how we can do things.

    As a team, we try to find the path to those goals that our developers want to take. We want to create experiences through tooling that can make our developers feel productive, and we want to do our best to make shipping feel like a celebration and not a chore.

    Measuring Release Culture at Shopify

    When we talk about measuring culture, we’re talking about a few things.

    • How do developers want to work?
    • What is important to them?
    • How do they expect the tools they use to support them?
    • How much do they want to know about what’s going on behind the scenes or under the hood of the tools they use?

    Often, there isn’t one single answer to these questions, especially given the number and variety of people who deploy every day at Shopify. There are a few active and passive ways we can get a sense of the culture around shipping code. One method isn’t more important than the others, but all of them together paint a clearer picture of what life is like for the people who use our tools.

    Passive and active methods of measurement
    Passive and active methods of measurement

    The passive methods we use really don’t require much work from our team, except to manage and aggregate information that comes in. The developer happiness survey is a biannual survey of developers across the company. Devs are asked to self-report about everything from their satisfaction with the tools they use or where they feel the most of their time is wasted or lost.

    In addition, we have Slack channels dedicated to shipping that are open to anyone. Users can get support from our team or each other, and report problems they’re having. Our team is active in these channels to help foster a sense of community and encourage developers to share their experiences, but we don’t often use these channels to directly ask for feedback.

    That said, we do want to be proactive about identifying pain points, and we know we can’t rely too much on users to provide that direction, so there are also active things we do to make sure we’re solving the most important problems.

    The first thing is dogfooding. Just like other developers at Shopify, our team ships code every day using the same tools that we build and maintain. This helps us identify gaps in our service and empathize with users when things don’t go as planned.

    Another valuable resource is our internal support team. They take on the huge responsibility of helping users and supporting our evolving suite of internal tools. They diagnose issues and help users find the right team to direct their questions. And they are invaluable in terms of identifying common pain points that users experience in current workflows, as well as potential pitfalls in concepts and prototypes. We love them.

    Finally, especially when it comes to adding new features or changing existing workflows, we do UX research throughout our process:

    • to better understand user behavior and expectations
    • to test out concepts and prototypes as we develop them

    We shadow developers as they ship PRs to see what else they’re looking at and what they’re thinking about as they make decisions. We talk to people, like designers and copywriters, who might not ship code at other companies (but they often do at Shopify) and ask them to walk us through their processes and how they learned to use the tools they rely on. We ask interns and new hires to test out prototypes to get fresh perspectives and challenge our assumptions.

    All of this together ensures that, throughout the process of building and launching, we’re getting feedback from real users to make things better.

    Feedback is a Gift

    At Shopify, we often say feedback is a gift, but that doesn’t always make it less intimidating for users to share their frustrations, or easier for us to hear when things go wrong. Our goal with all measuring is to create a feedback loop where users feel comfortable talking about what’s not working for them (knowing that we care and will try to act on it), and we feel energized and inspired by what we learn from users instead of disheartened and bitter. We want them to know that their feedback is valuable and helpful for us to make both the tools and culture around shipping supportive of everyone.

    Shopify’s Release Process

    Let’s look at what Shopify’s actual release process looks like and how we’re working to improve it.

    Release Pipeline

    Happy path of the release pipeline
    Happy path of the release pipeline

    This is what the release pipeline looks like on the happy path. We go from Pull Request (PR) to Continuous Integration (CI)/Merge to Canary deployment and finally Production.

    Release pipeline process starts with a PR and a /shipit command
    Release pipeline process starts with a PR and a /shipit command

    Developers start the process by creating a PR and then issue a /shipit command when ready to ship. From here, the Merge Queue system tries to integrate the PR with the trunk branch, Master.

    PR merged to Master and then deployed to Canary
    PR merged to Master and then deployed to Canary

    When the Merge Queue determines the changes can be integrated successfully, the PR is merged to Master and deployed to our Canary infrastructure. The Canary environment receives a random 5% of all incoming requests.

    Changes deployed to Production Changes deployed to Production 

    Developers have tooling allowing them to test their changes in the Canary environment for 10 minutes. If there’s no manual intervention and the automated canary analysis doesn’t trigger any alerts, the changes are deployed to Production


    Developers want to be trusted and have autonomy over their work. Developers should be able to own the entire release process for their PRs.

    Developers own the whole process
    Developers own the process

    Developers own the whole process. There are no release managers, sign offs, or windows that developers are allowed to make releases in.

    We have infrastructure to limit the blast radius of bad changes
    We have infrastructure to limit the blast radius of bad changes

    Unfortunately, sometimes things will break, and that’s ok. We have built our infrastructure to limit the blast radius of bad changes. Most importantly, we trust each of our developers to be responsible and own the recovery if their change goes bad.

    Developers can fast track a fix using /shipit --emergency command
    Developers can fast track a fix using /shipit --emergency

    Once a fix has been prepared (either a fix-forward or revert), developers can fast track their fix to the front of the line with a single /shipit --emergency command. To help our developers make decisions quickly, we don’t have multiple recovery protocols, and instead, just have a single emergency feature that takes the quickest path to recovery.


    Developers want to ship fast.

    A quick release process allows us a quick path to recovery
    A quick release process allows us a quick path to recovery

    Speed of release is a crucial element to most apps at Shopify. It’s a big productivity boost for developers to ship their code multiple times a day and have it reach end-users immediately. But more importantly, having a quick-release process allows us a quick path to recovery.

    We’re willing to make tradeoffs in cost for a fast release process
    We’re willing to make tradeoffs in cost for a fast release process

    In order to invest in really making our release process fast, we’re willing to make tradeoffs in cost. In addition to dedicated infrastructure teams, we also manage our own CI cluster with multiple thousands of nodes at its daily peak.

    Automate as Much as Possible

    Developers don’t want to perform repetitive tasks. Computers are still better at doing things than humans—so we automate as much as possible. We use automation in places like continuous deployments and canary analysis.

    Developers don't have to press deploy, we automate that
    Developers don't have to press deploy, we automate that

    We automated out the need for developers to press Deploy—we automatically continuously deploy to Canary and Production.

    Developers can override automation
    Developers can override automation

    It’s still important for developers to be able to override the machinery. Developers can lock the automatic deployments and deploy manually, in cases like emergencies.

    Release Culture @ Shopify Virtual Event

    We held Shipit! Presents a Q&A about Release Culture @ Shopify with our guests Jack Li, Kate Neely, and Jon Geiger on April 20, 2020.  We had a discussion about the culture around shipping code at Shopify and answered your questions. We weren't able to answer all the questions during the event, so we've included answers to all the questions that we didn't get to below.

    How has automation and velocity maintained uptime?

    Automation has helped maintain uptime by providing assurances on a level of quality for changes on the way out, such as the automated canary analysis guaranteeing that the changes meet a certain level of production quality. Velocity has helped maintain uptime by reducing downtime; when things break, the high velocity of release means that problems are resolved quicker.

    For an active monolith with many merges happening throughout the day, deploys to canary must be happening very frequently. How do you identify the "bad" merge, if there have been many recent merges on canary, and how do you ensure that bad merges don't block the release of other merges while there's a "bad" merge in the canary environment?

    Our process here is still relatively immature, so triaging failures is still a manual process. Our velocity helps us ensure smaller changelists which makes triaging failures easier. As for reducing the impact of bad changes, I will defer to our blog post about the Merge Queue, which helps us ensure that we are not completely stalled when bad changes happen. 

    How do you as a tooling organization handle sprawl? How do you balance enabling and controlling? That is, can teams choose to use their own languages and frameworks within those languages, or are they restricted to a set of things they're allowed to use?

    Generally, we are more restrictive in technology choices. This is mostly because we want to be strategic in the technologies that we use, and so we are open to experimentation, but we have technologies that are battle-tested at Shopify that we encourage as recommended defaults (e.g. Ruby, Rails). React Native is the Future of Mobile at Shopify is an interesting article that talks about a recent technology change we have made.

    What were you using before /shipit? How did that transition look? How did you measure its success?

    Successfully Merging the Work of 1000+ Developers tells the story of how we got to the current /shipit system. Our two measures of success around this has been through feedback from our developer happiness survey, and from metrics around average pull request time-to-production.

    How many different tools comprise the CI/CD pipeline and are they all developed in house, and are they owned by a specific team or does everyone contribute?

    We work with a variety of vendors! Our biggest partners are Buildkite, which we use for scheduling CI, and GitHub which we build our development workflow around. Some more info about our tooling can be found at The tools we build are developed and owned by our Developer Acceleration team, but everyone is free to contribute and we get tons of contributions all the time. Our CD system, Shipit, is actually open source and we see contributions by community members frequently as well.

    How is performance on production monitored after a feature release?

    Typically this is something that teams themselves will strategize around and monitor. Performance is a big deal to us, and we have an internal dashboard to help teams track their performance metrics. We trust each team to take this component of their product seriously. Aside from that, the Production Engineering team has monitors and dashboards around performance metrics of the entire system.

    How did you get into creating dev tooling for in-house teams. Are there languages/systems you would recommend learning to someone who is interested?

    (Note from Jack: Interpreting this as a kind of career question) Personally, I’ve always gravitated towards the more “meta” parts of software development, focusing on long-term productivity and maintainability of previous projects, so working on dev tooling full-time felt like a perfect fit. In my opinion, the most important skill to be successful in this problem space is to be adaptable, both in adapting to new technologies and to new ideas. Languages like Ruby, Python, that allow you to focus more on the ideas behind your code can be good enablers for this. Docker and Kubernetes knowledge is valuable in this area as well.

    Is development done on feature branches and entire features merged all at once, or are partial/incomplete features merged into master, but guarded by a feature flag?

    Very good question, I think certain teams/features will do slightly different things, but typically releases happen via feature flags that we call “Beta Flags” in our system. This allows changes to be rolled out on a per-shop basis, or a percentage-of-shops basis.

    Do you guys use Crystalball?

    We forwarded this question to our test infrastructure team, their response was that we don’t use Crystallball, there was some brief exploration into this, but it wasn’t fast enough to trace through our codebase, and the test suite in our main monolith is written in minitest.

    Additional Information

    If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.Learn about the actions we’re taking as we continue to hire during COVID‑19



    Continue reading

    Building Arrive's Confetti in React Native with Reanimated

    Building Arrive's Confetti in React Native with Reanimated

    Shopify is investing in React Native as our primary choice of mobile technology moving forward. As a part of this we’ve rewritten our package tracking app Arrive with React Native and launched it on Android—an app that previously only had an iOS version.

    One of the most cherished features by the users of the Arrive iOS app is the confetti that rains down on the screen when an order is delivered. The effect was implemented using the built-in CAEmitterLayer class in iOS, producing waves of confetti bursting out with varying speeds and colors from a single point at the top of the screen.

    When we on the Arrive team started building the React Native version of the app, we included the same native code that produced the confetti effect through a Native Module wrapper. This would only work on iOS however, so to bring the same effect to Android we had two options before us:

    1. Write a counterpart to the iOS native code in Android with Java or Kotlin, and embed it as a Native Module.
    2. Implement the effect purely in JavaScript, allowing us to share the same code on both platforms.

    As you might have guessed from the title of this blog post, we decided to go with the second option. To keep the code as performant as the native implementation, the best option would be to write it in a declarative fashion with the help of the Reanimated library.

    I’ll walk you through, step by step, how we implemented the effect in React Native, while also explaining what it means to write an animation declaratively.

    When we worked on this implementation, we also decided to make some visual tweaks and improvements to the effect along the way. These changes make the confetti spread out more uniformly on the screen, and makes them behave more like paper by rotating along all three dimensions.

    Laying Out the Confetti

    To get our feet wet, the first step will be to render a number of confetti on the screen with different colors, positions and rotation angles.

    Initialize the view of 100 confetti

    Initialize the view of 100 confetti

    We initialize the view of 100 confetti with a couple of randomized values and render them out on the screen. To prepare for animations further down the line, each confetto (singular form of confetti, naturally) is wrapped with Reanimated's Animated.View. This works just like the regular React Native View, but accepts declaratively animated style properties as well, which I’ll explain in the next section.

    Defining Animations Declaratively

    In React Native, you generally have two options for implementing an animation:

    1. Write a JavaScript function called by requestAnimationFrame on every frame to update the properties of a view.
    2. Use a declarative API, such as Animated or Reanimated, that allows you to declare instructions that are sent to the native UI-thread to be run on every frame.

    The first option might seem the most attractive at first for its simplicity, but there’s a big problem with the approach. You need to be able to calculate the new property values within 16 milliseconds every time to maintain a consistent 60 FPS animation. In a vacuum, this might seem like an easy goal to accomplish, but because of JavaScript's single threaded nature you’ll also be blocked by anything else that needs to be computed in JavaScript during the same time period. As an app grows and needs to be able to do more things at once, it quickly becomes implausible to always be able to finish the computation within the strict time limit.

    With the second option, you only rely on JavaScript at the beginning of the animation to set it all up, after which all computation happens on the native UI-thread. Instead of relying on a JavaScript function to answer where to move a view on each frame, you assemble a set of instructions that the UI-thread itself can execute on every frame to update the view. When using Reanimated these instructions can include conditionals, mathematical operations, string concatenation, and much more. These can be combined in a way that almost resembles its own programming language. With this language, you write a small program that can be sent down to the native layer, that is executed once every frame on the UI-thread.

    Animating the Confetti

    We are now ready to apply animations to the confetti that laid out in the previous step. Let's start by updating our createConfetti function:

    Instead of randomizing x, y and angle, we give all confetti the same initial values but instead randomize the velocities that we're going to be applying to them. This creates the effect of all confetti starting out inside an imaginary confetti cannon and shooting out in different directions and speeds. Each velocity expresses how much a value will be changing for each full second of animation.

    We need to wrap each value that we're intending to animate with Animated.Value, to prepare them for declarative instructions. The Animated.Clock value is what's going to be the driver of all our animations. As the name implies it gives us access to the animation's time, which we'll use to decide how much to move each value forward on each update.

    Further down, next to where we’re mapping over and rendering the confetti, we add our instructions for how the values should be animated:

    Before anything else, we set up our dt (delta time) value that will express how much time has passed since the last update, in seconds. This decides the x, y, and angle delta values that we're going to apply.

    To get our animation going we need to start the clock if it's not already running. To do this, we wrap our instructions in a condition, cond, which checks the clock state and starts it if necessary. We also need to call our timeDiff (time difference) value once to set it up for future use, since the underlying diff function returns its value’s difference since the last frame it evaluated, and the first call will be used as the starting reference point.

    The declarative instructions above roughly translate to the following pseudo code, which runs on every frame of the animation:

    Considering the nature of confetti falling through the air, moving at constant speed makes sense here. If we were to simulate more solid objects that aren't slowed down by air resistance as much, we might want to add a yAcc (y-axis acceleration) variable that would also increase the yVel (y-axis velocity) within each frame.

    Everything put together, this is what we have now:

    Staggering Animations

    The confetti is starting to look like the original version, but our React Native version is blurting out all the confetti at once, instead of shooting them out in waves. Let's address this by staggering our animations:

    We add a delay property to our confetti, with increasing values for each group of 10 confetti. To wait for the given time delay, we update our animation code block to first subtract dt from delay until it reaches below 0, after which our previously written animation code kicks in.

    Now we have something that pretty much looks like the original version. But isn’t it a bit sad that a big part of our confetti is shooting off the horizontal edges of the screen without having a chance to travel across the whole vertical screen estate? It seems like a missed potential.

    Containing the Confetti

    Instead of letting our confetti escape the screen on the horizontal edges, let’s have them bounce back into action when that’s about to happen. To prevent this from making the confetti look like pieces of rubber macaroni bouncing back and forth, we need to use a good elasticity multiplier to determine how much of the initial velocity to keep after the collision.

    When an x value is about to go outside the bounds of the screen, we reset it to the edge’s position and reverse the direction of xVel while reducing it by the elasticity multiplier at the same time:

    Adding a Cannon and a Dimension

    We’re starting to feel done with our confetti, but let’s have a last bit of fun with it before shipping it off. What’s more fun than a confetti cannon shooting 2-dimensional confetti? The answer is obvious of course—it’s two confetti cannons shooting 3-dimensional confetti!

    We should also consider cleaning up by deleting the confetti images and stopping the animation once we reach the bottom of the screen, but that’s not nearly as fun as the two additions above so we’ll leave that out of this blog post.

    This is the result of adding the two effects above:

    The final full code for this component is available in this gist.

    Driving Native-level Animation with JavaScript

    While it can take some time to get used to the Reanimated’s seemingly arcane API, once you’ve played around with it for a bit there should be nothing stopping you from implementing butter smooth cross-platform animations in React Native, all without leaving the comfort of the JavaScript layer. The library has many more capabilities we haven’t touched on in this post, for example, the possibility to add user interactivity by mapping animations to touch gestures. Keep a lookout for future posts on this subject!

    Continue reading

    Optimizing Ruby Lazy Initialization in TruffleRuby with Deoptimization

    Optimizing Ruby Lazy Initialization in TruffleRuby with Deoptimization

    Shopify's involvement with TruffleRuby began half a year ago, with the goal of furthering the success of the project and Ruby community. TruffleRuby is an alternative implementation of the Ruby language (where the reference implementation is CRuby, or MRI) developed by Oracle Labs. TruffleRuby has high potential in speed, as it is nine times faster than CRuby on optcarrot, a NES emulator benchmark developed by the Ruby Core Team.

    I’ll walk you through a simple feature I investigated and implemented. It showcases many important aspects of TruffleRuby and serves as a great introduction to the project!

    Introduction to Ruby Lazy Initialization

    Ruby developers tend to use the double pipe equals operator ||= for lazy initialization, likely somewhat like this:

    Syntactically, the meaning of the double pipe equals operator is that the value is assigned if the value of the variable is currently not set.

    The common use case of the operator isn’t so much “assign if” and more “assign once”.

    This idiomatic usage is a subset of the operator’s syntactical meaning so prioritizing that logic in the compiler can improve performance. For TruffleRuby, this would lead to less machine code being emitted as the logic flow is shortened.

    Analyzing Idiomatic Usage

    To confirm that this usage is common enough to be worth optimizing for, I ran static profiling on how many times this operator is used as lazy initialization.

    For a statement to count as a lazy initialization for these profiling purposes, we had it match one of the following requirements:

    • The value being assigned is a constant (uses only literals of int, string, symbol, hash, array or is a constant variable). An example would be a ||= [2 * PI].
    • The statement with the ||= operator is in a function, an instance or class variable is being assigned, and the name of the instance variable contains the name of the function or vice versa. The function must accept no params. An example would be def get_a; @a ||= func_call.

    These criteria are very conservative. Here are some examples of cases that won’t be considered a lazy initialization but probably still follow the pattern of “assign once”.

    After profiling 20 popular open-source projects, I found 2082 usages of the ||= operator, 64% of them being lazy initialization by this definition.

    Compiling Code with TruffleRuby

    Before we get into optimizing TruffleRuby for this behaviour, here’s some background on how TruffleRuby compiles your code.

    TruffleRuby is an implementation of Ruby that aims for higher performance through optimizing Just In Time (JIT) compilation (programs that are compiled as they're being executed). It’s built on top of GraalVM, a modified JVM built by Oracle that provides Truffle, a framework used by TruffleRuby for implementing languages through building Abstract Syntax Tree (AST) interpreters. With Truffle, there’s no explicit step where JVM bytecode is created as with a conventional JVM language, rather Truffle will just use the interpreter and communicate with the JVM to create machine code directly with profiling and a technique called partial evaluation. This means that GraalVM can be advertised as magic that converts interpreters into compilers!

    TruffleRuby also leverages deoptimization (more than other implementations of Ruby) which is a term for quickly moving between the fast JIT-compiled machine code to the slow interpreter. One application for deoptimization is how the compiler handles monkey patching (e.g. replacing a class method at runtime). It’s unlikely that a method will be monkey patched, so we can deoptimize if it has been monkey patched to find and execute the new method. The path for handling the monkey patching won't need to be compiled or appear in the machine code. In practice, this use case is even better—instead of constantly checking if a function has been redefined, we can just place the deoptimization where the redefinition is and never need a check in compiled code.

    In this case with lazy initialization, we make the deoptimization case the uncommon one where the variable needs to be assigned a value more than once.

    Implementing the Deoptimization

    Before when TruffleRuby encountered the ||= operator, a Graal profiler would see that since both sides have been used, the entire statement should be compiled into machine code. Our knowledge of how Ruby is used in practice tells us that the right hand side is unlikely to be run again, and so doesn’t need to be compiled into machine code if it’s never been executed or has been executed just once.

    TruffleRuby uses little objects called nodes to represent each part of a Ruby program. We use an OrNode to handle the ||= operator, with the left side being the condition and the right side being the action to execute if the left side is true (in this case the action is an assignment). The creation of these nodes are implemented in Java.

    To make this optimization, we swapped out the standard OrNode for an OrLazyValueDefinedNode in the BodyTranslator which translates the Ruby AST into nodes that Truffle can understand.

    The basic OrNode executes like this:

    The ConditionProfile is what counts how many times each branch is executed. With lazy initialization it counts both sides as used by default, so compiles them both into the machine code.

    The OrLazyValueDefinedNode only changes the else block. What I'm doing here is counting the number of times the else part is executed, and turning it into a deoptimization if it’s less than twice.

    Benchmarking and Impact

    Benchmarking isn’t a perfect measure of how effective this change is (benchmarking is arguably never perfect, but that’s a different conversation), as the results would be too noisy to observe in a large project. However, I can still benchmark on some pieces of code to see the improvements. By doing the “transfer to interpreter and invalidate”, time and space is saved in creating machine code for everything related to the right side.

    With our new optimization this piece of code compiles about 6% faster and produces about 63% fewer machine code by memory (about half the number of assembly instructions). Faster compilation means more time for your app to run, and smaller machine code means less usage of memory and cache. Producing less machine code more quickly improves responsiveness and should in turn make the program run faster, though it's difficult to prove.

    Function foo without optimization

    Function foo without optimization


    Above is a graph of the foo method in sample code above without the optimization that vaguely represents the logic present in the machine code. I can look at the actual compiler graphs produced by Graal at various stages to understand how exactly our code is being compiled, but this is the overview.

    Each of the nodes in this graph expands to more control flow and memory access, which is why this optimization can impact the amount of machine code so much. This graph represents the uncommon case where the checks and call to the calculate_foo method are needed, so for lazy initialization it’ll only need this flow once or zero times.

    Function foo with optimizationFunction foo with optimization

    The graph that includes the optimization is a bit less complex. The control flow doesn’t need to know anything about variable assignment or anything related to calling and executing a method.

    What I've added is just an optimization, so if you:

    • aren’t using ||= to mean lazy initialization
    • need to run the right-hand-side of the expression multiple times
    • need it to be fast

    then the optimization goes away and the code is compiled as it would have done before (you can revisit the OrLazyValueDefinedNode source above to see the logic for this).

    This optimization shows the benefit of looking at codebases used in industry for patterns that aren’t visible in the language specifications. It’s also worth noting that none of the code changes here were very complicated and modified code in a very modular way—other than the creation of the new node, only one other line was touched!

    Truffle is actually named after the chocolates, partially in reference to the modularity of a box of chocolates. Apart from modularity, TruffleRuby is also easy to develop on as it's primarily written in Ruby and Java (there's some C in there for extensions).

    Shopify is leading the way in experimenting with TruffleRuby for production applications. TruffleRuby is currently mirroring storefront traffic. This helped us work through some bugs, build better tooling for TruffleRuby and can lead to a faster browsing for customers.

    We also contribute to CRuby/MRI and Sorbet as a part of our work on Ruby. We like desserts, so along with contributions to TruffleRuby and Sorbet, we maintain Tapioca! If you'd like to become a part of our dessert medley (or work on other amazing Shopify projects), send us an application!

    Additional Information

    Tangentially related things about Ruby and TruffleRuby

    If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    Refactoring Legacy Code with the Strangler Fig Pattern

    Refactoring Legacy Code with the Strangler Fig Pattern

    Large objects are a code smell: overloaded with responsibilities and dependencies, as they continue to grow, it becomes more difficult to define what exactly they’re responsible for. Large objects are harder to reuse and slower to test. Even worse, they cost developers additional time and mental effort to understand, increasing the chance of introducing bugs. Unchecked, large objects risk turning the rest of your codebase into a ball of mud, but fear not! There are strategies for reducing the size and responsibilities of large objects. Here’s one that worked for us at Shopify, an all-in-one commerce platform supporting over one million merchants across the globe. 

    As you can imagine, one of the most critical areas in Shopify’s Ruby on Rails codebase is the Shop model. Shop is a hefty class with well over 3000 lines of code, and its responsibilities are numerous. When Shopify was a smaller company with a smaller codebase, Shop’s purpose was clearer: it represented an online store hosted on our platform. Today, Shopify is far more complex, and the business intentions of the Shop model are murkier. It can be described as a God Object: a class that knows and does too much.

    My team, Kernel Architecture Patterns, is responsible for enforcing clean, efficient, scalable architecture in the Shopify codebase. Over the past few years, we invested a huge effort into componentizing Shopify’s monolithic codebase (see Deconstructing the Monolith) with the goal of establishing well-defined boundaries between different domains of the Shopify platform.

    Not only is creating boundaries at the component-level important, but establishing boundaries between objects within a component is critical as well. It’s important that the business subdomain modelled by an object is clearly defined. This ensures that classes have clear boundaries and well-defined sets of responsibilities.

    Shop’s definition is unclear, and its semantic boundaries are weak. Unfortunately, this makes it an easy target for the addition of new features and complexities. As advocates for clean, well-modelled code, it was evident that the team needed to start addressing the Shop model and move some of its business processes into more appropriate objects or components.

    Using the ABC Code Metric to Determine Code Quality

    Knowing where to start refactoring can be a challenge, especially with a large class like Shop. One way to find a starting point is to use a code metric tool. It doesn’t really matter which one you choose, as long as it makes sense for your codebase. Our team opted to use Flog, which uses a score based on the number of assignments, branches and calls in each area of the code to understand where code quality is suffering the most. Running Flog identified a particularly disordered portion in Shop: store settings, which contains numerous “global attributes” related to a Shopify store.

    Refactoring Shop with the Strangler Fig Pattern

    Extracting store settings into more appropriate components offered a number of benefits, notably better cohesion and comprehension in Shop and the decoupling of unrelated code from the Shop model. Refactoring Shop was a daunting task—most of these settings were referenced in various places throughout the codebase, often in components that the team was unfamiliar with. We knew we’d potentially make incorrect assumptions about where these settings should be moved to. We wanted to ensure that the extraction process was well laid out, and that any steps taken were easily reversible in case we changed our minds about a modelling decision or made a mistake. Guaranteeing no downtime for Shopify was also a critical requirement, and moving from a legacy system to an entirely new system in one go seemed like a recipe for disaster.

    What is the Strangler Fig Pattern?

    The solution? Martin Fowler’s Strangler Fig Pattern. Don’t let the name intimidate you! The Strangler Fig Pattern offers an incremental, reliable process for refactoring code. It describes a method whereby a new system slowly grows over top of an old system until the old system is “strangled” and can simply be removed. The great thing about this approach is that changes can be incremental, monitored at all times, and the chances of something breaking unexpectedly are fairly low. The old system remains in place until we’re confident that the new system is operating as expected, and then it’s a simple matter of removing all the legacy code.

    That’s a relatively vague description of the Strangler Fig Pattern, so let’s break down the 7-step process we created as we worked to extract settings from the Shop model. The following is a macro-level view of the refactor.

    Macro-level view of the Strangler Fig Pattern
    Macro-level view of the Strangler Fig Pattern

    We’ll dive into exactly what is involved in each step, so don’t worry if this diagram is a bit overwhelming to begin with.

    Step 1: Define an Interface for the Thing That Needs to Be Extracted

    Define the public interface by adding methods to an existing class, or by defining a new model entirely.
    Define the public interface by adding methods to an existing class, or by defining a new model entirely

    The first step in the refactoring process is to define the public interface for the thing being extracted. This might involve adding methods to an existing class, or it may involve defining a new model entirely. This first step is just about defining the new interface; we’ll depend on the existing interface for reading data during this step. In this example, we’ll be depending on an existing Shop object and will continue to access data from the shops database table.

    Let’s look at an example involving Shopify Capital, Shopify’s finance program. Shopify Capital offers cash advances and loans to merchants to help them kick-start their business or pursue their next big goal. When a merchant is approved for financing, a boolean attribute, locked_settings, is set to true on their store. This indicates that certain functionality on the store is locked while the merchant is taking advantage of a capital loan. The locked_settings attribute is being used by the following methods in the Shop class:

    We already have a pretty clear idea of the methods that need to be involved in the new interface based on the existing methods that are in the Shop class. Let’s define an interface in a new class, SettingsToLock, inside the Capital component.

    As previously mentioned, we’re still reading from and writing to a Shop object at this point. Of course, it’s critical that we supply tests for the new interface as well.

    We’ve clearly defined the interface for the new system. Now, clients can start using this new interface to interact with Capital settings rather than going through Shop.

    Step 2: Change Calls to the Old System to Use the New System Instead

    Replace calls to the existing “host” interface with calls to the new system instead
    Replace calls to the existing “host” interface with calls to the new system instead

    Now that we have an interface to work with, the next step in the Strangler Fig Pattern is to replace calls to the existing “host” interface with calls to the new system instead. Any objects sending messages to Shop to ask about locked settings will now direct their messages to the methods we’ve defined in Capital::SettingsToLock.

    In a controller for the admin section of Shopify, we have the following method:

    This can be changed to:

    A simple change, but now this controller is making use of the new interface rather than going directly to the Shop object to lock settings.

    Step 3: Make a New Data Source for the New System If It Requires Writing

    New Data Source
    New data source

    If data is written as a part of the new interface, it should be written to a more appropriate data source. This might be a new column in an existing table, or may require the creation of a new table entirely.

    Continuing on with our existing example, it seems like this data should belong in a new table. There are no existing tables in the Capital component relevant to locked settings, and we’ve created a new class to hold the business logic—these are both clues that we need a new data source.

    The shops table currently looks like this in db/schema.rb

    We create a new table, capital_shop_settings_locks, with a column locked_settings and a reference to a shop.

    The creation of this new table marks the end of this step.

    Step 4: Implement Writers in the New Model to Write to the New Data Source

    implement writers in the new model to write data to the new data source while also writing to the existing data source
    Implement writers in the new model to write data to the new data source and existing data source

    The next step in the Strangler Fig Pattern is a bit more involved. We need to implement writers in the new model to write data to the new data source while also writing to the existing data source.

    It’s important to note that while we have a new class, Capital::SettingsToLock, and a new table, capital_shop_settings_locks, these aren’t connected at the moment. The class defining the new interface is a plain old Ruby object and solely houses business logic. We are aiming to create a separation between the business logic of store settings and the persistence (or infrastructure) logic. If you’re certain that your model’s business logic is going to stay small and uncomplicated, feel free to use a single Active Record. However, you may find that starting with a Ruby class separate from your infrastructure is simpler and faster to test and change.

    At this point, we introduce a record object at the persistence layer. It will be used by the Capital::SettingsToLock class to read data from and write data to the new table. Note that the record class will effectively be kept private to the business logic class.

    We accomplish this by creating a subclass of ApplicationRecord. Its responsibility is to interact with the capital_shop_settings_locks table we’ve defined. We define a class Capital::SettingsToLockRecord, map it to the table we’ve created, and add some validations on the attributes.

    Let’s add some tests to ensure that the validations we’ve specified on the record model work as intended:

    Now that we have Capital::SettingsToLockRecord to read from and write to the table, we need to set up Capital::SettingsToLock to access the new data source via this record class. We can start by modifying the constructor to take a repository parameter that defaults to the record class:

    Next, let’s define a private getter, record. It performs find_or_initialize_by on the record model, Capital::SettingsToLockRecord, using shop_id as an argument to return an object for the specified shop.

    Now, we complete this step in the Strangler Fig Pattern by starting to write to the new table. Since we’re still reading data from the original data source, we‘ll need to write to both sources in tandem until the new data source is written to and has been backfilled with the existing data. To ensure that the two data sources are always in sync, we’ll perform the writes within transactions. Let’s refresh our memories on the methods in Capital::SettingsToLock that are currently performing writes.

    After duplicating the writes and wrapping these double writes in transactions, we have the following:

    The last thing to do is to add tests that ensure that lock and unlock are indeed persisting data to the new table. We control the output of SettingsToLockRecord’s find_or_initialize_by, stubbing the method call to return a mock record.

    At this point, we are successfully writing to both sources. That concludes the work for this step.

    Step 5: Backfill the New Data Source with Existing Data

    Backfill the data
    Backfill the data

    The next step in the Strangler Fig Pattern involves backfilling data to the new data source from the old data source. While we’re writing new data to the new table, we need to ensure that all of the existing data in the shops table for locked_settings is ported over to capital_shop_settings_locks.

    In order to backfill data to the new table, we’ll need a job that iterates over all shops and creates record objects from the data on each one. Shopify developed an open-source iteration API as an extension to Active Job. It offers safer iterations over collections of objects and is ideal for a scenario like this. There are two key methods in the iteration API: build_enumerator specifies the collection of items to be iterated over, and each_iteration defines the actions to be taken out on each object in the collection. In the backfill task, we specify that we’d like to iterate over every shop record, and each_iteration contains the logic for creating or updating a Capital::SettingsToLockRecord object given a store. The alternative is to make use of Rails’ Active Job framework and write a simple job that iterates over the Shop collection. 

    Some comments about the backfill task: the first is that we’re placing a pessimistic lock on the Shop object prior to updating the settings record object. This is done to ensure data consistency across the old and new tables in a scenario where a double write occurs at the same time as a row update in the backfill task. The second thing to note is the use of a logger to output information in the case of a persistence failure when updating the settings record object. Logging is extremely helpful in pinpointing the cause of persistence failures in a backfill task such as this one, should they occur.

    We include some tests for the job as well. The first tests the happy path and ensures that we're creating and updating settings records for every Shop object. The other tests the unhappy path in which a settings record update fails and ensures that the appropriate logs are generated

    After writing the backfill task, we enqueue it via a Rails migration:

    Once the task has run successfully, we celebrate that the old and new data sources are in sync. It’s wise to compare the data from both tables to ensure that the two data sources are indeed in sync and that the backfill hasn’t failed anywhere.

    Step 6: Change the Methods in the Newly Defined Interface to Read Data from the New Source

    Change the reader methods to use the new data source
    Change the reader methods to use the new data source

    The remaining steps of the Strangler Fig Pattern are fairly straightforward. Now that we have a new data source that is up to date with the old data source and is being written to reliably, we can change the reader methods in the business logic class to use the new data source via the record object. With our existing example, we only have one reader method:

    It’s as simple as changing this method to go through the record object to access locked_settings:

    Step 7: Stop Writing to the Old Source and Delete Legacy Code

    Remove the now-unused, “strangled” code from the codebase
    Remove the now-unused, “strangled” code from the codebase

    We’ve made it to the final step in our code strangling! At this point, all objects are accessing locked_settings through the Capital::SettingsToLock interface, and this interface is reading from and writing to the new data source via the Capital::SettingsToLockRecord model. The only thing left to do is remove the now-unused, “strangled” code from the codebase.

    In Capital::SettingsToLock, we remove the writes to the old data source in lock and unlock and get rid of the getter for shop. Let’s review what Capital::SettingsToLock looks like.

    After the changes, it looks like this:

    We can remove the tests in Capital::SettingsToLockTest that assert that lock and unlock write to the shops table as well.

    Last but not least, we remove the old code from the Shop model, and drop the column from the shops table.

    With that, we’ve successfully extracted a store settings column from the Shop model using the Strangler Fig Pattern! The new system is in place, and all remnants of the old system are gone.


    In summary, we’ve followed a clear 7-step process known as the Strangler Fig Pattern to extract a portion of business logic and data from one model and move it into another:

    1. We defined the interface for the new system.
    2. We incrementally replaced reads to the old system with reads to the new interface.
    3. We defined a new table to hold the data and created a record for the business logic model to use to interface with the database.
    4. We began writing to the new data source from the new system.
    5. We backfilled the new data source with existing data from the old data source.
    6. We changed the readers in the new business logic model to read data from the new table.
    7. Finally, we stopped writing to the old data source and deleted the remaining legacy code.

    The appeal of the Strangler Fig Pattern is evident. It reduces the complexity of the refactoring journey by offering an incremental, well-defined execution plan for replacing a legacy system with new code. This incremental migration to a new system allows for constant monitoring and minimizes the chances of something breaking mid-process. With each step, developers can confidently move towards a refactored architecture while ensuring that the application is still up and tests are green. We encourage you to try out the Strangler Fig Pattern with a small system that already has good test coverage in place. Best of luck in future code-strangling endeavors!

    Continue reading

    The Evolution of Kit: Automating Marketing Using Machine Learning

    The Evolution of Kit: Automating Marketing Using Machine Learning

    For many Shopify business owners, whether they’ve just started their entrepreneur journey or already have an established business, marketing is one of the essential tactics to build audience and drive sales to their stores. At Shopify, we offer various tools to help them do marketing. One of them is Kit, a virtual employee that can create easy Instagram and Facebook ads, perform email marketing automation and offer many other skills through powerful app integrations. I’ll talk about the engineering decision my team made to transform Kit from a rule based system to an artificially-intelligent assistant that looks at a business owner’s products, visitors, and customers to make informed recommendations for the next best marketing move. 

    As a virtual assistant, Kit interacts with business owners through messages over various interfaces including Shopify Ping and SMS. Designing the user experience for messaging is challenging especially when creating marketing campaigns that can involve multiple steps, such as picking products, choosing audience and selecting budget. Kit not only builds a user experience to reduce the friction for business owners in creating ads, but also goes a step further to help them create more effective and performant ads through marketing recommendation.

    Simplifying Marketing Using Heuristic Rules

    Marketing can be daunting, especially when the number of different configurations in the Facebook Ads Manager can easily overwhelm its users.

    Facebook Ads Manager ScreenshotFacebook Ads Manager screenshot

    There is a long list of settings that need configuring including objective, budget, schedule, audience, ad format and creative. For a lot of business owners who are new to marketing, understanding all these concepts is already time consuming, let alone making the correct decision at every decision point in order to create an effective marketing campaign.

    Kit simplifies the flow by only asking for the necessary information and configuring the rest behind the scenes. The following is a typical flow on how a business owner interacts with Kit to start a Facebook ad.

    Screenshot of the conversation flow on how a business owner interacts with Kit to start a Facebook ad
    Screenshot of the conversation flow on how a business owner interacts with Kit to start a Facebook ad

    Kit simplifies the workflow into two steps: 1) pick products as ad creative and 2) choose a budget. We use heuristic rules based on our domain knowledge and give business owners limited options to guide them through the workflow. For products, we identify several popular categories that they want to market. For budget, we offer a specific range based on the spending behavior of the business owners we want to help. For the rest of configurations, Kit defaults to best practices removing the need to make decisions based on expertise.

    The first version of Kit was a standalone application that communicated with Shopify to extract information such as orders and products to make product suggestions and interacted with different messaging channels to deliver recommendations conversationally.

    System interaction diagram for heuristic rules based recommendation. There are two major systems that Kit interacts with: Shopify for product suggestions; messaging channels for communication with business owners
    System interaction diagram for heuristic rules based recommendation. There are two major systems that Kit interacts with: Shopify for product suggestions; messaging channels for communication with business owners

    Building Machine Learning Driven Recommendation

    One of the major limitations in the existing heuristic rules-based implementations is that the range of budget is hardcoded into the application where every business owner has the same option to choose from. The static list of budget range may not fit their needs, where the more established ones with store traffic and sales may want to spend more. In addition, for many of the business owners who don’t have enough marketing experience, it’s a tough decision to choose the right amount in order to generate the optimal return.

    Kit strives to automate marketing by reducing steps when creating campaigns. We found that budgeting is one of the most impactful criteria in contributing to successful campaigns. By eliminating the decision from the configuration flow, we reduced the friction for business owners to get started. In addition, we eliminated the first step of picking products by generating a proactive recommendation for a specific category such as new products. Together, Kit can generate a recommendation similar to the following:

    Screenshot of a one-step marketing recommendation
    Screenshot of a one-step marketing recommendation

    To generate this recommendation, there are two major decisions Kit has to make:

    1. How much is the business owner willing to spend?
    2. Given the current state of the business owner, will the budget be enough for the them to successfully generate sales?

    Kit decided that for business owner Cheryl, she should spend about $40 for the best chance to make sales given the new products marketing opportunity. From a data science perspective, it’s broken down into two types of machine learning problems:

    1. Regression: given a business owner’s historic spending behavior, predict the budget range that they’re likely to spend.
    2. Classification: given the budget a business owner has with store attributes such as existing traffic and sales that can measure the state of their stores, predict the likelihood of making sales.

    The heuristic rules-based system allowed Kit to collect enough data to make solving the machine learning problem possible. Kit can generate actionable marketing recommendation that gives the business owners the best chance of making sales based on their budget range and the state of their stores using the data we learnt.

    The second version of Kit had its first major engineering revision by implementing the proactive marketing recommendation in the app through the machine learning architecture in Google Cloud Platform:

    Flow diagram on generating proactive machine learning driven recommendation in Kit
    Flow diagram on generating proactive machine learning driven recommendation in Kit

    There are two distinct flows in this architecture:

    Training flow: Training is responsible for building the regression and classification models that are used in the prediction flow.

    1. Aggregate all relevant features. This includes the historic Facebook marketing campaigns created by business owners through Shopify, and the store state (e.g. traffic and sales) at the time when they create the marketing campaign.
    2. Perform feature engineering, a process using domain knowledge to extract useful features from the source data that are used to train the machine learning models. For historic marketing features, we derive features such as past 30 days average ad spend and past 30 days marketing sales. For shop state features, we derive features such as past 30 days unique visitors and past 30 days total orders. We take advantage of Apache Spark’s distributed computation capability to tackle the large scale Shopify dataset.
    3. Train the machine learning models using Google Cloud’s ML Engine. ML Engine allows us to train models using various popular frameworks including scikit-learn and TensorFlow.
    4. Monitor the model metrics. Model metrics are methods to evaluate the performance of a given machine learning model by comparing the predicted values against the ground truth. Monitoring is the process to validate the integrity of the feature engineering and model training by comparing the model metrics against its historic values. The source features in Step 1 can sometimes be broken leading to inaccurate feature engineering results. Even when feature pipeline is intact, it’s possible that the underlying data distribution changes due to unexpected new user behavior leading to deteriorating model performance. A monitoring process is important to keep track of historic metrics and ensure the model performs as expected before making it available for use. We employed two types of monitoring strategies: 1) threshold: alert when the model metric is beyond a defined threshold; 2) outlier detection: alert when the model metrics deviates from its normal distribution. We use z-score to detect outliers.
    5. Persist the models for prediction flow.
    Prediction flow: Prediction is responsible for generating the marketing recommendation by optimizing for the budget and determining whether or not the ad will generate sales given the existing store state.
    1. Generate marketing recommendations by making predictions using the features and models prepared in the training flow.
    2. Send recommendations to Kit through Apache Kafka.

    At Shopify, we have a data platform engineering team to maintain the data services required to implement both the training and prediction flows. This allows the product team to focus on building the domain specific machine learning pipelines, prove product values, and iterate quickly.

    Moving to Real Time Prediction Architecture

    Looking back at our example featuring business owner Cheryl, Kit decided that she can spend $40 for the best chance of making sales. In marketing, making sales is often not the first step in a business owner’s journey, especially when they don’t have any existing traffic to their store. Acquiring new visitors to the store in order to build lookalike audiences that are more relevant to the business owner is a crucial step to expand the audience size in order to create more successful marketing campaigns afterward. For this type of business owner, Kit evaluates the budget based on a different goal and suggests a more appropriate amount to acquire enough new visitors in order to build the lookalike audience. This is how the recommendation looks:

    [Screenshot of a recommendation to build lookalike audience
    Screenshot of a recommendation to build lookalike audience

    To generate this recommendation, there are three major decisions Kit has to make:

    1. How many new visitors does the business owner need in order to create lookalike audiences?
    2. How much are they willing to spend?
    3. Given the current state of the business owner, will the budget be enough for them to acquire those visitors?

    Decision two and three are solved using the same machine learning architecture as described previously. However, there’s a new complexity in this recommendation that step one needs to determine the required number of new visitors in order to build lookalike audiences. Since the traffic to a store can change in real time, the prediction flow needs to process the request at the time when the recommendation is delivered to the business owner.

    One major limitation for the Spark-based prediction flow is that recommendations are optimized in batch manner rather than on demand, i.e., the prediction flow is triggered from Spark on schedule basis rather than from Kit at the time when the recommendation is delivered to business owners. With the Spark batch setting, it’s possible that the budget recommendation is already stale by the time it’s delivered to the business owner. To solve that problem, we built a real time prediction service to replace the Spark prediction flow.

    Flow diagram on generating real time recommendation in Kit
    Flow diagram on generating real time recommendation in Kit

    One major distinction compared to the previous Spark-based prediction flow is that Kit is proactively calling into the real time prediction service to generate the recommendation.

    1. Based on the business owner’s store state, Kit decides that their marketing objective should be building lookalike audiences. Kit sends a request to the prediction service to generate budget recommendation from which the request reaches an HTTP API exposed through the web container component.
    2. Similar to the batch prediction flow in Spark, the web container generates marketing recommendations by making predictions using the features and models prepared in the training flow. However, there are several design considerations:
      1. We need to ensure efficient access to the features to minimize prediction request latency. Therefore, once features are generated during the feature engineering stage, they are immediately loaded into a key value store using Google Cloud’s Bigtable.
      2. Model prediction can be computationally expensive especially when the model architecture is complex. We use Google’s TensorFlow Serving which is a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving also provides out-of-the-box integration with TensorFlow models from which it can directly consume the models generated from the training flow with minimal configurations.
      3. Since the most heavy-lifting CPU/GPU-bound model prediction operations are dedicated to TensorFlow Serving, the web container remains a light-weight application that holds the business logic to generate recommendations. We chose Tornado as the Python web framework. By using non-blocking network I/O, Tornado can scale to tens of thousands open connections for model predictions.
    3. Model predictions are delegated to the TensorFlow Serving container.
    4. TensorFlow Serving container preloads the machine learning models generated during the training flow and uses them to perform model predictions upon requests.

    Powering One Third of All Kit Marketing Campaigns

    Kit started as a heuristic rules-based application that uses common best practices to simplify and automate marketing for Shopify’s business owners. We progressively improved the user experience by building machine learning driven recommendations to further reduce user friction and to optimize budgets giving business owners a higher chance of creating a more successful campaign. By first using a well established Spark-based prediction process (that’s well supported within Shopify) we showed the value of machine learning in driving user engagement and marketing results. This also allows us to focus on productionalizing an end-to-end machine learning pipeline with both training and prediction flows that serve tens of thousands of business owners. 

    We learned that having a proper monitoring component in place is crucial to ensure the integrity of the overall machine learning system. We moved to an advanced real time prediction architecture to solve use cases that required time-sensitive recommendations. Although the real time prediction service introduced two additional containers (web and TensorFlow Serving) to maintain, we delegated the most heavy-lifting model prediction component to TensorFlow Serving, which is a well supported service by Google and integrates with Shopify’s existing cloud infrastructure easily. This ease of use allowed us to focus on defining and implementing the core business logic to generate marketing recommendation in the web container.

    Moving to machine learning driven implementation has been proven valuable. One third of the marketing campaigns in Kit are powered by machine learning driven recommendations. Kit will continue to improve its marketing automation skills by optimizing for different marketing tactics and objectives in order to support their diverse needs.

    Disclaimer: Kit was a free virtual employee that helped Shopify merchants with marketing. As of September 1st, 2021, Shopify no longer actively maintains Kit.

    We're always on the lookout for talent and we’d love to hear from you. Please take a look at our open positions on the Data Science & Engineering career page.

    Continue reading

    Start your free 14-day trial of Shopify