Stop trying to do it all alone, add Kit to your team. Learn more.
How I Define My Boundaries to Prevent Burnout

How I Define My Boundaries to Prevent Burnout

There’s a way to build a model for your life and the other demands on you. It doesn’t have to be 100 hours a week. It might not even be 40 hours a week. Whatever sustainable model you come up with, prioritize fiercely and communicate expectations accordingly. I’ve worked this way for three decades through different phases of my life. The hours have changed, but the basic model and principles remain the same: Define the time, prioritize the work, and communicate those two effectively.

Continue reading

A Five-Step Guide for Conducting Exploratory Data Analysis

A Five-Step Guide for Conducting Exploratory Data Analysis

Have you ever been handed a dataset and then been asked to describe it? When I was first starting out in data science, this question confused me. My first thought was “What do you mean?” followed by “Can you be more specific?” The reality is that exploratory data analysis (EDA) is a critical tool in every data scientist’s kit, and the results are invaluable for answering important business questions.

Simply put, an EDA refers to performing visualizations and identifying significant patterns, such as correlated features, missing data, and outliers. EDA’s are also essential for providing hypotheses for why these patterns occur. It most likely won’t appear in your data product, data highlight, or dashboard, but it will help to inform all of these things.

Below, I’ll walk you through key tips for performing an effective EDA. For the more seasoned data scientists, the contents of this post may not come as a surprise (rather a good reminder!), but for the new data scientists, I’ll provide a potential framework that can help you get started on your journey. We’ll use a synthetic dataset for illustrative purposes. The data below does not reflect actual data from Shopify merchants. As we go step-by-step, I encourage you to take notes as you progress through each section!

Before You Start

Before you start exploring the data, you should try to understand the data at a high level. Speak to leadership and product to try to gain as much context as possible to help inform where to focus your efforts. Are you interested in performing a prediction task? Is the task purely for exploratory purposes? Depending on the intended outcome, you might point out very different things in your EDA.

With that context, it’s now time to look at your dataset. It’s important to identify how many samples (rows) and how many features (columns) are in your dataset. The size of your data helps inform any computational bottlenecks that may occur down the road. For instance, computing a correlation matrix on large datasets can take quite a bit of time. If your dataset is too big to work within a Jupyter notebook, I suggest subsampling so you have something that represents your data, but isn’t too big to work with.

The first 5 rows of our synthetic dataset. The dataset above does not reflect actual data from Shopify merchants.
The first 5 rows of our synthetic dataset. The dataset above does not reflect actual data from Shopify merchants.

Once you have your data in a suitable working environment, it’s usually a good idea to look at the first couple rows. The above image shows an example dataset we can use for our EDA. This dataset is used to analyze merchant behaviour. Here are a few details about the features:

  • Shop Cohort: the month and year a merchant joined Shopify
  • GMV (Gross Merchandise Volume): total value of merchandise sold.
  • AOV (Average Order Value): the average value of customers' orders since their first order.
  • Conversion Rate: the percentage of sessions that resulted in a purchase.
  • Sessions: the total number of sessions on your online store.
  • Fulfilled Orders: the number of orders that have been packaged and shipped.
  • Delivered Orders: the number of orders that have been received by the customer.

One question to address is “What is the unique identifier of each row in the data?” A unique identifier can be a column or set of columns that is guaranteed to be unique across rows in your dataset. This is key for distinguishing rows and referencing them in our EDA.

Now, if you’ve been taking notes, here’s what they may look like so far:

  • The dataset is about merchant behaviour. It consists of historic information about a set of merchants collected on a daily basis
  • The dataset contains 1500 samples and 13 features. This is a reasonable size and will allow me to work in Jupyter notebooks
  • Each row of my data is uniquely identified by the “Snapshot Date” and “Shop ID” columns. In other words, my data contains one row per shop per day
  • There are 100 unique Shop IDs and 15 unique Snapshot Dates
  • Snapshot Dates range from ‘2020-01-01’ to ‘2020-01-15’ for a total of 15 days

1. Check For Missing Data

Now that we’ve decided how we’re going to work with the data, we begin to look at the data itself. Checking your data for missing values is usually a good place to start. For this analysis, and future analysis, I suggest analyzing features one at a time and ranking them with respect to your specific analysis. For example, if we look at the below missing values analysis, we’d simply count the number of missing values for each feature, and then rank the features by largest amount of missing values to smallest. This is especially useful if there are a large amount of features.

Feature ranking by missing value counts
Feature ranking by missing value counts

Let’s look at an example of something that might occur in your data. Suppose a feature has 70 percent of its values missing. As a result of such a high amount of missing data, some may suggest to just remove this feature entirely from the data. However, before we do anything, we try to understand what this feature represents and why we’re seeing this behaviour. After further analysis, we may discover that this feature represents a response to a survey question, and in most cases it was left blank. A possible hypothesis is that a large proportion of the population didn’t feel comfortable providing an answer. If we simply remove this feature, we introduce bias into our data. Therefore, this missing data is a feature in its own right and should be treated as such.

Now for each feature, I suggest trying to understand why the data is missing and what it can mean. Unfortunately, this isn’t always so simple and an answer might not exist. That’s why an entire area of statistics, Imputation, is devoted to this problem and offers several solutions. What approach you choose depends entirely on the type of data. For time series data without seasonality or trend, you can replace missing values with the mean or median. If the time series does contain a trend but not seasonality, then you can apply a linear interpolation. If it contains both, then you should adjust for the seasonality and then apply a linear interpolation. In the survey example I discussed above, I handle missing values by creating a new category “Not Answered” for the survey question feature. I won’t go into detail about all the various methods here, however, I suggest reading How to Handle Missing Data for more details on Imputation.

Great! We’ve now identified the missing values in our data—let’s update our summary notes:

  • ...
  • 10 features contain missing values
  • “Fulfilled Orders” contains the most missing values at 8% and “Shop Currency” contains the least at 6%

2. Provide Basic Descriptions of Your Sample and Features

At this point in our EDA, we’ve identified features with missing values, but we still know very little about our data. So let’s try to fill in some of the blanks. Let’s categorize our features as either:

Continuous: A feature that is continuous can assume an infinite number of values in a given range. An example of a continuous feature is a merchant’s Gross Merchandise Value (GMV).

Discrete: A feature that is discrete can assume a countable number of values and is always numeric. An example of a discrete feature is a merchant’s Sessions.

Categorical: A feature that is discrete can only assume a finite number of values. An example of a discrete feature is a merchant’s Shopify plan type.

The goal is to classify all your features into one of these three categories.

GMV AOV Conversion Rate
62894.85 628.95 0.17
NaN 1390.86 0.07
25890.06  258.90 0.04
6446.36 64.46 0.20
47432.44 258.90 0.10

 Example of continuous features

Sessions Products Sold Fulfilled Orders
11 52 108
119 46 96
182 47 98
147 44 99
45 65 125

 Example of discrete features

Shop Cohort
Advanced Shopify
Advanced Shopify
Advanced Shopify

 Example of categorical features

You might be asking yourself, how does classifying features help us? This categorization helps us decide what visualizations to choose in our EDA, and what statistical methods we can apply to this data. Some visualizations won’t work on all continuous, discrete, and categorical features. This means we have to treat groups of each type of feature differently. We will see how this works in later sections.

Let’s focus on continuous features first. Record any characteristics that you think are important, such as the maximum and minimum values for that feature. Do the same thing for discrete features. For categorical features, some things I like to check for are the number of unique values and the number of occurrences of each unique value. Let’s add our findings to our summary notes:

  •  ...
  • There are 3 continuous features, 4 discrete features, and 4 categorical features
  • GMV:
    • Continuous Feature
    •  Values are between $12.07 and $814468.03
    • Data is missing for one day…I should check to make sure this isn’t a data collection error”
  •  Plan:
    • Categorical Feature
    • Assumes four values: “Basic Shopify”, “Shopify”, “Advanced Shopify”, and “Plus”.
    • The value counts of “Basic Shopify”, “Shopify”, “Advanced Shopify”, and “Plus” are 255, 420, 450, and 375, respectively
    • There seems to be more merchants on the “Advanced Shopify” plan than on the “Basic” plan. Does this make sense?

3. Identify The Shape of Your Data

The shape of a feature can tell you so much about it. What do I mean by shape? I’m referring to the distribution of your data, and how it can change over time. Let’s plot a few features from our dataset:

GMV and Sessions behaviour across samples
GMV and Sessions behaviour across samples

If the dataset is a time series, then we investigate how the feature changes over time. Perhaps there’s a seasonality to the feature or a positive/negative linear trend over time. These are all important things to consider in your EDA. In the graphs above, we can see that AOV and Sessions have positive linear trends and Sessions emits a seasonality (a distinct behaviour that occurs in intervals). Recall that Snapshot Data and Shop ID uniquely define our data, so the seasonality we observe can be due to particular shops having more sessions than other shops in the data. In the line graph below, we see that the Sessions seasonality was a result of two specific shops: Shop 1 and Shop 51. Perhaps these shops have a higher GMV or AOV?

In line graph below we have Snapshot date on the x-axis and we see that the Sessions (y-axis) seasonality was a result of two specific shops: Shop 1 and Shop 51

Next, you’ll calculate the mean and variance of each feature. Does the feature hardly change at all? Is it constantly changing? Try to hypothesize about the behaviour you see. A feature that has a very low or very high variance may require additional investigation.

Probability Density Functions (PDFs) and Probability Mass Functions (PMFs) are your friends. To understand the shape of your features, PMFs are used for discrete features and PDFs for continuous features.

Example of feature density functions
Example of feature density functions

Here a few things that PMFs and PDFs can tell you about your data:

  • Skewness
  • Is the feature heterogeneous (multimodal)?
  • If the PDF has a gap in it, the feature may be disconnected.
  • Is it bounded?

We can see from the example feature density functions above that all three features are skewed. Skewness measures the asymmetry of your data. This might deter us from using the mean as a measure of central tendency. The median is more robust, but it comes with additional computational cost.

Overall, there are a lot of things you can consider when visualizing the distribution of your data. For more great ideas, I recommend reading Exploratory Data Analysis. Don’t forget to update your summary notes:

  • ...
  • AOV:
    • Continuous Feature
    • Values are between $38.69 and $8994.58
    • 8% of values missing
    • Observe a large skewness in the data.
    • Observe a positive linear trend across samples
  • Sessions:
    • Discrete Feature
    • Contains count data (can assume non-negative integer data)
    • Values are between 2 and 2257
    • 7.7% of values missing
    • Observe a large skewness in the data
    • Observe a positive linear trend across samples
    • Shops 1 and 51 have larger daily sessions and show a more rapid growth of sessions compared to other shops in the data
  • Conversion Rate:
    • Continuous Feature
    • Values are between 0.0 and 0.86
    • 7.7% of values missing
    • Observe a large skewness in the data

4. Identify Significant Correlations

Correlation measures the relationship between two variable quantities. Suppose we focus on the correlation between two continuous features: Delivered Orders and Fulfilled Orders. The easiest way to visualize correlation is by plotting a scatter plot with Delivered Orders on the y axis and Fulfilled Orders on the x axis. As expected, there’s a positive relationship between these two features.

Scatter plot showing positive correlation between features “Delivered Orders” and “Fulfilled Orders”
Scatter plot showing positive correlation between features “Delivered Orders” and “Fulfilled Orders”

If you have a high number of features in your dataset, then you can’t create this plot for all of them—it takes too long. So, I recommend computing the Pearson correlation matrix for your dataset. It measures the linear correlation between features in your dataset and assigns a value between -1 and 1 to each feature pair. A positive value indicates a positive relationship and a negative value indicates a negative relationship.

Correlation matrix for continuous and discrete features
Correlation matrix for continuous and discrete features

It’s important to take note of all significant correlation between features. It’s possible that you might observe many relationships between features in your dataset, but you might also observe very little. Every dataset is different! Try to form hypotheses around why features might be correlated with each other.

In the correlation matrix above, we see a few interesting things. First of all, we see a large positive correlation between Fulfilled Orders and Delivered Orders, and between GMV and AOV. We also see a slight positive correlation between Sessions, GMV, and AOV. These are significant patterns that you should take note of.

Since our data is a time series, we also look at the autocorrelation of shops. Computing the autocorrelation reveals relationships between a signal’s current value and its previous values. For instance, there could be a positive correlation between a shop’s GMV today and its GMV from 7 days ago. In other words, customers like to shop more on Saturdays compared to Mondays. I won’t go into any more detail here since Time Series Analysis is a very large area of statistics, but I suggest reading A Very Short Course on Time Series Analysis.

Shop Cohort 2019-01 2019-02 2019-03 2019-04 2019-05 2019-06
Adv. Shopify 71 102 27 71 87 73
Basic Shopify 45 44 42 59 0 57
Plus 29 55 69 44 71 86
Shopify 53 72 108 72 60 28

Contingency table for discrete features “Shop Cohort” an “Plan”

The methodology outlined above only applies to continuous and discrete features. There are a few ways to compute correlation between categorical features, however, I’ll only discuss one, the Pearson chi-square test. This involves taking pairs of discrete features and computing their contingency table. Each cell in the contingency table shows the frequency of observations. In the Pearson chi-square test, the null hypothesis is that the categorical variables in question are independent and, therefore, not related. In the table above, we compute the contingency table for two categorical features from our dataset: Shop Cohort and Plan. After that, we perform a hypothesis test using the chi-square distribution with a specified alpha level of significance. We then determine whether the categorical features are independent or dependent.

5. Spot Outliers in the Dataset

Last, but certainly not least, spotting outliers in your dataset is a crucial step in EDA. Outliers are significantly different from other samples in your dataset and can lead to major problems when performing statistical tasks following your EDA. There are many reasons why an outlier might occur. Perhaps there was a measurement error for that sample and feature, but in many cases outliers occur naturally.

Continuous feature box plots
Continuous feature box plots

The box plot visualization is extremely useful for identifying outliers. In the above figure, we observe that all features contain quite a few outliers because we see data points that are distant from the majority of the data.

There are many ways to identify outliers. It doesn’t make sense to plot all of our features one at a time to spot outliers, so how do we systematically accomplish this? One way is to compute the 1st and 99th percentile for each feature, then classify data points that are greater than the 99th percentile or less than the 1st percentile. For each feature, count the number of outliers, then rank them from most outliers to least outliers. Focus on the features that have the most outliers and try to understand why that might be. Take note of your findings.

Unfortunately, the aforementioned approach doesn’t work for discrete features since there needs to be an ordering to compute percentiles. An outlier can mean many things. Suppose our discrete feature can assume one of three values: apple, orange, or pear. For 99 percent of samples, the value is either apple or orange, and only 1 percent for pear. This is one way we might classify an outlier for this feature. For more advanced methods on detecting anomalies in categorical data, check out Outlier Analysis.

What’s Next?

We’re at the finish line and completed our EDA. Let’s review the main takeaways:

  • Missing values can plague your data. Make sure to understand why they are there and how you plan to deal with them.
  • Provide a basic description of your features and categorize them. This will drastically change the visualizations you use and the statistical methods you apply.
  • Understand your data by visualizing its distribution. You never know what you will find! Get comfortable with how your data changes across samples and over time.
  • Your features have relationships! Make note of them. These relationships can come in handy down the road.
  • Outliers can dampen your fun only if you don’t know about them. Make known the unknowns!

But what do we do next? Well that all depends on the problem. It’s useful to present a summary of your findings to leadership and product. By performing an EDA, you might answer some of those crucial business questions we alluded to earlier. Going forward, does your team want to perform a regression or classification on the dataset? Do they want it to power a KPI dashboard? So many wonderful options and opportunities to explore!

It’s important to note that an EDA is a very large area of focus. The steps I suggest are by no means exhaustive and if I had the time I could write a book on the subject! In this post, I share some of the most common approaches, but there’s so much more that can be added to your own EDA.

If you’re passionate about data at scale, and you’re eager to learn more, we’re hiring! Reach out to us or apply on our careers page.

Cody Mazza-Anthony is a Data Scientist on the Insights team. He is currently working on building a Product Analytics framework for Insights. Cody enjoys building intelligent systems to enable merchants to grow their business. If you’d like to connect with Cody, reach out on LinkedIn.

Continue reading

Dynamic ProxySQL Query Rules

Dynamic ProxySQL Query Rules

ProxySQL comes with a powerful feature called query rules. The main use of these rules at Shopify is to reroute, rewrite, or reject queries matching a specified regex. However, with great power comes great responsibility. These rules are powerful and can have unexpected consequences if used incorrectly. At Shopify’s scale, we’re running thousands of ProxySQL instances, so applying query rules to each one is a painful and time consuming process, especially during an incident. We’ve built a tool to help us address these challenges and make deploying new rules safe and scalable.

Continue reading

How Shopify Dynamically Routes Storefront Traffic

How Shopify Dynamically Routes Storefront Traffic

In 2019 we set out to rewrite the Shopify storefront implementation. Our goal was to make it faster. We talked about the strategies we used to achieve that goal in a previous post about optimizing server-side rendering and implementing efficient caching. To build on that post, I’ll go into detail on how the Storefront Renderer team tweaked our load balancers to shift traffic between the legacy storefront and the new storefront.

First, let's take a look at the technologies we used. For our load balancer, we’re running nginx with OpenResty. We previously discussed how scriptable load balancers are our secret weapon for surviving spikes of high traffic. We built our storefront verification system with Lua modules and used that system to ensure our new storefront achieved parity with our legacy storefront. The system to permanently shift traffic to the new storefront, once parity was achieved, was also built with Lua. Our chatbot, spy, is our front end for interacting with the load balancers via our control plane.

At the beginning of the project, we predicted the need to constantly update which requests were supported by the new storefront as we continued to migrate features. We decided to build a rule system that allows us to add new routing rules easily.

Starting out, we kept the rules in a Lua file in our nginx repository, and kept the enabled/disabled status in our control plane. This allowed us to quickly disable a rule without having to wait for a deploy if something went wrong. It proved successful, and at this point in the project, enabling and disabling rules was a breeze. However, our workflow for changing the rules was cumbersome, and we wanted this process to be even faster. We decided to store the whole rule as a JSON payload in our control plane. We used spy to create, update, and delete rules in addition to the previous functionality of enabling and disabling the rules. We only needed to deploy nginx to add new functionality.

The Power of Dynamic Rules

Fast continuous integration (CI) time and deployments are great ways to increase the velocity of getting changes into production. However, for time-critical use cases like ours removing the CI time and deployment altogether is even better. Moving the rule system into the control plane and using spy to manipulate the rules removed the entire CI and deployment process. We still require a “code review” on enabled spy commands or before enabling a new command, but that’s a trivial amount of time compared to the full deploy process used prior.

Before diving into the different options available for configuration, let’s look at what it’s like to create a rule with spy. Below are three images showing creating a rule, inspecting it, and then deleting it. The rule was never enabled, as it was an example, but that process requires getting approval from another member of the team. We are affecting a large share of real traffic on the Shopify platform when running the command spy storefront_renderer enable example-rule, so the rules to good code reviews still apply.

An example of how to create a
routing rule with spy via slack.
Adding a rule with spy
An example of how to describe an
existing rule with spy via slack.
Displaying a rule with spy
An example of how to describe an
existing rule with spy via slack.
Removing a rule with spy

Configuring New Rules

Now let’s review the different options available when creating new rules.

Option Name
Description Default  Example
The identifier for the rule.
A comma-separated list of filters.
A comma-separated list of shop ids to which the rule applies.

The rule_name is the identifier we use. It can be any string, but it’s usually descriptive of the type of request it matches.

The shop_ids option lets us choose to have a rule target all shops or target a specific shop for testing. For example, test shops allow us to test changes without affecting real production data. This is useful to test rendering live requests with the new storefront because verification requests happen in the background and don’t directly affect client requests.

Next, the filters option determines which requests would match that rule. This allows us to partition the traffic into smaller subsets and target individual controller actions from our legacy Ruby on Rails implementation. A change to the filters list does require us to go through the full deployment process. They are kept in a Lua file, and the filters option is a comma-separated list of function names to apply to the request in a functional style. If all filters return true, then the rule will match that request.

Above is an example of a filter, is_product_list_path, that lets us target HTTP GET requests to the storefront products JSON API implemented in Lua.

Option Name
The rate at which we render allowed requests.
The rate at which we verify requests.
The rate at which requests are reverse-verified when rendering from the new storefront.

Both render_rate and verify_rate allow us to target a percentage of traffic that matches a rule. This is useful for doing gradual rollouts of rendering a new endpoint or verifying a small sample of production traffic.

The reverse_verify_rate is the rate used when a request is already being rendered with the new storefront. It lets us first render the request with the new storefront and then sends the request to the legacy implementation asynchronously for verification. We call this scenario a reverse-verification, as it’s the opposite or reverse of the original flow where the request was rendered by the legacy storefront then sent to the new storefront for verification. We call the opposite flow forward-verification. We use forward-verification to find issues as we implement new endpoints and reverse-verifications to help detect and track down regressions.

Option Name





The rate at which we verify requests in the nearest region.




Now is a good time to introduce self-verification and the associated self_verify_rate. One limitation of the legacy storefront implementation was due to how our architecture for a Shopify pod meant that only one region had access to the MySQL writer at any given time. This meant that all requests had to go to the active region of a pod. With the new storefront, we decoupled the storefront rendering service from the database writer and now serve storefront requests from any region where a MySQL replica is present.

However, as we started decoupling dependencies on the active region, we found ourselves wanting to verify requests not only against the legacy storefront but also against the active and passive regions with the new storefront. This led us to add the self_verify_rate that allows us to sample requests bound for the active region to be verified against the storefront deployment in the local region.

We have found the routing rules flexible, and it made it easy to add new features or prototypes that are usually quite difficult to roll out. You might be familiar with how we generate load for testing the system's limits. However, these load tests will often fall victim to our load shedder if the system gets overwhelmed. In this case, we drop any request coming from our load generator to avoid negatively affecting a real client experience. Before BFCM 2020 we wondered how the application behaved if the connections to our dependencies, primarily Redis, went down. We wanted to be as resilient as possible to those types of failures. This isn’t quite the same as testing with a load generation tool because these tests could affect real traffic. The team decided to stand up a whole new storefront deployment, and instead of routing any traffic to it, we used the verifier mechanism to send duplicate requests to it. We then disconnected the new deployment from Redis and turned our load generator on max. Now we had data about how the application performed under partial outages and were able to dig into and improve resiliency of the system before BFCM. These are just some of the ways we leveraged our flexible routing system to quickly and transparently change the underlying storefront implementation.


I’d like to walk you through the main entry point for the storefront Lua module to show more of the technical implementation. First, here is a diagram showing where each nginx directive is executed during the request processing.

A flow chart showing the order
different Lua callbacks are run in the nginx request lifecycle.
Order in which nginx directives are run - source:

During the rewrite phase, before the request is proxied upstream to the rendering service, we check the routing rules to determine which storefront implementation the request should be routed to. After the check during the header filter phase, we check if the request should be forward-verified (if the request went to the legacy storefront) or reverse-verified (if it went to the new storefront). Finally, if we’re verifying the request (regardless of forward or reverse) in the log phase, we queue a copy of the original request to be made to the opposite upstream after the request cycle has completed.

In the above code snippet, the renderer module referenced in the rewrite phase and the header filter phase and the verifier module reference in the header filter phase and log phase, use the same function find_matching_rule from the storefront rules module below to get the matching rule from the control plane. The routing_method parameter is passed in to determine whether we’re looking for a rule to match for rendering or for verifying the current request.

Lastly, the verifier module uses nginx timers to send the verification request out of band of the original request so we don’t leave the client waiting for both upstreams. The send_verification_request_in_background function shown below is responsible for queuing the verification request to be sent. To duplicate the request and verify it, we need to keep track of the original request arguments and the response state from either the legacy storefront (in the case of a forward verification request) or the new storefront (in the case of a reverse verification request). This will then pass them as arguments to the timer since we won’t have access to this information in the context of the timer.

The Future of Routing Rules

At this point, we're starting to simplify this system because the new storefront implementation is serving almost all storefront traffic. We’ll no longer need to verify or render traffic with the legacy storefront implementation once the migration is complete, so we'll be undoing the work we’ve done and going back to the hardcoded rules approach of the early days of the project. Although that doesn’t mean the routing rules are going away completely, the flexibility provided by the routing rules allowed us to build the verification system and stand up a separate deployment for load testing. These features weren’t possible before with the legacy storefront implementation. While we won’t be changing the routing between storefront implementations, the rule system will evolve to support new features.

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Derek Stride is a Senior Developer on the Storefront Renderer team. He joined Shopify in 2016 as an intern and transitioned to a full-time role in 2018. He has worked on multiple projects and has been involved with the Storefront Renderer rewrite since the project began in 2019.

Continue reading

Building Smarter Search Products: 3 Steps for Evaluating Search Algorithms

Building Smarter Search Products: 3 Steps for Evaluating Search Algorithms

By Jodi Sloan and Selina Li

Over 2 million users visit Shopify’s Help Center every month to find help solving a problem. They come to the Help Center with different motives: learn how to set up their store, find tips on how to market, or get advice on troubleshooting a technical issue. Our search product helps users narrow down what they’re looking for by surfacing what’s most relevant for them. Algorithms empower search products to surface the most suitable results, but how do you know if they’re succeeding at this mission?

Below, we’ll share the three-step framework we built for evaluating new search algorithms. From collecting data using Kafka and annotation, to conducting offline and online A/B tests, we’ll share how we measure the effectiveness of a search algorithm.

The Challenge

Search is a difficult product to build. When you input a search query, the search product sifts through its entire collection of documents to find suitable matches. This is no easy feat as a search product’s collection and matching result set might be extensive. For example, within the Shopify Help Center’s search product lives thousands of articles, and a search for “shipping” could return hundreds. 

We use an algorithm we call Vanilla Pagerank to power our search. It boosts results by the total number of views an article has across all search queries. The problem is that it also surfaces non-relevant results. For example, if you conducted a search for “delete discounts” the algorithm may prefer results with the keywords “delete” or “discounts”, but not necessarily results on “deleting discounts”. 

We’re always trying to improve our users’s experience by making our search algorithms smarter. That’s why we built a new algorithm, Query-specific Pagerank, which aims to boost results with high click frequencies (a popularity metric) from historic searches containing the search term. It basically boosts the most frequently clicked-on results from similar searches. 

The challenge is any change to the search algorithms might improve the results for some queries, but worsen others. So to have confidence in our new algorithm, we use data to evaluate its impact and performance against the existing algorithm. We implemented a simple three-step framework for evaluating experimental search algorithms.

1. Collect Data

Before we evaluate our search algorithms, we need a “ground truth” dataset telling us which articles are relevant for various intents. For Shopify’s Help Center search, we use two sources: Kafka events and annotated datasets.

Events-based Data with Kafka

Users interact with search products by entering queries, clicking on results, or leaving feedback. By using a messaging system like Kafka, we collect all live user interactions in schematized streams and model these events in an ETL pipeline. This culminates in a search fact table that aggregates facts about a search session based on behaviours we’re interested in analyzing.

A simplified model of a search fact generated by piecing together raw Kafka events. The model shows that 3 Kafka events make up a Search fact: 1. search query, 2. search result click, and 3. contacted support.
A simplified model of a search fact generated by piecing together raw Kafka events

With data generated from Kafka events, we can continuously evaluate our online metrics and see near real-time change. This helps us to:

  1. Monitor product changes that may be having an adverse impact on the user experience.
  2. Assign users to A/B tests in real time. We can set some users’s experiences to be powered by search algorithm A and others by algorithm B.
  3. Ingest interactions in a streaming pipeline for real-time feedback to the search product.


Annotation is a powerful tool for generating labeled datasets. To ensure high-quality datasets within the Shopify Help Center, we leverage the expertise of the Shopify Support team to annotate the relevance of our search results to the input query. Manual annotation can provide high-quality and trustworthy labels, but can be an expensive and slow process, and may not be feasible for all search problems. Alternatively, click models can build judgments using data from user interactions. However, for the Shopify Help Center, human judgments are preferred since we value the expert ratings of our experienced Support team.

The process for annotation is simple:

  1. A query is paired with a document or a result set that might represent a relative match.
  2. An annotator judges the relevance of the document to the question and assigns the pair a rating.
  3. The labels are combined with the inputs to produce a labelled dataset.
A diagram showing the 3 steps in the annotation process
Annotation Process

There are numerous ways we annotate search results:

  • Binary ratings: Given a user’s query and a document, an annotator can answer the question Is this document relevant to the query? by providing a binary answer (1 = yes, 0 = no).
  • Scale ratings: Given a user’s query and a document, a rater can answer the question How relevant is this document to the query? on a scale (1 to 5, where 5 is the most relevant). This provides interval data that you can turn into categories, where a rating of 4 or 5 represents a hit and anything lower represents a miss.
  • Ranked lists: Given a user’s query and a set of documents, a rater can answer the question Rank the documents from most relevant to least relevant to the query. This provides ranked data, where each query-document pair has an associated rank.

The design of your annotation options depends on the evaluation measures you want to employ. We used Scale Ratings with a worded list (bad, ok, good, and great) that provides explicit descriptions on each label to simplify human judgment. These labels are then quantified and used for calculating performance scores. For example, when conducting a search for “shipping”, our Query-specific Pagerank algorithm may return documents that are more relevant to the query where the majority of them are labelled “great” or “good”.

One thing to keep in mind with annotation is dataset staleness. Our document set in the Shopify Help Center changes rapidly, so datasets can become stale over time. If your search product is similar, we recommend re-running annotation projects on a regular basis or augmenting existing datasets that are used for search algorithm training with unseen data points.

2. Evaluate Offline Metrics

After collecting our data, we wanted to know whether our new search algorithm, Query-specific Pagerank, had a positive impact on our ranking effectiveness without impacting our users. We did this by running an offline evaluation that uses relevance ratings from curated datasets to judge the effectiveness of search results before launching potentially-risky changes to production. By using offline metrics, we run thousands of historical search queries against our algorithm variants and use the labels to evaluate statistical measures that tell us how well our algorithm performs. This enables us to iterate quickly since we simulate thousands of searches per minute.

There are a number of measures we can use, but we’ve found Mean Average Precision and Normalized Discounted Cumulative Gain to be the most useful and approachable for our stakeholders.

Mean Average Precision

Mean Average Precision (MAP) measures the relevance of each item in the results list to the user’s query with a specific cutoff N. As queries can return thousands of results, and only a few users will read all of them, only top N returned results need to be examined. The top N number is usually chosen arbitrarily or based on the number of paginated results. Precision@N is the percentage of relevant items among the first N recommendations. MAP is calculated by averaging the AP scores for each query in our dataset. The result is a measure that penalizes returning irrelevant documents before relevant ones. Here is an example of how MAP is calculated:

An example of how MAP is calculated for an algorithm, given 2 search queries as inputs
Example of MAP Calculation

Normalized Discounted Cumulative Gain

To compute MAP, you need to classify if a document is a good recommendation by determining a cutoff score. For example, document and search query pairs that have ok, good, and great labels (that is scores greater than 0) will be categorized as relevant. But the differences in the relevancy of ok and great pairs will be neglected.

Discounted Cumulative Gain (DCG) addresses this drawback by maintaining the non-binary score while adding a logarithmic reduction factor—the denominator of the DCG function to penalize the recommended items with lower positions in the list. DCG is a measure of ranking quality that takes into account the position of documents in the results set.

An example of calculating and comparing DCG of two search algorithms. The scale ratings of each query and document pair are determined by annotators
Example of DCG Calculation

One issue with DCG is that the length of search results differ depending on the query provided. This is problematic because the more results a query set has, the better the DCG score, even if the ranking doesn’t make sense. Normalized Discounted Cumulative Gain Scores (NDCG) solves this problem. NDCG is calculated by dividing DCG by the maximum possible DCG score—the score calculated from the sorted search results.

Comparing the numeric values of offline metrics is great when the differences are significant. The higher value, the more successful the algorithm. However, this only tells us the ending of the story. When the results aren’t significantly different, the insights we get from the metrics comparison are limited. Therefore, understanding how we got to the ending is also important for future model improvements and developments. You gather these insights by breaking down the composition of queries to look at:

  • Frequency: How often do our algorithms return worse results than annotation data?
  • Velocity: How far off are our rankings in different algorithms?
  • Commonalities: Understanding the queries that consistently have positive impacts on the algorithm performance, and finding the commonality among these queries, can help you determine the limitations on an algorithm.

Offline Metrics in Action

We conducted a deep dive analysis on evaluating offline metrics using MAP and NDCG to assess the success of our new Query-specific Pagerank algorithm. We found that our new algorithm returned higher scored documents more frequently and had slightly better scores in both offline metrics.

3. Evaluate Online Metrics

Next, we wanted to see how our users interact with our algorithms by looking at online metrics. Online metrics use search logs to determine how real search events perform. They’re based on understanding users’ behaviour with the search product and are commonly used to evaluate A/B tests.

The metrics you choose to evaluate the success of your algorithms depends on the goals of your product. Since the Shopify Help Center aims to provide relevant information to users looking for assistance with our platform, metrics that determine success include:

  • How often users interact with search results
  • How far they had to scroll to find the best result for their needs
  • Whether they had to contact our Support team to solve their problem.

When running an A/B test, we need to define these measures and determine how we expect the new algorithm to move the needle. Below are some common online metrics, as well as product success metrics, that we used to evaluate how search events performed:

  • Click-through rate (CTR): The portion of users who clicked on a search result when surfaced. Relevant results should be clicked, so we want a high CTR.
  • Average rank: The average rank of clicked results. Since we aim for the most relevant results to be surfaced first, and clicked, we want to have a low average rank.
  • Abandonment:When a user has intent to find help but they didn’t interact with search results, didn’t contact us, and wasn’t seen again. We always expect some level of abandonment due to bots and spam messages (yes, search products get spam too!), so we want this to be moderately low.
  • Deflection: Our search is a success when users don’t have to contact support to find what they’re looking for. In other words, the user was deflected from contacting us. We want this metric to be high, but in reality we want people to contact us when it’s what is best for their situation, so deflection is a bit of a nuanced metric.

We use the Kafka data collected in our first step to calculate these metrics and see how successful our search product is over time, across user segments, or between different search topics. For example, we study CTR and deflection rates between users in different languages. We also use A/B tests to assign users to different versions of our product to see if our new version significantly outperforms the old.

Flow of Evaluation Online Metrics
Flow of Evaluation Online Metrics

A/B testing search is similar to other A/B tests you may be familiar with. When a user visits the Help Center, they are assigned to one of our experiment groups, which determines which algorithm their subsequent searches will be powered by. Over many visits, we can evaluate our metrics to determine which algorithm outperforms the other (for example, whether algorithm A has a significantly higher click-through rate with search results than algorithm B).

Online Metrics in Action

We conducted an online A/B test to see how our new Query-specific Pagerank algorithm measured against our existing algorithm, with half of our search users assigned to group A (powered by Vanilla Pagerank) and half assigned to group B (powered by Query-specific Pagerank). Our results showed that users in group B were:

  • Less likely to click past the first page of results
  • Less likely to conduct a follow-up search
  • More likely to click results
  • More likely to click the first result shown
  • Had a lower average rank of clicked results

Essentially, group B users were able to find helpful articles with less effort when compared to group A users.

The Results

After using our evaluation framework to measure our new algorithm against our existing algorithm, it was clear that our new algorithm outperformed the former in a way that was meaningful for our product. Our metrics showed our experiment was a success, and we were able to replace our Vanilla Pagerank algorithm with our new and improved Query-specific Pagerank algorithm.

If you’re using this framework to evaluate your search algorithms, it’s important to note that even a failed experiment can help you learn and identify new opportunities. Did your offline or online testing show a decrease in performance? Is a certain subset of queries driving the performance down? Are some users better served by the changes than other segments? However your analysis points, don’t be afraid to dive deeper to understand what’s happening. You’ll want to document your findings for future iterations.

Key Takeaways for Evaluating a Search Algorithm

Algorithms are the secret sauce of any search product. The efficacy of a search algorithm has a huge impact on a users’ experience, so having the right process to evaluate a search algorithm’s performance ensures you’re making confident decisions with your users in mind.

To quickly summarize, the most important takeaways are:

  • A high-quality and reliable labelled dataset is key for a successful, unbiased evaluation of a search algorithm.
  • Online metrics provide valuable insights on user behaviours, in addition to algorithm evaluation, even if they’re resource-intensive and risky to implement.
  • Offline metrics are helpful for iterating on new algorithms quickly and mitigating the risks of launching new algorithms into production.

Jodi Sloan is a Senior Engineer on the Discovery Experiences team. Over the past 2 years, she has used her passion for data science and engineering to craft performant, intelligent search experiences across Shopify. If you’d like to connect with Jodi, reach out on LinkedIn.

Selina Li is a Data Scientist on the Messaging team. She is currently working to build intelligence in conversation and improve merchant experiences in chat. Previously, she was with the Self Help team where she contributed to deliver better search experiences for users in Shopify Help Center and Concierge. If you would like to connect with Selina, reach out on Linkedin.

If you’re a data scientist or engineer who’s passionate about building great search products, we’re hiring! Reach out to us or apply on our careers page.

Continue reading

How to Build a Web App with and without Rails Libraries

How to Build a Web App with and without Rails Libraries

How would we build a web application only using the standard Ruby libraries? In this article, I’ll break down the key foundational concepts of how a web application works while building one from the ground up. If we can build a web application only using Ruby libraries, why would we need web server interfaces like Rack and web applications like Ruby on Rails? By the end of this article, you’ll gain a new appreciation for Rails and its magic.

Continue reading

Remove Circular Dependencies by Using the Repository Pattern in Ruby

Remove Circular Dependencies by Using the Repository Pattern in Ruby

There are dependencies between gems and the platforms that use them. In scenarios where the platforms have the data and the gem has the knowledge, there is a direct circular dependency between the two and both need to talk to each other. I’ll show you how we used the Repository pattern in Ruby to remove that circular dependency and help us make gems thin and stateless. Plus, I’ll show you how using Sorbet in the implementation made our code typed and cleaner.

Continue reading

Updates on Shopify’s Bug Bounty Program

Updates on Shopify’s Bug Bounty Program

For three years we, Shopify’s Application Security team, have set aside time to reflect on our bug bounty program and share recent insights. This past year has been quite a ride, as our program has been busier than ever! We’re excited to share what we have learned, and share some of the great things we have planned!

Continue reading

4 Tips for Shipping Data Products Fast

4 Tips for Shipping Data Products Fast

Shipping a new product is hard. Doing so under tight time constraints is even harder. It’s no different for data-centric products. Whether it’s a forecast, a classification tool, or a dashboard, you may find yourself in a situation where you need to ship a new data product in a seemingly impossible timeline. 

Shopify’s Data Science team has certainly found itself in this situation more than a few times over the years. Our team focuses on creating data products that support our merchants’ entrepreneurial journeys, from their initial interaction with Shopify, to their first sale, and throughout their growth journey on the platform. Commerce is a fast changing industry, which means we have to build and ship fast to ensure we’re providing our merchants with the best tools to help them succeed.

Along the way, our team learned a few key lessons for shipping data products quickly, while maintaining focus and getting things done efficiently—but also done right. Below are four tips that are proven to help you ship new products fast. 

1. Utilize Design Sprints 

Investing time in a design sprint pays off down the line as you approach deadlines. The design sprint (created by Google Ventures) is “a process for answering critical business questions through design, prototyping and testing ideas with customers.” Sprints are great for getting a new data product off the ground quickly because they carve out specific time blocks and resources for you and your team to work on a problem. The Shopify Data Science teams make sprints a common practice, especially when we’re under a tight deadline. When setting up new sprints, here are the steps we like to take:

  1. Choose an impactful problem to tackle. We focus on solving problems for our merchants, but in order to do that, we first have to uncover what those problems are by asking questions. What is the problem we’re trying to solve? Why are we solving this problem? Asking questions empowers you to find a problem worth tackling, identify the right technical solution and ultimately drive impact.
  2. Assemble a small sprint team: Critical to the success of any sprint is assembling a small team (no more than 6 or 7) of highly motivated individuals. Why a small team? It’s easier to stay aligned in a smaller group due to better communication and transparency, which means it’s easier to move fast.
  3. Choose a sprint Champion: This individual should be responsible for driving the direction of the project and making decisions when needed (should we use solution A or B?). Assigning a Champion helps remove ambiguity and allow the rest of the team to focus their energy on solving the problem in front of them.
  4. Set your sprint dates: Timeboxing is one of the main reasons why sprints are so effective. By setting fixed dates, you're committing your team to focus on shipping on a precise timeline. Typically, a sprint lasts up to five days. However, the timeline can be shortened based on the size of the project (for example, three days is likely enough time for creating the first version of a dashboard that follows the impact of COVID-19 on the business’s acquisition funnel).

With your problem identified, your team set up, and your dates blocked off, it’s now time to sprint. Keep in mind while exploring solutions that solving a data-centric problem with a non-data focused approach can sometimes be simple and time efficient. For instance, asking a user for its preferred location rather than inferring it using a complex heuristic.

2. Don’t Skip Out on Prototyping

Speed is critical! The first iterations of a brand new product often go through many changes. Prototypes allow for quick and cheap learning cycles. They also help prevent the sunk cost fallacy (when a past investment becomes a rationale for continuing). 

In the data world, a good rule of thumb is to leverage spreadsheets for building a prototype. Spreadsheets are a versatile tool that help accelerate build times, yet are often underutilized by data scientists and engineers. By design, spreadsheets allow the user to make sense of data in messy contexts, with just a few clicks. The built-in functions cover most basic use cases: 

  • cleaning data by hand rapidly
  • displaying graphs
  • computing basic ranking indices
  • formatting output data.

While creating a robust system is desirable, it often comes at the expense of longer building times. When releasing a brand new data product under a tight timeline, the focus should be on developing prototypes fast. 

A sample Google Sheet dashboard evaluating Inbound Leads.  The dashboard consists of 6 charts.  The 3 line charts on the left measure Lead Count, Qualification Rate %, and Time to Qualification in Minutes.  The 3 bar charts on the right  measure Leads by Channel, Leads by Country, and Leads by Last Source Touched.
An example of a dashboard prototype created within Google Sheets.

Despite a strong emphasis on speed, a prototype should still look and feel professional. For example, the first iteration of the Marketing attribution tool developed for Shopify’s Revenue team was a collection of SQL queries automated by a bash script. The output was then formatted in a spreadsheet. This allowed us to quickly make changes to the prototype and compare it to out-of-the-box tools. We avoided any wasted effort spinning up dashboards, production code, as well as any sentimental attachment to the work, which made it easier for the best solution to win.

3. Avoid Machine Learning (on First Iterations)

When building a new data product, it’s tempting to spend lots of time on a flashy machine learning algorithm. This is especially true if the product is supposed to be “smart”. Building a machine learning model for your first iteration can cost a lot of time. For example, when sprinting to build a lead scoring system for our Sales Representatives, our team spent 80% of the sprint gathering features and training a model. This left little time to integrate the product with the existing customer relationship management (CRM) infrastructure, polish it, and ask for feedback. A simple ranking using a proxy metric would be much faster to implement for the first iteration. The time gained would allow for more conversations with the users about the impact, use and engagement with the tool. 

We took that lesson to heart in our next project when we built a sales forecasting tool. We started with a linear regression using only two input variables that allowed us to have a prototype ready in a couple of hours. Using a simple model allowed us to ship fast and quickly learn whether it solved our user’s problem. Knowing we were on the right track, we then built a more complex model using machine learning.

Focus on building models that solve problems and can be shipped quickly. Once you’ve proven that your product is effective and delivers impact, then you can focus your time and resources on building more complex models. 

4. Talk to Your Users

Shipping fast also means shipping the right product. In order to stay on track, gathering feedback from users is invaluable! It allows you to build the right solution for the problem you’re tackling. Take the time to talk to your users, before, during, and after each build iteration. Shadowing them, or even doing the task yourself is a great return on investment.

Gathering feedback is an art. Entire books and research papers are dedicated to it. Here are the two tips we use at Shopify that increased the value of feedback we’ve received:

  • Ask specific questions. Asking, “Do you have any feedback?” doesn’t help the user direct their thoughts. Questions like, “How do you feel about the speed at which the dashboard loads?” or “Are you able to locate the insights you need on this dashboard to report on top of funnel performance?” are more targeted and will yield richer feedback.
  • Select a diverse group of users for feedback. Let’s suppose that you are building a dashboard that’s going to be used by three regional teams. It’s more effective to send a request for feedback to one person in each team rather than five people in a single team.
A sample Google Form that measures Prototype A's Scoring.  The form consists of 2 questions. The first question is "Is the score easy to parse and interpret? It is scored using a ranking from 1 - 5 with 1 = Very Hard and 5 = Very Easy. The 2nd question is "Additional Comments" and has a text field for the answer.
Feedback our team asked for the scoring system we created. When asking for feedback, you want to ask specific questions so you can yield better feedback.

We implemented the two tips when requesting feedback from users of the sales forecasting tool highlighted in the previous section. We asked a diverse group specific questions about our product, and learned that displaying a numerical score (0 - 100) was confusing. The difference between scores wasn’t clear to the users. Instead, it was suggested to display grades (A, B, C) which turned out to be much quicker to interpret and led to a better user experience.

At Shopify, following these tips has provided the team with a clearer path for launching brand new data products under tight time constraints. More importantly, it helped us avoid common pitfalls like getting stuck during neverending design phases, overengineering complex machine learning systems, or building data products that users don’t use. 

Next time you’re under a tight timeline to ship a new data product, remember to:

  1. Utilize design sprints to help focus your team’s efforts and remove the stress of the ticking clock
  2. Don’t skip on prototyping, it’s a great way to fail early
  3. Avoid machine learning (for first iterations) to avoid being slowed down by unnecessary complexity
  4. Talk to your users so you can get a better sense of what problem they’re facing and what they need in a product

If you’d like to read more about shipping new products fast, we recommend checking out The Design Sprint book, by Jake Knapp et al. which provides a complete framework for testing new ideas.

If you’re interested in helping us ship great data products, quickly, we’re looking for talented data scientists to join our team.

Continue reading

Read Consistency with Database Replicas

Read Consistency with Database Replicas

At Shopify, we’ve long used database replication for redundancy and failure recovery, but only fairly recently started to explore the potential of replicas as an alternative read-only data source for applications. Using read replicas holds the promise of enhanced performance for read-heavy applications, while alleviating pressure on primary servers that are needed for time-sensitive read/write operations.

There's one unavoidable factor that can cause problems with this approach: replication lag. In other words, applications reading from replicas might be reading stale data—maybe seconds or even minutes old. Depending on the specific needs of the application, this isn’t necessarily a problem. A query may return data from a lagging replica, but any application using replicated data has to accept that it will be somewhat out of date; it’s all a matter of degree. However, this assumes that the reads in question are atomic operations.

In contrast, consider a case where related pieces of data are assembled from the results of multiple queries. If these queries are routed to various read replicas with differing amounts of replication lag and the data changes in midstream, the results could be unpredictable.


Reading from replicas with varying replication lag produces unpredictable results

Reading from replicas with varying replication lag produces unpredictable results

For example, a row returned by an initial query could be absent from a related table if the second query hits a more lagging replica. Obviously, this kind of situation could negatively impact the user experience and, if these mangled datasets are used to inform future writes, then we’re really in trouble. In this post, I’ll show you the solution the Database Connection Management team at Shopify chose to solve variable lag and how we solved the issues we ran into.

Tight Consistency

One way out of variable lag is by enforcing tight consistency, meaning that all replicas are guaranteed to be up to date with the primary server before any other operations are allowed. This solution is expensive and negates the performance benefits of using replicas. Although we can still lighten the load on the primary server, it’s at the cost of delayed reads from replicas.

Causal Consistency

Another approach we considered (and even began to implement) is causal consistency based on global transaction identifier (GTID). This means that each transaction in the primary server has a GTID associated with it, and this GTID is preserved as data is replicated. This allows requests to be made conditional upon the presence of a certain GTID in the replica, so we can ensure replicated data is at least up to date with a certain known state in the primary server (or a replica), based on a previous write (or read) that the application has performed. This isn’t the absolute consistency provided by tight consistency, but for practical purposes it can be equivalent.

The main disadvantage to this approach is the need to implement software running on each replica which would report its current GTID back to the proxy so that it can make the appropriate server selection based on the desired minimum GTID. Ultimately, we decided that our use cases didn’t require this level of guarantee, and that the added level of complexity was unnecessary.

Our Solution to Read Consistency

Other models of consistency in replicated data necessarily involve some kind of compromise. In our case, we settled on a form of monotonic read consistency: successive reads will follow a consistent timeline, though not necessarily reading the latest data in real time. The most direct way to ensure this is for any series of related reads to be routed to the same server, so successive reads will always represent the state of the primary server at the same time or later, compared to previous reads in that series.

Reading repeatedly from a single replica produces a coherent data timeline
Reading repeatedly from a single replica produces a coherent data timeline

In order to simplify implementation and avoid unnecessary overhead, we wanted to offer this functionality on an opt-in basis, while at the same time avoiding any need for applications to be aware of database topology and manage their own use of read replicas. To see how we did this, let’s first take a step back.

Application access to our MySQL database servers is through a proxy layer provided by ProxySQL using the concept of hostgroups: essentially pools of interchangeable servers which look like a single data source from the application’s point of view.

A modified illustration from the ProxySQL website shows its place in our architecture
A modified illustration from the ProxySQL website shows its place in our architecture

When a client application connects with a user identity assigned to a given hostgroup, the proxy routes its individual requests to any randomly selected server within that hostgroup. (This is somewhat oversimplified in that ProxySQL incorporates considerations of latency, load balancing, and such into its selection algorithm, but for purposes of this discussion we can consider the selection process to be random). In order to provide read consistency, we modified this server selection algorithm in our fork of ProxySQL.

Any application which requires read consistency within a series of requests can supply a unique identifier common to those requests. This identifier is passed within query comments as a key-value pair:

/* consistent_read_id:<some unique ID> */ SELECT <fields> FROM <table>

The ID we use to identify requests is always a universally unique identifier (UUID) representing a job or any other series of related requests. This consistent_read_id forms the basis for a pseudorandom but repeatable index into the list of servers that replaces the default random indexing taking place in the absence of this identifier. Let’s see how.

First, a hashing algorithm is applied to the consistent_read_id to yield an integer value. We calculate the modulo of this number and the number of servers that becomes our pseudorandom index into the list of available servers. Repeated application of this algorithm yields the same pseudorandom result, thus maintaining read consistency over a series of requests specifying the same consistent_read_id. This explanation is simplified in that it ignores the server weighting which is configurable in ProxySQL. Here’s what an example looks like, including the server weighting:

The <code>consistent_read_id</code> is used to generate a hash that yields an index into a weighted list of servers.  In this example, Every time we receive this particular consistent_ read_ id, server 1 will be selected.
The consistent_read_id is used to generate a hash that yields an index into a weighted list of servers. In this example, every time we receive this particular consistent_ read_ id, server 1 will be selected.

A Couple of Bumps in the Road

I’ve covered the basics of our consistent-read algorithm, but there were a couple of issues to be addressed before the team got it working perfectly.

The first one surfaced during code review and relates to situations where a server becomes unavailable between successive consistent read requests. If the unavailable server is the one that was previously selected (and therefore would’ve been selected again), a data inconsistency is possible—this is a built-in limitation of our approach. However, even if the unavailable server isn’t the one that would’ve been selected, applying the selection algorithm directly to the list of available servers (as ProxySQL does with random server selection) could also lead to inconsistency, but in this case unnecessarily. To address this issue, we index into the entire list of configured servers in the host group first, then disqualify the selected server and reselect if necessary. This way, the outcome is affected only if the selected server is down, rather than having the indexing potentially affected for others in the list. Discussion of this issue in a different context can be found on

Indexing into configured servers rather than available servers provides a better chance of consistency in case of server failures

Indexing into configured servers rather than available servers provides a better chance of consistency in case of server failures

The second issue was discovered as an intermittent bug that led to inconsistent reads in a small percentage of cases. It turned out that ProxySQL was doing an additional round of load balancing after initial server selection. For example, in a case where the target server weighting was 1:1 and the actual distribution of server connections drifted to 3:1, any request would be forcibly rerouted to the underweighted server, overriding our hash-based server selection. By disabling the additional rebalancing in cases of consistent-read requests, these sneaky inconsistencies were eliminated.

Currently, we're evaluating strategies for incorporating flexible use of replication lag measurements as a tuneable factor that we can use to modify our approach to read consistency. Hopefully, this feature will continue to appeal to our application developers and improve database performance for everyone.

Our approach to consistent reads has the advantage of relative simplicity and low overhead. Its main drawback is that server outages (especially intermittent ones) will tend to introduce read inconsistencies that may be difficult to detect. If your application is tolerant of occasional failures in read consistency, this hash-based approach to implementing monotonic read consistency may be right for you. On the other hand, if your consistency requirements are more strict, GTID-based causal consistency may be worth exploring. For more information, see this blog post on the ProxySQL website.

Thomas has been a software developer, a professional actor, a personal trainer, and a software developer again. Currently, his work with the Database Connection Management team at Shopify keeps him learning and growing every day.

We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

Continue reading

Bound to Round: 8 Tips for Dealing with Hanging Pennies

Bound to Round: 8 Tips for Dealing with Hanging Pennies

Rounding is used to simplify the use of numbers that contain more decimal places than required. The perfect example is representing cash, money, dough. In the USA and Canada, the cent represents the smallest fraction of money. The US and Canadian dollar can’t be transacted with more than 2 decimal places. When numbers represent money, we use rounding to replace an un-representable, un-transactable money amount with one that represents a cash tender.

The best way to introduce this blog is by asking you to watch a scene from one of my favorite movies, Office Space:

In this scene, Peter describes to his girlfriend a program that compounds interest using high precision amounts. He explains that they simplify the calculations by rounding the amounts down and by doing that they’re left with hanging pennies that they transfer into their personal accounts.

This is exactly what we want to avoid—we want to avoid having one developer aware of hanging pennies. We also want to avoid having many hanging pennies. And when faced with such a situation, we want to identify such calculations and put a plan in place on who to notify and what to do with them. 

Before I explain this further, I want to tell you this story first. My father introduced banking software systems in the Middle East in the late 70’s. Rest assured he was bound to round. He faced the same issue Peter faced. He resolved it by accepting that he can’t resolve it. So, he created an account where the extra pennies accumulated and later were given as bonuses to the IT team at the bank. It was a way of getting back at the rest of the employees at the bank that didn’t want to move to using a software system and preferred pen and paper.

The Rounding Dilemma

Okay, let’s get back to breaking this problem down further with another example.

Let’s assume we can only charge 1 total amount, even if this 1 amount consists of a summation of multiple rates.

Rate 1 is 2.4%
Rate 2 is 2.9%
Amount $10.10

When rounding individual rate amounts:
Rate 1 total = (rate /100) * $10.10 = 0.2424 = rounded = 0.24
Rate 2 total = (rate /100) * $10.10 = 0.2929 = rounded = 0.29
Total = 0.24 + 0.29 = 0.53

When rounding total of the rate amounts:
Rate 1 total = (rate /100) * $10.10 = 0.2424
Rate 2 total = (rate /100) * $10.10 = 0.2929

Total = 0.2424 + 0.2929 = 0.5353 = rounded = 0.54

The example above makes it clear that deciding when to round can either make you more money by collecting the loose penny or lose money by deciding to let go of it.

Rounding at different stages in the example above has more impact if there are currency conversions involved. As a rule of thumb, the more currency conversions (which also involve rounding) and more rounding, the more we lose precision along the way. 

Rational numbers are natural products of various banking calculations: distributed payments, shared liabilities, and rates applied. So, you’ll face other rounding encounters in many other places in financial software, most notably while calculating taxes or discounts and, just like in Office Space, while calculating interest. 

Did it make cents? I hope you have a grasp on the problem. Now, is this avoidable? No, it’s not. If you’re working on financial software you’ll eventually be bound to round. But, we can control where and how to handle the precision loss. I’m sharing 8 tips to make your precision obsessive compulsiveness a bit less troubling to you as a developer and to the company as a business.

1. Notify Stakeholders

Show and tell where the rounding happens within your calculations to the stakeholders of your project. Explain the impact of the rounding, document it, and keep talking about it until all leaders on your team and within your department are aware. You, as a developer, don’t have to take the full burden of knowing that the company is making less than 1 cent on some transactions because of the calculations you put in place. Is a problem really a problem if it’s everyone’s problem?!

2. Use Banker’s Rounding

There are many types of rounding. There are rounding methods that increase bias and rounding methods that decrease bias. Banker’s rounding is the method proven to decrease rounding bias within calculations. Banking rounding deliberately distorts some of the rounded values to bring rounding totals of rounded numbers as close to the totals of the original numbers. Talking about why regular rounding taught in schools can’t meet our needs and why Banker’s rounding is mostly used for financial calculations would turn this blog into a math lesson, and as much as I would love to do that, I’d probably lose many readers.

3. Use Data Types That Hold the Most Precision

Within your calculations, ensure that all variables used are data types that can hold as much precision as possible (can hold enough decimal points). For example, using a double instead of a float. It’s important to keep the precision wherever there isn’t rounding involved as it reduces the amount of hanging pennies. 

4. Be Consistent

I mean, this applies to a lot of things in life. When you and your team decide on which rounding methods to use, ensure that the same rounding method is used throughout your code. 

5. Be Explicit About Rounding

When rounding within your calculation make it explicit by either adding comments or prefix rounded variables with “rounded_”. This ensures that anyone reading your code understands where precision loss is happening. Link to documentation about rounding strategies within your code documentation.

6. Refer to Government Rounding Standards

A photo of the 1040 U.S. Individual Income Tax Return form on a desk.
The 1040 U.S. Individual Income Tax Return form

Losing precision is a universal problem and not only suffered by mathematicians. Refer to your government’s ruling around rounding. When it comes to tax calculations, governments might have different rules. Refer to them and educate yourself and your team.

7. Round Only When You Absolutely Have To

Remember, only tender money amounts need to be rounded. Whenever you can avoid rounding, do so!

8. Tell Your Users

Please don’t hide what rounding methods you use to your users. Many users will try to reverse engineer calculations on their own, and as a company you don’t want to end up explaining this several times. Ensure rounding rules are explicitly written in your documentation and easily accessible. 

A circular logo with a Shopify shopping bag above the words "Be Merchant Obsessed. What Shopify Values"
Be Merchant Obsessed

At Shopify, we are, of course, bound to round. If you are a Shopify merchant reading this post I want to assure you that in all our calculations, developers are biased towards benefiting our merchants. Not only are our support teams merchant obsessed, all Shopify developers are too.

Dana is a senior developer on the Money team at Shopify. She’s been in software engineering since 2007. Her primary interests are back-end development, database design, and software quality management. She's contributed to a variety of products, and since joining Shopify she's been on the Shopify Payments and Balance teams. She recently switched to data development to deliver impactful money insights to our merchants.

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Using Betas to Deploy New Features Safely

Using Betas to Deploy New Features Safely

For companies like Shopify that practice continuous deployment, our code is changing multiple times every day. We have to de-risk new features to ship safely and confidently without impacting the million+ merchants using our platform. Beta flags are one approach to feature development that gives us a number of notable advantages.

Continue reading

Technical Mentorship Reimagined: Time-bound and No Awkward Asks Necessary

Technical Mentorship Reimagined: Time-bound and No Awkward Asks Necessary

Authors: Sarah Naqvi and Steve Lounsbury

Struggling with a concept and frantically trying to find the answers online? Are you thinking: I should just ping a teammate or Is this something I should already know? That’s fine, I’ll ask anyway. And then you do ask and get an answer and are feeling pretty darn proud of yourself and move forward. But wait, now I have about 25 follow-up questions. Sound familiar?

Or how about the career growth questions that can be sometimes too uncomfortable asking your lead or manager? Oh, I know, I’ll take them to reddit. But wait, they have no company context. Yeah, also familiar. I know. I should get a mentor! But darn. I’ve heard so many stories… Who do I approach? Will they be interested? What if they reject me? And what if it’s not working out?

Shopify is a collaborative place. We traditionally pair with other developers and conduct code reviews to level up. This approach is great for just-in-time feedback and unblocking us on immediate problems. We wanted to continue this trend and also look at how we can support developers in growing themselves through longer term conversations.

We surveyed our developers at Shopify and learned that they:

  • Love to learn from others at Shopify
  • Are busy people and find it challenging to make dedicated time for learning
  • Want to grow in their technical and leadership skills.

These findings birthed the idea of building a mentorship program targeted at solving these exact problems. Enter Shopify’s Engineering Mentorship Program. Shopify’s RnD Learning Team partnered with developers across the company to design a unique mentorship program and this guide walks readers through the structure, components, value-adds, and lessons we’ve had over the past year.

Gathering Participants

Once a quarter developers get an email inviting them to join the upcoming cycle of the program and sign up to be a mentee, mentor, or both. In addition to the email that is sent out, updates are posted in a few prominent Slack channels to remind folks that the signup window for the upcoming cycle is now open.

When signing up to participate in the program, mentees are asked to select their areas of interest and mentors are asked to select their areas of expertise. The current selections include:

  • Back-end
  • Data
  • Datastores
  • Front-end
  • Infrastructure
  • Mobile - Android
  • Mobile - iOS
  • Mobile - React Native
  • Non-technical (leadership, management).

Mentors are also prompted to choose if they are interested in supporting one or two mentees for the current cycle.

Matching Mentors and Mentees

Once the signup period wraps up, we run an automated matching script to pair mentees and mentors. Pairs are matched based on a few criteria:

  • Mentor isn’t mentee's current lead
  • Mentor and mentee don’t report to same lead
  • Aligned areas of interest and expertise based on selections in sign-up forms
  • Mentee job level is less than or equal to mentor job level.

The matching system intentionally avoids matching based upon criteria such as geographic location or department to broaden the developer’s network and gain company context they would have otherwise not been exposed to.

Pairs are notified by email of their match and invited to a kickoff meeting where organizers welcome participants, explain the program model and value that they will receive as a mentor or mentee, and answer any questions.

Running the Six Week Program Cycle

Each program cycle runs for six weeks and pairs are expected to meet for approximately one hour per week.

Shopify’s Engineering Mentorship Program overview
Shopify’s Engineering Mentorship Program overview

The time bound nature of the program enables developers to try out the program and see if it’s a good fit for them. They can connect with someone new and still feel comfortable knowing that they can walk away, no strings attached.

The voluntary signup process ensures that developers who sign up to be a mentor are committed to supporting a mentee for the six week duration and mentees can rest assured that their mentor is keen on supporting them with their professional growth. The sign-up emails as well as the sign-up form reiterate the importance of only signing up as a mentor or mentee if you can dedicate at minimum an hour per week for the six week cycle.

Setting Goals

In advance of the first meeting, mentees are asked to identify technical skills gaps they want to improve. During their first meeting, mentees and mentors work together building a tangible goal that they can work towards over the course of the six weeks. Goals often change and that’s expected.

Through the initial kickoff meeting and weekly check-ins via Slack, we reinforce and reiterate throughout the program that the goal itself is never the goal, but an opportunity to work towards a moving target.

Defining the goal is often the most challenging part for mentees. Mentors take an active role in supporting them craft this—the program team also provides tons of real examples from past mentees.

Staying Connected as a Group

Outside of the one on one weekly meetings between each pairing of mentor and mentee, the broader mentorship community stays connected on Slack. Two Slack channels are used to manage the program and connect participants with one another and with the program team.

The first Slack channel is a public space for all participants as well as anyone at the company who is curious about the program. This channel serves the purpose to advertise the program and keep participants connected to each other as well as to the program team. This is done by regularly asking questions and continuously sharing their experiences of what’s working well, how they’ve pivoted (or not) from their initial goals, and any general tips to support fellow mentors and mentees throughout the program.

The second Slack channel is a private space that is used exclusively by the program team and mentors. This channel is a space for mentors to be vulnerable and lean on fellow mentors for suggestions and resources.

Preparing the Participants with a Guidebook

Beyond the Slack channels, the other primary resource our participants use is a mentorship guidebook that curates tips and resources for added structure. The program team felt that a guidebook was an important aspect to include for participants who were craving more support. While it is entirely an optional resource to use, many first time mentors and mentees find it to be a helpful tool in navigating an otherwise open ended relationship between themselves and their match. It includes tips on sample agendas for your first meeting, example goals, and ways to support your mentee. For example, invite them to one of your meetings and debrief afterwards, pair, or do a code review.

Growing Mentor’s Skills Too

Naturally teaching someone else a technical concept helps reinforce it in our own minds. Our mentors constantly share how they’ve found the program helps refine their craft skills as well:

“Taking a step back from my day-to-day work to meet with [them] and chatting about career goals at a higher level, gave me more insight into what I personally want from my career path as well.”

The ability to mentor others in their technical acumen and craft is a skill that’s valued at Shopify. As engineers progress in their career here, being an effective mentor becomes a bigger piece of what’s expected in the role. The program gives folks a place to improve both their mentorship and leadership skills through iteration.

Throughout the program, mentors receive tips and resources from engineering leaders at the company that are curated by the program team and shared via Slack, but the most valuable piece ends up being the support they provide each other through a closed channel dedicated to mentors.

Here’s an actual example of how mentors help unblock each other:

Mentor 1: Hey! Im curious to know how y’all are supporting your mentees in coming up with a measurable goal? My mentee’s original goal was “learn how Shopify core works” and we’ve scoped that down to “learn how jobs are run in core” but we still don’t have it being something that’s measurable and can clearly be ticked off by the end of the 6 weeks. They aren’t the most receptive to refining the goal so I’m curious how any of you have approached that as well?

Mentor 2: Hmmm, I’d try to get to the “why” when they say they want to learn how Shopify core works. Is it so they can find things easier? Make better decisions by having more context? Are they interested in why it’s structured the way it is to inform them on future architecture decisions? Maybe that could help in finding something they can measure. Or if they’re just curious, could the goal be something like being able to explain to someone new to Shopify what the different components of the system are and how they interact? Or they’re able to create a new job in core in x amount of time?

Mentor 3: If you've learned how something works, you should be able to tell someone else. So I turn these learning goals into a goal to write a wiki page or report, make a presentation, or teach someone else one on one.

Mentor 1: Thanks for all the replies! I surfaced adapting the learning goal to have an outcome so they've decided on building an example that can be used as documentation and shared with their team. They're writing this example in the component their team maintains as well which will help in "learn how Shopify works" as they weren't currently familiar with the component.

Gathering Program Feedback

At the end of the six weeks mentees and mentors are asked to provide constructive feedback to one another and the program officially comes to a close. 

Program participants receive a feedback survey that helps organizers understand what’s working well and what to revise for future cycles. Participants share

  • Whether they would recommend the program to someone else or not
  • What the best part of the program was for them
  • What they would like to see improved for future cycles.

Tweaks are made within a short month or so and the next cycle begins. A new opportunity to connect with someone else, grow your skills, and do it in a time-bound and supportive environment.

What We’ve Learned in 2020

Overall, it’s been working well. The type of feedback we receive from participants is definitely share-worthy:

“was phenomenal to learn more about Production Engineering and our infrastructure. Answered hundreds of mind-numbing questions with extreme patience and detail; he put in so much time to write and share resources—and we wrapped up with a live exercise too! I always look forward to our sessions and it was like walking into a different, fantasy-like universe each time. Hands down the best mentoring experience of my professional career thus far.” - Senior Developer

  • We’ve had 300+ developers participate as mentees and 200+ as mentors.
  • 98% of survey respondents indicated that they would recommend the program to someone else.
  • The demand is there. Each cycle of the program we have to turn away potential mentees because we can’t meet the demand due to limited mentors. We are working on strategies to attract more mentors to better support the program in 2021.

The themes that emerged in terms of where participants found the most value are around:

  • Building relationships: getting to know people is hard. Getting to know colleagues in a fully remote world is near impossible. This program helps.
  • Having dedicated time for learning set aside each week: we’ve all got a lot to do. We know that making time for learning is important, but it can easily fall on the back burner. This program helps with that too.
  • Developing technical craft skills: growing your technical chops? Say no more.
  • Developing skills as a teacher and mentor: getting better at supporting peers can be tricky. You need experience and a safe space to learn with real people.
  • Gaining broader Shopify context: being a T-shaped developer is an asset. By T-shaped we are referring to the vertical bar of the “T” as the areas the developer is an expert in, while the horizontal bar refers to areas where the developer has some breadth and less depth of knowledge. Getting outside of our silos and learning about how different parts of Shopify work helps build stronger developers and a stronger team.

Reinvesting in the developers at Shopify is one way that we help individuals grow in their careers and increase the effectiveness of our teams.

Some great resources that inspired us:

Sarah is a Senior Program Manager who manages Engineering learning programs at Shopify. For the past five years, she has been working on areas such as designing and building the initial version of the Dev Degree program to now designing and delivering the Engineering Mentorship Program. She is focused on helping our engineers develop their professional skills and creating a strong Engineering community at Shopify.

Steve Lounsbury is a Sr Staff Developer at Shopify. He has been with Shopify since 2013 and is currently working to improve the checkout experience.

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

How to Make Dashboards Using a Product Thinking Approach

How to Make Dashboards Using a Product Thinking Approach

It’s no secret that communicating results to your team is a big part of the data science craft. This is where we drive home the value of our work, allowing our stakeholders to understand, monitor, and ultimately make decisions informed by data. One tool we frequently use at Shopify to help us is the data dashboard. This post is a step-by-step guide to how you can create dashboards that are user-centred and impact-driven.

People use the word dashboard to mean one of several different things. In this post I narrow my definition to mean an automatically updated collection of data visualisations or metrics giving insight into a set of business questions. Popular dashboard-building tools for product analytics include Tableau, Shiny, or Mode.

Unfortunately, unless you’re intentional about your process, it can be easy to put a lot of work into building a dashboard that has no real value. A dashboard that no one ever looks at is about as useful as a chocolate teapot. So, how can you make sure your dashboards meet your users’ needs every time? 

The key is taking a product thinking approach. Product thinking is an integral part of data science at Shopify. Similar to the way we always build products with our merchants in mind, data scientists build dashboards that are impact-focused, and give a great user experience for their audience.

When to Use a Dashboard

Before we dive into how to build a dashboard, the first thing you should ask yourself is whether this is the right tool for your situation. There are many ways to communicate data, including longform documents, presentations, and slidedocs. Dashboards can be time consuming to create and maintain, so we don’t want to put in the effort unnecessarily.

Some questions to ask when deciding whether to build a dashboard are

  • Will the data be dynamically updated?
  • Do you want the exploration to be interactive?
  • Is the goal to monitor something or answer data-related questions?
  • Do users need to continuously refer back to this data as it changes over time?

If most or all of the answers are “Yes”, then a dashboard is a good choice. 

If your goal is to persuade your audience to take a specific action, then a dashboard isn’t the best solution. Dashboards are convenient because they automatically serve updated metrics and visualisations in response to changing data. However, this convenience requires handing off some amount of interpretation to your users. If you want to tell a curated story to influence the audience, you’re better off working with historic, static data in a data report or presentation.

1. Understand the Problem and Your Audience

Once you have decided to build a dashboard, it’s imperative to start with a clear goal and audience in mind—that way you know from the get-go that you’re creating something of value. For example, at Shopify these might be



Your data team

Decide whether to ship an experimental feature to all our merchants.

Senior leadership

Monitor the effect of COVID-19 on merchants with retail stores to help inform our response.

Your interdisciplinary product team

Detect changes in user behaviour caused by shipping a new feature.

If you find that you have more than one stated goal (for example, both monitoring for anomalies and tracking performance), this is a flag that you need more than one dashboard.

Having clearly identified your audience and the reason for your dashboard, you’ll need to figure out what metrics to include that best serve their needs. A lot of the time this isn’t obvious and can turn into a lengthy back and forth with your users, which is ok! Time spent upfront will pay dividends later on. 

Good metrics to include are ones carefully chosen to reflect your stated goals. If your goal is to monitor for anomalies, you need to include a broad range of metrics with tripwire thresholds. If you want a dashboard to tell how successful your product is, you need to think deeply about a small number of KPIs and north stars that are proxies for real value created. 

An example of how you could whiteboard a dashboard plan to show your stakeholders. Visuals are helpful to get everyone aligned.
An example of how you could whiteboard a dashboard plan to show your stakeholders. Visuals are helpful to get everyone aligned.

Once you’ve decided on your metrics and data visualisations, make a rough plan of how they are presented; this could be a spreadsheet, something more visual like a whiteboard sketch, or even post-it notes. Present this to your stakeholders before you write any code: you’ll refine it as you go, but the important thing is to make sure what you’re proposing will help to solve their problem.

Now that you have a plan, you’re ready to start building.

2. Build a Dashboard with Your Users In Mind

The tricky thing about creating a dashboard is that the data presented must be accurate and easy to understand for your audience. So, how do you ensure both of these attributes while you’re building? 

When it comes to accuracy and efficiency, depending on what dashboard software you’re using, you’ll probably have to write some code or queries to create the metrics or visualisations from your data. Like any code we write at Shopify, we follow software best practices. 

  • Use clean code conventions such as Common Table Expressions to make queries more readable 
  • Use query optimisations to make them as run as efficiently as possible
  • Use version control to keep track of code changes during the development process 
  • Get the dashboard peer reviewed for quality assurance and to share context

The way that you present your data will have a huge impact on how easily your users understand and use it. This involves thinking about the layout, the content included or excluded, and the context given.

Use Your Layout to Focus Your Users’ Attention

Like the front page of a newspaper, your users need to know the most important information in the first few seconds.

One way you can do this is to structure your dashboard like an inverted pyramid, with the juicy headlines (key metrics) at the top, important details (analysis and visualisations) in the middle, and more general background info (interesting but less vital analyses) at the bottom. 

Above is an inverted pyramid demonstrating  how you can think about the hierarchy of the information you present in your dashboard.
Above is an inverted pyramid demonstrating  how you can think about the hierarchy of the information you present in your dashboard.

Remember to use the original goals decided on in step one to inform the hierarchy.

Keep the layout logical and simple. Guide the eye of your reader over the page by using a consistent visual hierarchy of headings and sections where related metrics are grouped together to make them easy to find.

Visual hierarchy, grouped sections and whitespace will make your dashboard easier to read.
Visual hierarchy, grouped sections and whitespace will make your dashboard easier to read.

Similarly, don’t be afraid to add whitespace, it gives your users a breather, and when used properly it increases comprehension of information.

Keep Your Content Sparing But Targeted

The visualizations you choose to display your data can make or break the dashboard. There’s a myriad of resources on this so I won’t go in-depth, but it’s worth becoming familiar with the theory and experimenting with what works best for your situation. 

Be brave and remove any visualisations or metrics that aren’t directly relevant to stated goals. Unnecessary details bury the important facts under clutter. If you need to include them, consider creating a separate dashboard for secondary analyses.

Ensure Your Dashboard Includes Business and Data Context

Provide enough business context so that someone discovering your dashboard from another team can understand it at a high level, such as:

  • Why this dashboard exists 
  • Who it’s for
  • When it was built, and if and when it’s set to expire 
  • What features it’s tracking via links to team repositories, project briefs, screenshots, or video walkthroughs

Data context is also important for the metrics on your dashboard as it allows the user to anchor what they are seeing to a baseline. For example, instead of just showing the value for new users this week, add an arrow showing the direction and percentage change since the same time last week.

The statistic on the right has more value than the one on the left because it is given with context.
The statistic on the right has more value than the one on the left because it is given with context.

You also can provide data context by segmenting or filtering your data. Different segmentations of data can give results with completely opposite interpretations. Leveraging domain-specific knowledge is the key to choosing appropriate segments that are likely to represent the truth.

Before You Ship, Consider Data Freshness

Your dashboard is only as fresh as the last time it was run, so think about how frequently the data is refreshed. This might be a compromise between how often your users want to look at the dashboard, and how resource-intensive it is to run the queries.

Finally, it’s best practice to get at least two technical reviews before you ship. It’s also worth getting sign-off from your intended users. After all, if they don’t understand or see value in it, they won’t use it.

3. Follow Up and Iterate

You’ve put in a lot of work to understand the problem and audience, and you’ve built a killer dashboard. However, it’s important to remember the dashboard is a tool. The ultimate goal is to make sure it gets used and delivers value. For that you’ll need to do some follow-up.

Market It

It’s up to you to spread awareness and make sure the dashboard gets into the right hands. The way you choose to ‘market’ your dashboard depends on the audience and team culture, but in general, it’s a good idea to think about both how to launch it, and how to make it discoverable longer-term. 

For the launch, think about how you can present what you’ve made in a way that will maximise uptake and understandability. You might only get one chance to do this, so it’s worth being intentional. For example, you might choose to make a well-crafted announcement via a team email and provide an accompanying guide for how to use the dashboard, such as a short walk-through video. 

In the long-term make sure that after launching your dashboard that it's easily discoverable by whoever might need it. This might mean making it available through an internal data portal and using a title and tags tailored to common search terms. You might think about ways to re-market the dashboard once time has passed. Don’t be afraid to resurface or make noise about the dashboard when you find organic moments to do so. 

Use It and Improve It

Return to the goals  identified in step one and think about how to make sure these are reached. 

For example, if the dashboard was created to help decide whether to ship a specific feature, set a date for when this should happen and be prepared to give your opinion to the team based on the data at this point.

Monitor usage of the dashboard to find out how often it’s being used, shared, or quoted. It gives insight into how much impact you’ve had with it.

If the dashboard didn’t have the intended outcome, figure out why not. Is there something you could change to make it more useful or would do differently next time? Use this research to help improve the next dashboard.

Maintain It

Finally, as with any data product, without proper maintenance a dashboard will fall into disrepair. Assign a data scientist or team to answer questions or fix any issues that arise.

Key Takeaways

I’ve shown you how to break down the process of building a dashboard using product thinking. A product-thinking approach is the key to maximising the business impact created proportional to the time and effort put in. 

You can take a product-thinking approach to building a impact-driving dashboard by following these steps:

  • Understand your problem and your audience; design a dashboard that does one thing really well, for a clear set of users 
  • Build the dashboard with your users in mind, ensuring it is accurate and easy to understand 
  • Follow up and iterate on your work by marketing, improving and maintaining it into the future. 

By following these three steps, you will create dashboards that put your users front and centre. 

If you’re interested in using data to drive impact, we’re looking for talented data scientists to join our team.

Lin has been a Data Scientist at Shopify for 2 years and is currently working on Merchandising, a team dedicated to helping our merchants be as successful as possible at branding and selling their products. She has a PhD in Molecular Genetics and used to wear a lab coat to work.

Continue reading

Using GraphQL for High-Performing Mobile Applications

Using GraphQL for High-Performing Mobile Applications

GraphQL is the syntax that describes data that a client asks from a server. The client, in this case, is a mobile application. GraphQL is usually compared with REST API, a common syntax that most mobile application developers use. We will share how GraphQL can solve some of the pain points of REST API in mobile application development and discuss tips and best practices that we learned at Shopify by using GraphQL in our mobile applications.

Why Use GraphQL?

A mobile application generally has four basic layers in the codebase:
  1. Network layer: defines the connection and the server to connect to send/receive data.
  2. Data model layer: translates data coming from the network layer to understandable data for local app models.
  3. View models layer: translates data models to understandable models for the user interface.
  4. User interface layer: presents/receives data to/from the user.
Four layers in a mobile application: Network layer, Data model layer, View models layer, User interface layer

A network layer and data model layer are needed for an app to talk to a server and translate that information to view layers. GraphQL can fit into these two layers and base a data graph layer and solve most of the pain points mobile developers used to have when using REST APIs.

One of the pain points when using REST APIs is that the data coming from the server should be mapped many times to different object types in order to be presented on the screen or vice versa from input in the screen to be sent to the server. Simpler apps might have fewer of these mappings depending on if the app has a local database to store data or if the app is online only. But mobile apps surely have the mapping to convert the JSON data coming from an API to a class object (for example, Swift objects ).

When working with REST endpoints these mappings are basically matching statically typed code with the unstructured JSON responses. In other words, mobile developers are asked to hard code the type of a field and cast the JSON value to the assumed type. Sometimes developers validate and assert the type. These castings or validations might fail as we know the server is always changing and deprecating fields and objects. If that happens, we cannot fix the mobile application that is already released in the market without replacing those hard codes and assumptions. This is one of the bug-prone parts of the mobile application when working with REST endpoints. These changes will happen again and again during the lifetime of an application. The mobile developer’s job is to maintain those hard codes and keep the parity between the APIs response and the application mapping logic. Any change on server APIs has to be announced and that forces the mobile developers to update their code.

The problem described above can be somewhat alleviated by adding frameworks to control the flow and providing more API documentation, such as The OpenAPI Specification (OAS). However, this does not actually solve the problem as part of the endpoint itself, and adds a workaround or dependencies on different frameworks.

On the other side, GraphQL addresses the aforementioned concerns. GraphQL APIs are strongly typed and a self-documented contract between server and clients. Strongly typed means each type of data is predefined as part of the language. This makes it easy for clients to be always in sync with server data types. There are no more statical types in your mobile application and no JSON mapping with the static data types in the codebase. Mobile apps’ objects will always be synced with the server objects and developers will get the updates and deprecations at compile time.

GraphQL endpoints are defined by schemas. Introspection is the system in GraphQL that enables tooling systems to generate code for different languages and platforms. Deprecation is a good example of describing introspection. It can be added so that each field would have a isDeprecated boolean and a replicationReason. This GraphQL tool become very useful as it shows warnings and feedback on compile-time in a mobile project.

As an example, the below JSON is the response from an API endpoint:

The price field on product is received as a String type and the client has the mapping below to convert the JSON data in to a swift model:

Let price = product[“price”] as? String


This type casting is how a mobile application transfers the JSON data to understandable data for UI layers. Basically, mobile clients have to statically define the type of each field and this is independent of server’s objects.

On the other side, GraphQL removes these static type castings. Client and server will always be tightly coupled and in sync. In the example above, Product type will be in the schema in GraphQL documentation as a contract between client and server, and price will always be the type that is defined in that contract. So the client is no longer keeping static types for each field.


Note that customization comes with a cost. It is the client developer's responsibility to keep the performance high while taking advantage of the customization. The choice between using REST API vs GraphQL is up to the developer based on the project but in general REST API endpoint is defined in a more optimized way. That means each endpoint only receives a defined input and it returns a defined output and no more than that. GraphQL endpoints can be customized and clients can ask for anything in a single request. But clients also need to be careful about the costs of this flexibility. We are going to talk about GraphQL query costs later but having cost doesn't mean we can't reach the same optimization as REST API with GraphQL. Query cost should be considered when taking advantage of the customization feature.

Tips and Best Practices

To use GraphQL queries in a mobile project, you need to have a code generator tool to generate the client-side files representing your GraphQL queries, mutations, and responses. The tool we use at Shopify is called Syrup. Syrup is open source and generates strongly-typed Swift and Kotlin codes based on the GraphQL queries used in your app. Let's look at some examples of GraphQL queries in a mobile application and learn some tips. The examples and screenshots are from Shopify POS application.

Fragments and Screens in Mobile Apps

Defining fragments usually depends on the application UI. In this example, the order details screen in Shopify POS application shows lineItems on an order but it also has a sub screen which shows an event on order with related lineItems. For example, order details on the top image and return event screen with the lineItems that are returned on the bottom.

Fragments and Screens in Mobile Apps


Fragments: return event screen with the lineItems that are returned on the bottom

In this example lineItem rows in both screens are exactly the same and the view to create that row receives exactly the same information to create the view. Assuming each screen calls a query to get the information they need. They both need the same fields on the lineItem object. So, OrderLineItem object is basically a shared object between more than one screen and also between more than one query in the app. With GraphQL query we define orderLineItem as a fragment so it can be reusable and it guarantees that the lineItem view gets all the fields it needs every time the app fetches lineItem using this fragment. See query examples below:

Fragments with Common Fields but Different Names

Fragments can be customized on the client side and usually in mobile applications very much depends on the UI. Defining more fragments does not affect query cost so it's free and it gives your query a good structure. A good tip about using fragments is that not only you can break down the fields into multiple fragments but also you can put the same fields in multiple fragments and again it does not add cost to the query. For example, sometimes applications present repetitive data in more than one screen. In our OrderDetails screen example, the POS app presents high-level payment information about the order in the orderDetails screen (such as: subtotal, discount, total, etc.), but order can have a longer payment history (including change, failed transactions, etc.). Order history is presented in sub screens if the user selects to see that information. Assuming we only call one query to get all the information, we can have two fragments: OrderPayments, OrderHistory.

See fragments below:

Defining these fragments makes it easier to pass the data around and it does not affect the performance or cost of query. We are going to talk more about query cost later.

Customize Query Response to Benefit your App’s UX

With GraphQL you are able to customize your query/mutation response for the benefit of your application UI. If you have used REST API for a mobile application before you will appreciate the power that GraphQL can bring into your app. For example, after calling a mutation on an Order object, you can define the response of the mutation call with the fields you need to build your next screen. If the mutation is adding a lineItem to an order object and your next screen is to show the total price of the order, you can define the response object to include the totalPrice field on order so you can easily build your UI without having to fetch the updated order object. See mutation example below:

This flexibility is not possible with REST API without asking the server team to change the response object for the specific REST API endpoint.

Use Aliases to Have Readable Data Models Based on your App’s UI

If you are building the UI based on directly using the GraphQL objects, you can use aliases to rename the fields anything you want. A small tip about using aliases is that you can use aliases to rename a field but also if you add an extension to the object you can have the original field’s name as a new variable with added logic. See example below:

Use Directives and Add Logic to Your Query

Directives are mentioned in GraphQL documentation as a way to avoid string manipulation for server side code, but it also has advantages for a mobile client. For example, for the Order details screen, POS needs different fields on order based on the type of an order. If order is a pickup order, the OrderDetails screen needs more information about fulfillments and does not need information about shipping details. With directives you can pass boolean variables from your UI to the query to include or skip fields. See below query example:

We can add directives on fragments or fields. This enables mobile applications to fetch only the data that the UI needs and not more than that. This flexibility isn’t possible with REST API endpoints without having two different endpoints and having code in the mobile app codebase to switch between endpoints based on the boolean variable.

GraphQL Performance

GraphQL gives all the power and simplicity to your mobile application and some work is now transferred to the server-side to give clients the flexibility. On the client side, we have to consider the costs of a query we build. The cost of the query affects performance directly as it affects the responsiveness of your application and the resources on the server. This is not something that is usually mentioned when talking about GraphQL, but at Shopify we care about performance on both client-side and server-side.

Different GraphQL servers might have different API rate limiting methods. At Shopify, calls to GraphQL APIs are limited based on calculated query cost, which means the cost of query per minute and is more important than the number of query calls per minute. Each field in the schema has an integer cost value, and the sum of all these costs will be the cost of the query we build on the client side.

In simple words, each user has a bucket of maximum query cost per minute and each second the bucket will be refilled after each query execution. Obviously, complex queries will take up a proportionally larger amount of that bucket. To be able to start an execution of a query bucket app should have enough room for the complexity of the request query. That is the reason why on the client side we should care about our calculated query cost. There are tips and ways to improve the query cost in general, as described here.

Future of GraphQL

GraphQL is more than just a graph query language. It’s language-independent and flexible to serve any platform’s needs. It is built to serve clients where network bandwidth, latency and UX is critical. We mentioned the pain points when using REST in mobile applications and how GraphQL can address many of those concerns. GraphQL allows you to build whatever you need for the client and fulfill it in your own way. GraphQL is already an immense move forward from REST API design, addressing directly the models of data that need to be transferred between each client and server to do the job. At Shopify, we believe in the future of GraphQL and that is why Shopify has offered APIs in GraphQL since 2018.

Mary is a senior developer in Retail at Shopify. She has tons of experience in Swift and iOS development in general, and has been coding Swift since 2014. She's contributed to a variety of apps and since joining Shopify she's been on the Point Of Sale (POS) app team. She recently switched to React Native and started learning JavaScript and React. If you want to connect with Mary, check her out on Twitter.

Continue reading

Apache Beam for Search: Getting Started by Hacking Time

Apache Beam for Search: Getting Started by Hacking Time

To create relevant search, processing clickstream data is key: you frequently want to promote search results that are being clicked on and purchased, and demote those things users don’t love.

Typically search systems think of processing clickstream data as a batch job run over historical data, perhaps using a system like Spark. But on Shopify’s Discovery team, we ask the question: What if we could auto-tune relevance in real-time as users interact with search results—not having to wait days for a large batch job to run?

At Shopify—this is what we’re doing! We’re using streaming data processing systems that can process both real-time and historic data to enable real-time use cases ranging from simple auto boosting or down boosting of documents, to computing aggregate click popularity statistics, building offline search evaluation sets, and on to more complex reinforcement learning tasks.

But this article is introducing you to the streaming system themselves. In particular, to Apache Beam. And the most important thing to think about is time with those streaming systems. So let’s get started!

What Exactly is Apache Beam?

Apache Beam is a unified batch and stream processing system. This lets us potentially unify historic and real-time views of user search behaviors in one system. Instead of a batch system, like Spark, to churn over months of old data, and a separate streaming system, like Apache Storm, to process the live user traffic, Beam hopes to keep these workflows together.

For search, this is rather exciting. It means we can build search systems that both rely on historic search logs while perhaps being able to live-tune the system for our users’ needs in various ways.

Let’s walk through an early challenge everyone faces with Beam: that of time! Beam is a kind of time machine that has to reorder events in their right spot after getting annoyingly delayed by lots of intermediate processing and storage step. This is one of the core complications of a streaming system - how long do we wait? How do we deal with late or out of order data?

So to get started with Beam, the first thing you’ll need to do is Hack Time!

The Beam Time Problem

At the core of Apache Beam are pipelines. They connect a source through various processing steps to finally a sink.  

Data flowing through a pipeline is timestamped. When you consider a streaming system, this makes sense. We have various delays as events flow from browsers, through APIs, and other data systems. Finally the events arrive at our Beam pipeline. They can easily be out-of-order or delayed. Beam source APIs, like the one for Kafka, maintain a moving view of the event data to emit well-ordered events known as a watermark.

If we don’t give our Beam source good information on how to build a timestamp, we’ll drop events or receive them in the wrong order. But even more importantly for search, we likely must combine different streams of data to build a single view on a search session or query, like below:

combine different streams of data to build a single view on a search session or query, like below

Joining (a Beam topic for another day!) needs to look back over each source’s watermark and ensure they’re aligned in time before deciding that sufficient time has elapsed before moving on. But before you get to the complexities of streaming joins, replaying with accurate timestamps is the first milestone on your Beam-for-clickstream journey.

Configuring the Timestamp Right at the Source

Let’s set up a simple Beam pipeline to explore Beam. Here we’ll use Kafka in Java as an example. You can see the full source code in this gist.

Here we’ll set up a Kafka source, the start of a pipeline producing a custom SearchQueryEvent stored in a search_queries_topic.

You’ll notice we have information on the topic/servers to retrieve the data, along with how to deserialize the underlying binary data. We might add further processing steps to transform or process our SearchQueryEvents, eventually sending the final output to another system.

But nothing about time yet. By default, the produced SearchQueryEvents will use Kafka processing time. That is, when they’re read from Kafka. This is the least interesting for our purposes. We care about when users actually searched and clicked on results.

More interesting is when the event was created in a Kafka client. Which we can add here:


You’ll notice above, when we use create time below, we need to give the source’s Watermark a tip for how out of order event times might be. For example, below we instruct the Kafka source to use create time, but with a possible 5 minutes of discrepancy. 

Appreciating The Beam Time Machine

Let’s reflect on what such a 5 minute possible delay actually means from the last snippet. Beam is kind of a time machine… How Beam bends space-time is where your mind can begin to hurt.

As you might be picking up, event time  is quite different from processing time! So in the code snippet above, we’re *not* telling the computer to wait for 5 minutes of execution time for more data. No, the event time might be replayed from historical data, where 5 minutes of event time is replayed through our pipeline in mere milliseconds. Or it could be event time is really now, and we’re actively streaming live data for processing. So we DO indeed wait 5 real minutes! 

Let’s take a step back and use a silly example to understand this. It’s really crucial to your Beam journey. 

Imagine we’re super-robot androids that can watch a movie at 1000X speed. Maybe like Star Trek The Next Generation’s Lt Commander Data. If you’re unfamiliar, he could process input as fast as a screen could display! Data might say “Hey look, I want to watch the classic 80s movie, The Goonies, so I can be a cultural reference for the crew of the Enterprise.” 

Beam is like watching a movie in super-fast forward mode with chunks of the video appearing possibly delayed or out of order relative to other chunks in movie time. In this context we have two senses of time:

  • Event Time: the timestamp in the actual 1h 55 minute runtime of The Goonies aka movie time.
  • Processing Time: the time we actually experience The Goonies (perhaps just a few minutes if we’re super-robot androids like Data).

So Data tells the Enterprise computer “Look, play me The Goonies as fast as you can recall it from your memory banks.” And the computer has various hiccups where certain frames of the movie aren’t quite getting to Data’s screen to keep the movie in order. 

Commander Data can tolerate missing these frames. So Data says “Look, don’t wait more than 5 minutes in *movie time* (aka event time) before just showing me what you have so far of that part of the movie. This lets Data watch the full movie in a short amount of time, dropping a tolerable number of movie frames.

This is just what Beam is doing with our search query data. Sometimes it’s replaying days worth of historic search data in milliseconds, and other times we’re streaming live data where we truly must wait 5 minutes for reality to be processed. Of course, the right delay might not be 5 minutes, it might be something else appropriate to our needs. 

Beam has other primitives such as windows which further inform, beyond the source, how data should be buffered or collected in units of time. Should we collect our search data in daily windows? Should we tolerate late data? What does subsequent processing expect to work over? Windows also work with the same time machine concepts that must be appreciated deeply to work with Beam.

Incorporating A Timestamp Policy

Beam might know a little about Kafka, but it really doesn’t know anything about our data model. Sometimes we need even more control over the definition of time in the Beam time machine.

For example, in our previous movie example, movie frames perhaps have some field informing us of how they should be arranged in movie time. If we examine our SearchQueryEvent, we also see a specific timestamp embedded in the data itself:

public class SearchQueryEvent {

   public final String queryString;

   public final Instant searchTimestamp;


Well Beam sources can often be configured to use a custom event time like our searchTimestamp. We just need to make a TimestampPolicy. We simply provide a simple function-class that takes in our record (A key-value of Long->SearchQueryEvent) and returns a timestamp:

We can use this to create our own timestamp policy:

Here, we’ve passed in our own function, and we’ve given the same allowed delay (5 minutes). This is all wrapped up in a factory class TimestampPolicyFactory SearchQueryTimestampPolicyFactory (now if that doesn’t sound like a Java class name, I don’t know what does ;) )

We can add our timestamp policy to the builder:

.withTimestampPolicyFactory(new SearchQueryTimestampPolicyFactory())

Hacking Time!

Beam is about hacking time, I hope you’ve appreciated this walkthrough of some of Beam’s capabilities. If you’re interested in joining me on building Shopify’s future in search and discovery, please check out these great job postings!

Doug Turnbull is a Sr. Staff Engineer in Search Relevance at Shopify. He is known for writing the book “Relevant Search”, contributing to “AI Powered Search”, and creating relevance tooling for Solr and Elasticsearch like Splainer, Quepid, and the Elasticsearch Learning to Rank plugin. Doug’s team at Shopify helps Merchants make their products and brands more discoverable. If you’d like to work with Doug, send him a Tweet at @softwaredoug!

Continue reading

How Shopify Uses WebAssembly Outside of the Browser

How Shopify Uses WebAssembly Outside of the Browser

On February 24, 2021, Shipit!, our monthly event series, presented Making Commerce Extensible with WebAssembly. The video is now available.

At Shopify we aim to make what most merchants need easy, and the rest possible. We make the rest possible by exposing interfaces to query, extend and alter our Platform. These interfaces empower a rich ecosystem of Partners to solve a variety of problems. The primary mechanism of this ecosystem is an “App”, an independently hosted web service which communicates with Shopify over the network. This model is powerful, but comes with a host of technical issues. Partners are stretched beyond their available resources as they have to build a web service that can operate at Shopify’s scale. Even if Partners’ resources were unlimited, the network latency incurred when communicating with Shopify precludes the use of Apps for time sensitive use cases.

We want Partners to focus on using their domain knowledge to solve problems, and not on managing scalable web services. To make this a reality we’re keeping the flexibility of untrusted Partner code, but executing it on our own infrastructure. We choose a universal format for that code that ensures it’s performant, secure, and flexible: WebAssembly.


What is WebAssembly? According to

“WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine. Wasm is designed as a portable compilation target for programming languages, enabling deployment on the web for client and server applications.”

To learn more, see this series of illustrated articles written by Lin Clark of Mozilla with information on WebAssembly and its history.

Wasm is often presented as a performant language that runs alongside JavaScript from within the Browser. We, however, execute Wasm outside of the browser and with no Javascript involved. Wasm, far from being solely a Javascript replacement, is designed for Web and Non-Web Embeddings alike. It solves the more general problem of performant execution in untrusted environments, which exists in browsers and code execution engines alike. Wasm satisfies our three main technical requirements: security, performance, and flexibility.


Executing untrusted code is a dangerous thing—it's exceptionally difficult to predict by nature, and it has potential to cause harm to Shopify’s platform at large. While no application is entirely secure, we need to both prevent security flaws and mitigate their impacts when they occur.

Wasm executes within a sandboxed stack-based environment, relying upon explicit imports to allow communication with the host. Because of this, you cannot express anything malicious in Wasm. You can only express manipulations of the virtual environment and use provided imports. This differs from bytecodes which have references to the computers or operating systems they expect to run on built right into the syntax.

Wasm also hosts a number of features which protect the user from buggy code, including protected call stacks and runtime type checking. More details on the security model of Wasm can be found on


In ecommerce, speed is a competitive advantage that merchants need to drive sales. If a feature we deliver to merchants doesn’t come with the right tradeoff of load times to customization value, then we may as well not deliver it at all.

Wasm is designed to leverage common hardware capabilities that provide it near native performance on a wide variety of platforms. It’s used by a community of performance driven developers looking to optimize browser execution. As a result, Wasm and surrounding tooling was built, and continues to be built, with a performance focus.


A code execution service is only as useful as the developers using it are productive. This means providing first class development experiences in multiple languages they’re familiar with. As a bytecode format, Wasm is targeted by a number of different compilers. This allows us to support multiple languages for developer use without altering the underlying execution model.

Community Driven

We have a fundamental alignment in goals and design, which provides our “engineering reason” for using Wasm. But there’s more to it than that—it’s about the people as well as the technology. If nobody was working on the Wasm ecosystem, or even if it was just on life support in its current state, we wouldn’t use it. WebAssembly is an energized community that’s constantly building new things and has a lot of potential left to reach. By becoming a part of that community, Shopify stands to gain significantly from that enthusiasm.

We’re also contributing to that enthusiasm ourselves. We’re collecting user feedback, discussing feature gaps, and most importantly contributing to the open source tools we depend on. We think this is the start of a healthy reciprocal relationship between ourselves and the WebAssembly community, and we expect to expand these efforts in the future.

Architecture of our Code Execution Service

Now that we’ve covered WebAssembly and why we’re using it, let’s move onto how we’re executing it.

We use an open source tool called Lucet (originally written by Fastly). As a company, Fastly provides a programmable edge cloud platform. They’re trying to bring execution of high-volume, short-lived, and untrusted modules closer to where they’re being requested. This is the same as the problem we’re trying to solve with our Partner code, so it’s a natural fit to be using the same tools.


Lucet is both a runtime and a compiler for Wasm. Modules are represented in Wasm for the safety that representation provides. Recall that you can’t express anything malicious in Wasm. Lucet takes advantage of this and uses a validation of the Wasm module as a security check. After the validation, the module is compiled to an executable artifact with near bare metal performance. It also supports ahead of time compilation, allowing us to have these artifacts ready to execute at runtime. Lucet containers boast an impressive startup time of 35 μs. That’s because it’s a container that doesn’t need to do anything at all to start up.  If you want the full picture, Tyler McMullan, the CTO of Fastly, did a great talk which gives an overview of Lucet and how it works.

A flow diagram showing how Shopify uses our Wasm engine: Lucet wrapped within a Rust web service which manages the I/O and storage of modules
A flow diagram showing Shopify's Wasm engine

We wrap Lucet within a Rust web service which manages the I/O and storage of modules, which we call the Wasm Engine. This engine is called by Shopify during a runtime process, usually a web request, in order to satisfy some function. It then applies the output in context of the callsite. This application could involve the creation of a discount, the enforcement of a constraint, or any form of synchronous behaviour Merchants want to customize within the Platform.

Execution Performance

Here’s some metrics pulled from a recent performance test. During this test, 100k modules were executed per minute for approximately 5 min. These modules contained a trivial implementation of enforcing a limit on the number of items purchased in a cart. 

A line graph showcasing the time taken to execute a module. The x axis representing the time over the test was running and the y axis is the time represented in ms
Time taken to execute a module

This chart demonstrates a breakdown of the time taken to execute a module, including I/O with the container and the execution of the module. The y-axis is time in ms, the x-axis is the time over which the test was running.

The light purple bar shows the time taken to execute the module in Lucet, the width of which hovers around 100 μs. The remaining bars deal with I/O and engine specifics, and the total time of execution is around 4 ms. All times are 99th percentiles (p99).To put these times in perspective, let’s compare these times to the request times of Storefront Renderer, our performant Online Store rendering service:

A line graph showing Storefront Renderer Response time
Storefront Renderer response time

This chart demonstrates the request time to Storefront Renderer over time. The y-axis is request time in seconds. The x-axis is the time over which the values were retrieved. The light blue line representing the 99th percentile hovers around 700 ms.

Then if we consider the time taken by our module execution process to be generally under 5 ms, we can say that the performance impact of Lucet execution is negligible.

Generating WebAssembly

To get value out of our high performance execution engine, we’ll need to empower developers to create compatible Wasm modules. Wasm is primarily intended as a compilation target, rather than something you write by hand (though you can write Wasm by hand). This leaves us with the question of what languages we’ll support and to what extent.

Theoretically any language with a Wasm target can be supported, but the effort developers spend to conform to our API is better focused on solving problems for merchants. That’s why we’ve chosen to provide first class support to a single language that includes tools that get developers up and running quickly.At Shopify, our language of choice is Ruby. However, because Ruby is a dynamic language, we can’t compile it down to Wasm directly. We explored solutions involving compiling interpreters, but found that there was a steep performance penalty. Because of this, we decided to go with a statically compiled language and revisit the possibility of dynamic languages in the future.

Through our research we found that developers in our ecosystem were most familiar with Javascript. Unfortunately, Javascript was precluded as it’s a dynamic language like Ruby. Instead, we chose a language with familiar TypeScript-like syntax called AssemblyScript.

Using AssemblyScript

At first glance, there are a huge number of languages that support a WebAssembly target. Unfortunately, there are two broad categories of WebAssembly compilers which we can’t use:

  • Compilers that generate environment or language specific artifacts, namely node or the browser. (Examples: Asterius, Blazor)
  • Compilers that are designed to work only with a particular Runtime. The modules generated by these compilers rely upon special language specific imports. This is often done to support a language’s standard library, which expects certain system calls or runtime features to be available. Since we don’t want to be locked down to a certain language or tool, we don’t use these compilers. (Examples: Lumen)

These are powerful tools in the right conditions, but aren’t built for our use case. We need tools that produce WebAssembly, rather than tools which are powered by WebAssembly. AssemblyScript is one such tool.

AssemblyScript, like many tools in the WebAssembly space, is still under development. It’s missing a few key features, such as closure support, and it still has a number of edge case bugs. This is where the importance of the community comes in.

The language and the tooling around AssemblyScript has an active community of enthusiasts and maintainers who have supported Shopify since we first started using the language in 2019. We’ve supported the community through an OpenCollective donation and continuing code contributions. We’ve written a language server, made some progress towards implementing closures, and have written bug fixes for the compiler and surrounding tooling.

We’ve also integrated AssemblyScript into our own early stage tooling. We’ve built integrations into the Shopify CLI which will allow developers to create, test, and deploy modules from their command line. To improve developer ergonomics, we provide SDKs which handle the low level implementation concerns of Shopify defined objects like “Money”. In addition to these tools, we’re building out systems which allow Partners to monitor their modules and receive alerts when their modules fail. The end goal is to give Partners the ability to move their code onto our service without losing any of the flexibility or observability they had on their own platform.

New Capabilities, New Possibilities

As we tear down the boundaries between Partners and Merchants, we connect merchants with the entrepreneurs ready to solve their problems. If you have ideas on how our code execution could help you and the Apps you own or use, please tweet us at @ShopifyEng. To learn more about Apps at Shopify and how to get started, visit our developer page.

Duncan is a Senior Developer at Shopify. He is currently working on the Scripts team, a team dedicated to enabling and managing untrusted code execution within Shopify for Merchants and Developers alike.

Shipit! Presents: Making Commerce Extensible with WebAssembly


If you love working with open source tools, are passionate about API design and extensibility, and want to work remotely, we’re always hiring! Reach out to us or apply on our careers page.

Continue reading

Simplify, Batch, and Cache: How We Optimized Server-side Storefront Rendering

Simplify, Batch, and Cache: How We Optimized Server-side Storefront Rendering

On December 16, 2020 we held Shipit! presents: Performance Tips from the Storefront Renderer Team. A video for the event is now available for you to learn more about how the team optimized this Ruby application for the particular use case of serving storefront traffic. Click here to watch the video.

By Celso Dantas and Maxime Vaillancourt

In the previous post about our new storefront rendering engine, we described how we went about the rewrite process and smoothly transitioned to serve storefront requests with the new implementation. As a follow-up and based on readers’ comments and questions, this post dives deeper into the technical details of how we built the new storefront rendering engine to be faster than the previous implementation.

To set the table, let’s see how the new storefront rendering engine performs:

  • It generates a response in less than ~45ms for 75% of storefront requests;
  • It generates a response in less than ~230ms for 90% of storefront requests;
  • It generates a response in less than ~900ms for 99% of storefront requests.

Thanks to the new storefront rendering engine, the average storefront response is nearly 5x faster than with the previous implementation. Of course, how fast the rendering engine is able to process a request and spit out a response depends on two key factors: the shop’s Liquid theme implementation, and the number of resources needed to process the request. To get a better idea of where the storefront rendering engine spends its time when processing a request, try using the Shopify Theme Inspector: this tool will help you identify potential bottlenecks so you can work on improving performance in those areas.

A data scheme diagram showing that the Storefront Renderer and Redis instance are contained in a Kubernetes node. The Storefront Renderer sends Redis data. The Storefront Renderer sends data to two sharded data stores outside of the Kubernetes node: Sharded MySQL and Sharded Redis
A simplified data schema of the application

Before we cover each topic, let’s briefly describe our application stack. As mentioned in the previous post, the new storefront rendering engine is a Ruby application. It talks to a sharded MySQL database and uses Redis to store and retrieve cached data.

Optimizing how we load all that data is extremely important. As one of our requirements was to improve rendering time for Storefront requests. Here are some of the approaches that we took to accomplish that.

Using MySQL’s Multi-statement Feature to Reduce Round Trips

To reduce the number of network round trips to the database, we use MySQL’s multi-statement feature to allow sending multiple queries at once. With a single request to the database, we can load data from multiple tables at once. Here’s a simplified example:

This request is especially useful to batch-load a lot of data very early in the response lifecycle based on the incoming request. After identifying the type of request, we trigger a single multi-statement query to fetch the data we need for that particular request in one go, which we’ll discuss later in this blog post. For example, for a request for a product page, we’ll load data for the product, its variants, its images, and other product-related resources in addition to information about the shop and the storefront theme, all in a single round-trip to MySQL.

Implementing a Thin Data Mapping Layer

As shown above, the new storefront rendering engine uses handcrafted, optimized SQL queries. This allows us to easily write fine-tuned SQL queries to select only the columns we need for each resource and leverage JOINs and sub-SELECT statements to optimize data loading based on the resources to load which are sometimes less straightforward to implement with a full-service object-relational mapping (ORM) layer.

However, the main benefit of this approach is the tiny memory footprint of using a raw MySQL client compared to using an object-relational mapping (ORM) layer that’s unnecessarily complex for our needs. Since there’s no unnecessary abstraction, forgoing the use of an ORM drastically simplifies the flow of data. Once the raw rows come back from MySQL, we effectively use the simplest ORM possible: we create plain old Ruby objects from the raw rows to model the business domain. We then use these Ruby objects for the remainder of the request. Below is an example of how it’s done.

Of course, not using an ORM layer comes with a cost: if implemented poorly, this approach can lead to more complexity leaking into the application code. Creating thin model abstractions using plain old Ruby objects prevents this from happening, and makes it easier to interact with resources while meeting our performance criteria. Of course, this approach isn’t particularly common and has the potential to cause panic in software engineers who aren’t heavily involved in performance work, instead worrying about schema migrations and compatibility issues. However, when speed is critical, we accept to take on that complexity.

Book-keeping and Eager-loading Queries

An HTTP request for a Shopify storefront may end up requiring many different resources from data stores to render properly. For example, a request for a product page could lead to requiring information about other products, images, variants, inventory information, and a whole lot of other data not loaded on multi-statement select. The first time the storefront rendering engine loads this page, it needs to query the database, sometimes making multiple requests, to retrieve all the information it needs. This usually happens during the request at any given time.

A flow diagram showing the Storefront Renderer's requests from  the data stores and how it uses a Query Book Keeper Middlewear to eager-load data
Flow of a request with the Book-keeping solution

As it retrieves this data for the first time, the storefront rendering engine keeps track of the queries it performed on the database for that particular product page and stores that list of queries in a key-value store for later use. When an HTTP request for the same product page comes in later (which it knows when the cache key matches), the rendering engine looks up the list of queries it performed throughout the previous request of the same type and performs those queries all at once, at the very beginning of the current request, because we’re pretty confident we’ll need them for this request (since they were used in the previous request).

This book-keeping mechanism lets us eager-load data we’re pretty confident we’ll need. Of course, when a page changes, this may lead to over-fetching and/or under-fetching, which is expected, and the shape of the data we fetch stabilizes quickly over time as more requests come in.

On the other side, some liquid models of Shopify’s storefronts are not accessed as frequently, and we don’t need to eager-load data related to them. If we did, we’d increase I/O wait time for something that we probably wouldn’t use very often. What the new rendering engine does instead is lazy-load this data by default. Unless the book-keeping mechanism described above eager-loads it, we’ll defer retrieving data to only load it if it’s needed for a particular request.

Implementing Caching Layers

Much like a CPU’s caching architecture, the new rendering engine implements multiple layers of caching to accelerate responses.

A critical aside before we jump into this section: adding caching should never be the first step towards building performance-oriented software. Start by building a solution that’s extremely fast from the get go, even without caching. Once this is achieved, then consider adding caching to reduce load on the various components on the system while accelerating frequent use cases. Caching is like a sharp knife and can introduce hard to detect bugs.

In-Memory Cache

A data scheme diagram showing that the Storefront Renderer and Redis instance are contained in a Kubernetes node. Within the Storefront Renderer is an In-memory cache. The Storefront Renderer sends Redis data. The Storefront Renderer sends data to two sharded data stores outside of the Kubernetes node: Sharded MySQL and Sharded Redis
A simplified data schema of the application with an in-memory cache for the Storefront Renderer

At the frontline of our caching system is an in-memory cache that you can essentially think of as a global hash that’s shared across requests within each web worker. Much like the majority of our caching mechanisms, this caching layer uses the LRU caching algorithm. As a result, we use this caching layer for data that’s accessed very often. This layer is especially useful in high throughput scenarios such as flash sales.

Node-local Shared Caching

As a second layer on top of the in-memory cache, the new rendering engine leverages a node-local Redis store that’s shared across all server workers on the same node. Since the database is available on the same machine as the rendering engine process itself, this node-local data transfer prevents network overhead and improves response times. As a result, multiple Ruby processes benefit from sharing cached data with one another.

Full-page Caching

Once the rendering engine successfully renders a full storefront response for a particular type of request, we store the final output (most often an HTML or JSON string) into the local Redis for later retrieval for subsequent requests that match the same cache key. This full-page caching solution lets us prevent regenerating storefront responses if we can by using the output we previously computed.

Database Query Results Caching

In a scenario where the full-page output cache, the in-memory cache, and the node-local cache doesn’t have a valid entry for a given request, we need to reach all the way to the database. Once we get a result back from MySQL, we transparently cache the results in Redis for later retrieval based on the queries and their parameters. As long as the cache keys don’t change, running the same database queries over and over always hit Redis instead of reaching all the way to the database.

Liquid Object Memoizer

Thanks to the Liquid templating language, merchants and partners may build custom storefront themes. When loading a particular storefront page, it’s possible that the Liquid template to render includes multiple references to the same object. This is common on the product page for example, where the template will include many references to the product object:
{{ product.title }}, {{ product.description }}, {{ product.featured_media }}, and others.

Of course, when each of these are executed, we don’t fetch the product over and over again from the database—we fetch it once, then keep it in memory for later use throughout the request lifecycle. This means that if the same product object is required multiple times at different locations during the render process, we’ll always use the same one and only instance of it throughout the entire request lifecycle.

The Liquid object memoizer is especially useful when multiple different Liquid objects end up loading the same resource. For example, when loading multiple product objects on a collection page using {{ collection.products }} and then referring to a particular product using {{ all_products[‘cowboy-hat’] }} on a collection page, with the Liquid object memoizer we’ll load it from an external data store once, then store it in memory and fetch it from there if it’s needed later. On average, across all Shopify storefronts, we see that the Liquid object memoizer prevents between 16 and 20 accesses to Redis and/or MySQL for every single storefront request, where we leverage the in-memory cache instead. In some extreme cases, we see that the memoizer prevents up to 4,000 calls to data stores per request.

Reducing Memory Allocations

Writing Memory-aware Code

Garbage collection execution is expensive. So we write code that doesn’t generate unnecessary objects. Use of methods and algorithms that modify objects in place, instead of generating a new object. For example:

  • use map! instead of map when dealing with lists. It prevents a new Array object from being created.
  • Use string interpolation instead of string concatenation. Interpolation does not create intermediate unnecessary String objects.

This may not seem like much, but consider this: using #map! instead of #map could reduce your memory usage significantly, even when simply looping over an array of integers to double the values.

Let’s set up an following array of 1000 integers from 1 to 1000:

array = (1..1000).to_a

Then, let’s double each number in the array with Array#map: { |i| i * 2 }

The line above leads to one object allocated in memory, for a total of 8040 bytes.

Now let’s do the same thing with Array#map! instead:! { |i| i * 2 }

The line above leads to zero object allocated in memory, for a total of 0 bytes.

Even with this tiny example, using map! instead of map saves ~8 kilobytes of allocated memory, and considering the sheer scale of the Shopify platform and the storefront traffic throughput it receives, every little bit of memory optimization counts to help the garbage collector run less often and for smaller periods of time, thus improving server response times.

With that in mind, we use tracing and profiling tools extensively to dive deeper into areas in the rendering engine that are consuming too much memory and to make precise changes to reduce memory usage.

Method-specific Memory Benchmarking

To prevent accidentally increasing memory allocations, we built a test helper method that lets us benchmark a method or a block to know many memory allocations and allocated bytes it triggers. Here’s how we use it:

This benchmark test will succeed if calling Product.find_by_handle('cowboy-hat') matches the following criteria:

  • The call allocates between 48 and 52 objects in memory;
  • The call allocates between 5100 and 5200 bytes in memory.

We allow a range of allocations because they’re not deterministic on every test run. This depends on the order in which tests run and the way data is cached, which can affect the final number of allocations.

As such, these memory benchmarks help us keep an eye on memory usage for specific methods. In practice, they’ve prevented introducing inefficient third-party gems that bloat memory usage, and they’ve increased awareness of memory usage to developers when working on features.

We covered three main ways to improve server-side performance: batching up calls to external data stores to reduce roundtrips, caching data in multiple layers for specific use cases, and simplifying the amount of work required to fulfill a task by reducing memory allocations. When they’re all combined, these approaches lead to big time performance gains for merchants on the platform—the average response time with the new rendering engine is 5x faster than with the previous implementation. 

Those are just some of the techniques that we are using to make the new application faster. And we never stop exploring new ways to speed up merchant’s storefronts. Faster rendering times are in the DNA of our team!

- The Storefront Renderer Team

Celso Dantas is a Staff Developer on the Storefront Renderer team. He joined Shopify in 2013 and has worked on multiple projects since then. Lately specializing in making merchants storefront faster.

Maxime Vaillancourt is a Senior Developer on the Storefront Rendering team. He has been at Shopify for 3 years, starting on the Online Store Themes team, then specializing towards storefront performance with the storefront rendering engine rewrite.

Shipit! Presents: Performance Tips from the Storefront Renderer Team

The Storefront Renderer is a server-side application that loads a Shopify merchant's storefront Liquid theme, along with the data required to serve the request (for example product data, collection data, inventory information, and images), and returns the HTML response back to your browser. On average, server response times for the Storefront Renderer are four times faster than the implementation it replaced.

Our blog post, How Shopify Reduced Storefront Response Times with a Rewrite generated great discussions and questions. This event looks to answer those questions and dives deeper into the technical details of how we made the Storefront Renderer engine faster.

​​​​​​​During this event you will learn how we:
  • optimized data access
  • implemented caching layers
  • reduced memory allocations

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Resiliency Planning for High-Traffic Events

Resiliency Planning for High-Traffic Events

On January 27, 2021 Shipit!, our monthly event series, presented Building a Culture of Resiliency at Shopify. Learn about creating and maintaining resiliency plans for large development teams, testing and tooling, developing incident strategies, and incorporating and improving feedback loops. The video is now available.

Each year, Black Friday Cyber Monday weekend represents the peak of activity for Shopify. Not only is this the most traffic we see all year, but it’s also the time our merchants put the most trust in our team. Winning this weekend each year requires preparation, and it starts as soon as the weekend ends.

Load Testing & Stress Testing: How Does the System React?

When preparing for a high traffic event, load testing regularly is key. We have discussed some of the tools we use already, but I want to explain how we use these exercises to build towards a more resilient system.

While we use these tests to confirm that we can sustain required loads or probe for new system limits, we can also use regular testing to find potential regressions. By executing the same experiments on a regular basis, we can spot any trends at easily handled traffic levels that might spiral into an outage at higher peaks.

This same tool allows us to run similar loads against differently configured shops and look for differences caused by the theme, configuration, and any other dimensions we might want to use for comparison.

Resiliency Matrix: What are Our Failure Modes?

If you've read How Complex Systems Fail, you know that "Complex systems are heavily and successfully defended against failure" and "Catastrophe requires multiple failures - single point failures are not enough.” For that to be true, we need to understand our dependencies, their failure modes, and how those impact the end-user experience.

We ask teams to construct a user-centric resiliency matrix, documenting the expected user experience under various scenarios. For example:

This user-centric resiliency matrix shows the potential failures and their impact on user experience. For example, can a user browse (yes) or check out (no) if MySQL is down.
User-centric resiliency matrix documenting expected user experience and possible failures

The act of writing this matrix serves as a very basic tabletop chaos exercise. It forces teams to consider how well they understand their dependencies and what the expected behaviors are.

This exercise also provides a visual representation of the interactions between dependencies and their failure modes. Looking across rows and columns reveals areas where the system is most fragile. This provides the starting point for planning work to be done. In the above example, this matrix should start to trigger discussion around the ‘User can check out’ experience and what can be done to make this more resilient to a single dependency going ‘down’.

Game Days: Do Our Models Match?

So, we’ve written our resilience matrix. This is a representation of our mental model of the system, and when written, it's probably a pretty accurate representation. However, systems change and adapt over time, and this model can begin to diverge from reality.

This divergence is often unnoticed until something goes wrong, and you’re stuck in the middle of a production incident asking “Why?”. Running a game day exercise allows us to test the documented model against reality and adjust in a controlled setting.

The plan for the game day will derive from the resilience matrix. For the matrix above, we might formulate a plan like:

This game day exercise allows us to test the model against reality and adjust in a controlled setting. This plan lays out scenarios to be tested and how they will be accomplished.
Game day planning scenarios 

Here, we are laying out what scenarios are to be tested, how those will be accomplished, and what we expect to happen. 

We’re not only concerned with external effects (what works, what doesn’t), but internally do any expected alerts fire, are the appropriate on-call teams paged, and do those folks have the information available to understand what is happening?

If we refer back to How Complex Systems Fail, the defences against failure are technical, human, and organizational. On a good game day, we’re attempting to exercise all of these.

  • Do any automated systems engage?
  • Do the human operators have the knowledge, information and tools necessary to intervene?
  • Do the processes and procedures developed help or hinder responding to the outage scenario?

By tracking the actual observed behavior, we can then update the matrix as needed or make changes to the system in order to bring our mental model and reality back into alignment.

Incident Analysis: How Do We Get Better?

During the course of the year, incidents happen which disrupt service in some capacity. While the primary focus is always in restoring service as fast as possible, each incident also serves as a learning opportunity.

This article is not about why or how to run a post-incident review; there are more than enough well-written pieces by folks who are experts on the subject. But to refer back to How Complex Systems Fail, one of the core tenets in how we learn from incidents is “Post-accident attribution to a ‘root cause’ is fundamentally wrong.”

When focusing on a single root cause, we stop at easy, shallow actions to resolve the ‘obvious’ problem. However, this ignores deeper technical, organizational, and cultural issues that contributed to the issue and will again if uncorrected.

What’s Special About BFCM?

We’ve talked about the things we’re constantly doing, year-round to ensure we’re building for reliability and resiliency and creating an anti-fragile system that gets better after every disruption. So what do we do that’s special for the big weekend?

We’ve already mentioned How Complex Systems Fail several times, but to go back to that well once more, “Change introduces new forms of failure.” As we get closer to Black Friday, we slow down the rate of change.

This doesn’t mean we’re sitting on our hands and hoping for the best, but rather we start to shift where we’re investing our time. Fewer new services and features as we get closer, and more time spent dealing with issues of performance, reliability, and scale.

We review defined resilience matrices carefully, start running more frequent game days and load tests and working on any issues or bottlenecks those reveal. This means updating runbooks, refining internal tools, and shipping fixes for issues that this activity brings to light.

All of this comes together to provide a robust, reliable platform to power over $5.1 billion in sales.

Shipit! Presents: Building a Culture of Resiliency at Shopify

Watch Ryan talk about how we build a culture of resiliency at Shopify to ensure a robust, reliable platform powering over $5.1 billion in sales.


Ryan is a Senior Development Manager at Shopify. He currently leads the Resiliency team, a centralized globally distributed SRE team responsible for keeping commerce better for everyone.

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

How to Reliably Scale Your Data Platform for High Volumes

How to Reliably Scale Your Data Platform for High Volumes

By Arbab Ahmed and Bruno Deszczynski

Black Friday and Cyber Monday—or as we like to call it, BFCM—is one of the largest sales events of the year. It’s also one of the most important moments for Shopify and our merchants. To put it into perspective, this year our merchants across more than 175 countries sold a record breaking $5.1+ billion over the sales weekend. 

That’s a lot of sales. That’s a lot of data, too.

This BFCM, the Shopify data platform saw an average throughput increase of 150 percent. Our mission as the Shopify Data Platform Engineering (DPE) team is to ensure that our merchants, partners, and internal teams have access to data quickly and reliably. It shouldn’t matter if a merchant made one sale per hour or a million; they need access to the most relevant and important information about their business, without interruption. While this is a must all year round, the stakes are raised during BFCM.

Creating a data platform that withstands the largest sales event of the year means our platform services need to be ready to handle the increase in load. In this post, we’ll outline the approach we took to reliably scale our data platform in preparation for this high-volume event. 

Data Platform Overview

Shopify’s data platform is an interdisciplinary mix of processes and systems that collect and transform data for use by our internal teams and merchants. It enables access to data through a familiar pipeline:

  • Ingesting data in any format, from any part of Shopify. “Raw” data (for example, pageviews, checkouts, and orders) is extracted from Shopify’s operational tables without any manipulation. Data is then conformed to an Apache Parquet format on disk.
  • Processing data, in either batches or streams, to form the foundations of business insights. Batches of data are “enriched” with models developed by data scientists, and processed within Apache Spark or dbt
  • Delivering data to our merchants, partners, and internal teams so they can use it to make great decisions quickly. We rely on an internal collection of streaming and serving applications, and libraries that power the merchant-facing analytics in Shopify. They’re backed by BigTable, GCS, and CloudSQL.

In an average month, the Shopify data platform processes about 880 billion MySQL records and 1.75 trillion Kafka messages.

Tiered Services

As engineers, we want to conquer every challenge right now. But that’s not always realistic or strategic, especially when not all data services require the same level of investment. At Shopify, a tiered services taxonomy helps us prioritize our reliability and infrastructure budgets in a broadly declarative way. It’s based on the potential impact to our merchants and looks like this:

Tier 1

This service is critical externally, for example. to a merchant’s ability to run their business

Tier 2

This service is critical internally to business functions, e.g. a operational monitoring/alerting service

Tier 3

This service is valuable internally, for example, internal documentation services

Tier 4

This service is an experiment, in very early development, or is otherwise disposable. For example, an emoji generator

The highest tiers are top priority. Our ingestion services, called Longboat and Speedboat, and our merchant-facing query service Reportify are examples of services in Tier 1.

The Challenge 

As we’ve mentioned, each BFCM the Shopify data platform receives an unprecedented volume of data and queries. Our data platform engineers did some forecasting work this year and predicted nearly two times the traffic of 2019. The challenge for DPE is ensuring our data platform is prepared to handle that volume. 

When it comes to BFCM, the primary risk to a system’s reliability is directly proportional to its throughput requirements. We call it throughput risk. It increases the closer you get to the front of the data pipeline, so the systems most impacted are our ingestion and processing systems.

With such a titillating forecast, the risk we faced was unprecedented throughput pressure on data services. In order to be BFCM ready, we had to prepare our platform for the tsunami of data coming our way.

The Game Plan

We tasked our Reliability Engineering team with Tier 1 and Tier 2 service preparations for our ingestion and processing systems. Here’s the steps we took to prepare our systems most impacted by BFCM volume:

1. Identify Primary Objectives of Services

A data ingestion service's main operational priority can be different from that of a batch processing or streaming service. We determine upfront what the service is optimizing for. For example, if we’re extracting messages from a limited-retention Kafka topic, we know that the ingestion system needs to ensure, above all else, that no messages are lost in the ether because they weren’t consumed fast enough. A batch processing service doesn’t have to worry about that, but it may need to prioritize the delivery of one dataset versus another.

In Longboat’s case, as a batch data ingestion service, its primary objective is to ensure that a raw dataset is available within the interval defined by its data freshness service level objective (SLO). That means Longboat is operating reliably so long as every dataset being extracted is no older than eight hours— the default freshness SLO. For Reportify, our main query serving service, its primary objective is to get query results out as fast as possible; its reliability is measured against a latency SLO.

2. Pinpoint Service Knobs and Levers

With primary objectives confirmed, you need to identify what you can “turn up or down” to sustain those objectives.

In Longboat’s case, extraction jobs are orchestrated with a batch scheduler, and so the first obvious lever is job frequency. If you discover a raw production dataset is stale, it could mean that the extraction job simply needs to run more often. This is a service-specific lever.

Another service-specific lever is Longboat’s “overlap interval” configuration, which configures an extraction job to redundantly ingest some overlapping span of records in an effort to catch late-arriving data. It’s specified in a number of hours.

Memory and CPU are universal compute levers that we ensure we have control of. Longboat and Reportify run on Google Kubernetes Engine, so it’s possible to demand that jobs request more raw compute to get their expected amount of work done within their scheduled interval (ignoring total compute constraints for the sake of this discussion).

So, in pursuit of data freshness in Longboat, we can manipulate:

  1. Job frequency
  2. Longboat overlap interval
  3. Kubernetes Engine Memory/CPU requests

In pursuit of latency in Reportify, we can turn knobs like its:

  1. BigTable node pool size 
  2. ProxySQL connection pool/queue size

3. Run Load Tests!

Now that we have some known controls, we can use them to deliberately constrain the service’s resources. As an example, to simulate an unrelenting N-times throughput increase, we can turn the infrastructure knobs so that we have 1/N the amount of compute headroom, so we’re at N-times nominal load.

For Longboat’s simulation, we manipulated its “overlap interval” configuration and tripled it. Every table suddenly looked like it had roughly three times more data to ingest within an unchanged job frequency; throughput was tripled.

For Reportify, we leveraged our load testing tools to simulate some truly haunting throughput scenarios, issuing an increasingly extreme volume of queries, as seen here:

A line graph showing streaming service queries per second by source. The graph shows increase in the volume of queries over time during a load test.
Streaming service queries per second metric after the load test

In this graph, the doom is shaded purple. 

Load testing answers a few questions immediately, among others:

  • Do infrastructure constraints affect service uptime? 
  • Does the service’s underlying code gracefully handle memory/CPU constraints?
  • Are the raised service alarms expected?
  • Do you know what to do in the event of every fired alarm?

If any of the answers to these questions leave us unsatisfied, the reliability roadmap writes itself: we need to engineer our way into satisfactory answers to those questions. That leads us to the next step. 

4. Confirm Mitigation Strategies Are Up-to-Date

A service’s reliability depends on the speed at which it can recover from interruption. Whether that recovery is performed by a machine or human doesn’t matter when your CTO is staring at a service’s reliability metrics! After deliberately constraining resources, the operations channel turns into a (controlled) hellscape and it's time to act as if it were a real production incident.

Talking about mitigation strategy could be a blog post on its own, but here are the tenets we found most important:

  1. Every alert must be directly actionable. Just saying “the curtains are on fire!” without mentioning “put it out with the extinguisher!” amounts to noise.
  2. Assume that mitigation instructions will be read by someone broken out of a deep sleep. Simple instructions are carried out the fastest.
  3. If there is any ambiguity or unexpected behavior during controlled load tests, you’ve identified new reliability risks. Your service is less reliable than you expected. For Tier 1 services, that means everything else drops and those risks should be addressed immediately.
  4. Plan another controlled load test and ensure you’re confident in your recovery.
  5. Always over-communicate, even if acting alone. Other engineers will devote their brain power to your struggle.

5. Turn the Knobs Back

Now that we know what can happen with an overburdened infrastructure, we can make an informed decision whether the service carries real throughput risk. If we absolutely hammered the service and it skipped along smiling without risking its primary objective, we can leave it alone (or even scale down, which will have the CFO smiling too).

If we don’t feel confident in our ability to recover, we’ve unearthed new risks. The service’s development team can use this information to plan resiliency projects, and we can collectively scale our infrastructure to minimize throughput risk in the interim.

In general, to be prepared infrastructure-wise to cover our capacity, we perform capacity planning. You can learn more about Shopify’s BFCM capacity planning efforts on the blog.

Overall, we concluded from our results that:

  • Our mitigation strategy for Longboat and Reportify was healthy, needing gentle tweaks to our load-balancing maneuvers.
  • We should scale up our clusters to handle the increased load, not only from shoppers, but also from some of our own fun stuff like the BFCM Live Map.
  • We needed to tune our systems to make sure our merchants could track their online store’s performance in real-time through the Live View in the analytics section of their admin.
  • Some jobs could use some tuning, and some of their internal queries could use optimization.

Most importantly, we refreshed our understanding of data service reliability. Ideally, it’s not any more exciting than that. Boring reliability studies are best.

We hope to perform these exercises more regularly in the future, so BFCM preparation isn’t particularly interesting. In this post we talked about throughput risk as one example, but there are other risks to data integrity, correctness, latency. We aim to get out in front of them too because data grows faster than engineering teams do. “Trillions of records every month” turns into “quadrillions” faster than you expect.

So, How’d It Go?

After months of rigorous preparation systematically improving our indices, schemas, query engines, infrastructure, dashboards, playbooks, SLOs, incident handling and alerts, we can proudly say BFCM 2020 went off without a hitch!

During the big moment we traced down every spike, kept our eyes glued to utilization graphs, and turned knobs from time to time, just to keep the margins fat. There were only a handful of minor incidents that didn’t impact merchants, buyers or internal teams - mainly self healing cases thanks to the nature of our platform and our spare capacity.

This success doesn’t happen by accident, it happens because of diligent planning, experience, curiosity and—most importantly—teamwork.

Arbab is a seven-year veteran at Shopify serving as Reliability Engineering lead. He's previously helped launch Shopify payments, some of the first Shopify public APIs, and Shopify's Retail offerings before joining the Data Platform. 99% of Shopifolk joined after him!
Bruno is a DPE TPM working with the Site Reliability Engineering team. He has a record of 100% successful BFCMs under his belt and plans to keep it that way.

Interested in helping us scale and tackle interesting problems? We’re planning to double our engineering team in 2021 by hiring 2,021 new technical roles. Learn more here!

Continue reading

The State of Ruby Static Typing at Shopify

The State of Ruby Static Typing at Shopify

Shopify changes a lot. We merge around 400 commits to the main branch daily and deploy a new version of our core monolith 40 times a day. The Monolith is also big: 37,000 Ruby files, 622,000 methods, more than 2,000,000 calls. At this scale with a dynamic language, even with the most rigorous review process and over 150,000 automated tests, it’s a challenge to ensure everything works properly. Developers benefit from a short feedback loop to ensure the stability of our monolith for our merchants.

Since 2018, our Ruby Infrastructure team has looked at ways to make the development process safer, faster, and more enjoyable for Ruby developers. While Ruby is different from other languages and brings amazing features allowing Shopify to be what it is today, we felt there was a feature from other languages missing: static typing.

Shipit! Presents: The State of Ruby Static Typing at Shopify

On November 25, 2020, Shipit!, our monthly event series, presented The State of Ruby Static Typing at Shopify. Alexandre Terrasa and I talked about the history of static typing at Shopify and our adoption of Sorbet.

We weren't able to answer all the questions during the event, so we've included answers to them below.

What are some challenges with adopting Sorbet? What was some code you could not type?

So far most of our problems are in modules that use ActiveSupport::Concern (many layers deep, even) and modules that assume they will be included in a certain kind of class, but have no way of making that explicit. For example, a module that assumes it will be included in an ActiveRecord model could be calling before_save to add a hook, but Sorbet would have no idea where before_save is defined. We are also looking to make those kinds of dependencies between modules and include sites explicit in Sorbet.

Inclusion requirements is also something we’re trying to fix right now, mostly for our helpers. The problem is explained in the description of this pull-request:

If a method has an array in argument, do you have to specify it is an array of what type? And if not, how do Sorbet makes the method you call on the array's element exists?

It depends on the type of elements inside the array. For simple types like Integer or Foo, you can easily type it as T::Array[Integer] and Sorbet will be able to type check method calls. For more complex types like arrays containing hashes it depends, you may use T::Array[T.untyped] in which case Sorbet won’t be able to check the calls. Using T.untyped you can go as deep and precise you want it to be: T::Array[T::Hash[String, T.untyped]], T::Array[T::Hash[String, T::Array[T.untyped]]], T::Array[T::Hash[String, T::Array[Integer]]] and Sorbet will check the calls on what it knows about. Note that as your type becomes more and more complex, maybe you should start thinking about making a class about it so you can just use T::Array[MyNewClass].

How would you compare the benefits of Sorbet to Ruby relative to the benefits of Typescript to Javascript?

There are similar benefits, but Ruby is a much more dynamic language than JavaScript and Sorbet is a much younger project than TypeScript. So the coverage of Ruby features and expressiveness of the type system of Sorbet lags behind the same benefits that TypeScript brings to JavaScript.On the other hand, Sorbet annotations are pure Ruby. That means you don’t have to learn a new language and you can keep using your existing editors and tooling to work with it. There is also no compilation of Ruby code with types to plain Ruby, like how you need to compile TypeScript to JavaScript. Finally, Sorbet also has a runtime type-checker and it can verify your types and alert you if they don’t check when your application is running, which is a great additional safety that TypeScript does not have.

Could you quickly relate Sorbet with RBS and what is the future of sorbet after Ruby 3.0?

Stripe gave an interesting answer to this question: RBS is about the language to write the signatures, you still need a type checker to check those signatures against your code. We see Sorbet as one of the solution that can use those types, and currently it’s the fastest solution. One limitation of RBS is the lack of inline type annotations, for example there is no syntax to cast a variable to another type. So type checkers have to use additional syntax to make this possible. Even if Sorbet doesn’t support RBS at the moment, it might in the future. And in the case it never happens, remember that it’s easier to go from one type specification to another rather than an untyped codebase to a typed one. So all the efforts are not lost.

Does Tapioca support Enums etc?

Tapioca is able to generate RBI files for T::Enum definitions coming from gems. It can also generate method definitions for the ActiveRecord enums as DSL generators.

In which scenarios would you NOT use Sorbet?

I guess the one scenario where using Sorbet would be counterproductive is if there is no team buy-in for adopting Sorbet. If I were on such a team and I couldn’t convince the rest of the team of the utility of it, I would not push to use Sorbet.Other than that, I can only think of a code base that has a LOT of metaprogramming idioms to be a bad target for using Sorbet. You would still get some benefits from even running Sorbet at typed: false but it might not be worth the effort.

What editors do you personally use? Any standardization across the organization?

We have not standardized on a single editor across the company and we probably will not do so, since we believe in developers’ freedom to use the tools that make them the most productive. However, we also cannot build tooling for all editors, either. So, most of our developer acceleration team builds tooling primarily for VSCode, today. Ufuk personally uses VSCode and Alex uses VIM.

Is there a roadmap for RBI -> RBS conversion for when Ruby 3.0 comes out?

No official roadmap yet, we’re still experimenting with this on the side of our main project: 100% typed: true files in our monolith. We can already say that some features from RBS will not directly translate to RBI and vice versa. You can see a comparison of both specifications here: (might not be completely up-to-date with the latest version of RBS).

What are the major challenges you had or are having with Ruby GraphQL libraries?

Our team tried to marry the GraphQL typing system and the Sorbet types using RBI generation but we got stuck in some very dynamic usages of GraphQL resolvers, so we paused that work for now. On the other hand, there are teams within Shopify who have been using Sorbet and GraphQL together by changing the way they write GraphQL endpoints. You can read more about the technical details of that from the blog post of one of the Shopify engineers that has worked on that:

What would be the first step in getting started with typing in a Rails project? What are kind of files that should be checked in to a repo?

The fastest way to start is to use the steps listed on the Sorbet site to start running with Sorbet. After doing that, you can take a look at using sorbet-rails to generate Rails RBI files for you, or you can look at tapioca to generate gem RBIs. Since you can go gradual it’s totally up to you.

Our advice would be to first target all files at typed: false. If you use tapioca, the price is really low and already brings a lot of benefit. Then try to move the files to type: true where it does not create new type errors (you can use spoom for that:

When it comes to adding signatures, prefer the files that are the most reused or, if you track the errors from production, going first with the files that create the most errors might be a good choice. Files touched by many teams are also an interesting target as signatures make collaboration easier. Files with a lot of churn. Or files defining methods reused a lot across your codebase.

As for the files that you need to check-in to a repository, the best practice is to check-in all the files (mostly RBI files) generated by Sorbet and/or tapioca/sorbet-typed. Those files enable the code to be type checked, so should be available to all the developers that work on the code base.

Learn More About Ruby Static Typing at Shopify

Additional Information

Open Source

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Organizing 2000 Developers for BFCM in a Remote World

Organizing 2000 Developers for BFCM in a Remote World

Shopify is an all-in-one commerce platform that serves over 1M+ merchants in approximately 175 countries across the world. Many of our merchants prepare months in advance for their biggest shopping season of the year, and they trust us to help them get through it successfully. As our merchants grow and their numbers increase, we must scale our platform without compromising on our stability, performance, and quality.  

With Black Friday and Cyber Monday (BFCM) being the two biggest shopping events of the year and with other events on the horizon, there is a lot of preparation that Shopify needs to do on our platform. This effort needs a key driver to set expectations for many teams and hold them accountable to complete the work for their area in the platform. 

Lisa Vanderschuit getting hyped about making commerce better for everyone
Lisa Vanderschuit getting hyped about making commerce better for everyone

I’m an Engineering Program Manager (EPM) with a focus on platform quality and was one of the main program managers (PgM) tapped midway in the year to drive these efforts. For this initiative I worked with three Production Engineering leads (BFCM leads) and three other program managers (with a respective focus in resiliency, scale, and capacity) to:

  • understand opportunities for improvement
  • build out a program that’s effective at scale
  • create adjustments to the workflow specifically for BFCM
  • execute the program
  • start iterating on the program for next year.

Understanding our Opportunities

Each year, the BFCM leads start a large cross company push to get the platform ready for BFCM. They ask the teams responsible for critical areas of the platform to complete the following prep:

Looking at the past years, the BFCM leads chosen to champion this in spend a significant time on administrative, communication, and reporting activities when their time is better spent in the weeds of the problems. Our PgM group was assigned to take on these responsibilities so that these leads could focus on investigating the technical challenges and escalations.

Before jumping into solutions, our PgM group looked into the past to find lessons to inform the future. In looking at past retrospective documents we found some common themes over the years that we needed to keep in mind as we put together our plan:

  • Shopify needs to prepare in advance for supporting large merchants with lots of popular inventory to sell. 
  • Scaling trends weren’t just on the two main days. Sales were spreading out through the week, and there were pre sale and post sale workflows where we needed to be well tested for how much load we could sustain without performance issues. 
  • There were some parts of the platform tied to disruptions in the past that would require additional load testing to give us more confidence in their stability. 

With Shopify moving to Digital by Default and the increasing number of timezones to consider, there were more complexities to getting the company aligned to the same goals and schedule. Our PgM group wanted to create structure around coordinating a large scale effort, but we also wanted to start thinking about how maintenance and prep work can be done throughout the year, so we’re ready for any large shopping event regardless of the time of year. 

Building the Program Plan

Our PgM group listed all the platform preparation tasks the BFCM leads asked developers to do in the past. Then we highlighted items that had to happen this year and took note of when they needed to happen. After this, we asked the BFCM leads to highlight the important things critical for their participation and then we assigned the rest of the work for our PgM group to manage. 

Example of our communication plan calendar
Example of our communication plan calendar

Once we had those details documented, we created a communication plan calendar (a.k.a spreadsheet) to see what was next, week over week. We split the PgM work into workstreams then we each selected ones respective to our areas of focus. In my workstream I had two main responsibilities:

  • Put together a plan to get people assigned to do the platform preparation work that the BFCM leads wanted them to. 
  • Determine what kind of PRs should or should not be shipped to production in the month before and after BFCM.

For platform preparation work listed earlier, I asked teams to identify which areas of the platform that need prepping for the large shopping event. Even with a reduced set of areas to focus on there were still quite a bit of people that I would need to get this prep work assigned to. Instead of working directly with every single person, I used a distributed ownership model. I asked each GM or VP with critical areas to assign a champion from their department to work with. Then I reached out to the champions to let them know of the prep work that needed to be done. They then either assigned the work themselves or they assigned people from their team. To keep track of this ownership I built a tracking spreadsheet and set up a schedule to report on progress week over week.

In the past, our Deploys team would lock the ability to automatically deploy to production for a week. Since BFCM is becoming more spread out year after year, we realized we needed to adjust our culture around shipping in the last two months of the year to make sure we could confidently provide merchants with a resilient platform. Merchants were also needing to train up staff further in advance of the year so we also had to consider slowing down new features that could require extra training for their staff. To start tackling these challenges I asked:

  • Teams to take inventory of all of our platform areas and highlight which areas were considered critical for the merchant experience. 
  • That we set up a rule in a bot we call Caution Tape to comment a thorough risk to value assessment on any new PRs created between November to December in repos that had been flagged as critical to successful large shopping events. 

If the PRs were proposing a merchant facing feature the Caution Tape bot message asked that they document the risks vs the value to shipping around BFCM and that they only ship if approved by a director or GM in their area. In many cases the people creating these PRs either investigated a safer approach, got more thorough reviews, or decided to wait until next year to launch the feature. 

On the week of Black Friday 60% of the work in GitHub was code reviews
On the week of Black Friday 60% of the work in GitHub was code reviews

To artificially slow down the rate of items being shipped to production we planned to reduce the amount of PRs that could be shipped in a deploy and increase the amount of time they would spend in canaries (pre-production deploy test). On top of this we also planned to lock deploys for a week around the BFCM weekend. 

Executing the Program

How do you rally 2500+ people around a mission who are responsible for 1000+ deploys across all services? You state your expectations and then repeat many times in different ways. 

1. Our PGM and BFCM lead group had two main communication options where we started engagement with the rest of the engineering group working on platform prep:

  • Shared Slack channels for discussions, questions, and updates.
  • GitHub repos for assigning work.

2. Our PGM group put together and shared internal documentation on the program details to make it easier to onboard participants to the program.

3. Our PGM group shared high-level announcements, reminders and presentations throughout the year, increasing in frequency leading up to BFCM, to increase awareness and engagement. Some examples of this were:

  • progress reports on each department posted in Slack. 
  • live-Streamed and recorded presentations on our internal broadcasting channel to inform the company about our mission and where to go for help.
  • emails sent to targeted groups to remind people of their respective responsibilities and deadlines.
  • GitHub issues created and assigned to teams with a checklist of the prep work we had asked them to do.

To make sure our BFCM leads had the support they needed our PgM group had regular check-in meetings with them to get a pulse on how they were feeling things were going. To make sure our PgM group was on top of the allocated tasks each week we had meetings at the start and end of each week. Then we hosted office hours along with the BFCM leads for any developers that wanted facetime to flag any potential concerns about their area.

Celebrations and Lessons Learned

Overall I’d say our program was a success. We had a very successful BFCM with sales of $5.1+ billion from the more than one million Shopify-powered brands around the world. We found that our predictions for which areas would take the most load were on target and that the load testing and spinning up of resources paid off. 

A photo of 3 women and 2 men celebrating. Gold confetti showers down on them.
Celebrating BFCM

From our internal developer view we had success in the sense that shipping to the platform didn’t need to come to a full stop. PR reviews were at an all time high which meant that developers focus on quality was at an all time high. For the areas where we did have to slow down on shipping code for features we found that our developers had more time to work on the other important aspects of work that needs to be done in engineering. Teams were able to

  • focus more on clean up tasks
  • write blog posts
  • put together strategic roadmaps and architecture design docs
  • plan team building exercises. 

Overall we still did take a hit in developer productivity and we could have been a bit more relaxed on how long we enforced the extra risk to value assessment on our PRs and expectations on deploying to production. Our PGM team hopes to find a more balanced approach for this for next year's plan. 

From a communication standpoint, some of the messaging to developers was inconsistent on whether or not they could ship to critical areas of the platform during November and December. Our PGM group also ended up putting together some of the announcement drafts last minute so in future years we want to have this included in our communication plan from the start with templates ready to go. 

Our PGM group is hoping to have a retrospective meeting later this year with the BFCM leads to see how we can adjust the program plan for next year. We will be taking everything we learned and find opportunities where we can automate some of the work or distribute the work throughout the year so we can be always ready for any large shopping event in the year. 

If you have a large initiative at your company, consider creating a role for people technical enough to be dangerous that can help drive engineering initiatives forward and work with your top developers to maximise their time and expertise to solve the big complex problems and get shit done.

Lisa Vanderschuit is an Engineering Program Manager who manages the engineering theme of Code Quality. She has been at Shopify for 6 years, working on areas from editing and reviewing Online Store themes to helping our engineering teams raise the bar of code quality at Shopify.

How does your team leverage program managers at your company? What advice do you have for coordinating cross company engineering initiatives? We want to hear from you on Twitter at @ShopifyEng.

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

A World Rendered Beautifully: The Making of the BFCM 3D Data Visualization

A World Rendered Beautifully: The Making of the BFCM 3D Data Visualization

By Mikko Haapoja and Stephan Leroux

2020 Black Friday Cyber Monday (BFCM) is over, and another BFCM Globe has shipped. We’re extremely proud of the globe, it focused on realism, performance, and the impact our merchants have on the world.

The Black Friday Cyber Monday Live Map

We knew we had a tall task in front of us this year, building something that could represent orders from our one million merchants in just two months. Not only that, we wanted to ship a data visualization for our merchants so they could have a similar experience to the BFCM globe every day in their Live View.

Prototypes for the 2020 BFCM Globe and Live View. **

With tight timelines and an ambitious initiative, we immediately jumped into prototypes with three.js and planned our architecture.

Working with a Layer Architecture

As we planned this project, we converged architecturally on the idea of layers. Each layer is similar to a React component where state is minimally shared with the rest of the application, and each layer encapsulates its own functionality. This allowed for code reuse and flexibility to build both the Live View Globe, BFCM Globe, and beyond.

A showcase of layers for the 2020 BFCM Globe. **

When realism is key, it’s always best to lean on fantastic artists, and that’s where Byron Delgado came in. We hoped that Byron would be able to use the 3D modeling tools he’s used to, and then we would incorporate his 3D models into our experience. This is where the EarthRealistic layer comes in.

EarthRealistic layer from the 2020 BFCM Globe. **

EarthRealistic uses a technique called physically based rendering, which most modern 3D modeling software supports. In three.js, physically based rendering is implemented via the MeshPhysicalMaterial or MeshStandardMaterial materials.

To achieve realistic lighting, EarthRealistic is lit by a 32bit EXR Environment Map. By using a 32bit EXR, it means we can have smooth image based lighting. Image based lighting is a technique where a “360 sphere” is created around the 3D scene, and pixels in that image are used to calculate how bright Triangles on 3D models should be. This allows for complex lighting setups without much effort from an artist. Traditionally images on the web such as JPGs and PNGs have a color depth of 8bits. If we were to use these formats and 8bit color depth, our globe lighting would have had horrible gradient banding, missing realism entirely.

Rendering and Lighting the Carbon Offset Visualization

Once we converged on physically based rendering and image based lighting, building the carbon offset layer became clearer. Literally!

Carbon Offset visualization layer from the 2020 BFCM Globe. **

Bubbles have an interesting phenomenon where they can be almost opaque at a certain angle and light intensity but in other areas completely transparent. To achieve this look, we created a custom material based on MeshStandardMaterial that reads in an Environment Map and simulates the bubble lighting phenomenon. The following is the easiest way to achieve this with three.js:

  1. Create a custom Material class that extends off of MeshStandardMaterial.
  2. Write a custom Vertex or Fragment Shader and define any Uniforms for that Shader Program.
  3. Override onBeforeCompile(shader: Shader, _renderer: WebGLRenderer): void on your custom Material and pass the custom Vertex or Fragment Shader and uniforms via the Shader instance.

Here’s our implementation of the above for the Carbon Offset Shield Material:

Let’s look at the above, starting with our Fragment shader. In shield.frag lines 94-97

These two lines are all that are needed to achieve a bubble effect in a fragment shader.

To calculate the brightness of an rgb pixel, you calculate the length or magnitude of the pixel using the GLSL length function. In three.js shaders, outgoingLight is an RGB vec3 representing the outgoing light or pixel to be rendered.

If you remember from earlier, the bubble’s brightness determines how transparent or opaque it should appear.  After calculating brightness, we can set the outgoing pixel’s alpha based on the brightness calculation. Here we use the GLSL mix function to go between the expected alpha of the pixel defined by diffuseColor.a and a new custom uniform defined as maxOpacity. By having the concept of min or expected opacity and max opacity, Byron and other artists can tweak visuals to their exact liking.

If you look at our shield.frag file, it may seem daunting! What on earth is all of this code?  three.js materials handle a lot of functionality, so it’s best to make small additions and not modify existing code. three.js materials all have their own shaders defined in the ShaderLib folder. To extend a three.js material, you can grab the original material shader code from the src/renderers/shaders/ShaderLib/ folder in the three.js repo and perform any custom calculations before setting gl_FragColor. An easier option to access three.js shader code is to simply console.log the shader.fragmentShader or shader.vertexShader strings, which are exposed in the onBeforeCompile function:

onBeforeCompile runs immediately before the Shader Program is created on the GPU. Here you can override shaders and uniforms. CustomMeshStandardMaterial.ts is an abstraction we wrote to make creating custom materials easier. It overrides the onBeforeCompile function and manages uniforms while your application runs via the setCustomUniform and getCustomUniform functions. You can see this in action in our custom Shield Material when getting and setting maxOpacity:

Using Particles to Display Orders

Displaying orders on Shopify from across the world using particles. **

One of the BFCM globe’s main features is the ability to view orders happening in real-time from our merchants and their buyers worldwide. Given Shopify’s scale and amount of orders happening during BFCM, it’s challenging to visually represent all of the orders happening at any given time. We wanted to find a way to showcase the sheer volume of orders our merchants receive over this time in both a visually compelling and performant way. 

In the past, we used visual “arcs” to display the connection between a buyer’s and a merchant’s location.

The BFCM Globe from 2018 showing orders using visual arcs.
The BFCM Globe from 2018 showing orders using visual arcs.

With thousands of orders happening every minute, using arcs alone to represent every order quickly became a visual mess along with a heavy decrease in framerate. One solution was to cap the number of arcs we display, but this would only allow us to display a small fraction of the orders we were processing. Instead, we investigated using a particle-based solution to help fill the gap.

With particles, we wanted to see if we could:

  • Handle thousands of orders at any given time on screen.
  • Maintain 60 frames per second on low-end devices.
  • Have the ability to customize style and animations per order, such as visualizing local and international orders.

From the start, we figured that rendering geometry per an order wouldn't scale well if we wanted to have thousands of orders on screen. Particles appear on the globe as highlights, so they don’t necessarily need to have a 3D perspective. Rather than using triangles for each particle, we began our investigation using three.js Points as a start, which allowed us to draw using dots instead. Next, we needed an efficient way to store data for each particle we wanted to render. Using BufferGeometry, we assigned custom attributes that contained all the information we needed for each particle/order.

To render the points and make use of our attributes, we created a ShaderMaterial, and custom vertex and fragment shaders. Most of the magic for rendering and animating the particles happens inside the vertex shader. Each particle defined in the attributes we pass to our BufferGeometry goes through a series of steps and transformations.

First, each particle has a starting and ending location described using latitude and longitude. Since we want the particle to travel along the surface and not through it, we use a geo interpolation function on our coordinates to find a path that goes along the surface.

A photo of a globe with an order represented as a particle traveling from New York City to London. The vertex shader uses each location’s latitude and longitude and determines the path it needs to travel.
An order represented as a particle traveling from New York City to London. The vertex shader uses each location’s latitude and longitude and determines the path it needs to travel. **

Next, to give the particle height along its path, we use high school geometry, a parabola equation based on time to alter the straight path to a curve.

A photo of a globe with particles that follow a curved path away from the earth’s surface using a parabola equation to determine its height.
Particles follow a curved path away from the earth’s surface using a parabola equation to determine its height. **

To render the particle to make it look 3D in its travels, we combine our height and projected path data then convert it to a vector position our shader uses as it’s gl_Position. With our particle now knowing where it needs to go, using a time uniform, we drive animations for other changes such as size and color. At the end of the vertex shader, we pass the position and point size to render onto the fragment shader that combines the calculated color and alpha at the time for each particle.

Once the vertex shader is complete, the vertex shader passes position and point size onto the fragment shader that combines the animated color and alpha for each particle.

Given that we wanted to support updating and animating thousands of particles at any moment, we wanted to be careful about how we access and update our attributes. For example, if we had 10000 particles in transit, we need to continue updating those and other data points that are coming in. Instead of updating all of our attributes every time, which can be processor-intensive, we made use of BufferAttribute’s updateRange to update a subset of the attributes we needed to change on each frame instead of the entire attribute set.

gl_Points enables us to render 150,000 particles flying around the globe at any given time without performance issues. **

Combining all of the above, we saw upwards of 150,000 particles animating to and from locations on the globe without noticing any performance degradation.

Optimizing Performance

In video games, you may have seen settings for different quality levels. These settings modify the render quality of the application. Most modern games will automatically scale performance. Most aggressively, the application may reduce texture quality or how many vertices are rendered per 3D object.

With the amount of development time we had for this project, we simply didn’t have time to be this aggressive. Yet, we still had to support old, low-power devices such as dated mobile phones. Here’s how we implemented an auto optimizer that could increase an iPhone 7+ render performance from 40 frames per second (fps) to a cool 60fps.

If your application isn’t performing well, you might see a graph like this:

Graph depicting the Globe application running at 40 frames per second on a low power device
Graph depicting the Globe application running at 40 frames per second on a low power device

Ideally, in modern applications, your application should be running at 60fps or more. You can also use this metric to determine when you should lower the quality of your application. Our initial implementation plan was to keep it simple and make every device with a low-resolution display run in low quality. However, this would mean new phones with low-resolution displays and extremely capable GPUs would receive a low-quality experience. Our final attempt monitors fps. If it’s lower than 55fps for over 2 seconds, we decrease the application’s quality. This adjustment allows phones such as the new iPhone 12 Pro Max to run in the highest quality possible while an iPhone 7+ can render at lower quality but consistent high framerate. Decreasing the quality of an application by reducing buffer sizes is optimal. However, in our aggressive timeline, this would have created many bugs and overall application instability.

Left side of the image depicts the application running in High-Quality mode, where the right side of the image depicts the application running in Low-Quality mode
Left side of the image depicts the application running in High-Quality mode, where the right side of the image depicts the application running in Low-Quality mode. **

What we opted for instead was simple and likely more effective. When our application retains a low frame rate, we simply reduce the size of the <canvas> HTML element, which means we’re rendering fewer pixels. After this, WebGL has to do far less work, in most cases, 2x or 3x less work. When our WebGLRenderer is created, we setPixelRatio based on window.devicePixelRatio. When we’ve retained a low frame rate, we simply drop the canvas pixel ratio back down to 1x. The visual differences are nominal and mainly noticeable in edge aliasing. This technique is simple but effective. We also reduce the resolution of our Environment Maps generated by PMREMGenerator, but most applications will be able to utilize the devicePixelRatio drop more effectively.

If you’re curious, this is what our graph looks like after the Auto Optimizer kicks in (red circled area)

Graph depicting the Globe application running at 60 frames per second on a low power device with a circle indicating when the application quality was reduced.
Graph depicting the Globe application running at 60 frames per second on a low power device with a circle indicating when the application quality was reduced

Globe 2021

We hope you enjoyed this behind the scenes look at the 2020 BFCM Globe and learned some tips and tricks along the way. We believe that by shipping two globes in a short amount of time, we were able to focus on the things that mattered most while still keeping a high degree of quality. However, the best part of all of this is that our globe implementation now lives on as a library internally that we can use to ship future globes. Onward to 2021!

*All data is unaudited and is subject to adjustment.
**Made with Natural Earth; textures from Visible Earth NASA

Mikko Haapoja is a development manager from Toronto. At Shopify he focuses on 3D, Augmented Reality, and Virtual Reality. On a sunny day if you’re in the beaches area you might see him flying around on his OneWheel or paddleboarding on the lake.

Stephan Leroux is a Staff Developer on Shopify's AR/VR team investigating the intersection of commerce and 3D. He has been at Shopify for 3 years working on bringing 3D experiences to the platform through product and prototypes.

Additional Information



We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM, and commerce isn't slowing down. Help us scale & make commerce better for everyone

Continue reading

Capacity Planning at Scale

Capacity Planning at Scale

By Kathryn Tang and Kir Shatrov

The fourth Thursday in November is Thanksgiving in the United States. The day after, Black Friday (coined in 1961), is the first day of the Christmas shopping season and since 2005 it’s the busiest shopping day of the year in North America. Cyber Monday is a more recent development. Getting its name in 2005, it refers to the Monday after the Thanksgiving weekend where retailers focus on sales offered online. At Shopify, we call the weekend including Black Friday and Cyber Monday BFCM.

From the engineering team’s point of view, every BFCM challenges the platform and all the things we’ve shipped throughout the year:

  • Would our clusters handle two times the number of virtual machines? 
  • Would we hit some sort of limitation on the new network design? 
  • Would the new logging pipeline handle such an increase in traffic? 
  • What’s going to be the next scalability bottleneck that we hit?

The other challenge is planning the capacity. We need to understand the magnitude of traffic ahead of us, and how many resources like CPUs and storage we’ll need to handle BFCM sales. On top of that, we need to have enough room in case of something unexpected, and we need to perform a regional failover.

Since 2017, we’ve partnered with Google Cloud Platform (GCP) as our main vendor for the cloud. Over these years, we’ve worked closely with their team on our capacity models, and prior to every BFCM that collaboration gets even closer.

In this post, we’ll cover our approaches to capacity planning, and how we rolled it out across the org and to dozens of teams. We’ll also share how we validated our capacity plans with scalability tests to make sure they work.

Capacity Planning 

Our Google Cloud resource needs depend on how much traffic our merchants see during BFCM. We worked with our data scientists to forecast traffic levels and set those levels as a bar for our platform to scale to. Additionally, we looked into historical numbers, applied a safety margin, and projected how many buyers would check out or view online stores.

A list of GCP projects for resource planning.  The list includes items like memcache, Kafka, MySQL, etc
A list of GCP projects for resource planning.

We created a master resourcing plan for our Google Cloud implementation and estimated how things like CPUs and storage would scale to BFCM traffic levels. Owners for our top 10 or so resource areas were tasked to estimate what they needed for BFCM. These estimates were detailed breakdowns of the machine types, geographic locations, and quantities of resources like CPUs. We also added buffers to our overall estimates to allow flexibility to change our resourcing needs, move machines across projects, or failover traffic to different regions if we needed to. What also helps is that we partition each component into a separate GCP project, which makes it a lot easier to think of quotas per every project.

A line graph showing the BFCM traffic forecasts over time. 4 different scenarios are shown in blue, red, yellow and green. The black line shows existing traffic patterns
A line graph showing the BFCM traffic forecasts over time

2020 is an exceptionally difficult year to plan for. Normally, we’d look at BFCM trends from years prior and predict BFCM traffic with a fairly high level of confidence. This year, COVID-19 lockdowns drove a rapid shift to selling online this spring, and we didn't know what to expect. Would we see a massive increase in online traffic this BFCM, or a global economic depression where consumers stopped buying much at all? To manage heightened uncertainty, we forecasted multiple scenarios and their respective needs for our cloud deployment.

From an investment perspective, planning for the largest scale scenario means spending a lot of money very quickly to handle sales that might not happen. Alternatively, not deploying enough machines means having too little computing power and putting our merchant storefronts at risk of outages. It was absolutely vital to avoid anything that would put our merchants at risk of downtime. We decided to scale to our more aggressive growth scenarios to ensure our platform is stable regardless of what happens. We’re transparent with our partners, finance teams, and internal teams about how we thought through these scenarios which helps them make their own operating decisions.

Scalability Testing

A sheet with a capacity plan is just a starting point. Once we start scaling to projected numbers, there’s a high chance that we’ll hit limits throughout our tech stack that need resolving. In a complex system, there’s always a limit like:

  • the number of VMs in a network
  • the number of packets that a busy Memcached server can accept 
  • the number of MB/s your logging pipeline can handle.

Historically, every BFCM brought us some scalability surprises, and what’s worse, we’d only notice them when fully scaled prior to BFCM. That left too little time to come up with mitigation plans.

Back in 2018, we decided that a “faux” BFCM in the middle of the year would increase our resilience as an organization and push us to find unknowns that we’d otherwise only discover during the real thing. As we started doing that, it allowed us to find problems at scale more often and created that mental muscle of preparing for critical events and finding unknowns. If you’re exercising and something feels hard, you train more and eventually your muscles get better. Shopify treats BFCM the same way.

We’ve started the practice of regular scale-up testing at Shopify, and of course we made sure to come up with fun names for each. We’ve had Mayday (2019), Spooky scale-up (2019), and Oktoberfest scale-up (2020). Another fun fact is that our Waterloo teams play a large part in running this testing, and the dates of our Oktoberfest matched the city of Kitchener-Waterloo’s Oktoberfest festivities (It’s the second-largest Oktoberfest in the world).

Oktoberfest scale-up’s goal was to simulate this year’s expected BFCM load based on the traffic forecasts from the data science team. And the fact that we run Shopify in cloud on Google Kubernetes Engine allowed us to grab extra compute capacity just for the window of the exercise, and only pay for those hours when we needed it.

Investment in our internal load testing tooling over the years is fundamental to our ability to run such large scale, platform-wide load tests. We’ve talked about go-lua, an open source project that powers our load testing tool. Thanks to embedded Lua, we feed it with a high-level set of steps for what we want to test: actions like browsing the storefront, adding a product to a card, proceeding to check out, and processing the transaction through a mock payment gateway.

Thanks to Oktoberfest scale-up, we identified and then fixed some bottlenecks that could have become an issue for the real BFCM. Doing the test in early October gave us time to address issues.

After addressing all the issues, we repeated the scale-up test to see how our mitigations helped. Seeing that going smoothly increased our confidence levels about the upcoming Black Friday and reduced stress levels for all teams.

We strive for a smooth BFCM and spend a lot of time preparing for it, from capacity planning, to setting the expectations for our vendors, to load testing, and failover simulations. Beyond delivering a smooth holiday season for our merchants, BFCM is time to reflect on the future. As Shopify continues to grow, BFCM traffic levels can become the normal everyday loads we see in the next year. Our job is to bring lessons from events like BFCM to make our systems even more automated, more dynamic, and more resilient. We relish this opportunity to think about where Shopify is going and to architect our platform to scale with it.

Kir Shatrov is an Engineering Lead who’s been with Production Engineering at Shopify for the past five years, working on areas from CI/CD infrastructure to sharding and capacity planning.

Kathryn Tang is an Engineering Program Manager who manages our Google Cloud relationship. She has been at Shopify for 4 years, working with a multitude of R&D and commercial teams to derive business insights and guide operating decisions to help us scale.

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone

Continue reading

Pummelling the Platform–Performance Testing Shopify

Pummelling the Platform–Performance Testing Shopify

Developing a product or service at Shopify requires care and consideration. When we deploy new code at Shopify, it’s immediately available for merchants and their customers. When over 1 million merchants rely on Shopify for a successful Black Friday Cyber Monday (BFCM), it’s extremely important that all merchants—big and small—can run sales events without any surprises.

We typically classify a “sales event” as any single event that attracts a high amount of buyers to Shopify merchants. This could be a product launch, a promotion, or an event like BFCM. In fact, BFCM 2020 was the largest single sales event that Shopify has ever seen, and many of the largest merchants on the planet also saw some of the biggest flash sales ever observed before on Earth.

In order to ensure that all sales are successful, we regularly and repeatedly simulate large sales internally before they happen for real. We proactively identify and eliminate issues and bottlenecks in our systems using simulated customer traffic on representative test shops. We do this dozens of times per day using a few tools and internal processes.

I’ll give you some insight into the tools we use to raise confidence in our ability to serve large sales events. I’ll also cover our experimentation and regression framework we built to ensure that we’re getting better, week-over-week, at handling load.

We use “performance testing” as an umbrella term that covers different types of high-traffic testing including (but not limited to) two types of testing that happen regularly at Shopify: “load testing” and “stress testing”. 

Load testing verifies that a service under load can withstand a known level of traffic or specific number of requests. An example load test is when a team wants to confirm that their service can handle 1 million requests per minute for a sustained duration of 15 minutes. The load test will confirm (or disconfirm) this hypothesis.

Stress testing, on the other hand, is when we want to understand the upper limit of a particular service. We do this by increasing the amount of load—sometimes very quicklyto the service being tested until it crumbles under pressure. This gives us a good indication of how far individual services at Shopify can be pushed, in general.

We condition our platform through performance testing on a massive scale to ensure that all components of Shopify’s platform can withstand the rush of customers trying to purchase during sales events like BFCM. Through proactive load tests and stress tests, we have a really good picture of what to expect even before a flash sale kicks off. 

Enabling Performance Testing at Scale

A platform as big and complex as Shopify has many moving parts, and each component needs to be finely tuned and prepared for large sales events. Not unlike a sports car, each individual part needs to be tested under load repeatedly to understand performance and throughput capabilities before assembling all the parts together and taking the entire system for a test drive. 

Individual teams creating services at Shopify are responsible for their own performance testing on the services they build. These teams are best positioned to understand the inner workings of the services they own and potential bottlenecks or situations that may be overwhelmed under extreme load, like during a flash sale. To enable these teams, performance testing needs to be approachable, easy to use and understand, and well-supported across Shopify. The team I lead is called Platform Conditioning, and our mission is to constantly improve the tooling, process, and culture around performance testing at Shopify. Our team makes all aspects of the Shopify platform stronger by simulating large sales events and making high-load events a common and regular occurrence for all developers. Think of Platform Conditioning as the personal trainers of Shopify. It’s Platform Conditioning that can help teams develop individualized workout programs and set goals. We also provide teams with the tools they need in order to become stronger.

Generating Realistic Load

At the heart of all our performance testing, we create “load”. A service at Shopify will add load to to cause stresses that—in the end—make it stronger, by using requests that hit specific endpoints of the app or service.

Not all requests are equal though, and so stress testing and load testing are never as easy as tracking the sheer volume of requests received by a service. It’s wise to hit a variety of realistic endpoints when testing. Some requests may hit CDNs or layers of caching that make the response very lightweight to generate. Other requests, however, can be extremely costly and include multiple database writes, N+1 queries, or other buried treasures. It’s these goodies that we want to find and mitigate up front, before a sales event like BFCM 2020.

For example, a request to a static CSS file is served from a CDN node in 40ms without creating any load to our internal network. Comparatively, making a search query on a shop hits three different layers of caching and queries Redis, MySQL, and Elasticsearch with total round-trip time taking 1.5 seconds or longer.

Another important factor to generating load is considering the shape of the traffic as it appears at our load balancers. A typical flash sale is extremely spiky and can begin with a rush of customers all trying to purchase a limited product simultaneously. It’s very important to simulate this same traffic shape when generating load and to run our tests for the same duration that we would see in the wild.

A flow diagram showing how we generate load with go-lua
A systems diagram showing how we generate load with go-lua

When generating load we use a homegrown, internal tool that generates raw requests to other services and receives responses from them. There are two main pieces to this tool: the first is the coordinator, and the second is the group of workers that generate the load. Our load generator is written in Go and executes small scripts written in Lua called “flows”. Each worker is running a Go binary and uses a very fast and lightweight Lua VM for executing the flows. (The Go-Lua VM that we use is open source and can be found on Github) Through this, the steps of a flow can scale to issue tens of millions of requests per minute or more. This technique stresses (or overwhelms) specific endpoints of Shopify and allows us to conduct formal tests using the generated load.

We use our internal ChatOps tool, ‘Spy’, to enqueue tests directly from Slack, so everyone can quickly see when a load test has kicked off and is running. Spy will take care of issuing a request to the load generator and starting a new test. When a test is complete, some handy links to dashboards, logs, and overall results of the test are posted back in Slack.

Here’s a snippet of a flow, written in Lua, that browses a Shopify storefront and logs into a customer account—simulating a real buyer visiting a Shopify store:

Just like a web browser, when a flow is executing it sends and receives headers, makes requests, receives responses and simulates many browser actions like storing cookies for later requests. However, an executing flow won’t automatically make subsequent requests for assets on a page and can’t execute Javascript returned by the server. So our load tests don’t make any XMLHttpRequest (XHR) requests, Javascript redirects or reloads that can happen in a full web browser.

So our basic load generator is extremely powerful for generating a great deal of load, but in its purest form it only can hit very specific endpoints as defined by the author of a flow. What we create as “browsing sessions” are only a streamlined series of instructions and only include a few specific requests for each page. We want all our performance testing as realistic as possible, simulating real user behaviour in simulated sales events and generating all the same requests that actual browsers make. To accomplish this, we needed to bridge the gap between scripted load generation and realistic functionality provided by real web browsers.

Simulating Reality with HAR-based Load Testing

Our first attempt at simulating real customers and adding realism to our load tests was an excellent idea, but fairly naive when it came to how much computing power it would require. We spent a few weeks exploring browser-based load testing. We researched tools that were already available and created our own using headless browsers and tools like Puppeteer. My team succeeded in making realistic browsing sessions, but unfortunately the overhead of using real browsers dramatically increased both computing costs and real money costs. With browser-based solutions, we could only drive thousands of browsing sessions at a time, and Shopify needs something that can scale to tens of millions of sessions. Browsers provide a lot of functionality, but they come with a lot of overhead. 

After realizing that browser-based load generation didn’t suit our needs, my team pivoted. We were still driving to add more realism to our load tests, and we wanted to make all the same requests that a browser would. If you open up your browser’s Developer Tools, and look at the Network tab while you browse, you see hundreds of requests made on nearly every page you visit. This was the inspiration for how we came up with a way to test using HTTP Archive (HAR) files as a solution to our problems.

An image of chrome developer tools showing a small sample of requests made by a single product page
A small sample of requests made by a single product page

HAR files are detailed JSON representations of all of the network requests and responses made by most popular browsers. You can export HAR files easily from your browser, or web proxy tools like Charles Proxy. A single HAR file includes all of the requests made during a browsing session and are easy to save, examine, and share. We leveraged this concept and created a HAR-based load testing solution. We even gave it a tongue-and-cheek name: Hardy Har Har.

Hardy Har Har (or simply HHH for those who enjoy brevity) bridges the gap between simple, lightweight scripted load tests and full-fledged, browser-based load testing. HHH will take a HAR file as input and extract all of the requests independently, giving the test author the ability to pick and choose which hostnames can be targeted by their load test. For example, we nearly always remove requests to external hostnames like Google Analytics and requests to static assets on CDN endpoints (They only add complexity to our flows and don’t require load testing). The resulting output of HHH is a load testing flow, written in Lua and committed into our load testing repository in Git. Now—literally at the click of a button—we can replay any browsing session in its full completeness. We can watch all the same requests made by our browser, scaled up to millions of sessions.

Of course, there are some aspects of a browsing session that can’t be simply replayed as-is. Things like logging into customer accounts and creating unique checkouts on a Shopify store need dynamic handling that HHH recognizes and intelligently swaps out the static requests and inserts dynamic logic to perform the equivalent functionality. Everything else lives in Lua and can be ripped apart or edited manually giving the author complete control of the behaviour of their load test. 

Taking a Scientific Approach to Performance Testing

The final step to having great performance testing leading up to a sales event is clarity in observations and repeatability of experiments. At Shopify, we ship code frequently, and anyone can deploy changes to production at any point in time. Similarly, anyone can kick off a massive load test from Slack whenever they please. Given the tools we’ve created and the simplicity in using them, it’s in our best interest to ensure that performance testing follows the scientific method for experimentation.

A flow diagram showing the performance testing scientific method of experimentation
Applying the scientific method of experimentation to performance testing

Developers are encouraged to develop a clear hypothesis relating to their product or service, perform a variety of experiments, observe the results of various experiment runs, and formulate a conclusion that relates back to their hypothesis. 

All this formality in process can be a bit of a drag when you’re developing, so the Platform Conditioning team created a framework and tool for load test experimentation called Cronograma. Cronograma is an internal Rails app making it easy for anyone to set up an experiment and track repeated runs of a performance testing experiment. 

Cronograma enforces the formal use of experiments to track both stress tests and load tests. The Experiment model has several attributes, including a hypothesis and one or more orchestrations that are coordinated load tests executed simultaneously in different magnitudes and durations. Also, each experiment has references to the Shopify stores targeted during a test and links to relevant dashboards, tracing, and logs used to make observations.

Once an experiment is defined, it can be run repeatedly. The person running an experiment (the runner) starts the experiment from Slack with a command that creates a new experiment run. Cronograma kicks off the experiment and assigns a dedicated Slack channel for the tests allowing multiple people to participate. During the running of an experiment any number of things could happen including exceptions, elevated traffic levels, and in some cases, actual people may be paged. We want to record all of these things. It’s nice to have all of the details visible in Slack, especially when working with teams that are Digital by Default. Observations can be made by anyone and comments are captured from Slack and added to a timeline for the run. Once the experiment completes, the experiment runner terminates the run and logs a conclusion based on their observations that relates back to the original hypothesis.

We also included additional fanciness in Cronograma. The tool automatically detects whether any important monitors or alerts were triggered during the experiment from internal or third-party data monitoring applications. Whenever an alert is triggered, it is logged in the timeline for the experiment. We also retrieve metrics from our data warehouse automatically and consume these data in Cronograma allowing developers to track observed metrics between runs of the same experiment. For example:

  • the response times of the requests made
  • how many 5xx errors were observed
  • how many requests per minute (RPM) were generated

All of this information is automatically captured so that running an experiment is useful and it can be compared to any other run of the experiment. It’s imperative to understand whether a service is getting better or worse over time.

Cronograma is the home of formal performance testing experiments at Shopify. This application provides a place for all developers to conduct experiments and repeat past experiments. Hypotheses, observations, and conclusions are available for everyone to browse and compare to. All of the tools mentioned here have led to numerous performance improvements and optimizations across the platform, and they give us confidence that we can handle the next major sales event that comes our way.

The Best Things Go Unnoticed 

Our merchants count on Shopify being fast and stable for all their traffic—whether they’re making their first sale, or they’re processing tens of thousands of orders per hour. We prepare for the worst case scenarios by proactively testing the performance of our core product, services, and apps. We expose problems and fix them before they become a reality for our merchants using simulations. By building a culture of load testing across all teams at Shopify, we’re prepared to handle sales events like BFCM and flash sales. My team’s tools make performance testing approachable for every developer at Shopify, and by doing so, we create a stronger platform for all our merchants. It’s easy to go unnoticed when large sales events go smoothly. We quietly rejoice in our efforts and the realization that it’s through strength and conditioning that we make these things possible.


Performance Scale at Shopify

Join Chris as he chats with Anita about how Shopify conditions our platform through performance testing on a massive scale to ensure that all components of Shopify’s platform can withstand the rush of customers trying to purchase during sales events like flash sales and Black Friday Cyber Monday.  

Chris Inch is a technology leader and development manager living in Kitchener, Ontario, Canada. By day, he manages Engineering teams at Shopify, and by night, he can be found diving head first into a variety of hobbies, ranging from beekeeping to music to amateur mycology.

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone

Continue reading

Vouching for Docker Images

Vouching for Docker Images

If you were using computers in the ‘90s and the early 2000s, you probably had the experience of installing a piece of software you downloaded from the internet, only to discover that someone put some nasty into it, and now you’re dragging your computer to IT to beg them to save your data. To remedy this, software developers started “signing” their software in a way that proved both who they were and that nobody tampered with the software after they released it. Every major operating system now supports code or application signature verification, and it’s a backbone of every app store.But what about Kubernetes? How do we know that our Docker images aren’t secret bitcoin miners, stealing traffic away from customers to make somebody rich? That’s where Binary Authorization comes in. It’s a way to apply the code signing and verification that modern systems now rely on to the cloud. Coupled with Voucher, an open source project started by my team at Shopify, we’ve created a way to prevent malicious software from being installed without making developers miserable or forcing them to learn cryptography.

Why Block Untrusted Applications?

Your personal or work computer getting compromised is a huge deal! Your personal data being stolen or your computer becoming unusable due to popup ads or background processes doing tasks you don’t know about is incredibly upsetting.

But imagine if you used a compromised service. Imagine if your email host ran in Docker containers in a cluster with a malicious service that wanted to access contents of the email databases? This isn’t just your data, but the data of everyone around you.

This is something we care about deeply at Shopify, since trust is a core component of our relationship with our merchants and our merchants’ relationships with their customers. This is why Binary Authorization has been a priority for Shopify since our move to Kubernetes.

What is Code Signing?

Code signing starts by taking a hash of your application. Hashes are made with hashing algorithms that take the contents of something (such as the binary code that makes up an application) and make a short, reproducible value that represents that version. A part of the appeal of hashing algorithms is that it takes an almost insurmountable amount of work (provided you’re using newer algorithms) to find two pieces of data that produce the same hash value.

For example, if you have a file that has the text:

Hello World

The hash representation of that (using the “sha256” hashing algorithm) is:


Adding an exclamation mark to the end of our file:

Hello World!

Results in a completely different hash:


Once you have a hash of an application, you can run the same hashing algorithm on it to ensure that it hasn’t changed. While this is an important part of the code signing, most signing applications will automatically create a hash of what you are signing, rather than requiring you to hash and then sign the hash separately. It makes the hash creation and verification transparent to the developers and their users.

Once the initial release is ready, the developer that’s signing the application creates a public and private key for signing it, and shares the public key with their future users. The developer then uses the private part of their signing key and the hash of the application to create a value that can be verified with the public part of the key.

For example, with Minisign, a tool for creating signatures quickly, first we create our signing key:

The public half of the key is now:


And the private half remains private, living in /Users/caj/.minisign/minisign.key.

Now, if our application was named “hello” we can create a signature with that private key:

And then your users could verify that “hello” hasn’t been tampered with by running:

Unless you’re a software developer or power user, you likely have never consciously verified a signature, but that’s what’s happening behind the scenes if you’re using a modern operating system. 

Where is Code Signing Used?

Two of the biggest users of code signing are Apple and Google. They use code signing to ensure that you don’t accidentally install software updates that have been tampered with or malicious apps from the internet. Signatures are usually verified in the background, and you only get notified if something is wrong. In Android, you can turn this off by allowing unknown apps in the phone's settings, whereas iOS requires the device be jailbroken to allow unsigned applications to be installed.

A macOS dialog window showing that Firefox is damaged and can't be opened. It gives users the option of moving it to the Trash.

A macOS dialog window showing that Firefox is damaged and can't be opened. It gives users the option of moving it to the Trash.

In macOS, applications that are missing their developer signatures or don’t have valid signatures are blocked by the operating system and advise users to move them to the Trash.

Most Linux package managers, (such as Apt/DPKG in Debian and Ubuntu, Pacman in ArchLinux) use code signing to ensure that you’re installing packages from the distribution maintainer, and verify those packages at install time.

Docker Hub showing a docker image created by the author.
Docker Hub showing a docker image created by the author.

Unfortunately, Kubernetes doesn’t have this by default. There are features that allow you to leverage code signing, but chances are you haven’t used them.

And at the end of the day, do you really trust some rando on the internet to actually give you a container that does what it says it does? Do you want to trust that for your organization? Your customers?

What is Binary Authorization?

Binary Authorization is a series of components that work together: 

  • A metadata service: a service that stores signatures and other image metadata
  • A Binary Authorization Enforcer: a service that blocks images that it can’t find valid signatures for
  • A signing service: a system that signs new images and stores those signatures in the metadata service.

Google provides the first two services for their Kubernetes servers, which Shopify uses, based on two open source projects:

  • Grafeas, a metadata service
  • Kritis, a Binary Authorization Enforcer

When using Kritis and Grafeas or the Binary Authorization feature in Google Kubernetes Engine (GKE), infrastructure developers will configure policies for their clusters, listing the keys (also referred to as attestors) that must have signed the container images before they can run.

When new resources are started in a Kubernetes cluster, the images they reference are sent to the Binary Authorization Enforcer. The Enforcer connects to the metadata service to verify the existence of valid signatures for the image in question and then compares those signatures to the policy for the cluster it runs in. If the image doesn’t have the required signatures, it’s blocked, and any containers that would use it won’t start.You can see how these two systems work together to provide the same security that one gets in one’s operating system! However, there’s one piece that wasn’t provided by Google until recently: the signing service.

Voucher: The Missing Piece

Voucher serves as the last piece for Binary Authorization, the signing service. Voucher allows Shopify to run security checks against our Docker images and sign them depending on how secure they are, without requiring that non-security teams manage their signing keys.

Using Voucher's client software to check an image with the 'is_shopify' check, which verifies if the image was from a Shopify owned repository.
Using Voucher's client software to check an image with the 'is_shopify' check, which verifies if the image was from a Shopify owned repository.

The way it works is simple:

  1. Voucher runs in Google Cloud Run or Kubernetes and is accessible as a REST endpoint
  2. Every build pipeline automatically calls to Voucher with the path to the image it built
  3. Voucher reviews the image, signs it, and pushes that signature to the metadata service

On top of the basic code signing workflow discussed previously, Voucher also supports validating more complicated requirements, using separate security checks and associated signing keys to mix and match required signatures on a per cluster basis to create distinct policies based on a cluster’s requirement.

For example, do you want to block images that weren’t built internally? Voucher has a distinct check for verifying that an image is associated with a Git commit in a Github repo you own, and signing those images with a separate key.

Alternatively, do you need to be able to prove that every change was approved by multiple people? Voucher can support that, creating signatures based on the existence of approvals in Github (with support for other code hosting services coming soon). This would allow you to use Binary Authorization to block images that would violate that requirement.

Voucher also has support for verifying the identity of the container builder, blocking images with a high number of vulnerabilities, and so on. And Voucher was designed to be extensible, allowing for the creation of new checks as need be.By combining Voucher’s checks and Binary Authorization policies, infrastructure teams can create a layered approach to securing their organization’s Kubernetes clusters. Compliance clusters can be configured to require approvals and block images with vulnerabilities, while clusters for experiments and staging can use less strict policies to allow developers to move faster, all with minimum work from non-security focused developers.

Voucher Joins Grafeas

As mentioned earlier, Voucher serves a need that hasn’t been provided by Google until recently. This is because Voucher has moved into the Grafeas organization and now is a service provided by Google to Google Kubernetes Engine users going forwards. 

Since our move to Kubernetes, Shopify’s security team has been working with Google’s Binary Authorization team to plan out how we’ll roll out Binary Authorization and design Voucher. We also released Voucher as an open source project in December 2018. This move to the Grafeas project simplifies things, putting it in the same place as the other open source Binary Authorization components.

Improving the security of the infrastructure we build makes everyone safer. And making Voucher a community project will put it in front of more teams which will be able to leverage it to further secure their Kubernetes clusters, and if we’re lucky, will result in a better, more powerful Voucher! Of course, Shopify’s Software Supply Chain Security team will continue our work on Voucher, and we want you to join us!

Please help us!

If you’re a developer or writer who has time and interest in helping out, please take a look at the open issues or fork the project and open a PR! We can always use more documentation changes, tutorials, and third party integrations!

And if you’re using Voucher, let us know! We’d love to hear how it’s going and how we can do a better job of making Kubernetes more secure for everyone!

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

How to Build a Production Grade Workflow with SQL Modelling

How to Build a Production Grade Workflow with SQL Modelling

By Michelle Ark and Chris Wu

In January of 2014, Shopify built a data pipeline platform for the data science team called Starscream. Back then, we were a smaller team and needed a tool that could deal with everything from ad hoc explorations to machine learning models. We chose to build with PySpark to get the power of a generalized distributed computer platform, the backing of the industry standard, and the ability to tap into the Python talent market. 

Fast forward six years and our data needs have changed. Starscream now runs 76,000 jobs and writes 300 terabytes a day! As we grew, some types of work went away, but others (like simple reports) became so commonplace we do them every day. While our Python tool based on PySpark was computationally powerful, it wasn’t optimized for these commonplace tasks. If a product manager needed a simple rollup for a new feature by country, pulling it, and modeling it wasn’t a fast task.

We’ll show you how we moved to a SQL modelling workflow by leveraging dbt (data build tool) and created tooling for testing and documentation on top of it. All together, these features provide Shopify’s data scientists with a robust, production-ready workflow to quickly build straightforward pipelines.

The Problem

When we interviewed our users to understand their workflow on Starscream, there were two issues we discovered: development time and thinking.

Development time encompasses the time data scientists use to prototype the data model they’d like to build, run it, see the outcome,and iterate. The PySpark platform isn’t ideal for running straightforward reporting tasks, often forcing data scientists to write boilerplate and it yields long runtimes. This led to long iteration cycles when trying to build models on unfamiliar data.

The second issue, thinking, is more subtle and deals with the way the programming language forces you to look at the data. Many of our data scientists prefer SQL to python because its structure forces consistency in business metrics. When interviewing users, we found a majority would write out a query in SQL then translate it to Python when prototyping. Unfortunately, query translation is time consuming and doesn’t add value to the pipeline.

To understand how widespread these problems were, we audited the jobs run and surveyed our data science team for the use cases. We found that 70% or so of the PySpark jobs on Starscream were full batch queries that didn’t require generalized computing. We viewed this as an opportunity to make a kickass optimization for a painful workflow. 

Enter Seamster

Our goal was to create a SQL pipeline for reporting that enables data scientists to create simple reporting data faster, while still being production ready. After exploring a few alternatives, we felt that the dbt library came closest to our needs. Their tagline “deploy analytics code faster with software engineering practices” was exactly what we were looking for in a workflow. We opted to pair it with Google BigQuery as our data store and dubbed the system and its tools, Seamster.

We knew that any off-the-shelf system wouldn’t be one size fits all. In moving to dbt, we had to implement our own:

  • source and model structure to modularize data model development
  • unit testing to increase the types of testable errors
  • continuous integration (CI) pipelines to provide safety and consistency guarantees.

Source Independence and Safety

With dozens of data scientists making data models in a shared repository, a great user experience would

  • maximize focus on work 
  • minimize the impact of model changes by other data scientists.

By default, dbt declares raw sources in a central sources.yml. This quickly became a very large file as it included the schema for each source, in addition to the source name. It creates a huge bottleneck for teams editing the same file across multiple PRs. 

To mitigate the bottleneck, we leveraged the flexibility of dbt and created a top-level ‘sources’ directory to represent each raw source with its own source-formatted yaml file. This way, data scientists can parse only the source documentation that’s relevant for them and contribute to the sources.yml file without stepping on each other’s toes.

Base models are one-to-one interfaces to raw sources.

We also created a Base layer of models using the staging’ concept from dbt to implement their best practice of limiting references to raw data. Our Base models serve as a one-to-one interface to raw sources. They don’t change the grain of the raw source, but do apply renaming, recasting, or any other cleaning operation that relates to the source data collection system. 

The Base layer serves to protect users from breaking changes in raw sources. Raw external sources are by definition out of the control of Seamster and can introduce breaking changes for any number of reasons at any point in time. If and when this happens, you only need to apply the fix to the Base model representing the raw source, as opposed to every individual downstream model that depends on the raw source. 

Model Ownership for Teams

We knew that the tooling improvements of Seamster would be only one part of a greater data platform at Shopify. We wanted to make sure we’re providing mechanisms to support good dimensional modelling practices and support data discovery.

In dbt, a model is simply a .sql file. We’ve extended this definition in Seamster to define a model as a directory consisting of four files: 

  • model_name.sql
  • schema.yml

You can further organize models into directories that indicate a data science team at Shopify like ‘finance’ or ‘marketing’. 

To support a clean data warehouse we’ve also organized data models into these rough layers that differentiate between:

  • base: data models that are one-to-one with raw data, but cleaned, recast and renamed
  • application-ready: data that isn’t dimensionally modelled but still transformed and clean for consumption by another tool (for example,  training data for a machine learning algorithm)
  • presentation: shareable and reliable data models that follow dimensional modelling best practices and can be used by data scientists across different domains.

With these two changes, a data consumer can quickly understand the data quality they can expect from a model and find the owner in case there is an issue. We also pass this metadata upstream to other tools to help with the data discovery workflow.

More Tests

dbt has native support for ‘schema tests’, which are encoded in a model’s schema.yml file. These tests run against production data to validate data invariants, such as the presence of null values or the uniqueness of a particular key. This feature in dbt serves its purpose well, but we also want to enable data scientists to write unit tests for models that run against fixed input data (as opposed to production data).

Testing on fixed inputs allows the user to test edge cases that may not be in production yet. In larger organizations, there can and will be frequent updates and many collaborators for a single model. Unit tests give users confidence that the changes they’re making won’t break existing behaviour or introduce regressions. 

Seamster provides a Python-based unit testing framework. Data scientists write their unit tests in the file in the model directory. The framework enables constructing ‘mock’ input models from fixed data. The central object in this framework is a ‘mock’ data model, which has an underlying representation of a Pandas dataframe. You can pass fixed data to the mock constructor as either a csv-style string, Pandas dataframe, or a list of dictionaries to specify input data. 

Input and expected MockModels are built from static data. The actual MockModel is built from input MockModels by BigQuery. Actual and expected MockModels can assert equality or any Great Expectations expectation
Input and expected MockModels are built from static data. The actual MockModel is built from input MockModels by BigQuery. Actual and expected MockModels can assert equality or any Great Expectations expectation.

A constructor creates a test query where a common table expression (CTE) represents each input mock data model, and any references to production models (identified using dbt’s ‘ref’ macro) are replaced by references to the corresponding CTE. Once you execute a query, you can compare the output to an expected result. In addition to an equality assertion, we extended our framework to support all expectations from the open-source Great Expectations library to provide more granular assertions and error messaging. 

The main downside to this framework is that it requires a roundtrip to the query engine to construct the test data model given a set of inputs. Even though the query itself is lightweight and processes only a handful of rows, these roundtrips to the engine add up. It becomes costly to run an entire test suite on each local or CI run. To solve this, we introduced tooling both in development and CI to run the minimal set of tests that could potentially break given the change. This was straightforward to implement with accuracy because of dbt’s lineage tracking support; we simply had to find all downstream models (direct and indirect) for each changed model and run their tests. 

Schema and Directed Acyclic Graph Validation on the Cheap

Our objective in Seamster’s CI is to give data scientists peace of mind that their changes won’t introduce production errors the next time the warehouse is built. They shouldn’t have to wonder whether removing a column will cause downstream dependencies to break, or whether they made a small typo in their SQL model definition.

To achieve this accurately, we would need to build and tear down the entire warehouse on every commit. This isn’t feasible from both a time and cost perspective. Instead, on every commit we materialize every model as a view in a temporary BigQuery dataset which is created at the start of the validation process and removed as soon as the validation finishes. If we can’t build a view because its upstream model doesn’t provide a certain column, or if the SQL is invalid for any reason, BigQuery fails to build the view and produces relevant error messaging. 

Currently, We have a warehouse consisting of over 100 models, and this validation step takes about two minutes. We reduce validation time further by only building the portion of the directed acyclic graph (DAG) affected by the changed models, as done in the unit testing approach. 

dbt’s schema.yml serves purely as metadata and can contain columns with invalid names or types (data_type). We employ the same view-based strategy to validate the contents of a model’s schema.yml file ensuring the schema.yml is an accurate depiction of the actual SQL model.

Data Warehouse Rules

Like many large organizations, we maintain a data warehouse for reporting where accuracy is key. To power our independent data science teams, Seamster helps by enforcing conformance rules on the layers mentioned earlier (base, application-ready, and presentation layers). Examples include naming rules or inheritance rules which help the user reason over the data when building their own dependent models.

Seamster CI runs a collection of such rules that ensure consistency of documentation and modelling practices across different data science teams. For example, one warehouse rule enforces that all columns in a schema conform to a prescribed nomenclature. Another warehouse rule enforces that only base models can reference raw sources (via the ‘source’ macro) directly. 

Some warehouse rules apply only to certain layers. In the presentation layer, we enforce that any column name needs a globally unique description to avoid divergence of definitions. Since everything in dbt is YAML, most of this rule enforcement is just simple parsing.

So, How Did It Go?

To ensure we got it right and worked out the kinks, we ran a multiweek beta of Seamster with some of our data scientists who tested the system out on real models. Since you’re reading about it, you can guess by now that it went well!

While productivity measures are always hard, the vast majority of users reported they were shipping models in a couple of days instead of a couple of weeks. In addition, documentation of models increased because this is a feature built into the model spec.

Were there any negative results? Of course. dbt’s current incremental support doesn’t provide safe and consistent methods to handle late arriving data, key resolution, and rebuilds. For this reason, a handful of models (Type  2 dimensions or models in the 1.5B+ event territory) that required incremental semantics weren’t doable—for now. We’ve got big plans though!

Where to Next?

We’re focusing on updating the tool to ensure it’s tailored to Shopify’s data scientists. The biggest hurdle for a new product (internal and external) is adoption. We know we still have work to do to ensure that our tool is top of mind when users have simple (but not easy) reporting work. We’re spending time with each team to identify upcoming work that we can speed up by using Seamster. Their questions and comments will be part of our tutorials and documentations for new data scientists.

On the engineering front, an exciting next step is looking beyond batch data processing. Apache Beam and Beam SQL provide an opportunity to consider a single SQL-centric data modelling tool for both batch and streaming use cases.

We’re also big believers in open source at Shopify. Depending on the dbt’s community needs we’d also like to explore contributing our validation strategy and a unit testing framework to the project. 

If you’re interested in building solutions from the ground up and would like to come work with us, please check out Shopify’s career page.

Continue reading

Adopting Sorbet at Scale

Adopting Sorbet at Scale

On November 25, 2020 we held ShipIt! Presents: The State of Ruby Static Typing at Shopify. The video of the event is now available.

Shopify changes a lot. We merge around 400 commits to the main branch daily and deploy a new version or our monolith 40 times a day. Shopify is also big: 37,000 Ruby files, 622,000 methods, more than 2,000,000 calls. At this scale, with a dynamic language, even with the most rigorous review process and over 150 000 tests, it’s a challenge to ensure that everything runs smoothly. Developers benefit from a short feedback loop to ensure the stability of our monolith for our merchants.

In my first post, I talked about how we brought static typing to our core monolith. We adopted Sorbet in 2019, and the Ruby Infrastructure team continues to work on ways to make the development process safer, faster, and more enjoyable for Ruby developers. Currently, Sorbet is only enforced on our main monolith, but we have 60 internal projects using Sorbet as well. On our main monolith, we require all files to be at least typed: false and Sorbet is run on our continuous integration (CI) platform for every pull request and fails builds if type checking errors are found. As of today, 80% of our files (including tests) are typed: true or higher. Almost half of our calls are typed and half of our methods have signatures.

In this second post, I’ll present how we got from no Sorbet in our monolith to almost full coverage in the span of a few months. I’ll explain the challenges we faced, the tools we built to solve them, and the preliminary results of our experiment to reduce production errors with static typing.

Our Open-Source Tooling for Sorbet Adoption

Currently, Sorbet can’t understand all the constructs available in Ruby. Furthermore, Shopify relies on a lot of gems and frameworks, including Rails, that bring their own set of idioms. Increasing type coverage in our monolith meant finding ways to make Sorbet understand all of this. These are the tools we created to make it possible. They are open sourced in our effort to share our work with the community and make typing adoption easier for everyone.

Making Code Sorbet-compatible with RuboCop Sorbet

Even with gradual typing, moving our monolith to Sorbet required a lot of changes to remove or replace Ruby constructs that Sorbet couldn’t understand, such as non-constant superclasses or accessing constants through meta-programming with const_get. For this, we created RuboCop Sorbet, a suite of RuboCop rules allowing us to:

  • forbid some of the constructs not recognized by Sorbet yet 
  • automatically correct those constructs to something Sorbet can understand.

We also use these cops to require a minimal typed level on all files of our monolith (at least typed: false for now, but soon typed: true) and enforce some styling conventions around the way we write signatures.

Creating RBI Files for Gems with Tapioca

One considerable piece missing when we started using Sorbet was Ruby Interface file (RBI) generation for gems. For Sorbet to understand code from required gems, we had two options: 

  1. pass the full code of the gems to Sorbet which would make it slower and require making all gems compatible with Sorbet too
  2. pass a light representation of the gem content through an Ruby Interface file called a RBI file.

Being before the birth of Sorbet’s srb tool, we created our own: Tapioca. Tapioca provides an automated way to generate the appropriate RBI file for a given gem with high accuracy. It generates the definitions for all statically defined types and most of the runtime defined ones exported from a Ruby gem. It loads all the gems declared in the dependency list from the Gemfile into memory, then performs runtime introspection on the loaded types to understand their structure, and finally generates a complete RBI file for each gem with a versioned filename.

Tapioca is the de facto RBI generation tool at Shopify and used by a few renowned projects including Homebrew.

Creating RBI Files for Domain Specific Languages

Understanding the content of the gems wasn’t enough to allow type checking our monolith. At Shopify we use a lot of internal Domain Specific Languages (DSLs), most of them coming directly from Rails and often based on meta-programming. For example, the Active Record association belongs_to ends up defining tens of methods at runtime, none of which are statically visible to Sorbet. To enhance Sorbet coverage on our codebase we needed it to “see” those methods.

To solve this problem, we added RBI generation for Rails DSLs directly into Tapioca. Again, using runtime introspection, Tapioca analyzes the code of our application to generate RBI files containing a static definition for all the runtime-generated methods from Rails and other libraries.

Today Tapioca provides RBI generation for a lot of DSLs we use at Shopify:

  • Active Record associations
  • Active Record columns
  • Active Record enums
  • Active Record scopes
  • Active Record typed store
  • Action Mailer
  • Active Resource
  • Action Controller helpers
  • Active Support current attributes
  • Rails URL helpers
  • FrozenRecord
  • IdentityCache
  • Google Protobuf definitions
  • SmartProperties
  • StateMachines
  • …and the list is growing everyday

Building Tooling on Top of Sorbet with Spoom

As we began using Sorbet, the need for tooling built on top of it was more and more apparent. For example, Tapioca itself depends on Sorbet to list the symbols for which we need to generate RBI definitions.

Sorbet is a really fast Ruby parser that can build an Abstract Syntax Tree (AST) of our monolith in a matter of seconds versus a few minutes for the Whitequark parser. We believe that in the future a lot of tools such as linters, cops, or static analyzers can benefit from this speed.

Sorbet also provides a Language Server Protocol (LSP) with the option --lsp. Using this option, Sorbet can act as a server that is interrogated by other tools programmatically. LSP scales much better than using the file output by Sorbet with the --print option (see for example parse-tree-json or symbol-table-json) that spits out GBs of JSON for our monolith. Using LSP, we get answers in a few milliseconds instead of parsing those gigantic JSON files. This is generally how the language plugins for IDEs are implemented.

To facilitate the development of external tools to Sorbet we created Spoom, our toolbox to use Sorbet programmatically. It provides a set of useful features to interact with Sorbet, parse the configuration files, list the type checked files, collect metrics, or automatically bump files to higher strictnesses and comes with a Ruby API to connect with Sorbet’s LSP mode.

Today, Spoom is at the heart of our typing coverage reporting and provides the beautiful visualizations used in our SorbetMetrics dashboard.

Sharing Lessons Learned

After more than a year using Sorbet on our codebases, we learned a lot. I’ll share some insights about what typing did for us, which benefits it brings, and some of the limitations it implies.

Build, Measure, Learn

There’s a very scientific way to approach building products, encapsulated in the Build-Measure-Learn loop pioneered by Eric Ries. Our team believes in intuition, but we still prefer empirical proofs when we have access to them. So when we started with static typing in Ruby, we all believed it would be useful for our developers, but wanted to measure its effects and have hard data. This allows us to decide what we should concentrate on next based on the outcome of our measurements.

I talked about observing metrics, surveying developer happiness, or getting feedback through interviews in part 1, but my team wanted to go further and correlate the impact of typing on production errors. So, we conducted a series of controlled experiments to validate our assumptions.

Since our monolith evolves very fast, it becomes hard to observe the direct impact of typing on production. New features are added every day which gives rise to new errors while we work to decrease errors in other areas. Moreover, our monolith has about 500 gem dependencies (including transitive dependencies), any of which could introduce new errors in a version bump.

For this reason, we decreased our scope and targeted a smaller codebase for our experiment. Our internal developer tool, aptly named dev, was an ideal candidate. It’s a mature codebase that changes slowly and by a few people. It’s a very opinionated codebase with no external dependencies (the few dependencies it has are vendored), so it could satisfy the performance requirements of a command-line tool. Additionally, dev almost uses no meta-programming, especially not the kind normally coming from external libraries. Finally, it’s a tool with heavy usage since it’s the main development tool used by all developers at Shopify for their day-to-day work. It runs thousands of times a day on hundreds of different computers, there’s no edge case—at this scale, if something can break, it will.

We started monitoring all errors raised by dev in production, categorized the errors, analyzed their root cause, and tried to understand how typing could avoid them.

typed: ignore means typed: debt

Our first realisation was to keep away from typed: ignore. Ignoring a file can cause errors to appear in other files because those other files may reference something defined in the ignored file.

For example, if we opt to ignore this file:

Sorbet will raise errors in this file:

Since Sorbet doesn't even parse the file a.rb, it won’t know where constant A was defined. The more files you ignore, the more this case arises, especially when ignoring library files. This makes it harder and harder for other developers to type their own code.

As a rule of thumb at Shopify, we aim to have all our application files at least at typed: true and our test files at least at typed: false. We reserve typed: ignore for some test files that are particularly hard to type (because of mocking, stubbing, and fixtures), or some very specific files such as Protobuf definition files (which we handle through DSLs RBI generation with Tapioca).

Benefits Realized, Even at typed: false

Even at typed: false, Sorbet provides safety in our codebase by checking that all the constants resolve. Thanks to this, we now avoid mistakes triggering NameErrors either in CI or production.

Enabling Sorbet on our monolith allowed us to find and fix a few mistakes such as:

  • StandardException instead of StandardError
  • NotImplemented instead of NotImplementedError

We found dead code referencing constants deleted months ago. Interestingly, while most of the main execution paths were covered by tests, code paths for error handling were the places where we found the most NameErrors.

A bar graph showing the decreasing amount of NameErrors in dev over time
NameErrors raised in production for the dev project

During our experiment, we started by moving all files from dev to typed: false without adding any signatures. As soon as Sorbet was enabled in October 2019 on this project, no more NameErrors were raised in production.

Stacktrace showing NameError raised in production after Sorbet was enabled because of meta-programming like const_get
NameError raised in production after Sorbet was enabled because of meta-programming

The same observation was made on multiple projects: enabling Sorbet on a codebase eradicates all NameErrors due to developers’ mistakes. Note that this doesn’t avoid NameErrors triggered through metaprogramming, for example, when using const_get.

While Sorbet is a bit more restrictive when it comes to resolve constants, this strictness can be beneficial for developers:

Example of constant resolution error raised by Sorbet

typed: true Brings More Benefits

A circular tree map showing the relationship between strictness level and helpers in dev
Files strictnesses in dev (the colored dots are the helpers)

With our next experiment on gradual typing, we wanted to observe the effects of moving parts of the dev application to typed: true. We moved a few of the typed: false files to typed: true by focusing on the most reused part of the application, called helpers (the blue dots).

A bar graph showing the decrease in NoMethodErrors for files typed: true over time
NoMethodErrors in production for dev (in red the errors raised from the helpers)

By typing only this part of the application (~20% of the files) and still without signatures, we observed a decrease in NoMethodErrors for files typed: true.

Those preliminary results gave us confidence that a stricter typing can impact other classes of errors. We’re now in the process of adding signatures to the typed: true files in dev so we can observe their effect on TypeErrors and ArgumentErrors in production.

The Road Ahead of Us

The team working on Sorbet adoption is part of the broader Ruby Infrastructure team which is responsible for delivering a fast, scalable, and safe Ruby language for Shopify. As part of that mandate, we believe there are more things we can do to increase Ruby performance when the types of variables are known ahead of time. This is an interesting area to explore for the team as our adoption of typing increases and we’re certainly thinking about investing in this in the near future.

Support for Ruby 2.7

We keep our monolith as close as possible to Ruby trunk. This means we moved to Ruby 2.7 months ago. Doing so required a lot of changes in Sorbet itself to support syntax changes such as beginless ranges and numbered parameters, as well as new behaviors like forbidding circular argument references. Some work is still in progress to support the new forwarding arguments syntax and pattern matching. Stripe is currently working on making the keyword arguments compatible with Ruby 2.7 behavior.

100% Files at typed: true

The next objective for our monolith is to move 100% of our files to at least typed: true and make it mandatory for all new files going forward. Doing so implies making both Sorbet and our tooling smarter to handle the last Ruby idioms and constructs we can’t type check yet.

We’re currently focusing on changing Sorbet to support Rails ActiveSupport::Concerns (Sorbet pull requests #3424, #3468, #3486) and providing a solution to inclusion requirements. As well as, improvement to Tapioca for better RBI generation for generics and GraphQL support.

Investing in the Future of Static Typing for Ruby

Sorbet isn’t the end of the road concerning Ruby type checking. Matz announced that Ruby 3 will ship with RBS, another approach to type check Ruby programs. While the solution isn’t yet mature enough for our needs (mainly because of speed and some other limitations explained by Stripe) we’re already collaborating with Ruby core developers, academics, and Stripe to make its specification better for everyone.

Notably, we have open-sourced RBS parser, a C++ parser for RBS capable of translating a subset of RBS to RBI, and are now working on making RBS partially compatible with Sorbet.

We believe that typing, whatever the solution used is greatly beneficial for Ruby, Shopify, and our merchants so we'll continue to invest heavily in it. We want to increase the typing for the whole community.

We’ll continue to work with collaborators to push typing in Ruby even further. As we lessen the effort needed to adopt Sorbet, specifically on Rails projects, we’ll start making gradual typing mandatory on more internal projects. We will help teams start adopting at typed: false and move to stricter typing gradually.

As our long term goal, we hope to bring Ruby on par with compiled and statically typed languages regarding safety, speed and tooling.

Do you want to be part of this effort? Feel free to contribute to Sorbet (there are a lot of good first issues to begin with), check our many open-source projects or take a look at how you can join our team.

Happy typing!

—The Ruby Infrastructure Team

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Static Typing for Ruby

Static Typing for Ruby

On November 25, 2020 we held ShipIt! Presents: The State of Ruby Static Typing at Shopify. The video of the event is now available.

Shopify changes a lot. We merge around 400 commits to the main branch daily and deploy a new version of our core monolith 40 times a day. The Monolith is also big: 37,000 Ruby files, 622,000 methods, more than 2,000,000 calls. At this scale with a dynamic language, even with the most rigorous review process and over 150,000 automated tests, it’s a challenge to ensure everything runs smoothly. Developers benefit from a short feedback loop to ensure the stability of our monolith for our merchants.

Since 2018, our Ruby Infrastructure team has looked at ways to make the development process safer, faster, and more enjoyable for Ruby developers. While Ruby is different from other languages and brings amazing features allowing Shopify to be what it is today, we felt there was a feature from other languages missing: static typing.

The Three Key Requirements for a Typing Solution in Ruby

Even in 2018, typing for Ruby wasn't a novelty. A few attempts were made to integrate type annotations directly into the language or through external tools (RDL, Steep), or as libraries (dry-types). 

Which solution would best fit Shopify considering its codebase and culture? For the Ruby Infrastructure team, the best match for a typing solution needs:

  • Gradual typing: Typing a monolith isn't a simple task and can’t be done in a day. Our code evolves fast, and we can’t ask developers to stop coding while we add types to the existing codebase. We need flexibility to add types without blocking the development process or limiting our ability to satisfy merchants needs.
  • Speed: Considering the size of Shopify’s codebase, speed is a concern. If our goal is to provide quick feedback on errors and remove pressure from continuous integration (CI), we need a solution that’s fast.
  • Full Ruby support: We use all of Ruby at Shopify. Our culture embraces the language and benefits from all features, even hard to type ones like metaprogramming, overloading, and class reopening. Support for Rails is also a must. From code elegance to developer happiness, the chosen solution needs to be compatible as much as possible with all Ruby features.

With such a list of requirements, none of the contenders at the time could satisfy our needs, especially the speed requirement. We started thinking about developing our own solution, but a perfectly timed meeting with Stripe, who were working on a solution to the problem, introduced us to Sorbet.

Sorbet was closed-source at the time and under heavy development but was already promising. It’s built for gradual typing with phenomenal performance (able to analyze 100,000 lines per second per core) making it significantly faster than running automated tests. It can handle hard to type things like metaprogramming, thanks to Ruby Interface files (RBI). This is how, at the start of 2019, Shopify began its journey toward static type checking for Ruby.

Treat Static Typing as a Product

With only a three-person team and a lot on our plate, fully typing our monolith with Sorbet was going to be an approach based on Shopify’s Get Shit Done (GSD) framework.

  1. We tested the viability of Sorbet on our core monolith by only typing a few files, to check if we could observe benefits from it while not impairing other developers’ work. Sorbets’ gradual approach proved to be working.
  2. We manually created RBI files to represent what Sorbet could not understand yet. We checked we supported Ruby’s most advanced features as well as Rails constructs.
  3. We added more and more files while keeping an eye on performance ensuring Sorbet would scale with our monolith.

This gave us confidence Sorbet was the right choice to solve our problem. Once we officially decided to go with Sorbet we reflected on how we can reach 100% adoption in the monolith. To determine our roadmap we looked at:

  • how many files needed to be typed
  • the content of the files
  • the Ruby features they used.

Track Static Typing Adoption

Type checking in Sorbet comes in different levels of strictness. The strictness is defined on a per file basis by adding a magic comment in the file, called a sigil, written # typed: LEVEL, where LEVEL can be one of the following: 

  • ignore: At this level, the file is not even read by Sorbet, and no errors are reported for this file at all.
  • false: Only errors related to syntax, constant resolution and correctness of sigs are reported. At this level sorbet doesn’t check the calls in the files even if the methods called don't exist anywhere in the codebase.
  • true: This is the level where Sorbet actually starts to type check your code. All methods called need to exist in the code base. For each call, Sorbet will check that the arguments count matches the method definition. If the method has a signature, Sorbet will also check their types.
  • strict: At this level all methods must have a signature, and all constants and instance variables must have explicitly annotated types.
  • strong: Sorbet no longer allows untyped variables. In practice, this level is actually unusable for most files because Sorbet can’t type everything yet and even Stripe advises against using it.

Once we were able to run Sorbet on our codebase, we needed a way to track our progress and identify which parts of our monolith were typed with which strictness or which parts needed more work. To do so we created SorbetMetrics, a tool able to collect and display metrics about typing coverage for all our internal projects. We started tracking three key metrics to measure Sorbet adoption :

  • Sigils: how many files are typed ignore, false, true, strict or strong
  • Calls: how many calls are sent to a method with a signature
  • Signatures: how many methods have a signature

Bar graph showing increased Sorbet usage in projects over time. Below the bar graph is a table showing the percentage of sigils, calls, signatures in each project.
SorbetMetrics dashboard homepage

Each day SorbetMetrics pulls the latest version of our monolith and other Shopify projects using Sorbet, computes those metrics and displays them in a dashboard internally available to all our developers.

A selection of charts from the SorbetMetrics Dashboard. 3 pie charts showing the percentage of sigils, calls, and signatures in the monolith. 3 line charts showing Sigils, calls, and signature percentage over time. A circular tree map showing the relationship between strictness level and components. 2 line charts showing Sorbet versions and typechecking time over time
SorbetMetrics dashboard for our monolith

Sorbet Support at Scale

If we treat typing as a product, we also need to focus on supporting and enabling our “customers” who are developers at Shopify. One of our goals was to have a strong support system in place to help with any problems that arise and slow developers down.

Initially, we supported developers with a dedicated Slack channel where Shopifolk could ask questions to the team. We’d answer these questions real-time and help Shopifolk with typing efforts where our input was important.

This white glove support model obviously didn't scale, but it was an excellent learning opportunity for our team—we now understood the biggest challenges and recurring problems. We ended up solving some problems over and over again, but it solidified the effort to understand the patterns and decide which features to work on next.

Using Slack meant our answers weren't discoverable forever. We moved most of the support and conversation to our internal Discourse platform, increasing discoverability and broader sharing of knowledge. This also allows us to record solutions in a single place and let developers self-serve as much as possible. As we onboard more and more projects with Sorbet, this solution scales better.

Understand Developer Happiness

Going further from unblocking our users, we also need to ensure their happiness. Sorbet and more generally static typing in Ruby wouldn’t be a good fit for us if it made our developers miserable. We’re aware that it introduces a bit more work, so the benefits need to balance with the inconvenience.

Our first tool to measure developers’ opinions of Sorbet is surveys. Twice a year, we send a “Typing @ Shopify” survey to all developers and collect their sentiments regarding Sorbet’s benefits and limitations, as well as what we should focus on in the future.

A bar graph showing the increasing strongly agree answer over time to the question I want Sorbet to be applied to other Shopify projects. Below that graph is a bar graph showing the increasing strongly agree answer over time to the question I want more code to be typed.
Some responses from our “Sorbet @ Shopify” surveys

We use simple questions (“yes” or “no”, or a “Strongly Disagree” (1) to “Strongly Agree” (5) scale) and then look at how the answers evolve over time. The survey results gave us interesting insights:

  • Sorbet catches more errors on developer’s pull requests (PR) as adoption increased
  • Signatures help with code understanding and give developers confidence to ship
  • Added confidence directly impacted the increasing positive opinion about static typing in Ruby
  • Over time developers wanted more code and more projects to be typed
  • Developers get used to Sorbet syntax over time
  • IDE integration with Sorbet is a feature developers are rooting for

Our main observation is that developers enjoy Sorbet more as the typing coverage increases. This is one reason that's increasing our motivation to reach 100% of files at typed: true and maximize the amount of methods with a signature.

The second tool is interviews with individual developers. We select a team working with Sorbet and meet each member to talk about their experience using Sorbet either in the monolith or on another project. We get a better understanding of what their likes and dislikes are, what we should be improving, but also how we can better support them when introducing Sorbet, so the team keeps Sorbet in their project.

The Current State of Sorbet at Shopify

Currently, Sorbet is only enforced on our main monolith and we have about 60 other internal projects that opted to use Sorbet as well. On our main monolith, we require all files to be at least typed: false and Sorbet is run on our continuous integration platform (CI) for every PR and fails builds if type checking errors are found. We’re currently evaluating the idea of enforcing valid type checking on CI even before running the automated tests.

Three pie charts showing percentage of sigils, calls, and signatures in the monolith used to measure Sorbet adoption
Typing coverage metrics for Shopify’s monolith

As of today, 80% of our files (including tests) are typed: true or higher. Almost half of our calls are typed and half of our methods have signatures. All of this can be type checked under 15 seconds on our developers machines.

A circular tree map showing the relationship between strictness level and components
Files strictness map in Shopify’s monolith

The circle map shows which parts of our monolith are at which strictness level. Each dot represents a Ruby file (excluding tests). Each circle represents a component (a set of Ruby files serving the same application concern). Yes, it looks like a Petri dish and our goal is to eradicate the bad orange untyped cells.

A bar graph showing increased number of Shopify projects using Sorbet over time
Shopify projects using Sorbet

Outside of the core monolith, we’ve also observed a natural increase of Shopify projects, both internal and open-source, using Sorbet. As I write these lines, more than 60 projects now use Sorbet. Shopifolks like Sorbet and use it on their own without being forced to do so.

A bar graph showing manual dev tc runs from developers machine on our monolith in 2019
Manual dev tc runs from developers machine on our monolith in 2019

Finally, we track how many times our developers ran the command dev tc to typecheck a project with Sorbet on their development machine. This shows us that developers don’t wait for CI to use Sorbet—everyone enjoys a faster feedback loop.

The Benefits of Types for Ruby Developers 

Now that Sorbet is fully adopted in our core monolith, as well as in many internal projects, we’re starting to see the benefits of it on our codebases as well as on our developers. Our team that is working on Sorbet adoption is part of the broader Ruby Infrastructure team which is responsible for delivering a fast, scalable and safe Ruby language for Shopify. As part of that mandate, we believe that static typing has a lot to offer for Ruby developers, especially when working on big, complex codebases.

In this post I focused on the process we followed to ensure Sorbet was the right solution for our needs, treating static typing as a product and showed the benefits of this product on our customers: Shopifolk working on our monolith and outside. Are you curious to know how we got there? Then you’ll be interested in the second part: Adopting Sorbet at Scale where I present the tools we built to make adoption easier and faster, the projects we open-sourced to share with the community and the preliminary results of our experiment with static typing to reduce production errors.

Happy typing!

—The Ruby Infrastructure Team

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

How to Introduce Composite Primary Keys in Rails

How to Introduce Composite Primary Keys in Rails

Databases are a key scalability bottleneck for many web applications. But what if you could make a small change to your database design that would unlock massively more efficient data access? At Shopify, we dusted off some old database principles and did exactly that with the primary Rails application that powers online stores for over a million merchants. In this post, we’ll walk you through how we did it, and how you can use the same trick to optimize your own applications.


A basic principle of database design is that data that is accessed together should be stored together. In a relational database, we see this principle at work in the design of individual records (rows), which are composed of bits of information that are often accessed and stored at the same time. When a query needs to access or update multiple records, this query will be faster if those rows are “near” to each other. In MySQL, the sequential ordering of rows on disk is dictated by the table’s primary key.

Active Record is the portion of the Rails application framework that abstracts and simplifies database access. This layer introduces database practices and conventions that greatly simplify application development. One such convention is that all tables have a simple automatically incrementing integer primary key, often called `id`. This means that, for a typical Rails application, most data is stored on disk strictly in the order the rows were created. For most tables in most Rails applications, this works just fine and is easy for application developers to understand.

Sometimes the pattern of row access in a table is quite different from the insertion pattern. In the case of Shopify’s core API server, it is usually quite different, due to Shopify’s multi-tenant architecture. Each database instance contains records from many shops. With a simple auto-incrementing primary key, table insertions interleave the insertion of records across many shops. On the other hand, most queries are only interested in the records for a single shop at a time.

Let’s take a look at how this plays out at the database storage level. We will use details from MySQL using the InnoDB storage engine, but the basic idea will hold true across many relational databases. Records are stored on disk in a data structure called a B+ tree. Here is an illustration of a table storing orders, with the integer order id shown, color-coded by shop:

Individual records are grouped into pages. When a record is queried, the entire page is loaded from disk into an in-memory structure called a buffer pool. Subsequent reads from the same page are much faster while it remains in the buffer pool. As we can see in the example above, if we want to retrieve all orders from the “yellow” shop, every page will need loading from disk. This is the worst-case scenario, but it turned out to be a prevalent scenario in Shopify’s main operational database. For some of our most important queries, we observed an average of 0.9 pages read per row in the final query result. This means we were loading an entire page into memory for nearly every row of data that we needed!

The fix for this problem is conceptually very simple. Instead of a simple primary key, we create a composite primary key [shop_id, order_id]. With this key structure, our disk layout looks quite different:

Records are now grouped into pages by shop. When retrieving orders for the “yellow” shop, we read from a much smaller set of pages (in this example it’s only one page less, but imagine extrapolating this to a table storing records for 10,000 shops and the result is more profound).

So far, so good. We have an obvious problem with the efficiency of data access and a simple solution. For the remainder of this article, we’ll go through some of the implementation details and challenges we came across with rolling out composite primary keys in our main operational database, along with the impact for our Ruby on Rails application and other systems directly coupled to our database. We will continue using the example of an “orders” table, both because it is conceptually simple to understand. It also turned out to be one of the critical table names that we applied this change to.

Introducing Composite Primary Keys

The first challenge we faced with introducing composite primary keys was at the application layer. Our framework and application code contained various assumptions about the table’s primary key. Active Record, in particular, assumes an integer primary key, and although there is a community gem to monkey-patch this, we didn’t have confidence that this approach would be sustainable and maintainable in the future. On deeper analysis, it turned out that nearly all such assumptions in application layer code continued to hold if we changed the `id` column to be an auto-incrementing secondary key. We can leave the application layer blissfully unaware of the underlying database schema by forcing Active Record to treat the `id` column as a primary key:

class Order < ApplicationRecord
  self.primary_key = :id
  .. remainder of order model ...

Here is the corresponding SQL table definition:

CREATE TABLE `orders` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `shop_id` bigint(20) NOT NULL,
  … other columns ...
  PRIMARY KEY (`shop_id`,`id`),
  KEY `id` (`id`)
  … other secondary keys ...

Note that we chose to leave the secondary index as a non-unique key here. There is some risk to this approach because it is possible to construct application code that results in duplicate models with the same id (but with different shop_id in our case). You can opt for safety here and make a unique secondary key on id. We took this approach because the method we use for live schema migrations is prone to deadlock on tables containing multiple unique constraints. Specifically, we use Large Hadron Migrator (LHM), which uses MySQL triggers to copy records into a shadow table during migrations. Unique constraints are enforced in InnoDB through an exclusive table-level write lock. Since there are two tables accepting writes, each containing two exclusive locks, all of the necessary deadlock conditions are present during migration. You may be able to keep a unique constraint on `id` if any of the following are true:

  • You don’t perform live migrations on your application.
  • Your migrations don’t use SQL triggers (such as the default Rails migrations).
  • The write throughput on the table is low enough that a low volume of deadlocks is acceptable for your application.
  • The code path for writing to this table is resilient to database transaction failures.

The remaining area of concern is any data infrastructure that directly accesses the MySQL database outside of the Rails application layer. In our case, we had three key technologies that fell into this category: 

  • Our database schema migration infrastructure, already discussed above.
  • Our live data migration system, called Ghostferry. Ghostferry moves data across different MySQL instances while the application is still running, enabling load-balancing of sharded data across multiple databases. We implemented support for composite primary keys in ghostferry as part of this work, by introducing the ability to specify an alternate column for pagination during migration.
  • Our data warehousing system does both bulk and incremental extraction of MySQL tables into long term storage. Since this system is proprietary to Shopify we won’t cover this area further, but if you have a similar data extraction system, you’ll need to ensure it can accommodate tables with composite primary keys.


Before we dig into specific results, a disclaimer: every table and corresponding application code is different, so the results you see in one table do not necessarily translate into another. You need to carefully consider your data’s access patterns to ensure that the primary key structure produces the optimal clustering for those access patterns. In our case of a sharded application, clustering the data by shop was often the right answer. However, if you have multiple closely connected data models, you may find another structure works better. To use a common example, if an application has “Blog” and “BlogPost” models, a suitable primary key for the blog_posts table may be (blog_id, blog_post_id). This is because typical data access patterns will tend to query posts for a single blog at once. In some cases, we found no overwhelming advantage to a composite primary key because there was no such singular data access pattern to optimize for. In one more subtle example, we found that associated records tended to be written within the same transaction, and so were already sequentially ordered, eliminating the advantage of a composite key. To extend the previous blog example, imagine if all posts for a single blog were always created in a single transaction, so that blog post records were never interleaved with insertion of posts from other blogs.

Returning to our leading example of an “orders” table, we measured a significant improvement in database efficiency:

  • The most common queries that consumed most database capacity had a 5-6x improvement in elapsed query time.
  • Performance gains corresponded linearly with a reduction in MySQL buffer pool page reads per query. Adding a composite key on our single most queried table reduced the median buffer pool reads per query from 1.8 to 1.2.
  • There was a dramatic improvement in tail latency for our slowest queries. We maintain a log of slow queries, which showed a roughly 80% reduction in distinct queries relating to the orders table.
  • Performance gains varied greatly across different kinds of queries. The most dramatic improvement was 500x on a particularly egregious query. Most queries involving joins saw much lower improvement due to the lack of similar data clustering in other tables (we expect this to improve as more tables adopt composite keys).
  • A useful measure of aggregate improvement is to measure the total elapsed database time per day, across all queries involving the changed table. This helps to add up the net benefit on database capacity across the system. We observed a reduction of roughly one hour per day, per shard, in elapsed query time from this change.

There is one notable downside on performance that is worth clearly calling out. A simple auto-incrementing primary key has optimal performance on insert statements because data is always clustered in insertion order. Changing to a composite primary key results in more expensive inserts, as more distinct database pages need to be both read and flushed to disk. We observed a roughly 10x performance degradation on inserts by changing to a composite primary key. Most data are queried and updated far more often than inserted, so this tradeoff is correct in most cases. However, it is worth keeping this in mind if insert performance is a critical bottleneck in your system for the table in question.

Wrapping Up

The benefits of data clustering and the use of composite primary keys are well-established techniques in the database world. Within the Rails ecosystem, established conventions around primary keys mean that many Rails applications lose out on these benefits. If you are operating a large Rails application, and either database performance or capacity are major concerns, it is worth exploring a move to composite primary keys in some of your tables. In our case, we faced a large upfront cost to introduce composite primary keys due to the complexity of our data infrastructure, but with that cost paid, we can now introduce additional composite keys with a small incremental effort. This has resulted in significant improvements to query performance and total database capacity in one of the world’s largest and oldest Rails applications.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

Building Mental Models of Ideas That Don’t Change

Building Mental Models of Ideas That Don’t Change

I hope these mental models are as valuable for you as they are for me. I presented these ideas at ShipIt! Presents: Building Mental Models of Ideas That Don’t Change on October 28, 2020 and the video is available. I went over the process of prioritizing new ideas and coming up with a system of models for yourself to organize these ideas. If you find this useful, stay updated by following me on Twitter or my blog.

There’s always new stuff: new frameworks, new languages, and new platforms. All of this adds up. Sometimes it feels like you’re just treading water, and not actually getting better at what you do. I’ve tried spending more time learning this stuff, but that doesn’t work—there’s always more. I have found a better approach is learning things at a deeper level and using those lessons as a checklist. This checklist of core principles are called mental models. 

I learned this approach by studying how bright people think. You might have heard Richard Feynman describe the handful of algorithms that he applies to everything. Maybe you’ve  seen Elon Musk describe his approach as thinking by fundamental principles. Charlie Munger also credits most of his financial success to mental models. All of these people are amazing and you won’t get to their level with mental models alone, but mental models give you a nudge in the right direction.

So, how does one integrate mental models into their life and work? The first thing that you need is a method for prioritizing new concepts that you should learn. After that, you’ll need a good system for keeping track of what you have identified as important. With this process, you’ll identify mental models and use them to make more informed decisions. Below I start by describing some engineering and management mental models that I have found useful over the years.

Table of Contents

Engineering Mental Models

 Management Mental Models

 Engineering Mental Models

Avoid Silent Failures

When something breaks you should hear about it. This is important because small issues can help you find larger structural issues. Silent failures typically happen when exceptions are silenced—this may be in a networking library, or the code that handles exceptions. Failures can also be silent when one of your servers is down. You can prevent this by using a third party system that pings each of the critical components.

As your project gets more mature, set up a dashboard to track key metrics and create automated alerts. Generally, computers should tell you when something is wrong. Systems become more difficult to monitor as they grow. You want to measure and log everything at the beginning and not wait until something goes wrong. You can encourage other developers to do this by creating helper classes with a really simple APIs since things that are easy and obvious are more likely to be used. Once you are logging everything, create automated alerts. Post these alerts in shared communication channels, and automatically page the oncall developer for emergencies.

Do Minimal Upfront Work and Queue the Rest

A system is scalable when it handles unexpectedly large bursts of incoming requests. The faster your system handles a request, the faster it gets to the next one. Turns out, that in most cases, you don’t have to give a response to the request right away—just a response indicating you've started working on the task. In practice, you queue a background job after you receive a request. Once your job is in a queue, you have the added benefit of making your system fault tolerant since failed jobs can be tried again.

Scaling Reads with Caching and Denormalizing

Read-heavy systems mean some data is being read multiple times. This can be problematic because your database might not have enough capacity to deal with all of that work. The general approach of solving this is by pre-computing this data (called denormalizing) and storing it somewhere fast. In practice, instead of letting each request hit multiple tables in a database, you pre-compute the expected response and store it in a single place. Ideally, you store this information somewhere that’s really fast to read from (think RAM). In practice this means storing data in data stores like Memcached.

Scaling Writes with Sharding, NoSQL Datastore, or Design Choices

Write-heavy systems tend to be difficult to deal with. Traditional relational databases can handle reads pretty well, but have trouble with writes. They take more time processing writes because relational databases spend more effort on durability and that can lock up writes and create timeout errors.

Consider the scenario where a relational database is at it’s write-capacity and you can’t scale up anymore. One solution is to write data to multiple databases. Sharding is the process where you split your database into multiple parts (known as shards). This process allows you to group related data into one database. Another method of dealing with a write heavy system is by writing to Non-relational (NoSQL) databases. These databases are optimized to handle writes, but there’s a tradeoff. Depending on the type of NoSQL database and its configuration, it gives up:

  • atomic transactions (they don’t wait for other transactions to fully finish), 
  • consistency across multiple clusters (they don’t wait for other clusters to have the same data),
  • durability (they don’t spend time writing to disk). 

It may seem like you are giving up a lot, but you mitigate some of these losses with design choices. 

Design choices help you cover some of the weaknesses of SQL databases. For example, consider that updating rows is much more expensive than creating new rows. Design your system so you avoid updating the same row in multiple flows—insert new rows to avoid lock contention. With all of that said, I recommend starting out with a SQL database, and evolving your setup depending on your needs.

Horizontal Scaling Is the Only Real Long Term Solution

Horizontal scaling refers to running your software on multiple small machines, while vertical scaling refers to running your software on one large machine. Horizontal scaling is more fault tolerant since failure of a machine doesn’t mean an outage. Instead, the work for the failed machine is routed to the other machines. In practice, horizontally scaling a system is the only long term approach to scaling. All systems that appear ‘infinitely-scalable’ are horizontally scaled under the hood: Cloud object stores like S3 and GCS; NoSQL databases like Bigtable and Dynamo DB; and stream processing systems like Kafka are all horizontally scaled. The cost for horizontally scaling systems is application and operational complexity. It takes significant time and potential complexity to horizontally scale your system, but you want to be in a situation where you can linearly scale your system by adding more computers.

Things That are Harder to Test Are More Likely to Break

Among competing approaches to a problem, you should pick the most testable solution (this is my variant of Occam’s Razor). If something is difficult to test, people tend to avoid testing it. This means that future programmers (or you) will be less likely to fully test this system, and each change will make the system more brittle. This model is important to remember when you first tackle a problem because good testability needs to be baked into the architecture. You’ll know when something is hard to test because your intuition will tell you.

Antifragility and Root Cause Analysis

Nassim Taleb uses the analogy of a hydra in Antifragile; they grow back a stronger head every time they are struck. The software industry championed this idea too. Instead of treating failures as shameful incidents that should be avoided at all costs, they’re now treated as opportunities to improve the system. Netflix’s engineering team is known for Chaos Monkey, a resiliency system that turns off random components. Once you anticipate random events, you can build a more resilient system. When failures do happen, they’re treated as an opportunity to learn.

Root cause analysis is a process where the people involved in a failure try to extract the root cause in a blameless way by starting off by what went right, and then diving into the failure without blaming anyone.

Big-O and Exponential Growth

The Big-O notation describes the growth in complexity of an algorithm. There’s a lot to this, but you’ll get very far if you just understand the difference between constant, linear, and exponential growth. In layman’s terms, algorithms that perform one task are better than algorithms that perform many tasks, and algorithms that perform many tasks are better than ones where the tasks are ever increasing with each iteration. I have found this issue visible at an architectural level as well.

Margin of Safety

Accounting for a margin of safety means you need to leave some room for errors or exceptional events. For example, you might be tempted to run each server at 90% of its capacity. While this saves money, it leaves your server vulnerable to spikes in traffic. You’ll have more confidence in your setup, if you have auto-scaling setup. There’s a problem with this too, your overworked server can cause cascading failures in the whole system. By the time auto-scaling kicks in, the new server may have a disk, connection pool or an assortment of other random fun issues. Expect the unexpected and give yourself some room to breathe. Margin of safety also applies to planning releases of new software. You should add a buffer of time because unexpected things will come up.

Protect the Public API

Be very careful when making changes to the public API. Once something is in the public API, it’s difficult to change or remove. In practice, this means having a very good reason for your changes, and being extremely careful with anything that affects external developers; mistakes in this type of work affect numerous people and are very difficult to revert.


Any system with many moving parts should be built to expect failures of individual parts. This means having backup providers for systems like Memcached or Redis. For permanent data-stores like SQL, fail-overs and backups are critical. Keep in mind that you shouldn’t consider something a backup unless you do regular drills to make sure that you can actually recover that data.

Loose Coupling and Isolation

Tight coupling means that different components of a system are closely interconnected. This has two major drawbacks. The first drawback is that these tightly coupled systems are more complex. Complex systems, in turn, are more difficult to maintain and more error prone. The second major drawback is that failure in one component propagates faster. When systems are loosely coupled, failures can be self contained and can be replaced by potential backups (see Redundancy). At a code level, reducing tight coupling means following the single responsibility principle which states that every class has a single responsibility and communicates with other classes with a minimal public API. At an architecture level, you improve tightly coupled systems by following the service oriented architecture. This architecture system suggests dividing components by their business services and only allows communication between these services with a strict API.

Be Serious About Configuration

Most failures in well-tested systems occur due to bad configuration; this can be changes like environmental variables updates or DNS settings. Configuration changes are particularly error prone because of the lack of tests and the difference between the development and production environment. In practice, add tests to cover different configurations, and make the dev and prod environment as similar as possible. If something works in development, but not production, spend some time thinking about why that’s the case.

Explicit Is Better than Implicit

The explicit is better than implicit model is one of the core tenants from the Zen of Python and it’s critical to improving code readability. It’s difficult to understand code that expects the reader to have all of the context of the original author. An engineer should be able to look at class and understand where all of the different components come from. I have found that simply having everything in one place is better than convoluted design patterns. Write code for people, not computers.

Code Review

Code review is one of the highest leverage activities a developer can perform. It improves code quality and transfers knowledge between developers. Great code reviewers change the culture and performance of an entire engineering organization. Have at least two other developers review your code before shipping it. Reviewers should give thorough feedback all at once, as it’s really inefficient to have multiple rounds of reviews. You’ll find that your code review quality will slip depending on your energy level. Here’s an approach to getting some consistency in reviews: 

  1. Why is this change being made? 
  2. How can this approach or code be wrong? 
  3. Do the tests cover this or do I need to run it locally? 

Perceived Performance

Based on UX research, 0.1 second (100 ms) is the gold standard of loading time. Slower applications risk losing the user’s attention. Accomplishing this load time for non-trivial apps is actually pretty difficult, so this is where you can take advantage of perceived performance. Perceived performance refers to how fast your product feels. The idea is that you show users placeholder content at load time and then add the actual content on the screen once it finishes loading. This is related to the Do Minimal Upfront Work and Queue the Rest model.

Never Trust User Input Without Validating it First

The internet works because we managed to create predictable and secure abstractions on top of unpredictable and insecure networks of computers. These abstractions are mostly invisible to users but there’s a lot happening in the background to make it work. As an engineer, you should be mindful of this and never trust input without validating it first. There are a few fundamental issues when receiving input from the user.

  1. You need to validate that the user is who they say they are (authentication).
  2. You need to ensure that the communication channel is secure, and no one else is snooping (confidentiality).
  3. You need to validate that the incoming data was not manipulated in the network (data integrity).
  4. You need to prevent replay attacks and ensure that the same data isn’t being sent multiple times.
  5. You could also have the case where a trusted entity is sending malicious data.

This is simplified, and there are more things that can go wrong, so you should always validate user input before trusting it.

Safety Valves

Building a system means accounting for all possibilities. In addition to worst case scenarios, you have to be prepared to deal with things that you cannot anticipate. The general approach for handling these scenarios is stopping the system to prevent any possible damage. In practice, this means having controls that let you reject additional requests while you diagnose a solution. One way to do this is adding an environment variable that can be toggled without deploying a new version of your code.

Automatic Cache Expiration

Your caching setup can be greatly simplified with automatic cache expiration. To illustrate why, consider the example where the server is rendering a product on a page. You want to expire the cache whenever this product changes. The manual method is by expiring the cache expiration code after the product is changed. This requires two separate steps, 1) Changing the product, and then 2) Expiring the cache. If you build your system with key-based caching, you avoid the second step all together. It’s typically done by using a combination of the product’s ID and it’s last_updated_at_timestamp as the key for the product’s cache. This means that when a product changes it’ll have a different last_updated_at_timestamp field. Since you’ll have a different key, you won’t find anything in the cache matching that key and fetch the product in it’s newest state. The downside of this approach is that your cache datastore (e.g., Memcached or Redis) will fill up with old caches. You can mitigate it by adding an expiry time to all caches so old caches automatically disappear. You can also configure Memcached so it evicts the oldest caches to make room for new ones.

Introducing New Tech Should Make an Impossible Task Possible or Something 10x Easier

Most companies eventually have to evaluate new technologies. In the tech industry, you have to do this to stay relevant. However, introducing a new technology has two negative consequences. First, it becomes more difficult for developers to move across teams. This is a problem because it creates knowledge silos within the company, and slows down career growth. The second consequence is that fewer libraries or insights can be shared across the company because of the tech fragmentation. Moving over to new tech might come up because of people’s tendency to want to start over and write things from scratch—it’s almost always a bad idea. On the other hand, there are a few cases where introducing a new technology makes sense like when it enables your company to take on previously impossible tasks. It makes sense when the technical limitation of your current stack is preventing you from reaching your product goals.

Failure Modes

Designers and product folks focus on the expected use cases. As an engineer you also have to think about the worst case scenarios because that’s where the majority of your time will go. At scale, all bad things that can happen do happen. Asking “What could go wrong” or “How can I be wrong” really helps; these questions also cancel out our bias towards confirmation of our existing ideas. Think about what happens when no data, or a lot of data is flowing through the system (Think “Min-Max”). You should expect computers to occasionally die and handle those cases gracefully, and expect network requests to be slow or stall all together.

Management Mental Models

The key insight here is that "Management" might be the wrong name for this discipline all together. What you are really doing is growing people. You’ll rarely have to manage others if you align your interests with the people that report to you. Compared to engineering, management is more fuzzy and subjective. This is why engineers struggle with it. It's really about calibration; you are calibrating approaches for yourself and your reports. What works for you, might not work for me because the world has different expectations from us. Likewise, just reading books on this stuff doesn't help because advice from the book is calibrated for the author.

With that said, I believe the following mental models are falsifiable. I also believe that doing the opposite of these will always be harmful. I find these particularly valuable while planning my week. Enjoy!

Create Motivation by Aligning Incentives

Incentives drive our behavior above all else. You probably feel this yourself when you procrastinate on tasks that you don't really want to do.  Work with your reports to identify the intersection of:

  1. What do they want to work on?
  2. What does the product need?
  3. What does that company need?  

Venn diagram of the intersection of the three incentives
The intersection of the three incentives

Magic happens when these questions produce themes that overlap with each other. The person will have intrinsic motivation for tasks that build their skills, improve their product, and their company. Working on the two intersecting themes to these questions can be fruitful too. You can replace 'product' with 'direct team' if appropriate.

 Occasionally you'll find someone focusing on a task that's only:

  • done because that's what the person wants (neglecting the product and the company)
  • what the product needs (neglecting the person's needs or the company)
  • what the company wants (neglecting the person's needs or their product). 

This is fine in the short term, but not a good long term strategy. You should nudge your reports towards these overlapping themes.

Create Clarity by Understanding the "Why" and Having a Vision for the Product

You should have a vision for where your product needs to go. This ends up being super helpful when deciding between competing tactical options and also helps clear up general confusion. You must communicate the vision with your team.  While being a visionary isn't included in your job description, aligning on a "why" often counteracts the negative effects of broken-telephone effect in communication and entropy in organizations.

Focus on High Leverage Activities

This is the central idea in High output management. The core idea is similar to the "Pareto principle" where you focus your energy on the 20% of the tasks that have 80% of the impact. If you don't do this, your team will spend a lot of time, but not accomplish much. So, take some time to plan your approach and focus on the activities that give you the most leverage. I found Donella Medow’s research to be a super user for understanding leverage. A few examples of this include:

Promote Growth Mindset in Your Team

If you had to start with zero, what you'd want is the ability to acquire new skills. You want to instil a mindset of growth in yourself and the rest of the team. Create  an environment where reflection and failures are talked about. Lessons that you truly learn are the ones that you have learnt by making mistakes. Create an environment where people obsess about the craft and consider failures a learning opportunity.

Align Your Team on the Common Vision

Aligning your team towards a common direction is one of the most important things you can do. It'll mean that people will go in the same general direction.

Build Self-organizing Teams

Creating self-sufficient teams is the only way to scale yourself. You can enable a team to do this by promoting a sense of ownership. You can give your input without taking authority away from others and offer suggestions without steamrolling leaders.

Communication and Structural Organization

You should focus on communication and organization tools that keep the whole team organized. Communication fragmentation leads to massive waste.

Get the Architecture Right

This is where your engineering chops will come in handy. From a technical perspective, getting the core pieces of the architecture right ends up being critical and defines the flow of information in the system.

Don’t Try to be Efficient with Relationships

As an engineer your brain is optimized to seek efficiency. Efficiency isn’t a good approach when it comes to relationships with people as you often have the opposite effect as to what you intended. I have found that 30 minute meetings are too fast for one-on-ones with your reports. You want to give some time for banter and a free flow of information. This eases people up, you have better conversations and they often end up sharing more critical information than they would otherwise. Of course, you don't want to spend a lot of time in meetings, so I prefer to have longer infrequent meetings instead of frequent short meetings.

This model also applies to pushing for changes or influencing others in any way. This is a long game, and you should be prepared for that. Permanent positive behavioral changes take time.

Hire Smart People Who Get Stuff Done and You Want to Be Around

Pick business partners with high intelligence, energy, and, above all, integrity
Pick business partners with high intelligence, energy, and, above all, integrity. -@Naval

Hiring, when done right, is one of the highest-leverage activities that you can work on. You are looking for three key signals when hiring:

  • smart
  • gets stuff done
  • good to be around. 

When looking for the "smart" signal, be aware of the "halo effect" and be weary of charmers. "Get stuff done" is critical because you don't want to be around smart people who aren’t adding value to your company. Just like investing, your aim should be to identify people on a great trajectory. "Good to be around" is tricky because it's filled with personal bias. A good rule is to never hire assholes. Even if they are smart and get stuff done, they’ll do that at the expense of others and wreak havoc on the team's culture. Avoid! Focus on hiring people that you would want long term relationships with.

Be Useful

You could behave in a number of different ways at any given time or interaction. What you want is to be useful and add value. There is a useful way of giving feedback to someone that reports you and a useful way to review code. Apply this to yourself and plan your day to be more useful to others. Our default instinct is to seek confirmation bias and think about how we are right. We don’t give others the same courtesy. The right approach is to reverse that default instinct: Think “how can I make this work” for other people, and “how can this be wrong” for your own ideas.

Don’t compete with your reports either. As a manager, this is particularly important because you want to grow your reports. Be aware of situations or cases where you might be competing, and default to being useful instead of pushing your own agenda.

Get the Requirements Right Early and Come Up with a Game Plan

Planning gets a bad rep in agile organizations, but it ends up being critical in the long term. Doing some planning almost always ends up being much better than no planning at all. What you want to do is plan until you have a general direction defined and start iterating towards that. There are a few questions can help getting these requirements right:

  • What are the things that we want in the near future? You want to pick the path that gives you the most options for the expected future.
  • How can this be wrong? Counteract your confirmation bias with this question by explicitly thinking about failure modes.
  • Where do you not want to go? Inversion ends up being really useful. It’s easier to avoid stupidity than seeking brilliance
  • What happens once you get there? Seek second order effects. What will one path unlock or limit?
  • What other paths could you take? Has your team settled on a local maxima instead and not the global maxima?

Once you have a decent idea of how to proceed with this, you are responsible for communicating this plan with the rest of the team too. Not getting the requirements right early on means that your team can potentially end up going in the wrong direction which ends up being a net negative.

Establish Rapport Before Getting to Work

You will be much more effective at work, if you connect with the other people before getting to work. This could mean banter, or just listening—there’s a reason why people small-talk.  Get in the circle before attempting to change the circle. This leads to numerous positive improvements in your workflow. Slack conversations will sound like conversations instead of arguments and you'll assume positive intent.  You’ll also find that getting alignment in meetings and nudging reports towards positive changes ends up being much more useful this way. Icebreakers in meetings and room for silliness helps here.

There Is No One-size-fits-all Approach to People. Personality Tests Are Good Defaults

Management is about calibration. You are calibrating  your general approach to others, while calibrating a specific approach to each person. This is really important because an approach that might work for one person won’t work on others. You might find that personality tests like the Enneagram serve as great defaults to approaching a person. Type-5, the investigators, work best when you give them autonomy and new ideas. Type-6, the loyalists, typically want frequent support and the feeling of being entrusted. The Last Dance miniseries on Netflix is a master class on this topic.

Get People to Lead with Their Strengths and Address Their Growth Areas as a Secondary Priority

There are multiple ways to approach a person's personal growth. I’ve found that what works best is first identifying their strengths and then areas of improvements. Find people’s strengths and obsessions then point them to that. You have to get people to lead with their strengths. It’s the right approach because it gives people confidence and momentum. Turns out, that’s also how they add the most value. Ideally, one should develop to be more well-rounded, so it’s also important to come up with a game plan for addressing any areas of improvements. 

Focus on the Positives and Don't Over Index on the Negatives

For whatever reasons, we tend to focus on the negatives more than we should. It might be related to "deprival super reaction syndrome" where we hate losing more than we like winning. In management, we might have the proclivity to focus on what people are doing poorly instead of what they’re doing well. People may not feel appreciated if you only focus on the negatives. I believe this also means that we end up focusing on improving low-performers more than amplifying high-performers. Amplifying high-performers may have an order of magnitude higher impact.

People Will Act Like Owners if You Give Them Control and Transparency

Be transparent and don’t do everything yourself. Talk to people and make them feel included. When people feel left out of the loop, they generally grow more anxious as they feel that they’re losing control. Your ideal case is that your reports act like owners. You can do this by being transparent about how decisions are made. You also have to give others control and autonomy. Expect some mistakes as they calibrate their judgement and nudge in the right direction instead of steamrolling them.

There are other times where you'll have to act like an owner and lead by example. One hint of this case will be when you have a nagging feeling about something that you don't want to do. Ultimately, the right thing here is to take full ownership and not ask others to do what you wouldn't want.

Have High Standards for Yourself and the People Around You

Tech markets and products are generally winner-take-all. This means that second place isn’t a viable option—winning, in tech, leads to disproportionately greater rewards. Aim to be the best in the world in your space and iterate towards that. There’s no point in doing things in a mediocre way.  Aiming high, and having high standards is what pushes everything forward.

To make this happen, you need to partner with people who expect more from you. You should also have high standards for your reports. One interesting outcome of this is that you get the positive effects of the Pygmalion effect: people will rise to your positive expectations.

Hold People Accountable

When things don't get delivered or when you see bad behavior, you have to have high standards and hold people accountable. If you don't, that's an implicit message that these bad outcomes are ok. There are many cases where not holding others accountable can have a spiraling effect.

Your approach to this has to be calibrated for your work style and the situation. Ultimately, you should enforce all deal-breaker rules. Set clear expectations early on. When something goes wrong, work with the other person or team to understand why. Was it a technical issue, does the tooling need to be improved, or was it an issue with leadership?

Bring Other People Up with You

We like working with great and ambitious people because they raise the bar for everyone else. We’re allergic to self-obsessed people who only care about their own growth. Your job as a manager is to bring other people up. Don’t take credit for work that you didn’t do and give recognition to those that did the work. What's interesting is that most people only really care about their own growth. So, being the person who actually spends time thinking about the growth of others differentiates you, making more people want to work with you.

Maintain Your Mental Health with Mindfulness, Rest, and Distance

A team's culture is top down and bottom up. This means people mimic the behavior of others in the position of authority—for better or for worse. Keeping this in mind, you have to be aware of your own actions. Generally, most people become less aware of their actions as fatigue builds up. Be mindful of your energy levels when entering meetings.  Energy and positivity is phenomenal, because it's something that you can give to others, and it doesn't cost you anything.

Stress management is another important skill to develop. Most people can manage problems with clear yes/no solutions. Trickier problems with nuances and unclear paths, or split decisions tend to bubble up. Ambiguity and conflicting signals are a source of stress for many people and treating this like a skill is really important. Dealing with stressful situations by adding more emotion generally doesn't help. Keep your cool in stressful situations.

Aim to Operate Two to Three Months Into the Future

As an engineer you typically operate in scope between days and weeks. As you expand your influence, you also have to start thinking in a greater time horizon. As a manager, your horizon will be longer than a typical engineer, but smaller than someone who focuses on high level strategy. This means that you need to project your team and reports a few months into the future and anticipate their challenges. Ideally, you can help resolve these issues before they even happen. This exercise also helps you be more proactive instead of reacting to daily events.

Give Feedback to People That You Want Long Term Relationships with

Would you give feedback to a random stranger doing something that's bad for them? Probably not. Now, imagine that you knew this person. You would try to reason with this person, and hopefully nudge them in the right direction. You give feedback to people that you care about and want a long term relationship with. I believe this is also true at work. Even if someone isn’t your report, it’s worth sharing your feedback if you can deliver it usefully.

Giving feedback is tricky since people often get defensive. There are different schools of thoughts on this, but I try to build a rapport with someone before giving them direct feedback. Once you convince someone that you are on their side, people are much more receptive to it. Get in the circle. While code review feedback is best when it's all at once, that isn't necessarily true for one-on-one feedback. Many people default to quick-feedback and I think that doesn't work for people you don't have good rapport with and that it only really works if you are in a position of authority. The shortest path is not always the path of least resistance, and so you should build rapport before getting to work.

ShipIt! Presents: Building Mental Models of Ideas That Don’t Change


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

How to Do an In-depth Liquid Render Analysis with Theme Inspector

How to Do an In-depth Liquid Render Analysis with Theme Inspector

Shopify’s Online Store provides greater flexibility to build themes representing the brand of a merchant’s online store. However, like with any programming language, one often writes code without being aware of all the performance impact it creates. Whether it be performance impact on Shopify’s servers or observable performance impact on the browser, ultimately, it’s the customers that experience the slowness. The speed of server-side rendering is one of the most important performance timings to optimize for. That’s because customers wait on a blank screen until server-side rendering completes. Even though we’re working hard to make server-side rendering as fast as possible, bottlenecks may originate from the Liquid source itself.

In this article, I’ll look at:  

  • how to interpret flame graphs generated by the Shopify Theme Inspector
  • what kind of flame graphs generate from unoptimized Liquid code patterns
  • tips for spotting and avoiding these performance issues.

Install the Shopify Theme Inspector

With a Google Chrome browser, install the Shopify Theme Inspector extension. Follow this article on Debug Liquid Render Performance with Shopify Theme Inspector for Chrome for how to start with the extension and get to a point where you can produce a flame graph on your store.

A flame graph example
A flame graph example

The flame graph produced by this tool is a data representation of the code path and the time it took to execute. With this tool, as a developer, you can find out how long a piece of code took to render.

Start with Clean Code

We often forget what a clean implementation looks like, and this is often how we, Shopify, envision this piece of liquid code will be used—it’s often not the reality as developers will find their own ways to achieve their goals. As time passes, code becomes complicated. We need to go back to the clean implementation to understand what makes it take the time to render.

The simple code above creates a flame graph that looks like this image below:

Flame graph for a 10 item paginated collection
Flame graph for a 10 item paginated collection

The template section took 13 ms to complete rendering. Let’s have a better understanding of what we are seeing here.

Highlighted flame graph for a 10 item paginated collection
Highlighted flame graph for a 10 item paginated collection

The area where the server took the time to render is where the code for the pagination loop is executed. In this case, we rendered 10 product titles. Then there’s a block of time that seems to disappear. It‘s actually the time spent on Shopify’s side collecting all the information that belongs to the products in the paginate collection.

Look at Inefficient Code

To know what’s an inefficient code, one must know what it looks like, why it is slow, and how to recognize it in the flame graph. This section walks through a side-by-side comparison of code and it’s flame graphs, and how a simple change results in bad performance.

Heavy Loop

Let’s take that clean code example and make it heavy.

What I’ve done here is accessed attributes in a product while iterating through a collection. Here’s the corresponding flame graph:

Flame graph for a 10 item paginated collection with accessing to its attributes
Flame graph for a 10 item paginated collection with accessing to its attributes

The total render time of this loop is now at 162 ms compared to 13 ms from the clean example. The product attributes access changes a less than 1 ms render time per tile to a 16 ms render time per tile. This produces exactly the same markup as the clean example but at the cost of 16 times more rendering time. If we increase the number of products to paginate from 10 to 50, it takes 800 ms to render.


  • Instead of focusing on how many 1 ms bars there are, focus on the total rendering time of each loop iteration
  • Clean up any attributes aren’t being used
  • Reduce the number of products in a paginated page (Potentially AJAX the next page of products)
  • Simplify the functionality of the rendered product

Nested Loops

Let’s take that clean code example and make it render with nested loops.

This code snippet is a typical example of iterating through the options and variations of a product. Here’s the corresponding flame graph:

Flame graph for two nested loop example
Flame graph for two nested loop example

This code snippet is a two-level nested loop rendering at 55 ms.

Nested loops are hard to notice when just looking at code because it’s separated by files. With the flame graph, we see the flame graph start to grow deeper.

Flame graph of a single loop on a product
Flame graph of a single loop on a product

As highlighted in the above screenshot, the two inner for-loops stacks side by side. This is okay if there are only one or two loops. However, each iteration rendering time will vary based on how many inner iterations it has.

Let’s look at what a three nested loop looks like.

Flame graph for three nested loop example
Flame graph for three nested loop example

This three level nested loop rendered at 72 ms. This can get out of hand really quickly if we aren’t careful. A small addition to the code inside the loop could blow your budget on server rendering time.


  • Look for a sawtooth shaped flame graph to target potential performance problem
  • Evaluate each flame graph layer and see if the nested loops are required

Mix Usage of Multiple Global Liquid Scope

Let’s now take that clean code example and add another global scoped liquid variable.

And here’s the corresponding flame graph:

Flame graph of when there’s one item in the cart with rendering time at 45 ms
Flame graph of when there’s one item in the cart with rendering time at 45 ms

Flame graph of when there’s 10 items in the cart with rendering time at 124 ms
Flame graph of when there’s 10 items in the cart with rendering time at 124 ms

This flame graph is an example of a badly nested loop where each variation is accessing the cart items. As more items are added to the cart, the page takes longer to render.


  • Look for hair comb or sawtooth shaped flame graph to target potential performance problem
  • Compare flame graphs between one item and multiple items in cart
  • Don’t mix global liquid variable usage. If you have to, use  AJAX to fetch for cart items instead

What is Fast Enough?

Try to aim for 200 ms but no more than 500 ms total page rendering time reported by the extension. We didn’t just pick a number out of the hat. It’s made with careful consideration of what other allocation of available time during a page render that we need to include to hit a performance goal. Google Web Vitals stated that a good score for Largest Content Paint (LCP) is less than 2.5 seconds. However, the largest content paint is dependent on many other metrics like time to first byte (TTFB) and first content paint (FCP). So, let’s make some time allocation! Also, let’s understand what each metric represents:

Flow diagram: Shopify Server to Browser to FCP to LCP

  • From Shopify’s server to a browser is the network overhead time required. It varies based on the network the browser is on. For example, navigating your store on 3G or Wi-Fi.
  • From a browser blank page (TTFB) to showing anything (FCP) is the time the browser needs to read and display the page.
  • From the FCP to the LCF is the time the browser needs to get all other resources (images, css, fonts, scripts, video, … etc.) to complete the page.

The goal is an LCP < 2.5 seconds to receive a good score

Server → Browser

300 ms for network overhead

Browser → FCP

200 ms for browser to do its work


1.5 sec for above the fold image and assets to download


Which leaves us 500 ms for total page render time.

Does this mean that as long as we keep server rendering below 500 ms, we can get a good LCP score? No, there’s other considerations like critical rendering path that aren’t addressed here, but we’re at least half way there.


  • Optimizing for critical rendering path on the theme level can bring the 200 ms requirement between the browser to FCP timing down to a lower number.

So, we have 500 ms for total page render time, but this doesn’t mean you have all 500 ms to spare. There’s some mandatory server render times that are dedicated to Shopify and others that the theme dedicates to rendering global sections like the header and footer. Depending how you want to allocate the rendering resources, the available rendering time you leave yourself for the page content varies. For example:


500 ms

Shopify (content for header)

50 ms

Header (with menu)

100 ms


25 ms


25 ms

Page Content

300 ms

I mentioned trying to aim for 200 ms total page rendering time—this is a stretch goal. By keeping ourselves mindful of a goal, it’s much easier to start recognizing when performance starts to degrade.

An Invitation to the Shopify Theme Developer Community

We couldn’t possibly know every possible combination of how the world is using Shopify. So, I invite you to share your experience with Shopify’s Theme Inspector and let us know how we can improve at or tweet us at @shopifydevs.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

How to Build an Experiment Pipeline from Scratch

How to Build an Experiment Pipeline from Scratch

One of the most compelling ways to prove the value of any decision or intervention—to technical and non-technical audiences alike—is to run an A/B test. But what if that wasn’t an option on your current stack? That’s the challenge we faced at Shopify. Our amazing team of engineers built robust capabilities for experimentation on our web properties and Shopify admin experiences, but testing external channels like email was unexplored. When it came time to ship a new recommendation algorithm that generates personalized blog post suggestions, we had no way to measure its incremental benefit against the generic blog emails.

To address the problem I built an email experimentation pipeline from the ground up. This quick build helps the Marketing Data Science team solve challenges around experimentation for external channels, and it’s in use by various data teams across Shopify. Below is a guide that teaches you how to implement a similar pipeline with a relatively simple setup from scratch. 

The Problem

Experimentation is one of the most valuable tools for data teams, providing a systematic proof of concept mechanism for interface tweaks, product variations, and changes to the user experience. With our existing experimentation framework Verdict, we can randomize shops, as well as sessions for web properties that exist before the login gate. However, this didn’t extend to email experiments since the randomization isn’t triggered by a site visit and the intervention is in the user’s inbox, not our platform.

As a result, data scientists randomized emails themselves, shipped the experiment, and stored the results in a local text file. This was problematic for a number of reasons: 

  1. Local storage isn’t discoverable and can be lost or deleted. 
  2. The ad hoc randomization didn’t account for users that unsubscribed from our mailing list and didn’t resolve the many-to-many relationship of emails to shops, creating the risk for cross-contamination between the variations. Some shops have multiple staff each with an email address, and some people create multiple shops under the same email address.
  3. Two marketers can simultaneously test the same audience with no exclusion criteria, violating the assumption that all other variables are controlled. 

Toward the end of 2019, when email experimentation became even more popular among marketers as the marketing department grew at Shopify, it became clear that a solution was overdue and necessary.

Before You Start

There are few things I find more mortifying than shipping code just to ship more code to fix your old code, and I’m no stranger to this. My pull requests (PRs) were rigorously reviewed, but myself and the reviewers were in uncharted territory. Exhibit A: a selection of my failed attempts at building a pipeline through trial and error: 

Github PR montage, showing a series of bug fixes.
Github PR montage, showing a series of bug fixes

All that to say that requirements gathering isn’t fun, but it’s necessary. Here are some steps I’d recommend before you start.

1. Understanding the Problem

The basic goal is to create a pipeline that can pull a group constrained by eligibility criteria, randomly assign each person to one of many variations, and disseminate the randomized groups to the necessary endpoints to execute the experiment. The ideal output is repeatable and usable across many data teams. 

We define the problem as: given a list of visitors, we want to randomize so that each person is limited to one experiment at a time, and the experiment subjects can be fairly split among data scientists who want to test on a portion of the visitor pool. At this point, we won’t outline the how, we’re just trying to understand the what.

2. Draw a System Diagram

Get the lay of the land with a high-level map of how your solution will interact with its environment. It’s important to be abstract to prevent prescribing a solution; the goal is to understand the inputs and outputs of the system. This is what mine looked like:

Example of a system diagram for email experiment pipeline
Example of a system diagram for email experiment pipeline

In our case, the data come from two sources: our data warehouse and our email platform.

In a much simpler setup—say, with no ETL at all—you can replace the inputs in this diagram with locally-stored CSVs and the experiment pipeline can be a Jupyter notebook. Whatever your stack may be, this diagram is a great starting point.

3. Plan the Ideal Output

I anticipated the implementation portion to be complicated, so I started by whiteboarding my ideal production table and reverse-engineered the necessary steps. Some of the immediate decisions that arose as part of this exercise were:

  1. Choosing the grain of the table: subjects will get randomized at the shop grain, but the experience of the experiment variation is surfaced with the primary email associated with that shop.
  2. Considering necessary resolvers: each experiment is measured on its own success metric, meaning the experiment output table needs to be joined to other tables in our database.
  3. Compatibility with existing analysis framework: I didn’t want to reinvent the wheel; we already have an amazing experiment dashboard system, which can be leveraged if my output is designed with that in mind.

I built a table with one row per email, per shop, per experiment, and with some additional attributes detailing the timing and theme of the experiment. Once I had a rough idea of this ideal output, I created a mock version of an experiment with some dummy data in a CSV file that I uploaded as a temporary table in our data warehouse. With this, I brainstormed some common use cases and experiment KPIs and attempted to query my fake table. This allowed me to identify pitfalls of my first iteration; for example, I realized that in my first draft that my keys wouldn’t be compatible with the email engagement data we get from the email platform API, which is the platform that sends our emails.

I sat with some stakeholders that included my teammates, members of the experimentation team, and non-technical members of the marketing organization. I did a guided exercise where I asked them to query my temporary table and question whether the structure can support the analysis required for their last few email experiments. In these conversations, we nailed down several requirements: 

  • Exclude subjects from other experiments: all subjects in a current experiment should be excluded from other experiments for a minimum of 30 days, but the tool should support an override for longer exclusion periods for testing high risk variations, such as product pricing.
  • Identify missing experiment category tags: the version of the output table I had was missing the experiment category tags (ex. research, promotional, etc) which is helpful for experiment discoverability.
  • Exclude linked shops: if an email was linked to multiple shops that qualified for the same experiment, all shops linked to that email should be excluded altogether.
  • Enable on-going randomization of experiments: the new pipeline should allow experiments to randomize on an ongoing basis, assigning new users as they qualify over time (as opposed to a one-time batch randomization).
  • Backfill past experiments into the pipeline: all past email experiments needed to be backfilled into the pipeline, and if a data scientist inadvertently bypassed this new tool, the pipeline needs to support a way to backfill these experiments as well. 

After a few iterations and with stakeholders’ blessing, I was ready to move to technical planning.

4. Technical Planning

At Shopify, all major projects are drafted in a technical document that’s peer-reviewed by a panel of data scientists and relevant stakeholders. My document included the ideal output and system requirements I’d gathered in the planning phase, as well as expected use cases and sample queries. I also had to draw a blueprint for how I planned to structure the implementation in our ETL platform. After chatting with my lead and discussions on Slack, I decided to build the pipeline in three stages, demonstrated by the diagram below.

Proposed high-level ETL structure for the randomization pipeline
Proposed high-level ETL structure for the randomization pipeline

Data scientists may need to ship experiments simultaneously; therefore for the first phase, I needed to create an experiment definition file that defines the criteria for candidates in the form of a SQL query. For example, you may want to limit a promotional offer to shops that have been on the platform for at least a year, and only in a particular region. This also allows you to tag your experiment with the necessary categories and specify a maximum sample size, if applicable. All experiment definition files are validated on an output contract as they need to be in agreement to be unioned in the next phase.

Phase two contains a many-to-one transform stage that consolidates all incoming experiments into a single output. If an experiment produces randomizations over time, it continues to append new rows incrementally. 

In phase three, the table is filtered down in many ways. First, users that have been chosen for multiple experiments are only included in the first experiment to avoid cross-contamination of controlled variables. Additionally, users with multiple shops within the same experiment are excluded altogether. This is done by deduping a list at the email grain with a lookup at the shop grain. Finally, the job adds features such as date of randomization and indicators for whether the experiment included a promotional offer.

With this blueprint in hand, I scheduled a technical design review session and pitched my game plan to a panel of my peers. They challenged potential pitfalls, provided engineering feedback, and ultimately approved the decision to move into build.

5. Building the Pipeline

Given the detailed planning, the build phase follows as the incremental implementation of the steps described above. I built the jobs in PySpark and shipped in increments, small enough increments to be consumable by code reviewers since all of the PRs totalled several thousand lines of code.

6. Ship, Ship, Ship! 

Once all PRs were shipped into production, the tool was ready to use. I documented its use in our internal data handbook and shared it with the experimentation team. Over the next few weeks, we successfully shipped several email experiments using the pipeline, which allowed me to work out small kinks in the implementation as well. 

The biggest mistake I made in the shipping process is that I didn’t  share the tool enough across the data organization. I found that many data scientists didn’t know the tool existed, and continued to use local files as their workaround solution. Well, better late than never, I did a more thorough job of sending team-wide emails, posting in relevant Slack channels, and setting up GitHub alerts to notify me when other contributors edit experiment files.

As a result, the tool has not only been used by the Marketing Data Science, but across the data organization by teams that focus on shipping, retail and international growth, to ship email experiments for the past year. The table produced by this pipeline integrated seamlessly with our existing analysis framework, so no additional work was required to see statistical results once an experiment is defined.

Key Takeaways

To quickly summarize, the most important takeaways are:

  1. Don’t skip out on requirement gathering! Understand the problem you’re trying to solve, create a high-level map of how your solution will interact with its environment, and plan your ideal output before you start.
  2. Draft your project blueprint in a technical document and get it peer-reviewed before you build.
  3. When building the pipeline, keep PRs smaller where possible, so that reviewers can focus on detailed design recommendations and so production failures are easier to debug.
  4. Once shipped, make sure you share effectively across your organization.

Overall, this project was a great lesson that a simple solution, built with care for engineering design, can quickly solve for the long-term. In the absence of a pre-existing A/B testing framework, this type of project is a quick and resourceful way to unlock experimentation for any data science team with very few requirements from the data stack.

Are you passionate about experiments and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

Continue reading

Images as Code: Representing Localized and Evolving Products on Marketing Pages

Images as Code: Representing Localized and Evolving Products on Marketing Pages

Last year, our marketing team kicked off a large effort based on user research to revamp to better serve the needs of our site visitors. We recognized that visitors wanted to see screenshots and visuals of the product itself, however we found that most of the screenshots across our website were outdated.

The image was used to showcase our Shopify POS software on, however it misrepresented our product when our POS software was updated and rebranded.
Old Shopify POS software on

The above image was used to showcase our Shopify POS software on, however it misrepresented our product when our POS software was updated and rebranded. 

While we first experimented with a Scalable Vector Graphics (SVG) based solution to visuals, we found that it wouldn’t scale and forced us to restrict usage to only “high-value” pages. Still, other teams expressed interest in this approach, so we recreated these in HTML and JavaScript (JS) and compared the lift between them. The biggest question was around getting these to resize in a given container—with SVG all content, including text size, grows and shrinks proportionally with a width of 100%, appearing as an image to users. With CSS there’s no way to get font sizes to scale proportionally to a container, only the window. We created a solution that resizes all the contents of the element at the same rate in response to container size, and reused it to create a better

The Design Challenge

We wanted to create new visuals of our product that needed to be available and translated across more than 35 different localized domains. Many domains support different currencies, features, and languages. Re-capturing screenshots on each domain to keep in sync with all our product changes is extremely inefficient.

Screenshots of our product were simplified in order to highlight features relevant to the page or section.
Screenshots of our product were simplified in order to highlight features relevant to the page or section.

After a number of iterations and as part of a collaborative effort outlined in more detail by Robyn Larsen on our UX blog, our design team came up with simplified representations of our user interface, UI Illustrations as we called them, for the parts of the product that we wanted to showcase. This was a clever solution to drive user focus to the parts of the product that we’re highlighting in each situation, however it required that someone maintain translations and versions of the product as separate image assets. We had an automated process for updating translations in our code but not in the design editor. 

What Didn’t Work: The SVG Approach

As an experimental solution, we attempted to export these visuals as SVG code and added those SVGs inline in our HTML. Then we’d replace the text and numbers with translated and localized text.

SVGs don’t support word wrapping so visuals with long translations would look broken.
SVGs don’t support word wrapping so visuals with long translations would look broken.

Exported SVGs were cool, they actually worked to accomplish what we had set out to do, but they had a bunch of drawbacks. Certain effects like gaussian blur caused performance issues in Firefox, and SVG text doesn’t wrap when reaching a max-width like HTML can. This resulted in some very broken looking visuals (see above). Languages with longer word lengths, like German, had overflowing text. In addition, SVG export settings in our design tool needed to be consistent for every developer to avoid massive changes to the whole SVG structure every time someone else exported the same visual. Even with a consistent export process, the developer would have to go through the whole process of swapping out text with our own hooks for translated content again. It was a huge mess. We were writing a lot of documentation just to create consistency in the process, and new challenges kept popping up when new settings in Sketch were used. It felt like we had just replaced one arduous process with another.

Our strategy of using SVGs for these visuals was quickly becoming unmanageable, and that was just with a few simple visuals. A month in, we still saw a lot of value in creating visuals as code, but needed to find a better approach.

Our Solution: The HTML/JavaScript Approach

After toying around with using JS to resize the content, we ended up with a utility we call ScaleContentAsImage. It calculates how much size is available for the visual and then resizes it to fit in that space. Let’s break down the steps required to do this.

Starting with A Simple Class

We start by creating a simple class that accepts a reference to the element in the DOM that we want to scale, and initialize it by storing the computed width of the element in memory. This assumes that we assigned the element a fixed pixel width somewhere in our code already (this fixed pixel width matches the width of the visual in the design file). Then we override the width of the element to 100% so that it can fill the space available to it.I’ve purposely separated the initialization sequence into a separate method from the constructor. While not demonstrated in this post, that separation allows us to add lazy loading or conditional loading to save on performance.

Creating an Element That We Can Transform

Next we’ll need to create an element that will scale as needed using a CSS transform. We assign that wrapper the fixed width that its parent used to have. Note that we haven’t actually added this element anywhere in the DOM yet.

Moving the Content Over

We transfer all the contents of the visual out from where it is now and into the wrapper we just created, and put that back into the parent element. This method preserves any event bindings (such as lazy load listeners) that previously were bound to these elements. At this point the content might overflow the container, but we’ll apply the transform to resolve that.

Applying the Transformation

Now, we determine how much the wrapper should scale the contents by and apply that property. For example, if the visual was designed to be 200px wide but it’s rendered in an element that’s 100px wide, the wrapper would be assigned transform: scale(0.5);.

Preserving Space in the Page

A screenshot of a webpage in a desktop view with text content aligned to the left and an image on the right. The image is of the Shopify admin, with the Shopify logo, a search bar, and a user avatar next to “Helen B.” on the top of the screen. Below in a grid are summaries of emails delivered, purchase totals, and total growth represented in graphs.
A screenshot of a webpage in a desktop view with text content aligned to the left and an image on the right.

So now our visual is resizing correctly, however our page layout is now looking all wonky. Our text content and the visual are meant to display as equal width side-by-side, like the above.

So why does the page look like this? The colored highlight shows what’s happening with our CSS transform.

A screenshot of a webpage in a desktop viewport where text is pushed to the very left of the screen taking up one sixth of the width. The image of the Shopify admin to the right is only one third of the screen wide. The entire right half of the page is empty. The image is highlighted in blue, however a larger green box is also highlighted in the same position taking up more of the empty space but matching the same dimensions of the image. A diagonal line from the bottom right corner of the larger green box to the bottom right corner of the highlighted image hints at a relationship between both boxes.
A screenshot of a webpage in a desktop viewport after CSS transform.

CSS transforms don’t change the content flow, so even though our visual size is reduced correctly, the element still takes up its fixed width. We add some additional logic here to fix this problem by adding an empty element that takes up the correct amount of vertical space. Unless the visual contains images that are lazy loaded, or animates vertically in some way, we only need to make this calculation once since a simple CSS trick to maintain an aspect ratio will work just fine.

Removing the Transformed Element from the Document Flow

We also need to set the transformed element to be absolutely positioned, so that it doesn’t affect the document flow.

A screenshot of a webpage in a desktop view with text content aligned to the left and an image on the right. The image is of the Shopify admin, with the Shopify logo, a search bar, and a user avatar next to “Helen B.” on the top of the screen. Below in a grid are summaries of emails delivered, purchase totals, and total growth represented in graphs.
A screenshot of a webpage in a desktop view with text content aligned to the left and an image on the right.

Binding to Resize

Success! Looks good! Now we just add a bit of logic to update our calculations if the window is resized.

Finishing theCode

Our class is now complete.

Looking at the Detailed Changes in the DOM

1. Before JS is initialized, in this example the container width is 378px and the assigned width of the element is 757px. The available space is about 50% of the original size of the visual.

A screenshot of a page open in the browser with developer tools open. In the page, a UI Illustration is shown side-by-side with some text, and the highlighted element in the inspector matches that as described above. The override for the container width can be seen in the style inspector. In addition, the described element is assigned a property of “aria-hidden: true”, and its container is a `div` with `role: “img”` and an aria-label describing the visual as “View of the Shopify admin showing emails delivered, purchase totals, and total growth represented in graphs”
A screenshot of a page open in the browser with developer tools open

2. As seen in our HTML post-initialization, in JS we have overridden the size of the container to be 100%

3. We’ve also moved all the content of the visual inside of a new element that we created, to which we apply a scale of 0.5 (based on the 50% calculated in step 1).

4. We absolutely position the element that we scaled so that it doesn’t disturb the document flow.

5. We added a placeholder element to preserve the correct amount of space in the document flow.

A Solution for a React Project

For a project using React, the same thing is accomplished without any of the logic we wrote to create, move, or update the DOM elements. The result is a much simpler snippet that only needs to worry about determining how much space is available within its container. A project using CSS-in-JS benefits in that the fixed width is directly passed into the element.

Problems with Localization and Currency

Shopify Order Summary Page
Shopify admin order summary page

An interesting problem we ran into was displaying prices in local currencies for fictional products. For instance, we started off with a visual of a product, a checkout button, and an order summary. Shown in the order summary were two chairs, each priced at ~$200, which were made-up prices and products for demonstrative purposes only.

It didn’t occur to us that 200 Japanese Yen is the equivalent of under $1.89 USD (today), so when we just swapped the currency symbol the visual of the chair did not realistically match the price. We ended up creating a table of currency conversion rates pulled on that day. We don’t update those conversion values on a regular basis, since we don’t need accurate rates for our invented prices. We’re ok with fluctuations, even large ones, as long as the numbers look reasonable in context. We obviously don’t take this approach with real products and prices.

Comparing Approaches: SVG vs HTML/JavaScript

The HTML/JS approach took some time to build upfront, but its advantages clearly outweighed the developer lift required even from the start. The UI Illustrations were fairly quick to build out given how simply and consistently they were designed. We started finding that other projects were reusing and creating their own visuals using the same approach. We created a comparison chart between approaches, evaluating major considerations and the support for these between the two.

Text Support

While SVG resizes text automatically and in proportion to the visual resizing, it didn’t support word wrap which is available in HTML

Implementation and Maintenance

HTML/JS had a lot going for compared to the SVG approach when it came to implementation and maintenance. Using HTML and JS would mean that developers don’t need to have technical knowledge of SVGs, and they code these visuals with the help of our existing components. Code is easy to parse and tested using our existing testing framework. From an implementation standpoint, the only thing that SVG really had going for it was that it usually resulted in fewer lines of code, since styles are inline and elements are absolutely positioned relative to each other. That in itself isn’t reason to choose a less maintainable and human-readable solution.


While both would support animations—something we may want to add in the future—an HTML/JS approach allows us to easily use our existing play/pause buttons to control these animations.


The SVG approach works with JS disabled, however it’s less performant and caused a lot of jankiness on the page when certain properties like shadows were applied to it


Design is where HTML/JS really stood out against SVG. With our original SVG approach, designers needed to follow a specific process and use a specific design tool that worked with that process. For example, we started requiring that shadows applied to elements were consistent in order to prevent multiple versions of Gaussian Blur from being added to the page and creating jankiness. It also required our designers to design in a way that text would never break onto a new line because of the lack of support for word wrapping. Without introducing SVG, none of these concerns applied and designers had more flexibility to use any tools they wanted to build freely.

Documentation and Ramp-up

HTML/JS was a clear winner , as we did away with all of the documentation describing the SVG export process, design guidelines, and quirks we discovered. With HTML, all we’d need to document that wouldn’t apply to SVGs is how to apply the resize functionality to the content.

Scaling Our Solution

We started off with a set of named visuals, and designed our system around a single component that accepted a name (for example “Shopify admin dashboard” or “POS software”) and rendered the desired visual. We thought that having a single entry point would help us better track each visual and restrict us to a small, maintainable set of UI Illustrations. That single component was tested and documented and for each new visual we added respective tests and documentation.

We worried about overuse given that each UI Illustration needed to be maintained by a developer. But with this system, a good portion of that development effort ended up being the education of the structure, maintenance of documentation, and tests for basic HTML markup that’s only used in one place. We’ve since provided a more generic container that can be used to wrap any block of HTML for initialization with our ScaleContentLikeImage module and provides a consistent implementation of descriptive text for screen readers.

The Future of UI Illustrations

ScaleContentLikeImage and its application for our UI Illustrations is a powerful tool for our team to highlight our product in a very intentional and relevant way for our users. Jen Taylor dives deeper into our UX considerations and user-focused approach to UI Illustrations on the Shopify UX Blog. There are still performance and structural wins to be had, specifically around how we recalculate sizing for lazy loaded images, and how we document existing visuals for reuse. However, until there’s a CSS-only solution to handle this use case our HTML/JS approach seems to be the cleanest. Looking to the future, this could be an excellent application to explore with CSS Houdini once the layout API is made available (it’s not yet supported in any major browser).

Based on Anton Dosov’s CSS Houdini with Layout API demo, I can imagine a scenario where we can create a custom layout renderer and then apply this logic with a few lines of CSS.

We’ve all learned a lot in this process and like any system, its long term success relies on our team’s collaborative relationship in order to keep evolving and growing in a maintainable, scalable way. At Shopify one of our core values is to thrive on change, and this project certainly has done so.

If sounds this sounds like the kind of projects you want to be a part of please check out our open positions.

Continue reading

How to Use Quasi-experiments and Counterfactuals to Build Great Products

How to Use Quasi-experiments and Counterfactuals to Build Great Products

Descriptive statistics and correlations are every data scientists’ bread and butter, but they often come with the caveat that correlation isn’t causation. At Shopify, we believe that understanding causality is the key to unlocking maximum business value. We aim to identify insights that actually indicate why we see things in the data, since causal insights can validate (or invalidate) entire business strategies. Below I’ll discuss different causal inference methods and how to use them to build great products.

The Causal Inference “Levels of Evidence Ladder”

A data scientist can use various different methods to estimate the causal effects of a factor. The “levels of evidence ladder” is a great mental model that introduces the ideas of causal inference.

Levels of evidence ladder. First level (clearest evidence): A/B tests (a.k.a statistical experiments). Second level (reasonable level of evidence): Quasi-experiments (including Difference-in-differences, matching, controlled regression). Third level (weakest level of evidence): Full estimation of counterfactuals. Bottom of the chart: descriptive statistics—provides no direct evidence for causal relationship.
Levels of evidence ladder. First level (clearest evidence): A/B tests (a.k.a statistical experiments). Second level (reasonable level of evidence): Quasi-experiments (including difference-in-differences, matching, controlled regression). Third level (weakest level of evidence): Full estimation of counterfactuals. Bottom of the chart: descriptive statistics—provides no direct evidence for causal relationship.

The ladder isn’t a ranking of methods, instead it’s a loose indication of the level of proof each method will give you. The higher the method is on the ladder, the easier it is to compute estimates that constitute evidence of a strong causal relationship. Methods at the top of the ladder typically (but not always) require more focus on the experimentation setup. On the other end, methods at the bottom of the ladder use more observational data, but require more focus on robustness checks (more on this later).

The ladder does a good job of explaining that there is no free lunch in causal inference. To get a powerful causal analysis you either need a good experimental setup, or a good statistician and a lot of work. It’s also simple to follow. I’ve recently started sharing this model with my non-data stakeholders. Using it to illustrate your work process is a great way to get buy-in from your collaborators and stakeholders.

Causal Inference Methods

A/B tests

A/B tests, or randomized controlled trials, are the gold standard method for causal inference—they’re on rung one of the levels of evidence ladder! For A/B tests, group A and group B are randomly assigned. The environment both groups are placed in is identical except for one parameter: the treatment. Randomness ensures that both groups are clones “on average”. This enables you to deduce causal estimates from A/B tests, because the only way they differ is the treatment. Of course in practice, lots of caveats apply! For example, one of the frequent gotchas of A/B testing is when the units in your treatment and control groups self-select to participate in your experiment.

Setting up an A/B test for products is a lot of work. If you’re starting from scratch, you’ll need

  • A way to randomly assign units to the right group as they use your product.
  • A tracking mechanism to collect the data for all relevant metrics.
  • To analyze these metrics and their associated statistics to compute effect sizes and validate the causal effects you suspect.

And that only covers the basics! Sometimes you’ll need much more to be able to detect the right signals. At Shopify, we have the luxury of an experiments platform that does all the heavy work and allows data scientists to start experiments with just a few clicks.


Sometimes it’s just not possible to set up an experiment. Here are a few reasons why A/B tests won’t work in every situation:

  • Lack of tooling. For example, if your code can’t be modified in certain parts of the product.
  • Lack of time to implement the experiment.
  • Ethical concerns  for example, at Shopify, randomly leaving some merchants out of a new feature that could help them with their business is sometimes not an option).
  • Just plain oversight (for example, a request to study the data from a launch that happened in the past).

Fortunately, if you find yourself in one of the above situations, there are methods that exist  which enable you to obtain causal estimates.

A quasi-experiment (rung two) is an experiment where your treatment and control group are divided by a natural process that isn’t truly random, but are considered close enough to compute estimates. Quasi-experiments frequently occur in product companies, for example, when a feature rollout happens at different dates in different countries, or if eligibility for a new feature is dependent on the behaviour of other features (like in the case of a deprecation). In order to compute causal estimates when the control group is divided using a non-random criterion, you’ll use different methods that correspond to different assumptions on how “close” you are to the random situation.

I’d like to highlight two of the methods we use at Shopify. The first is linear regression with fixed effects. In this method, the assumption is that we’ve collected data on all factors that divide individuals between treatment and control group. If that is true, then a simple linear regression on the metric of interest, controlling for these factors, gives a good estimate of the causal effect of being in the treatment group.

The parallel trends assumption for differences-in-differences. In the absence of treatment, the difference between the ‘treatment’ and ‘control’ group is a constant. Plotting both lines in a temporal graph like this can help check the validity of the assumption. Credits to Youcef Msaid.

The parallel trends assumption for differences-in-differences. In the absence of treatment, the difference between the ‘treatment’ and ‘control’ group is a constant. Plotting both lines in a temporal graph like this can help check the validity of the assumption. Credits to Youcef Msaid.

The second is also a very popular method in causal inference: difference in difference. For this method to be applicable, you have to find a control group that shows a trend that’s parallel to your treatment group for the metric of interest, prior to any treatment being applied. Then, after treatment happens, you assume the break in the parallel trend is only due to the treatment itself. This is summed up in the above diagram.


Finally, there will be cases when you’ll want to try to detect causal factors from data that only consists of observations of the treatment. A classic example in tech is estimating the effect of a new feature that was released to all the user base at once: no A/B test was done and there’s absolutely no one that could be the control group. In this case, you can try making a counterfactual estimation (rung three).

The idea behind counterfactual estimation is to create a model that  allows you to compute a counterfactual control group. In other words, you estimate what would happen had this feature not existed. It isn’t always simple to compute an estimate. However, if you have a model of your users that you’re confident about, then you have enough material to start doing counterfactual causal analyses!

Example of time series counterfactual vs. observed data
Example of time series counterfactual vs. observed data

A good way to explain counterfactuals is with an example. A few months ago, my team faced a situation where we needed to assess the impact of a security update. The security update was important and it was rolled out to everyone, however it introduced friction for users. We wanted to see if this added friction caused a decrease in usage. Of course, we had no way of finding a control group among our users.

With no control group, we created a time-series model to get a robust counterfactual estimation of usage of the updated feature. We trained the model on data such as usage of other features not impacted by the security update and global trends describing the overall level of activity on Shopify. All of these variables were independent from the security update we were studying. When we compared our model’s prediction to actuals, we found that there was no lift. This was a great null result which showed that the new security feature did not negatively affect usage.

When using counterfactual methods, the quality of your prediction is key. If a confounding factor that’s independent from your newest rollout varies, you don’t want to attribute this change to your feature. For example, if you have a model that predicts daily usage of a certain feature, and a competitor launches a similar feature right after yours, your model won’t be able to account for this new factor. Domain expertise and rigorous testing are the best tools to do counterfactual causal inference. Let’s dive into that a bit more.

The Importance of Robustness

While quasi-experiments and counterfactuals are great methods when you can’t perform a full randomization, these methods come at a cost! The tradeoff is that it’s much harder to compute sensible confidence intervals, and you’ll generally have to deal with a lot more uncertainty—false positives are frequent. The key to avoiding falling into traps is robustness checks.

Robustness really isn't that complicated. It just means clearly stating assumptions your methods and data rely on, and gradually relaxing each of them to see if your results still hold. It acts as an efficient coherence check if you realize your findings can dramatically change due to a single variable, especially if that variable is subject to noise, error measurement, etc.

Direct Acyclic Graphs (DAGs) are a great tool for checking robustness. They help you clearly spell out assumptions and hypotheses in the context of causal inference. Popularized by the famous computer scientist, Judea Pearl, DAGs have gained a lot of traction recently in tech and academic circles.

At Shopify, we’re really fond of DAGs. We often use Dagitty, a handy browser-based tool. In a nutshell, when you draw an assumed chain of causal events in Dagitty, it provides you with robustness checks on your data, like certain conditional correlations that should vanish. I recommend you explore the tool

The Three Most Important Points About Causal Inference

Let’s quickly recap the most important points regarding causal inference:

  • A/B tests are awesome and should be a go to tool in every data science team’s toolbox.
  • However, it’s not always possible to set up an A/B test. Instead, look for natural experiments to replace true experiments. 
  • If no natural experiment can be found, counterfactual methods can be useful. However, you shouldn’t expect to be able to detect very weak signals using these methods. 

I love causal inference applications for business and I think there is a huge untapped potential in the industry. Just like generalizing A/B tests lead to building a very successful “Experimentation Culture” since the end of the 1990s, I hope the 2020s and beyond will be an era of the “Causal Culture” as a whole! I hope sharing how we do it at Shopify will help. If any of this sounds interesting to you, we’re looking for talented data scientists to join our team.

Continue reading

Enforcing Modularity in Rails Apps with Packwerk

Enforcing Modularity in Rails Apps with Packwerk

On September 30, 2020 we held ShipIt! presents: Packwerk by Shopify. A video for the event is now available for you to learn more about our latest open source tool for creating packages with enforced boundaries in Rails apps. Click here to watch the video.

The Shopify core codebase is large, complex, and growing by the day. To better understand these complex systems, we use software architecture to create structural boundaries. Ruby doesn't come with a lot of boundary enforcements out of the box. Ruby on Rails only provides a very basic layering structure, so it's hard to scale the application without any solid pattern for boundary enforcement. In comparison, other languages and frameworks have built-in mechanisms for vertical boundaries, like Elixir’s umbrella projects.

As Shopify grows, it’s crucial we establish a new architecture pattern so large scale domains within the monolith can interact with each other through well-defined boundaries, and in turn, increase developer productivity and happiness. 

So, we created an open source tool to build a package system that can be used to guide and enforce boundaries in large scale Rails applications. Packwerk is a static analysis tool used to enforce boundaries between groups of Ruby files we call packages.

High Cohesion and Low Coupling In Code

Ideally, we want to work on a codebase that feels small. One way to make a large codebase feel small is for it to have high cohesion and low coupling.

Cohesion refers to the measure of how much elements in a module or class belong together. For example, functional cohesion is when code is grouped together in a module because they all contribute to one single task. Code that is related changes together and therefore should be placed together.

On the other hand, coupling refers to the level of dependency between modules or classes. Elements that are independent of each other should also be independent in location of implementation. When a certain domain of code has a long list of dependencies of unrelated domains, there’s no separation of boundaries. 

Boundaries are barriers between code. An example of a code boundary is to have a separate repository and service. For the code to work together in this case, network calls have to be made. In our case, a code boundary refers to different domains of concern within the same codebase.

With that, there are two types of boundaries we’d like to enforce within our applications—dependency and privacy. A class can have a list of dependencies of constants from other classes. We want an intentional and ideally small list of dependencies for a group of relevant code. Classes shouldn’t rely on other classes that aren’t considered their dependencies. Privacy boundaries are violated when there’s external use of private constants in your module. Instead, external references should be made to public constants, where a public API is established.

A Common Problem with Large Rails Applications

If there are no code boundaries in the monolith, developers find it harder to make changes in their respective areas. You may remember making a straightforward change that shockingly resulted in the breaking of unrelated tests in a different part of the codebase, or digging around a codebase to find a class or module with more than 2,000 lines of code. 

Without any established code boundaries, we end up with anti-patterns such as spaghetti code and large classes that know too much. As a codebase with low cohesion and high coupling grows, it becomes harder to develop, maintain, and understand. Eventually, it’s hard to implement new features, scale and grow. This is frustrating to developers working on the codebase. Developer happiness and productivity when working on our codebase is important to Shopify.

Rails Is Like an Open-concept Living Space

Let’s think of a large Rails application as a living space within a house without any walls. An open-concept living space is like a codebase without architectural boundaries. In an effort to separate concerns of different types of living spaces, you can arrange the furniture in a strategic manner to indicate boundaries. This is exactly what we did with the componentization efforts in 2017. We moved code that made sense together into folders we call components. Each of the component folders at Shopify represent domains of commerce, such as orders and checkout.

In our open-concept analogy, imagine having a bathroom without walls—it’s clear where the bathroom is supposed to be, but we would like it to be separate from other living spaces with a wall. The componentization effort was a great first step towards modularity for the great Shopify monolith, but we are still far from a modular codebase—we need walls. Cross-component calls are still being made, and Active Record models are shared across domains. There’s no wall imposing those boundaries, just an agreed upon social contract that can be easily broken.

Boundary Enforcing Solutions We Researched

The goal is to find a solution for boundary enforcement. The Ruby we all know and love doesn't come with boundary enforcements out of the box. It allows specifying visibility on the class level only and loads all dependencies into the global namespace. There’s no differences between direct and indirect dependencies.

There are some existing ways of potentially enforcing boundaries in Ruby. We explored a combination of solutions: using the private_constant keyword to set private constants, creating gems to set boundaries, using tests to prevent cross-boundary associations, and testing out external gems such as Modulation.

Setting Private Constants

The private_constant keyword is a built-in Ruby method to make a constant private so it cannot be accessed outside of its namespace. A constant’s namespace is the modules or classes where it’s nested and defined. In other words, using private_constant provides visibility semantics for constants on a namespace level, which is desirable. We want to establish public and private constants for a class or a group of classes.

However, there are drawbacks of using the private_constant method of privacy enforcement. If a constant is privatized after it has been defined, the first reference to it will not be checked. It is therefore not a reliable method to use.

There’s no trivial way to tell if there’s a boundary violation using private_constants. When declaring a constant private to your class, it is hard to determine if the use of the constant is getting bypassed or used appropriately. Plus, this is just a solution for privacy issues and not dependency.

Overall, only using private_constant is insufficient to enforce boundaries across large domains. We want a tool that is flexible and can integrate into our current workflow. 

Establishing Boundaries Through Gems

The other method of creating a modular Rails application is through gems. Ruby gems are used to distribute and share Ruby libraries between Rails applications. People may place relevant code into an internal gem, separating concerns from the main application. The gem may also eventually be extracted from the application with little to no complications.

Gems provide a list of dependencies through the gemspec which is something we wanted, but we also wanted the list of dependencies to be enforced in some way. Our primary concern was that gems don't have visibility semantics. Gems make transitive dependencies available in the same way as direct dependencies in the application. The main application can use any dependency within the internal gem as it would its own dependency. Again, this doesn't help us with boundary enforcement.

We want a solution where we’re able to still group code that’s relevant together, but only expose certain parts of that group of code as public API. In other words, we want to control and enforce the privacy and dependency boundaries for a group of code—something we can’t do with Ruby gems.

Using Tests to Prevent Cross-component Associations

We added a test case that rejects any PRs that introduce Active Record associations across components, which is a pattern we’re trying to avoid. However, this solution is insufficient for several reasons. The test doesn’t account for the direction of the dependency. It also isn’t a complete test. It doesn’t cover use cases of Active Record objects that aren’t associations and generally doesn’t cover anything that isn’t Active Record.

The test was good enforcement, but lacked several key features. We wanted a solution that determined the direction of dependencies and accounted for different types of Active Record associations. Nonetheless, the test case still exists in our codebase as we still found it helpful in triggering developer thought and discussions to whether or not an association between components is truly needed.

Using the Modulation Ruby Gem

Modulation is a Ruby gem for file-level dependency management within the Ruby application that was experimental at the time of our exploration. Modulation works by overriding the default Ruby code loading, which is concerning, as we’d have to replace the whole autoloading system in our Rails application. The level of complexity added to the code and runtime application behaviour is because dependency introspection performed at runtime.

There are obvious risks that come with modifying how our monolith works for an experiment. If we went with Modulation as a solution and had to change our minds, we’d likely have to revert changes to hundreds of files, which is impractical in a production codebase. Plus, the gem works at file-level granularity which is too fine for the scale we were trying to solve.

Creating Microservices?

The idea of extracting components from the core monolith into microservices in order to create code boundaries is often brought up at Shopify. In our monolith’s case, creating more services in an attempt to decouple code is solving code design problems the wrong way.

Distributing code over multiple machines is a topology change, not an architectural change. If we try to extract components from our core codebase into separate services, we introduce the added concern of networked communication and create a distributed system. A poorly designed API within the monolith will still be a poorly designed API within a service, but now with additional complexities. These complexities can come in forms such as stateless network boundary and serialisation between the systems, and reliability issues with networked communications. Microservices are a great solution when the service is isolated and unique enough to reason the tradeoff of the network boundary and complexities that come with it.

The Shopify core codebase still stands as a majestic modular monolith, with all the code broken up into components and living in a singular codebase. Now, our goal is to advance our application’s modularity to the next step—by having clear and enforced boundaries.

Packwerk: Creating Our Own Solution

Taking our learnings from the exploration phase for the project, we created Packwerk. There are two violations that Packwerk enforces: dependency and privacy. Dependency violations occur when a package references a private constant from a package that hasn’t been declared as a dependency. Privacy violations occur when an external constant references a package’s private constants. However, constants within the public folder, app/public, can be accessed and won't be a violation.

How Packwerk Works 

Packwerk parses and resolves constants in the application statically with the help of an open-sourced Shopify Ruby gem called ConstantResolver. ConstantResolver uses the same assumptions as Zeitwerk, the Rails code loader, to infer the constant's file location. For example, Some::Nested::Model will be resolved to the constant defined in the file path, models/some/nested/model.rb. Packwerk then uses the file path to determine which package defines the constant.

Next, Packwerk will use the resolved constants to check against the configurations of the packages involved. If all the checks are enforced (i.e. dependency and privacy), references from Package A to Package B are valid if:

  1. Package A declares a dependency on Package B, and;
  2. The referenced constant is a public constant in Package B

Ensuring Application Validity

Before diving into further details, we have to make sure that the application is in a valid state for Packwerk to work correctly. To be considered valid, an application has to have a valid autoload path cache, package definition files and application folder structure. Packwerk comes with a command, packwerk validate, that runs on a continuous integration (CI) pipeline to ensure the application is always valid.

Packwerk also checks for any acyclic dependencies within the application. According to the Acyclic Dependency Principle, no cycles should be allowed in the component dependency graph. If packages depend on each other in a cycle, making a change to one package will create a domino effect and force a change on all packages in the cycle. This dependency cycle will be difficult to manage.

In practical terms, imagine working on a domain of the codebase concurrently with 100 other developers. If your codebase has cyclic dependencies, your change will impact the components that depend on your component. When you are done with your work, you want to merge it into the main branch, along with the changes of other developers. This code will create an integration nightmare because all the dependencies have to be modified in each iteration of the application.

An application with an acyclic dependency graph can be tested and released independently without having the entire application change at the same time.

Creating a Package 

A package is defined by a package.yml file at the root of the package folder. Within that file, specific configurations are set. Packwerk allows a package to declare the type of boundary enforcement that the package would like to adhere to. 

Additionally, other useful package-specific metadata can be specified, like team and contact information for the package. We’ve found that having granular, package-specific ownership makes it easier for cross-team collaboration compared to ownership of an entire domain.

Enforcing Boundaries Between Packages

Running packwerk check
Running packwerk check

Packwerk enforces boundaries between packages through a check that can be run both locally and on the CI pipeline. To perform a check, simply run the line packwerk check. We also included this in Shopify’s CI pipeline to prevent any new violations from being merged into the main branch of the codebase.

Enforcing Boundaries in Existing Codebases

Because of the lack of code structure in Rails apps, legacy large scale Rails apps tend to have existing dependency and privacy violations between packages. If this is the case, we want to stop the bleeding and prevent new violations from being added to the codebase.

Users can still enforce boundaries within the application despite existing violations, ensuring the list of violations doesn't continue to increase. This is done by generating a deprecated references list for the package.

We want to allow developers to continue with their workflow, but prevent any further violations. The list of deprecated references can be used to help a codebase transition to a cleaner architecture. It iteratively establishes boundaries in existing code as developers work to reduce the list.

List of deprecated references for components/online_store
List of deprecated references for components/online_store

The list of deprecated references contains some useful information about the violation within the package. In the example above, we can tell that there was a privacy violation in the following files that are referring to the ::RetailStore constant that was defined in the components/online_store package.

By surfacing the exact references where the package’s boundaries are being breached, we essentially have a to-do list that can be worked off.

Conventionally, the deprecated references list was meant for developers to start enforcing the boundaries of an application immediately despite existing violations, and use it to remove the technical debt. However, the Shipping team at Shopify found success using this list to extract a domain out of their main application into its own service. Also, the list can be used if the package were extracted into a gem. Ultimately, we make sure to let developers know that the list of deprecated references should be used to refactor the code and reduce the amount of violations in the list.

The purpose of Packwerk would be defeated if we merely added to the list of violations (though, we’ve made some exceptions to this rule). When a team is unable to add a dependency in the correct direction because the pattern doesn’t exist, we recommend adding the violation to the list of deprecated references. Doing so will ensure that when such a pattern exists, we eventually refactor the code and remove the violation from the list. This results in a better alternative than creating a dependency in the wrong direction.

Preventing New Violations 

After creating packages within your application and enforcing boundaries for those packages, Packwerk should be ready to go. Packwerk will display violations when packwerk check is run either locally or on the CI pipeline.

The error message as seen above displays the type of violation, location of violation, and provides actionable next steps for developers. The goal is to make developers aware of the changes they make and to be mindful of any boundary breaking changes they add to the code.

The Caveats 

Statically analyzing Ruby is complex. If a constant is not autoloaded, Packwerk ignores it. This ensures that the results produced by Packwerk won’t have any false positives, but it can create false negatives. If we get most of the references right, it’ll be enough to shift the code quality in a positive direction. The Packwerk team made this design decision as our strategy to handle the inaccuracy that comes with Ruby static analysis. 

How Shopify Is Using Packwerk

There was no formal push for the adoption of Packwerk within Shopify. Several teams were interested in the tool and volunteered to beta test before it was released. Since its release, many teams and developers are adopting Packwerk to enforce boundaries within their components.

Currently Packwerk runs in six Rails applications at Shopify, including the core monolith. Within the core codebase, we have 48 packages with 30 boundary enforcements within those packages. Packwerk integrates in the CI pipeline for all these applications and has commands that can run locally for packaging-related checks.

Since Packwerk was released for use within the company, new conversations related to software architecture have been sparked. As developers worked on removing technical debt and refactoring the code using Packwerk, we noticed there’s no established pattern for decoupling of code and creating single-direction dependencies. We’re currently researching and discussing inversion of control and establishing patterns for dependency inversion within Rails applications.

Start Using Packwerk. It’s Open Source!

Packwerk is now out in the wild and ready for you to try it out!

To get Packwerk installed in your Rails application, add it as a gem and simply run the command packwerk init. The command will generate the configuration files needed for you to use Packwerk.

The Packwerk team will be maintaining the gem and we’re stoked to see how you will be using the tool. You are also welcome to report bugs and open pull requests in accordance with our contribution guidelines.


Packwerk is inspired by Stripe’s internal Ruby packages solution with its idea adapted to the more complex world of Rails applications.

ShipIt! Presents: Packwerk by Shopify

Without code boundaries in a monolith, it’s difficult for developers to make changes in their respective areas. Like when you make a straightforward change that shockingly results in breaking unrelated tests in a different part of the codebase, or dig around a codebase to find a class or module with more than 2,000 lines of code!

You end up with anti-patterns like spaghetti code and large classes that know too much. The codebase is harder to develop, maintain and understand, leading to difficulty adding new features. It’s frustrating for developers working on the codebase. Developer happiness and productivity is important to us.

So, we created an open source tool to establish code boundaries in Rails applications. We call it Packwerk.

During this event you will

  • Learn more about the problems Packwerk solves.
  • See how we built Packwerk.
  • Understand how we use Packwerk at Shopify.
  • See a demo of Packwerk.
  • Learn how you can get started with Packwerk.

Additional Information 

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default

Continue reading

Under Deconstruction: The State of Shopify’s Monolith

Under Deconstruction: The State of Shopify’s Monolith

Ruby on Rails is a great framework for rapidly building beautiful web applications that users and developers love. But if an application is successful, there’s usually continued investment, resulting in additional features and increased overall system complexity.

Shopify’s core monolith has over 2.8 million lines of Ruby code and 500,000 commits. Rails doesn’t provide patterns or tooling for managing the inherent complexity and adding features in a structured, well-bounded way.

That’s why, over three years ago, Shopify founded a team to investigate how to make our Rails monoliths more modular. The goal was to help us scale towards ever increasing system capabilities and complexity by creating smaller, independent units of code we called components. The vision went like this:

  • We can more easily onboard new developers to just the parts immediately relevant to them, instead of the whole monolith.
  • Instead of running the test suite on the whole application, we can run it on the smaller subset of components affected by a change, making the test suite faster and more stable.
  • Instead of worrying about the impact on parts of the system we know less well, we can change a component freely as long as we’re keeping its existing contracts intact, cutting down on feature implementation time.

In summary, developers should feel like they are working on a much smaller app than they actually are.

It’s been 18 months since we last shared our efforts to make our Rails monoliths more modular. I’ve been working on this modularity effort for the last two and a half years, currently on a team called Architecture Patterns. I’ll lay out the current state of my team’s work, and some things we’d do differently if we started fresh right now.

The Status Quo

We generally stand by the original ideas as described in Deconstructing the Monolith, but almost all of the details have changed.  We make consistent progress, but it's important to note that making changes at this scale requires a significant shift in thinking for a critical mass of contributors, and that takes time.

While we’re far from finished, we already reap the benefits of our work. The added constraints on how we write our code trigger deep software design discussions throughout the organization. We see a mindset shift across our developers with a stronger focus on modular design. When making a change, developers are now more aware of the consequences on the design and quality of the monolith as a whole. That means instead of degrading the design of existing code, new feature implementations now more often improve it. Parts of the codebase that received heavy refactoring in recent years are now easier to understand because their relationship with the rest of the system is clearer.

We automatically triage exceptions to components, enabling teams to act on them without having to dig through the sometimes noisy exception stream for the whole monolith. And with each component explicitly owned by a team, whole-codebase chores like Rails upgrades are easily distributed and collaboratively solved. Shopify is running its main monolith on the newest, unreleased revisions of Rails. The clearly defined ownership for areas of the codebase is one of the factors enabling us to do that.

What We Learned so Far

Our main monolith is one of the oldest, largest Rails codebases on the planet, under continuous development since at least 2006, with hundreds of developers currently adding features.

A refactor on this scale needs to be approached completely differently from smaller efforts. We learned that all large scale changes start

  • with understanding and influencing developer behavior
  • at the grassroots
  • with a holistic perspective on architecture 
  • with careful application of tooling
  • with being aware of the tradeoffs involved

Understand Developer Behaviour

A single centralized team can’t make change happen by working against the momentum of hundreds of developers adding features.

Also, it can’t anticipate all the edge cases and have context on all domains of the application. A single team can make simple change happen on a large scale, or complex change on a small scale. To modularize a large monolith though, we need to make complex change happen on a large scale. Even if a centralized team could make it happen, the design would degrade once the team switches its focus to something else. 

That’s why making a fundamental architecture change to a system that’s being actively worked on is in large part a people problem. We need to change the behavior of the average developer on the codebase. We need to all iteratively evolve the system towards the envisioned future together. The developers are an integral part of the system.

Dr. B.J. Fogg, founder of the Behavior Design Lab at Stanford University, developed a model for thinking about behaviors that matches our experiences. The model suggests that for a behavior to occur, three things need to be in place: Ability, Motivation, and Prompt.

Fogg Behaviour Model by BJ Fogg, PhD
Fogg Behaviour Model by  BJ Fogg, PHD

In a nutshell, prompts are necessary for a desired behavior to happen, but they're ineffective unless there's enough motivation and ability. Exceptionally high motivation can, within reason, compensate for low ability and vice versa.

Automated tooling and targeted manual code reviews provide prompts. That’s the easy part. Creating ability and motivation to make positive change is harder. Especially when that goes against common Ruby on Rails community practices and requires a view of the system that’s much larger than the area that most individual developers are working on. Spreading an understanding of what we’re aiming for, and why, is critical.

For example, we invested quite a bit of time and energy into developing patterns to ensure some consistency in how component boundary interfaces are designed. Again and again we pondered: How should components call each other? We then pushed developers to use these patterns everywhere. In hindsight, this strategy didn’t increase developer ability or motivation. It didn’t solve the problems actually holding them back, and it didn’t explain the reasons or long term goals well enough. Pushing for consistency added rules, which always add some friction, because they have to be learned, remembered, and followed. It didn’t make any hard problem significantly easier to solve. In some cases, the patterns were helpful. In other cases, they lead developers to redefine their problem to fit the solution we provided, which degraded the overall state of the monolith.

Today, we’re still providing some general suggestions on interface consistency, but we have a lot less hard rules. We’re focusing on finding the areas where developers are hungry to make positive change, but don’t end up doing it because it’s too hard. Often, making our code more modular is hard because legacy code and tooling are based on assumptions that no longer hold true. One of the most problematic outdated assumptions is that all Active Record models are OK to access everywhere, when in this new componentized world we want to restrict their usage to the component that owns them. We can help developers overcome this problem.

So in the words of Dr. Fogg, these days we’re looking for areas where the prompt is easy, the motivation is present, and we just have to amp up the ability to make things happen.

Foster the Grassroots

As I mentioned, we, as a centralized team, can’t make this change happen by ourselves. So, we work to create a grassroots movement among the developers at Shopify. We aim to increase the number of people that have ability, motivation and prompt to move the system a tiny step further in the right direction.

We give internal talks, write documentation, share wins, embed in other teams, and pair with people all over the company. Embedding and pairing make sure we’re solving the problems that product developers are most struggling with in practice, avoiding what’s often called Ivory Tower Syndrome where the solutions don’t match the problems. It also lets us gain context on different areas of the codebase and the business while helping motivated people achieve goals that align with ours.

As an example, we have a group called the Architecture Guild. The guild has a slack channel for software architecture discussions and bi-weekly meetups. It’s an open forum, and a way to grow more architecture conscious mindsets while encouraging architectural thinking. The Architecture Patterns team provides some content that we think is useful, but we encourage other people to share their thoughts, and most of the contributions come from other teams. Currently, the Architecture Guild has ~400 members and 54 documented meetups with meeting notes and recordings that are shared with all developers at Shopify.

The Architecture Guild grew organically out of the first Componentization team at Shopify after the first year of Componentization. If I were to start a similar effort again today, I’d establish a forum like this from the beginning to get as many people on board with the change as early as possible. It’s also generally a great vehicle to spread software design knowledge that’s siloed in specific teams to other parts of the company.

Other methods we use to create fertile ground for ambitious architecture projects are

  • the Developer Handbook, an internal online resource documenting how we develop software at Shopify.
  • Developer Talks, our internal weekly livestreamed and recorded talks about software development at Shopify.

Build Holistic Architecture

Some properties of software are so closely related that they need to be approached in pairs. By working on one property and ignoring its “partner property,” you could end up degrading the system.

Balance Encapsulation With A Simple Dependency Graph

We started out by focusing our work on building a clean public interface around each component to hide the internals. The expectation was that this would allow reasoning about and understanding the behavior of a component in isolation. Changing internals of a component wouldn’t break other components—as long as the interface stays stable.

It’s not that straightforward though. The public interface is what other components depend on; if a lot of components depend on it, it’s hard to change. The interface needs to be designed with those dependencies in mind, and the more components depend on it, the more abstract it needs to be. It’s hard to change because it’s used everywhere, and it will have to change often if it contains knowledge about concrete parts of the business logic.

When we started analyzing the graph of dependencies between components, it was very dense, to the point that every component depended on over half of all the other components. We also had lots of circular dependencies.

Circular Dependancies
Circular Dependancies

Circular dependencies are situations where for example component A depends on component B but component B also depends on component A. But circular dependencies don’t have to be direct, the cycles can be longer than two. For example, A depends on B depends on C depends on A.

These properties of the dependency graph mean that the components can’t be reasoned about, or evolved, separately. Changes to any component in a cycle can break all other components in the cycle. Changes to a component that has almost all other components depend on can break almost all other components. So these changes require a lot of context. A dense, cyclical dependency graph undermines the whole idea of Componentization—it blocks us from making the system feel smaller.

When we ignored the dependency graph, in large parts of the codebase the public interface turned out to just be an added layer of indirection in the existing control flows. This made it harder to refactor these control flows because it added additional pieces that needed to be changed. It also didn’t make it a lot easier to reason about parts of the system in isolation.

The simplest possible way to introduce a public interface to a private implementation
The simplest possible way to introduce a public interface to a private implementation

The diagram shows that the simplest possible way to introduce a public interface could just mean that a previously problematic design is leaked into a separate interface class, making the underlying design problem harder to fix by spreading it into more files.

Discussions about the desirable direction of a dependency often surface these underlying design problems. We routinely discover objects with too many responsibilities and missing abstractions this way.

Perhaps not surprisingly, one of the central entities of the Shopify system is the Shop and so almost everything depends on the Shop class. That means that if we want to avoid circular dependencies, the Shop class can depend on almost nothing. 

Luckily, there are proven tools we can use to straighten out the dependency graph. We can make arrows point in different directions, by either moving responsibilities into the component that depends on them or applying inversion of control. Inversion of control means to invert a dependency in such a way that control flow and source code dependency are opposed. This can be done for example through a publish/subscribe mechanism like ActiveSupport::Notifications.

This strategy of eliminating circular dependencies naturally guides us towards removing concrete implementation from classes like Shop, moving it towards a mostly empty container holding only the identity of a shop and some abstract concepts.

If we apply the aforementioned techniques while building out the public interfaces, the result is therefore much more useful. The simplified graph allows us to reason about parts of the system in isolation, and it even lays out a path towards testing parts of the system in isolation.

Dependencies diagram between Platform, Supporting, and Frontend components
Dependencies diagram between Platform, Supporting, and Frontend components

If determining the desired direction of all the dependencies on a component ever feels overwhelming, we think about the components grouped into layers. This allows us to prioritize and focus on cleaning up dependencies across layers first. The diagram above sketches out an example. Here, we have platform components, Platform and Shop Identity, that purely provide functionality to other components. Supporting components, like Merchandising and Inventory, depend on the platform components but also provide functionality to others and often serve their own external APIs. Frontend components, like Online Store, are primarily externally facing. The dependencies crossing the dotted lines can be prioritized and cleaned up first, before we look at dependencies within a layer, for example between Merchandising and Inventory.

Balance Loose Coupling With High Cohesion

Tight coupling with low cohesion and loose coupling with high cohesion
Tight coupling with low cohesion and loose coupling with high cohesion

Meaningful boundaries like those we want around components require loose coupling and high cohesion. A good approximation for this is Change Locality: The degree to which code that changes together lives together.

At first, we solely focused on decoupling components from each other. This felt good because it was an easy, visible change, but it still left us with cohesive parts of the codebase that spanned across component boundaries. In some cases, we reinforced a broken state. The consequence is that often small changes to the functionality of the system still meant changes in code across multiple components, for which the developers involved needed to know and understand all of those components.

Change Locality is a sign of both low coupling and high cohesion and makes evolving the code easier. The codebase feels smaller, which is one of our stated goals. And Change Locality can also be made visible. For example, we are working on automation analyzing all pull requests on our codebase for which components they touch. The number of components touched should go down over time.

An interesting side note here is that different kinds of cohesion exist. We found that where our legacy code respects cohesion, it’s mostly informational cohesion—grouping code that operates on the same data. This arises from a design process that starts with database tables (very common in the Rails community). Change Locality can be hindered by that. To produce software that is easy to maintain, it makes more sense to focus on functional cohesion—grouping code that performs a task together. That’s also much closer to how we usually think about our system. 

Our focus on functional cohesion is already showing benefits by making our business logic, the heart of our software, easier to understand.

Create a SOLID foundation

There are ideas in software design that apply in a very similar way on different levels of abstraction—coupling and cohesion, for example. We started out applying these ideas on the level of components. But most of what applies to components, which are really large groups of classes, also applies on the level of individual classes and even methods.

On a class level, the most relevant software design ideas are commonly summarized as the SOLID principles. On a component level, the same ideas are called “package principles.” Here’s a SOLID refresher from Wikipedia:

Single-responsibility principle

A class should only have a single responsibility, that is, only changes to one part of the software's specification should be able to affect the specification of the class.

Open–closed principle

Software entities should be open for extension, but closed for modification.

Liskov substitution principle

Objects in a program should be replaceable with instances of their subtypes without altering the correctness of that program.

Interface segregation principle

Many client-specific interfaces are better than one general-purpose interface.

Dependency inversion principle

Depend upon abstractions, not concretions.

The package principles express similar concerns on a different level, for example (source):

Common Closure Principle

Classes that change together are packaged together.

Stable Dependencies Principle

Depend in the direction of stability.

Stable Abstractions Principle

Abstractness increases with stability.

We found that it’s very hard to apply the principles on a component level if the code doesn’t follow the equivalent principles on a class and method level. Well designed classes enable well designed components. Also, people familiar with applying the SOLID principles on a class level can easily scale these ideas up to the component level.

So if you’re having trouble establishing components that have strong boundaries, it may make sense to take a step back and make sure your organization gets better at software design on a scale of methods and classes first.

This is again mostly a matter of changing people’s behavior that requires motivation and ability. Motivation and ability can be increased by spreading awareness of the problems and approaches to solving them.

In the Ruby world, Sandi Metz is great at teaching these concepts. I recommend her books, and we’re lucky enough to have her teach workshops at Shopify repeatedly. She really gets people excited about software design.

Apply Tooling Deliberately

To accelerate our progress towards the modular monolith, we’ve made a few major changes to our tooling based on our experience so far.

Use Rails Engines

While we started out with a lot of custom code, our components evolved to look more and more like Rails Engines. We’re doubling down on engines going forward. They are the one modularity mechanism that comes with Rails out of the box. They have the familiar looks and features of Rails applications, but other than apps, we can run multiple engines in the same process. And should we make the decision to extract a component from the monolith, an engine is easily transformed into a standalone application.

Engines don’t fit the use case perfectly though. Some of the roughest edges are related to libraries and tooling assuming a Rails application structure, not the slightly different structure of an engine. Others relate to the fact that each engine can (and probably should) specify its own external gem dependencies, and we need a predictable way to unify them into one set of gems for the host application. Thankfully, there are quite a few resources out there from other projects encountering similar problems. Our own explorations have yielded promising results with multiple production applications currently using engines for modularity, and we’re using engines everywhere going forward.

Define and Enforce Contracts

Strong boundaries require explicit contracts. Contracts in code and documentation allow developers to use a component without reading its implementation, making the system feel smaller.

Initially, we built a hash schema validation library called Component::Schema based on dry-schema. It served us well for a while, but we ran into problems keeping up with breaking changes and runtime performance for checking more complex contracts.

In 2019, Stripe released their static Ruby type checker, Sorbet. Shopify was involved in its development before that release and has a team contributing to Sorbet, as we are using it heavily. Now it’s our go-to tool for expressing input and output contracts on component boundaries. Configured correctly, it has barely any runtime performance impact, it’s more stable, and it provides advanced features like interfaces.

This is what an entrypoint into a component looks like using Component::Schema:

And this is what that entrypoint looks like today, using Sorbet:

Perform Static Dependency Analysis

As Kirsten laid out in the original blog post on Componentization at Shopify, we initially built a call graph analysis tool we called Wedge. It logged all method calls during test suite execution on CI to detect calls between components.

We found the results produced were often not useful. Call graph logging produces a lot of data, so it’s hard to separate the signal from the noise. Sometimes it’s not even clear which component a call is from or to. Consider a method defined in component A which is inherited by a class in component B. If this method is making a call to component C, which component is the call coming from? Also, because this analysis depended on the full test suite with added instrumentation, it took over an hour to run, which doesn’t make for a useful feedback cycle.

So, we developed a new tool called Packwerk to analyze static constant references. For example, the line Shop.first, contains a static reference to Shop and a method call to a method on that class that’s called first. Packwerk only analyzes the static constant reference to Shop. There’s less ambiguity in static references, and because they’re always explicitly introduced by developers, highlighting them is more actionable. Packwerk runs a full analysis on our largest codebase in a few minutes, so we’re able to integrate it with our Pull Request workflow. This allows us to reject changes that break the dependency graph or component encapsulation before they get merged into our main branch.

We’re planning to make Packwerk open source soon. Stay tuned!

Decide to Prioritize Ownership or Boundaries

There are two major ways to partition an existing monolith and create components from a big ball of mud. In my experience, all large architecture changes end up in an incomplete state. Maybe that’s a pessimistic view, but my experience tells me that the temporary incomplete state will at least last longer than you expect. So choose an approach based on which intermediary state is most useful for your specific situation.

One option is to draw lines through the monolith based on some vision of the future and strengthen those lines over time into full fledged boundaries. The other option is to spin off parts of it into tiny units with strong boundaries and then transition responsibilities over iteratively, growing the components over time.

For our main monolith, we took the first approach; our vision was guided by the ideas of Domain Driven Design. We defined components as implementations of subdomains of the domain of commerce, and moved the files into corresponding folders. The main advantage is that even though we’re not finished building out the boundaries, responsibilities are roughly grouped together, and every file has a stewardship team assigned. The disadvantage is that almost no component has a complete, strong boundary yet, because with the components containing large amounts of legacy code, it’s a huge amount of work to establish these. This vision of the future approach is good if well-defined ownership and a clearly visible partition of the app are most important for you—which they were for us because of the huge number of people working on the codebase.

On other large apps within Shopify, we’ve tried out the second approach. The advantage is that large parts of the codebase are in isolated and clean components. This creates good examples for people to work towards. The disadvantage of this approach is that we still have a considerable sized ball of mud within the app that has no structure whatsoever. This spin-off approach is good if clean boundaries are the priority for you.

What We’re Building Right Now

While feature development on the monolith is going on as fast as ever, many developers are making things more modular at the same time. We see an increase of people in a position to do this, and the number of good examples around the codebase is expanding.

We currently have 37 components in our main monolith, each with public entrypoints covering large parts of its responsibilities. Packwerk is used on about a third of the components to restrict their dependencies and protect the privacy of their internal implementation. We’re working on making Packwerk enticing enough that all components will adopt it.

Through increased adoption we’re progressively enforcing properties of the dependency graph. Total acyclicity is the long term goal, but the more edges we can remove from the graph in the short term the easier the system will be to reason about.

We have a few other monolithic apps going through similar processes of componentization right now; some with the goal of splitting into separate services long term, some aiming for the modular monolith. We are very deliberate about when to split functionality out into separate services, and we only do it for good reasons. That’s because splitting a single monolithic application into a distributed system of services increases the overall complexity considerably.

For example, we split out storefront rendering because it’s a read-only use case with very high throughput and it makes sense for us to scale and distribute it separately from the interface that our merchants use to manage their stores. Credit card vaulting is a separate service because it processes sensitive data that shouldn’t flow through other parts of the system.

In addition, we’re preparing to have all new Rails applications at Shopify componentized by default. The idea is to generate multiple separately tested engines out of the box when creating a Rails app, removing the top level app folder and setting up developers for a modular future from the start.

At the same time, we’re looking into some of the patterns necessary to unblock further adoption of Packwerk. First and foremost that means making the dependency graph easy to clean up. We want to encourage inversion of control and more generally dependency inversion, which will probably lead us to use a publish/subscribe mechanism instead of straightforward method calls in many cases.

The second big blocker is efficiently querying data across components without coupling them too tightly. The most interesting problems in this area are

  • Our GraphQL API exposes a partially circular graph to external consumers while we’d like the implementation in the components to be acyclic.
  • Our GraphQL query execution and ElasticSearch reindexing currently heavily rely on Active Record features, which defeats the “public interface, private implementation” idea.

The long term vision is to have separate, isolated test suites for most of the components of our main monolith.

Last But Not Least

I want to give a shout out to Josh Abernathy, Bryana Knight, Matt Todd, Matthew Clark, Mike Chlipala and Jakob Class at Github. This blog post is based on, and indirectly the result of a conversation I had with them. Thank you!

Anita Clarke, Edward Ocampo-Gooding, Gannon McGibbon, Jason Gedge, Martin LaRochelle, and Keyfer Mathewson contributed super valuable feedback on this article. Thank you BJ Fogg for the behavior model and use of your image.

If you’re interested in the kinds of challenges I described, you should join me at Shopify!

Further Reading


    Continue reading

    Tophatting in React Native

    Tophatting in React Native

    On average in 2019, Shopify handled billions of dollars of transactions per week. Therefore, it’s important to ensure new features are thoroughly tested before shipping them to our merchants. A vital part of the software quality process at Shopify is a practise called tophatting. Tophatting is manually testing your coworker’s changes and making sure everything is working properly before approving their pull request (PR).

    Earlier this year, we announced that React Native is the future of mobile development in the company. However, the workflow for tophatting a React Native app was quite tedious and time consuming. The reviewer had to 

    1. save their current work
    2. switch their development environment to the feature branch
    3. rebuild the app and load the new changes
    4. verify the changes inside the app.

    To provide a more convenient and painless experience, we built a tool enabling React Native developers to quickly load their peer’s work within seconds. I’ll explain how the tool works in detail.

    React Native Tophatting vs Native Tophatting

    About two years ago, the Mobile Tooling team developed a tool for tophatting native apps. The tool works by storing the app’s build artifacts in cloud storage, mobile developers can download the app and launch it in an emulator or a simulator on demand. However, the tool’s performance can be improved when there are only React Native code changes because we don’t need to rebuild and re-download the entire app. One major difference between React Native and native apps is that React Native apps produce an additional build artifact, the JavaScript bundle. If a developer only changes the React Native code and not native code, then the only build artifact needed to load the changes is the new JavaScript bundle. We leveraged this fact and developed a tool to store any newly built JavaScript bundles, so React Native apps can fetch any bundle and load the changes almost instantly.

    Storing the JavaScript Bundle

    The main idea behind the tool is to store the JavaScript bundle of any new builds in our cloud storage, so developers can simply download the artifact instead of building it on demand when tophatting.

    New PR on React Native project triggers a CI pipeline in Shopify Build

    New PR on React Native project triggers a CI pipeline in Shopify Build

    When a developer opens a new PR on GitHub or pushes a new commit in a React Native project, it triggers a CI pipeline in Shopify Build, our internal continuous integration/continuous delivery (CI/CD) platform then performs the following steps:

    1. The pipeline first builds the app’s JavaScript bundle.
    2. The pipeline compresses the bundle along with any assets that the app uses.
    3. The pipeline makes an API call to a backend service that writes the bundle’s metadata to a SQL database. The metadata includes information such as the app ID, the commit’s Secure Hash Algorithms (SHA) checksum, and the branch name.
    4. The backend service generates a unique bundle ID and a signed URL for uploading to cloud storage.
    5. The pipeline uploads the bundle to cloud storage using the signed URL.
    6. The pipeline makes an API call to the backend service to leave a comment on the PR.

    QR code that developers can scan on their mobile device

    QR code that developers can scan on their mobile device

    The PR comment records that the bundle upload is successful and gives developers three options to download the bundle, which include

    • A QR code that developers can scan on their mobile device, which opens the app on their device and downloads the bundle.
    • A bundle ID that developers can use to download the bundle without exiting the app using the Tophat screen. This is useful when developers are using a simulator/emulator.
    • A link that developers can use to download the bundle directly from a GitHub notification email. This allows developers to tophat without opening the PR on their computer.

    Loading the JavaScript Bundle

    Once the CI pipeline uploads the JavaScript bundle to cloud storage, developers need a way to easily download the bundle and load the changes in their app. We built a React Native component library providing a user interface (called the Tophat screen) for developers to load the changes.

    The Tophat Component Library 

    The component library registers the Tophat screen as a separate component and a URL listener that handles specific deep link events. All developers need to do is to inject the component into the root level of their application.

    The library also includes an action that shows the Tophat screen on demand. Developers open the Tophat screen to see the current bundle version or to reset the current bundle. In the example below, we use the action to construct a “Show Tophat” button, which opens the Tophat screen on press.

    The Tophat Screen

    The Tophat screen looks like a modal or an overlay in the app, but it’s separate from the app’s component tree, so it introduces a non-intrusive UI for React Native apps. 

    React Native tophat screen in action
    Tophat screen in action

    Here’s an example of using the tool to load a different commit in our Local Delivery app.

    The Typical Tophat Workflow

    Typical React Native tophat workflow

    Typical React Native tophat workflow

    The typical workflow using the Tophat library looks like:

    1. The developer scans the QR code or clicks the link in the GitHub PR comment that resolves to an URL in the format “{appId}://tophat_bundle/{bundle_id}”.
    2. The URL opens the app on the developer’s device and triggers a deep link event.
    3. The component library captures the event and parses the URL for the app ID and bundle ID.
    4. If the app ID in the URL matches the current app, then the library makes an API call to the backend service requesting a signed download URL and metadata for the corresponding JavaScript bundle.
    5. The Tophat screen displays the bundle’s metadata and asks the developer to confirm whether or not this is the bundle they wish to download.
    6. Upon confirmation, the library downloads the JavaScript bundle from cloud storage and saves the bundle’s metadata using local storage. Then it decompresses the bundle and restarts the app.
    7. When the app is restarting, it detects the new JavaScript bundle and starts the app using that bundle instead.
    8. Once the developer verifies the changes, they can reset the bundle in the Tophat screen.

    Managing Bundles Using a Backend Service

    In the native tophatting project, we didn’t use a backend service. However, we decided to use a backend service to handle most of the business logic in this tool. This creates additional maintenance and infrastructure cost to the project, but we believe its proven advantages outweigh its costs. There are two main reasons why we chose to use a backend service:

    1. It abstracts away authentication and implementation details with third-party services.
    2. It provides a scalable solution of storing metadata that enables better UI capabilities.

    Abstracting Implementation Details

    The tool requires the use of Google Cloud’s and GitHub’s SDKs, which means the client needs to have an authentication token for each of these services. If a backend service didn’t exist, then each app and its respective CI pipeline would need to configure their own tokens. The CI pipeline and the component library would also need to have consistent storage path formats. This introduces extra complexity and adds additional steps in the tool’s installation process.

    The backend service abstracts away the interaction with third party services such as authentication, uploading assets, and creating Github comments. The service also generates each bundle’s storage path, eliminating the issue of having inconsistent paths across different components. 

    Storing Metadata

    Each JavaScript bundle has important metadata that developers need to quickly retrieve along with the bundle. A solution used by the native tophatting project is to store the metadata in the filename of the build artifact. We could leverage the same technique to store the metadata in the JavaScript Bundle’s storage path. However, this isn’t scalable if we wish to include additional metadata. For example, if we want to add the author of the commit to the bundle’s metadata, it would introduce a change in the storage path format, which requires changes in every app’s CI pipeline and the component library.

    By using a backend service, we store more detailed metadata in a SQL database and decouple it from the bundle’s storage. This opens up the possibility of adding features like a confirmation step before downloading the bundle and querying bundles by app IDs or branch names.

    What’s Next?

    The first iteration of the tool is complete and React Native developers use the tool to tophat each other’s pull request by simply scanning a QR code or entering a bundle ID. There are improvements that we want to make in the future:

    • Building and uploading the JavaScript bundle directly from the command line.
    • Showing a list of available JavaScript bundles in the Tophat screen.
    • Detecting native code changes.
    • Designing a better UI in the Tophat screen.

    Almost all of the React Native projects at Shopify are now using the tool and my team keeps working to improve the tophatting experience for our React Native developers.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    5 Ways to Improve Your React Native Styling Workflow

    5 Ways to Improve Your React Native Styling Workflow

    In April, we announced Shop, our digital shopping assistant that brings together the best features of Arrive and Shop Pay. The Shop app started from our React Native codebase for our previous package tracking app Arrive, with every screen receiving a complete visual overhaul to fit the new branding.

    While our product designers worked on introducing a whole new design system that would decide the look and feel of the app, we on the engineering side took the initiative to evolve our thinking around how we work with styling of screens and components. The end-product became Restyle, our open source library that allowed us to move forward quickly and easily in our transformation from Arrive to Shop.

    I'll walk you through the styling best practices we learned through this process. They served as the guiding principles for the design of Restyle. However anyone working with a React app can benefit from applying these best practices, with or without using our library.

    The Questions We Needed to Answer

    We faced a number of problems with our current approach in Arrive, and these were the questions we needed to answer to take our styling workflow to the next level:

    • With a growing team working in different countries and time zones, how do we make sure that the app keeps a consistent style throughout all of its different screens?
    • What can we do to make it easy to style the app to look great on multiple different device sizes and formats?
    • How do we allow the app to dynamically adapt its theme according to the user’s preferences, to support for example dark mode?
    • Can we make working with styles in React Native a more enjoyable experience?

    With these questions in place, we came up with the following best practices that provided answers to them. 

    #1. Create a Design System

    A prerequisite for being able to write clean and consistent styling code is for the design of the app to be based on a clean and consistent design system. A design system is commonly defined as a set of rules, constraints and principles that lay the foundation for how the app should look and feel. Building a complete design system is a topic far too big to dig into here, but I want to point out three important areas that the system should define its rules for.


    Size and spacing are the two parameters used when defining the layout of an app. While sizes often vary greatly between different components presented on a screen, the spacing between them should often stay as consistent as possible to create a coherent look. This means that it’s preferred to stick to a small set of predefined spacing constants that’s used for all margins and paddings in the app.

    There are many conventions to choose between when deciding how to name your spacing constants, but I've found the t-shirt size scale (XS, S, M, L, XL, etc) work best. The order of sizes are easy to understand, and the system is extensible in both directions by prefixing with more X’s.


    When defining colors in a design system, it’s important not only to choose which colors to stick with, but also how and when they should be used. I like to split these definitions up into two layers:

    • The color palette - This is the set of colors that’s used. These can be named quite literally, e.g. “Blue”, “Light Orange”, “Dark Red”, “White”, “Black”.
    • The semantic colors - A set of names that map to and describe how the color palette should be applied, that is, what their functions are. Some examples are “Primary”, “Background”, “Danger”, “Failure”. Note that multiple semantic colors can be mapped to the same palette color, for example, both the “Danger” and “Failure” color could both map to “Dark Red”.

    When referring to a color in the app, it should be through the semantic color mapping. This makes it easy to later change, for example, the “Primary” color to be green instead of blue. It also allows you to easily swap out color schemes on the fly to, for example, easily accommodate a light and dark mode version of the app. As long as elements are using the “Background” semantic color, you can swap it between a light and dark color based on the chosen color scheme.


    Similar to spacing, it‘s best to stick to a limited set of font families, weights and sizes to achieve a coherent look throughout the app. A grouping of these typographic elements are defined together as a named text variant. Your “Header” text might be size 36, have a bold weight, and use the font family “Raleway”. Your “Body” text might use the “Merriweather” family with a regular font weight, and size 16.

    #2. Define a Theme Object

    A carefully put together design system following the spacing, colour, and typography practices above should be defined in the app‘s codebase as a theme object. Here‘s how a simple version might look:

    All values relating to your design system and all uses of these values in the app should be through this theme object. This makes it easy to tweak the system by only needing to edit values in a single source of truth.

    Notice how the palette is kept private to this file, and only the semantic color names are included in the theme. This enforces the best practice with colors in your design system.

    #3. Supply the Theme through React‘s Context API

    Now that you've defined your theme object, you might be tempted to start directly importing it in all the places where it's going to be used. While this might seem like a great approach at first, you’ll quickly find its limitations once you’re looking to work more dynamically with the theming. In the case of wanting to introduce a secondary theme for a dark mode version of the app, you would need to either:

    • Import both themes (light and dark mode), and in each component determine which one to use based on the current setting, or
    • Replace the values in the global theme definition when switching between modes. 

    The first option will introduce a large amount of tedious code repetition. The second option will only work if you force React to re-render the whole app when switching between light and dark modes, which is typically considered a bad practice. If you have a dynamic value that you want to make available to all components, you’re better off using React’s context API. Here’s how you would set this up with your theme:

    The theme in React’s context will make sure that whenever the app changes between light and dark mode, all components that access the theme will automatically re-render with the updated values. Another benefit of having the theme in context is being able to swap out themes on a sub-tree level. This allows you to have different color schemes for different screens in the app, which could, for example, allow users to customize the colors of their profile page in a social app.

    #4. Break the System into Components

    While it‘s entirely possible to keep reaching into the context to grab values from the theme for any view that needs to be styled, this will quickly become repetitious and overly verbose. A better way is to have components that directly map properties to values in the theme. There are two components that I find myself needing the most when working with themes this way, Box and Text.

    The Box component is similar to a View, but instead of accepting a style object property to do the styling, it directly accepts properties such as margin, padding, and backgroundColor. These properties are configured to only receive values available in the theme, like this:

    The “m” and “s” values here map to the spacings we‘ve defined in the theme, and “primary” maps to the corresponding color. This component is used in most places where we need to add some spacing and background colors, simply by wrapping it around other components.

    While the Box component is handy for creating layouts and adding background colors, the Text component comes into play when displaying text. Since React Native already requires you to use their Text component around any text in the app, this becomes a drop in replacement for it:

    The variant property applies all the properties that we defined in the theme for textVariant.header, and the color property follows the same principle as the Box component’s backgroundColor, but for the text color instead.

    Here’s how both of these components would be implemented:

    Styling directly through properties instead of keeping a separate style sheet might seem weird at first. I promise that once you start doing it you’ll quickly start to appreciate how much time and effort you save by not needing to jump back and forth between components and style sheets during your styling workflow.

    #5. Use Responsive Style Properties

    Responsive design is a common practice in web development where alternative styles are often specified for different screen sizes and device types. It seems that this practice has yet to become commonplace within the development of React Native apps. The need for responsive design is apparent in web apps where the device size can range from a small mobile phone to a widescreen desktop device. A React Native app only targeting mobile devices might not work with the same extreme device size differences, but the variance in potential screen dimensions is already big enough to make it hard to find a one-size-fits-all solution for your styling.

    An app onboarding screen that displays great on the latest iPhone Pro will most likely not work as well with the limited screen estate available on a first generation iPhone SE. Small tweaks to the layout, spacing and font size based on the available screen dimensions are often necessary to craft the best experience for all devices. In responsive design this work is done by categorizing devices into a set of predefined screen sizes defined by their breakpoints, for example:

    With these breakpoints we're saying that anything below 321 pixels in width should fall in the category of being a small phone, anything above that but below 768 is a regular phone size, and everything wider than that is a tablet.

    With these set, let's expand our previous Box component to also accept specific props for each screen size, in this manner:

    Here's roughly how you would go about implementing this functionality:

    In a complete implementation of the above you would ideally use a hook based approach to get the current screen dimensions that also refreshes on change (for example when changing device orientation), but I’ve left that out in the interest of brevity.

    #6. Enforce the System with TypeScript

    This final best practice requires you to be using TypeScript for your project.

    TypeScript and React pair incredibly well together, especially when using a modern code editor such as Visual Studio Code. Instead of relying on React’s PropTypes validation, which only happens when the component is rendered at run-time, TypeScript allows you to validate these types as you are writing the code. This means that if TypeScript isn’t displaying any errors in your project, you can rest assured that there are no invalid uses of the component anywhere in your app.

    Using the prop validation mechanisms of TypeScript

    Using the prop validation mechanisms of TypeScript

    TypeScript isn’t only there to tell you when you’ve done something wrong, it can also help you in using your React components correctly. Using the prop validation mechanisms of TypeScript, we can define our property types to only accept values available in the theme. With this, your editor will not only tell you if you're using an unavailable value, it will also autocomplete to one of the valid values for you.

    Here's how you need to define your types to set this up:

    Evolve Your Styling Workflow

    Following the best practices above via our Restyle library made a significant improvement to how we work with styles in our React Native app. Styling has become more enjoyable through the use of Restyle’s Box and Text components, and by restricting the options for colors, typography and spacing it’s now much easier to build a great-looking prototype for a new feature before needing to involve a designer in the process. The use of responsive style properties has also made it easy to tailor styles to specific screen sizes, so we can work more efficiently with crafting the best experience for any given device.

    Restyle’s configurability through theming allowed us to maintain a theme for Arrive while iterating on the theme for Shop. Once we were ready to flip the switch, we just needed to point Restyle to our new theme to complete the transformation. We also introduced dark mode into the app without it being a concrete part of our roadmap—we found it so easy to add we simply couldn't resist doing it.

    If you've asked some of the same questions we posed initially, you should consider adopting these best practices. And if you want a tool that helps you along the way, our Restyle library is there to guide you and make it an enjoyable experience.

    Wherever you are, your next journey starts here! Intrigued? We’d love to hear from you.

    Continue reading

    How to Track State with Type 2 Dimensional Models

    How to Track State with Type 2 Dimensional Models

    Application databases are generally designed to only track current state. For example, a typical user’s data model will store the current settings for each user. This is known as a Type 1 dimension. Each time they make a change, their corresponding record will be updated in place:







    2019-01-01 12:14:23

    2019-01-01 12:14:23



    2019-01-01 15:21:45

    2019-01-02 05:20:00


    This makes a lot of sense for applications. They need to be able to rapidly retrieve settings for a given user in order to determine how the application behaves. An indexed table at the user grain accomplishes this well.

    But, as analysts, we not only care about the current state (how many users are using feature “X” as of today), but also the historical state. How many users were using feature “X” 90 days ago? What is the 30 day retention rate of the feature? How often are users turning it off and on? To accomplish these use cases we need a data model that tracks historical state:








    2019-01-01 12:14:23

    2019-01-01 12:14:23




    2019-01-01 15:21:45

    2019-01-02 05:20:00




    2019-01-02 05:20:00




    This is known as a Type 2 dimensional model. I’ll show how you can create these data models using modern ETL tooling like PySpark and dbt (data build tool).

    Implementing Type 2 Dimensional Models at Shopify

    I currently work as a data scientist in the International product line at Shopify. Our product line is focused on adapting and scaling our product around the world. One of the first major efforts we undertook was translating Shopify’s admin in order to make our software available to use in multiple languages.

    Shopify admin translatedShopify admin translated

    At Shopify, data scientists work across the full stack—from data extraction and instrumentation, to data modelling, dashboards, analytics, and machine learning powered products. As a product data scientist, I’m responsible for understanding how our translated versions of the product are performing. How many users are adopting them? How is adoption changing over time? Are they retaining the new language, or switching back to English? If we default a new user from Japan into Japanese, are they more likely to become a successful merchant than if they were first exposed to the product in English and given the option to switch? In order to answer these questions, we first had to figure out how our data could be sourced or instrumented, and then eventually modelled.

    The functionality that decides which language to render Shopify in is based on the language setting our engineers added to the users data model. 








    2019-01-01 12:14:23

    2019-06-01 07:15:03




    2019-02-02 11:00:35

    2019-02-02 11:00:35


    User 1 will experience the Shopify admin in English, User 2 in Japanese, etc... Like most data models powering Shopify’s software, the users model is a Type 1 dimension. Each time a user changes their language, or any other setting, the record gets updated in place. As I alluded to above, this data model doesn’t allow us to answer many of our questions as they involve knowing what language a given user is using at a particular point in time. Instead, we needed a data model that tracked user’s languages over time. There are several ways to approach this problem.

    Options For Tracking State

    Modify Core Application Model Design

    In an ideal world, the core application database model will be designed to track state. Rather than having a record be updated in place, the new settings are instead appended as a new record. Due to the fact that the data is tracked directly in the source of truth, you can fully trust its accuracy. If you’re working closely with engineers prior to the launch of a product or new feature, you can advocate for this data model design. However, you will often run into two challenges with this approach:

    1. Engineers will be very reluctant to change the data model design to support analytical use cases. They want the application to be as performant as possible (as should you), and having a data model which keeps all historical state is not conducive to that.
    2. Most of the time, new features or products are built on top of pre-existing data models. As a result, modifying an existing table design to track history will come with an expensive and risky migration process, along with the aforementioned performance concerns.

    In the case of rendering languages for the Shopify admin, the language field was added to the pre-existing users model, and updating this model design was out of the question.

    Stitch Together Database Snapshots

    System that extracts newly created or updated records from the application databases on a fixed schedule

    System that extracts newly created or updated records from the application databases on a fixed schedule

    At most technology companies, snapshots of application database tables are extracted into the data warehouse or data lake. At Shopify, we have a system that extracts newly created or updated records from the application databases on a fixed schedule.

    Using these snapshots, one can leverage them as an input source for building a Type 2 dimension. However, given the fixed schedule nature of the data extraction system, it is possible that you will miss updates happening between one extract and the next.

    If you are using dbt for your data modelling, you can leverage their nice built-in solution for building Type 2’s from snapshots!

    Add Database Event Logging

    Newly created or updated record is stored in this log stored in Kafka

    Newly created or updated record is stored in this log in Kafka

    Another alternative is to add a new event log. Each newly created or updated record is stored in this log. At Shopify, we rely heavily on Kafka as a pipeline for transferring real-time data between our applications and data land, which makes it an ideal candidate for implementing such a log.

    If you work closely with engineers, or are comfortable working in your application codebase, you can get new logging in place that will stream any new or updated record to Kafka. Shopify is built on the Ruby on Rails web framework. Rails has something called “Active Record Callbacks”, which allows you to trigger logic before or after an alternation of an object’s (read “database records”) state. For our use case, we can leverage the after_commit callback to log a record to Kafka after it has been successfully created or updated in the application database.

    While this option isn’t perfect, and comes with a host of other caveats I will discuss later, we ended up choosing it as it was the quickest and easiest solution to implement that provided the required granularity.

    Type 2 Modelling Recipes

    Below, I’ll walk through some recipes for building Type 2 dimensions from the event logging option discussed above. We’ll stick with our example of modelling user’s languages over time and work with the case where we’ve added event logging to our database model from day 1 (i.e. when the table was first created). Here’s an example of what our user_update event log would look like:







    2019-01-01 12:14:23

    2019-01-01 12:14:23



    2019-02-02 11:00:35

    2019-02-02 11:00:35



    2019-02-02 11:00:35

    2019-02-02 12:15:06



    2019-02-02 11:00:35

    2019-02-02 13:01:17



    2019-02-02 11:00:35

    2019-02-02 14:10:01


    This log describes the full history of the users data model.

    1. User 1 gets created at 2019-01-01 12:14:23 with English as the default language.
    2. User 2 gets created at 2019-02-02 11:00:35 with English as the default language.
    3. User 2 decides to switch to French at 2019-02-02 12:15:06.
    4. User 2 changes some other setting that is tracked in the users model at 2019-02-02 13:01:17.
    5. User 2 decides to switch back to English at 2019-02-02 14:10:01.

    Our goal is to transform this event log into a Type 2 dimension that looks like this:








    2019-01-01 12:14:23





    2019-02-02 11:00:35

    2019-02-02 12:15:06




    2019-02-02 12:15:06

    2019-02-02 14:10:01




    2019-02-02 14:10:01




    We can see that the current state for all users can easily be retrieved with a SQL query that filters for WHERE is_current. These records also have a null value for the valid_to column, since they are still in use. However, it is common practice to fill these nulls with something like the timestamp at which the job last ran, since the actual values may have changed since then.


    Due to Spark’s ability to scale to massive datasets, we use it at Shopify for building our data models that get loaded to our data warehouse. To avoid the mess that comes with installing Spark on your machine, you can leverage a pre-built docker image with PySpark and Jupyter notebook pre-installed. If you want to play around with these examples yourself, you can pull down this docker image with docker pull jupyter/pyspark-notebook:c76996e26e48 and then run docker run -p 8888:8888 jupyter/pyspark-notebook:c76996e26e48 to spin up a notebook where you can run PySpark locally.

    We’ll start with some boiler plate code to create a Spark dataframe containing our sample of user update events:

    With that out of the way, the first step is to filter our input log to only include records where the columns of interest were updated. With our event instrumentation, we log an event whenever any record in the users model is updated. For our use case, we only care about instances where the user’s language was updated (or created for the first time). It’s also possible that you will get duplicate records in your event logs, since Kafka clients typically support “at-least-once” delivery. The code below will also filter out these cases:

    We now have something that looks like this:






    2019-01-01 12:14:23



    2019-02-02 11:00:35



    2019-02-02 12:15:06



    2019-02-02 14:10:01


    The last step is fairly simple; we produce one record per period for which a given language was enabled:








    2019-01-01 12:14:23

    2020-05-23 00:56:49




    2019-02-02 11:00:35

    2019-02-02 12:15:06




    2019-02-02 12:15:06

    2019-02-02 14:10:01




    2019-02-02 14:10:01

    2020-05-23 00:56:49




    dbt is an open source tool that lets you build new data models in pure SQL. It’s a tool we are currently exploring using at Shopify to supplement modelling in PySpark, which I am really excited about. When writing PySpark jobs, you’re typically taking SQL in your head, and then figuring out how you can translate it to the PySpark API. Why not just build them in pure SQL? dbt lets you do exactly that:

    With this SQL, we have replicated the exact same steps done in the PySpark example and will produce the same output shown above.

    Gotchas, Lessons Learned, and The Path Forward

    I’ve leveraged the approaches outlined above with multiple data models now. Here are a few of the things I’ve learned along the way.

    1. It took us a few tries before we landed on the approach outlined above. 

    In some initial implementations, we were logging the record changes before they had been successfully committed to the database, which resulted in some mismatches in the downstream Type 2 models. Since then, we’ve been sure to always leverage the after_commit callback based approach.

    2. There are some pitfalls with logging changes from within the code:

    • Your event logging becomes susceptible to future code changes. For example, an engineer refactors some code and removes the after_commit call. These are rare, but can happen. A good safeguard against this is to leverage tooling like the CODEOWNERS file, which notifies you when a particular part of the codebase is being changed.
    • You may miss record updates that are not triggered from within the application code. Again, these are rare, but it is possible to have an external process that is not using the Rails User model when making changes to records in the database.

    3. It is possible to lose some events in the Kafka process.

    For example, if one of the Shopify servers running the Ruby code were to fail before the event was successfully emitted to Kafka, you would lose that update event. Same thing if Kafka itself were to go down. Again, rare, but nonetheless something you should be willing to live with. There are a few ways you can mitigate the impact of these events:

    • Have some continuous data quality checks running that compare the Type 2 dimensional model against the current state and checks for discrepancies.
    • If & when any discrepancies are detected, you could augment your event log using the current state snapshot.

    4. If deletes occur in a particular data model, you need to implement a way to handle this.

    Otherwise, the deleted events will be indistinguishable from normal create or update records with the logging setup I showed above. Here are some ways around this:

    • Have your engineers modify the table design to use soft deletes instead of hard deletes. 
    • Add a new field to your Kafka schema and log the type of event that triggered the change, i.e. (create, update, or delete), and then handle accordingly in your Type 2 model code.

    Implementing Type 2 dimensional models for Shopify’s admin languages was truly an iterative process and took investment from both data and engineering to successfully implement. With that said, we have found the analytical value of the resulting Type 2 models well worth the upfront effort.

    Looking ahead, there’s an ongoing project at Shopify by one of our data engineering teams to store the MySQL binary logs (binlogs) in data land. Binlogs are a much better source for a log of data modifications, as they are directly tied to the source of truth (the MySQL database), and are much less susceptible to data loss than the Kafka based approach. With binlog extractions in place, you don’t need to add separate Kafka event logging to every new model as changes will be automatically tracked for all tables. You don’t need to worry about code changes or other processes making updates to the data model since the binlogs will always reflect the changes made to each table. I am optimistic that with binlogs as a new, more promising source for logging data modifications, along with the recipes outlined above, we can produce Type 2s out of the box for all new models. Everybody gets a Type 2!

    Additional Information

    SQL Query Recipes

    Once we have our data modelled as a Type 2 dimension, there are a number of questions we can start easily answering:

    Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

    Continue reading

    ShipIt! Presents: A Look at Shopify's API Health Report

    ShipIt! Presents: A Look at Shopify's API Health Report

    On July 17, 2020, ShipIt!, our monthly event series, presented A Look at Shopify's API Health Report. Our guests, Shuting Chang, Robert Saunders, Karen Xie, and Vrishti Dutta join us to talk about Shopify’s API Health Report, the tool, this multidisciplinary team, built to surface breaking changes affecting Shopify Partner apps. 

    Additional Information

    The links shared to the audience during the event:

    API Versioning at Shopify

    Shopify GraphQL

    API Support Channels

    Other Links

    If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions.

    Continue reading

    How Shopify Reduced Storefront Response Times with a Rewrite

    How Shopify Reduced Storefront Response Times with a Rewrite

    In January 2019, we set out to rewrite the critical software that powers all online storefronts on Shopify’s platform to offer the fastest online shopping experience possible, entirely from scratch and without downtime.

    The Storefront Renderer is a server-side application that loads a Shopify merchant's storefront Liquid theme, along with the data required to serve the request (for example product data, collection data, inventory information, and images), and returns the HTML response back to your browser. Shaving milliseconds off response time leads to big results for merchants on the platform as buyers increasingly expect pages to load quickly, and failing to deliver on performance can hinder sales, not to mention other important signals like SEO.

    The previous storefront implementation‘s development, started over 15 years ago when Tobi launched Snowdevil, lived within Shopify’s Ruby on Rails monolith. Over the years, we realized that the “storefront” part of Shopify is quite different from the other parts of the monolith: it has much stricter performance requirements and can accept more complexity implementation-wise to improve performance, whereas other components (such as payment processing) need to favour correctness and readability.

    In addition to this difference in paradigm, storefront requests progressively became slower to compute as we saw more storefront traffic on the platform. This performance decline led to a direct impact on our merchant storefronts’ performance, where time-to-first-byte metrics from Shopify servers slowly crept up as time went on.

    Here’s how the previous architecture looked:

    Old Storefront Implementation
    Old Storefront Implementation

    Before, the Rails monolith handled almost all kinds of traffic: checkout, admin, APIs, and storefront.

    With the new implementation, traffic routing looks like this:

    New Storefront Implementation
    New Storefront Implementation

    The Rails monolith still handles checkout, admin, and API traffic, but storefront traffic is handled by the new implementation.

    Designing the new storefront implementation from the ground up allowed us to think about the guarantees we could provide: we took the opportunity of this evergreen project to set us up on strong primitives that can be extended in the future, which would have been much more difficult to retrofit in the legacy implementation. An example of these foundations is the decision to design the new implementation on top of an active-active replication setup. As a result, the new implementation always reads from dedicated read replicas, improving performance and reducing load on the primary writers.

    Similarly, by rebuilding and extracting the storefront-related code in a dedicated application, we took the opportunity to think about building the best developer experience possible: great debugging tools, simple onboarding setup, welcoming documentation, and so on.

    Finally, with improving performance as a priority, we work to increase resilience and capacity in high load scenarios (think flash sales: events where a large number of buyers suddenly start shopping on a specific online storefront), and invest in the future of storefront development at Shopify. The end result is a fast, resilient, single-purpose application that serves high-throughput online storefront traffic for merchants on the Shopify platform as quickly as possible.

    Defining Our Success Criteria

    Once we clearly outlined the problem we’re trying to solve and scoped out the project, we defined three main success criteria:

    • Establishing feature parity: for a given input, both implementations generate the same output.
    • Improving performance: the new implementation runs on active-active replication setup and minimizes server response times.
    • Improving resilience and capacity: in high-load scenarios, the new implementation generally sustains traffic without causing errors.

    Building A Verifier Mechanism

    Before building the new implementation, we needed a way to make sure that whatever we built would behave the same way as the existing implementation. So, we built a verifier mechanism that compares the output of both implementations and returns a positive or negative result depending on the outcome of the comparison.

    This verification mechanism runs on storefront traffic in production, and it keeps track of verification results so we can identify differences in output that need fixing. Running the verifier mechanism on production traffic (in addition to comparing the implementations locally through a formal specification and a test suite) lets us identify the most impactful areas to work on when fixing issues, and keeps us focused on the prize: reaching feature parity as quickly as possible. It’s desirable for multiple reasons:

    • giving us an idea of progress and spreading the risk over a large amount of time
    • shortening the period of time that developers at Shopify work with two concurrent implementations at once
    • providing value to Shopify merchants as soon as possible.

    There are two parts to the entire verifier mechanism implementation:

    1. A verifier service (implemented in Ruby) compares the two responses we provide and returns a positive or negative result depending on the verification outcome. Similar to a `diff` tool, it lets us identify differences between the new and legacy implementations.
    2. A custom nginx routing module (implemented in Lua on top of OpenResty) sends a sample of production traffic to the verifier service for verification. This module acts as a router depending on the result of the verifications for subsequent requests.

    The following diagram shows how each part interacts with the rest of the architecture:

    Legacy implementation and new implementation at the same conceptual layer
    Legacy implementation and new implementation at the same conceptual layer

    The legacy implementation (the Rails monolith) still exists, and the new implementation (including the Verifier service) is introduced at the same conceptual layer. Both implementations are placed behind a custom routing module that decides where to route traffic based on the request attributes and the verification data for this request type. Let’s look at an example.

    When a buyer’s device sends an initial request for a given storefront page (for example, a product page from shop XYZ), the request is sent to Shopify’s infrastructure, at which point an nginx instance handles it. The routing module considers the request attributes to determine if other shop XYZ product page requests have previously passed verification.

    First request routed to Legacy implementation
    First request routed to Legacy implementation

    Since this is the first request of this kind in our example, the routing module sends the request to the legacy implementation to get a baseline reference that it will use for subsequent shop XYZ product page requests.

    Routing module sends original request and legacy implementation’s response to the new implementation
    Routing module sends original request and legacy implementation’s response to the new implementation

    Once the response comes back from the legacy implementation, the Lua routing module sends that response to the buyer. In the background, the Lua routing module also sends both the original request and the legacy implementation’s response to the new implementation. The new implementation computes a response to the original request and feeds both its response and the forwarded legacy implementation’s response to the verifier service. This is done asynchronously to make sure we’re not adding latency to responses we send to buyers, who don’t notice anything different.

    At this point, the verifier service received the responses from both the legacy and new implementations and is ready to compare them. Of course, the legacy implementation is assumed to be correct as it’s been running in production for years now (it acts as our reference point). We keep track of differences between the two implementations’ responses so we can debug and fix them later. The verifier service looks at both responses’ status code, headers, and body, ensuring they’re equivalent. This lets us identify any differences in the responses so we make sure our new implementation behaves like the legacy one.

    Time-related and randomness-related exceptions make it impossible to have exactly byte-equal responses, so we ignore certain patterns in the verifier service to relax the equivalence criteria. The verifier service uses a fixed time value during the comparison process and sets any random values to a known value so we reliably compare the outputs containing time-based and randomness-based differences.

    The verifier service sends comparison result back to the Lua module
    The verifier service sends comparison result back to the Lua module

    The verifier service sends the outcome of the comparison back to the Lua module, which keeps track of that comparison outcome for subsequent requests of the same kind.

    Dynamically Routing Requests To the New Implementation

    Once we had verified our new approach, we tested rendering a page using the new implementation instead of the legacy one. We iterated upon our verification mechanism to allow us to route traffic to the new implementation after a given number of successful verifications. Here’s how it works.

    Just like when we only verified traffic, a request arrives from a client device and hits Shopify’s architecture. The request is sent to both implementations, and both outputs are forwarded to the verifier service for comparison. The comparison result is sent back to the Lua routing module, which keeps track of it for future requests.

    When a subsequent storefront request arrives from a buyer and reaches the Lua routing module, it decides where to send it based on the previous verification results for requests similar to the current one (based on the request attributes

    For subsequent storefront requests, the Lua routing module decides where to send it
    For subsequent storefront requests, the Lua routing module decides where to send it

    If the request was verified multiple times in the past, and nearly all outcomes from the verifier service were “Pass”, then we consider the request safe to be served by the new implementation.

    If nearly all verifier service results are “Pass”, then it uses the new implementation
    If most verifier service results are “Pass”, then it uses the new implementation

    If, on the other hand, some verifications failed for this kind of request, we’ll play it safe and send the request to the legacy implementation.

    If most verifier service results are “Fail”, then it uses the old implementation
    If most verifier service results are “Fail”, then it uses the old implementation

    Successfully Rendering In Production

    With the verifier mechanism and the dynamic router in place, our first goal was to render one of the simplest storefront pages that exists on the Shopify platform: the password page that protects a storefront before the merchant makes it available to the public.

    Once we reached full parity for a single shop’s password page, we tested our implementation in production (for the first time) by routing traffic for this password page to the new implementation for a couple of minutes to test it out.

    Success! The new implementation worked in production. It was time to start implementing everything else.

    Increasing Feature Parity

    After our success with the password page, we tackled the most frequently accessed storefront pages on the platform (product pages, collection pages, etc). Diff by diff, endpoint by endpoint, we slowly increased the parity rate between the legacy and new implementations.

    Having both implementations running at the same time gave us a safety net to work with so that if we introduced a regression, requests would easily be routed to the legacy implementation instead. Conversely, whenever we shipped a change to the new implementation that would fix a gap in feature parity, the verifier service starts to report verification successes, and our custom routing module in nginx automatically starts sending traffic to the new implementation after a predetermined time threshold.

    Defining “Good” Performance with Apdex Scores

    We collected Apdex (Application Performance Index) scores on server-side processing time for both the new and legacy implementations to compare them.

    To calculate Apdex scores, we defined a parameter for a satisfactory threshold response time (this is the Apdex’s “T” parameter). Our threshold response time to define a frustrating experience would then be “above 4T” (defined by Apdex).

    We defined our “T” parameter as 200ms, which lines up with Google’s PageSpeed Insights recommendation for server response times. We consider server processing time below 200ms as satisfying and a server processing time of 800ms or more as frustrating. Anything in between is tolerated.

    From there, calculating the Apdex score for a given implementation consists of setting a time frame, and counting three values:

    • N, the total number of responses in the defined time frame
    • S, the number of satisfying responses (faster than 200ms) in the time frame
    • T, the number of tolerated responses (between 200ms and 800ms) in the time frame

    Then, we calculate the Apdex score: 

    By calculating Apdex scores for both the legacy and new implementations using the same T parameter, we had common ground to compare their performance.

    Methods to Improve Server-side Storefront Performance

    We want all Shopify storefronts to be fast, and this new implementation aims to speed up what a performance-conscious theme developer can’t by optimizing data access patterns, reducing memory allocations, and implementing efficient caching layers.

    Optimizing Data Access Patterns

    The new implementation uses optimized, handcrafted SQL multi-select statements maximizing the amount of data transferred in a single round trip. We carefully vet what we eager-load depending on the type of request and we optimize towards reducing instances of N+1 queries.

    Reducing Memory Allocations

    We reduce the number of memory allocations as much as possible so Ruby spends less time in garbage collection. We use methods that apply modifications in place (such as #map!) rather than those that allocate more memory space (like #map). This kind of performance-oriented Ruby paradigm sometimes leads to code that’s not as simple as idiomatic Ruby, but paired with proper testing and verification, this tradeoff provides big performance gains. It may not seem like much, but those memory allocations add up quickly, and considering the amount of storefront traffic Shopify handles, every optimization counts.

    Implementing Efficient Caching Layers

    We implemented various layers of caching throughout the application to reduce expensive calls. Frequent database queries are partitioned and cached to optimize for subsequent reads in a key-value store, and in the case of extremely frequent queries, those are cached directly in application memory to reduce I/O latency. Finally, the results of full page renders are cached too, so we can simply serve a full HTTP response directly from cache if possible.

    Measuring Performance Improvement Successes

    Once we could measure the performance of both implementations and reach a high enough level of verified feature parity, we started migrating merchant shops. Here are some of the improvements we’re seeing with our new implementation:

    • Across all shops, average server response times for requests served by the new implementation are 4x to 6x faster than the legacy implementation. This is huge!
    • When migrating a storefront to the new implementation, we see that the Apdex score for server-side processing time improves by +0.11 on average.
    • When only considering cache misses (requests that can’t be served directly from the cache and need to be computed from scratch), the new implementation increases the Apdex score for server-side processing time by a full +0.20 on average compared to the previous implementation.
    • We heard back from merchants mentioning a 500ms improvement in time-to-first-byte metrics when the new implementation was rolled out to their storefront.

    So another success! We improved store performance in production.

    Now how do we make sure this translates to our third success criteria?

    Improving Resilience and Capacity

    While working on the new implementation, the Verifier service identified potential parity gaps, which helped tremendously. However, a few times we shipped code to production that broke in exceedingly rare edge cases that it couldn’t catch.

    As a safety mechanism, we made it so that whenever the new implementation would fail to successfully render a given request, we’d fall back to the legacy implementation. The response would be slower, but at least it was working properly. We used circuit breakers in our custom nginx routing module so that we’d open the circuit and start sending traffic to the legacy implementation if the new implementation was having trouble responding successfully. Read more on tuning circuit breakers in this blog post by my teammate Damian Polan.

    Increase Capacity in High-load Scenarios

    To ensure that the new implementation responds well to flash sales, we implemented and tweaked two mechanisms. The first one is an automatic scaling mechanism that adds or remove computing capacity in response to the amount of load on the current swarm of computers that serve traffic. If load increases as a result of an increase in traffic, the autoscaler will detect this increase and start provisioning more compute capacity to handle it.

    Additionally, we introduced in-memory cache to reduce load on external data stores for storefronts that put a lot of pressure on the platform’s resources. This provides a buffer that reduces load on very-high traffic shops.

    Failing Fast

    When an external data store isn’t available, we don’t want to serve buyers an error page. If possible, we’ll try to gracefully fall back to a safe way to serve the request. It may not be as fast, or as complete as a normal, healthy response, but it’s definitely better than serving a sad error page.

    We implemented circuit breakers on external datastores using Semian, a Shopify-developed Ruby gem that controls access to slow or unresponsive external services, avoiding cascading failures and making the new implementation more resilient to failure.

    Similarly, if a cache store isn’t available, we’ll quickly consider the timeout as a cache miss, so instead of failing the entire request because the cache store wasn’t available, we’ll simply fetch the data from the canonical data store instead. It may take longer, but at least there’s a successful response to serve back to the buyer.

    Testing Failure Scenarios and the Limits of the New Implementation

    Finally, as a way to identify potential resilience issues, the new implementation uses Toxiproxy to generate test cases where various resources are made available or not, on demand, to generate problematic scenarios.

    As we put these resilience and capacity mechanisms in place, we regularly ran load tests using internal tooling to see how the new implementation behaves in the face of a large amount of traffic. As time went on, we increased the new implementation’s resilience and capacity significantly, removing errors and exceptions almost completely even in high-load scenarios. With BFCM 2020 coming soon (which we consider as an organic, large-scale load test), we’re excited to see how the new implementation behaves.

    Where We’re at Currently

    We’re currently in the process of rolling out the new implementation to all online storefronts on the platform. This process happens automatically, without the need for any intervention from Shopify merchants. While we do this, we’re adding more features to the new implementation to bring it to full parity with the legacy implementation. The new implementation is currently at 90%+ feature parity with the legacy one, and we’re increasing that figure every day with the goal of reaching 100% parity to retire the legacy implementation.

    As we roll out the new implementation to storefronts we are continuing to see and measure performance improvements as well. On average, server response times for the new implementation are 4x faster than the legacy implementation. Rhone Apparel, a Shopify Plus merchant, started using the new implementation in April 2020 and saw dramatic improvements in server-side performance over the previous month.

    We learned a lot during the process of rewriting this critical piece of software. The strong foundations of this new implementation make it possible to deploy it around the world, closer to buyers everywhere, to reduce network latency involved in cross-continental networking, and we continue to explore ways to make it even faster while providing the best developer experience possible to set us up for the future.

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    Building Reliable Mobile Applications

    Building Reliable Mobile Applications

    Merchants worldwide rely on Shopify's Point Of Sale (POS) app to operate their brick and mortar stores. Unlike many mobile apps, the POS app is mission-critical. Any downtime leads to long lineups, unhappy customers, and lost sales. The POS app must be exceptionally reliable, and any outages resolved quickly.

    Reliability engineering is a well-solved problem on the server-side. Back-end teams are able to push changes to production several times a day. So, when there's an outage, they can deploy fixes right away.

    This isn't possible in the case of mobile apps as app developers don’t own distribution. Any update to an app has to be submitted to Apple or Google for review. It's available to users for download only when they approve it. A review can take anywhere between a few hours to several days. Additionally, merchants may not install the update for weeks or even months.

    It's important to reduce the likelihood of bugs as much as possible and resolve issues in production as quickly as possible. In the following sections, we will detail the work we’ve done in both these areas over the last few years.


    We rely heavily on automation testing at Shopify. Every feature in the POS app has unit, integration, functional, and UI snapshot tests. Developers on the team write these simultaneously as they are adding new functionality to the code-base. Changes aren’t merged unless they include automated tests that cover them. These tests run for each push to the repo in our Continuous Integration environment. You can learn more about our testing strategy here.

    Besides automation testing, we also perform manual testing at various stages of development. Features like pairing a Bluetooth card reader or printing a receipt are difficult to test using automation. While we use mocks and stubs to test parts of such features, we manually test the full functionality.

    Sometimes tests that can be automated, inadvertently end up in the manual test suite. This causes us to spend time testing something manually when computers can do that for us. To avoid this, we audit the manual test suite every few months to weed out all such test cases.

    Code Reviews

    Changes made to the code-base aren’t merged until reviewed by other engineers on the team. These reviews allow us to spot and fix issues early in the life-cycle. This process works only if the reviewers are knowledgeable about that particular part of the code-base. As the team grew, finding the right people to do reviews became difficult.

    To overcome this, we have divided the code-base into components. Each team owns the component(s) that make up the feature that they are responsible for. Anyone can make changes to a component, but the team that owns it must review them before merging. We have set up Code Owners so that the right team gets added as reviewers automatically.

    Reviewers must test changes manually, or in Shopify speak, "tophat", before they approve them. This can be a very time-consuming process. They need to save their work, pull the changes, build them locally, and then deploy to a device or simulator. We have automated this process, so any Pull Request can be top-hatted by executing a single command:

    `dev android tophat <pull-request-url>`


    `dev ios tophat <pull-request-url>`


    You can learn more about mobile tophatting at Shopify here.

    Release Management

    Historically, updates to POS were shipped whenever the team was “ready.” When the team decided it was time to ship, a release candidate was created, and we spent a few hours testing it manually before pushing it to the app stores.

    These ad-hoc releases made sense when only a handful of engineers were working on the app. As the team grew, our release process started to break down. We decided to adopt the release train model and started shipping monthly.

    This method worked for a few months, but the team grew so fast that it wasn’t working anymore. During this time, we went from being a single engineering team to a large team of teams. Each of these teams is responsible for a particular area of the product. We started shipping large changes every month, so testing release candidates was taking several days.

    In 2018, we decided to switch to weekly releases. At first, this seemed counter-intuitive as we were doing the work to ship updates more often. In practice, it provided several benefits:

    • The number of changes that we had to test manually reduced significantly.
    • Teams weren’t as stressed about missing a release train as the next train left in a few days.
    • Non-critical bug fixes could be shipped in a few days instead of a month.

    We then made it easier for the team to ship updates every week by introducing Release Captain and ShipIt Mobile.

    Release Captain

    Initially, the engineering lead(s) were responsible for shipping updates, which included:
    • making sure all the changes are merged before the cut-off
    • incrementing the build and version numbers
    • updating the release notes
    • making sure the translations are complete
    • creating release candidates for manual testing
    • triaging bugs found during testing and getting them fixed
    • submitting the builds to app stores
    • updating the app store listings
    • monitoring the rollout for any major bugs or crashes

    As you can see, this is quite involved and can take a lot of time if done by the same person every week. Luckily, we had quite a large team, so we decided to make this a rotating responsibility.

    Each week, the engineer responsible for the release is called the Release Captain. They work on shipping the release so that the rest of the team can focus on testing, fixing bugs, or working on future releases.

    Each engineer on the team is the Release Captain for two weeks before the next engineer in the schedule takes over. We leverage PagerDuty to coordinate this, and it makes it very easy for everyone to know when they will be Release Captain next. It also simplifies planning around vacations, team offsites, etc.

    To simplify things even further, we configured our friendly chatbot, spy, to automatically announce when a new Release Captain shift begins.

    ShipIt Mobile

    We’ve automated most of the manual work involved in doing releases using ShipIt Mobile. With just a few clicks, the Release Captain can generate a new release candidate.

    Once ready, the rest of the team is automatically notified in Slack to start testing.

    After fixing all the bugs found, the update is submitted to the app store with just a single click. You can learn more about ShipIt Mobile here. These improvements not only make weekly releases easier, but they also make it significantly faster to ship hotfixes in case of a critical issue in production.

    Staged Rollouts

    Despite our best efforts, bugs sometimes slip into production. To reduce the surface area of a disruption, we make the updates available only to a small fraction of our user base at first. We then monitor the release to make sure there are no crashes or regressions. If everything goes well, we gradually increase the percentage of users the update is available to over the next few days. This is done using Phased Releases and Staged Rollouts in iOS AppStore and Google Play, respectively.

    The only exception to this approach is when a fix for a critical issue needs to go out immediately. In such cases, we make the update available to 100% of the users right away. We also can block users from using the app until they update to the latest version.

    We do this by having the POS app query the server for the minimum supported version that we set. If the current version is older than that, the app blocks the UI and provides update instructions. This is quite disruptive and can be annoying to merchants who are trying to make a sale. So we do it very rarely and only for critical security issues.

    Beta Flags

    Staged rollouts are useful for limiting how many users get the latest changes. But, they don’t provide a way to explicitly pick which users. When building new features, we often handpick a few merchants to take part in early-access. During this phase, they get to try the new features and give us feedback that we can work on before a final release.

    To do that, we put features, and even big refactors behind server-side beta flags. Only merchants whose stores we have explicitly set a beta flag will see the app’s new feature. This makes it easy to run closed betas with selected merchants. We also can do staged rollouts for beta flags, which gives us another layer of flexibility.

    Automated Monitoring and Alerts

    When something goes wrong in production, we want to be the first to know about it. The POS app and backend is instrumented with comprehensive metrics, reported in real-time. Using these metrics, we have dashboards set up to track the health of the product in production.

    Using these dashboards, we can check the health of any feature in a geography with just a few clicks. For example, the % of successful chip transactions made using a VISA credit card with the Tap, Chip & Swipe reader in the UK, or the % of successful tap transactions made using an Interac debit card with the Tap & Chip reader in Canada for a particular merchant.

    While this is handy, we didn’t want to have to keep checking these dashboards for anomalies all the time. Instead, we wanted to get notified when something goes wrong. This is important because while most of our engineering team is in North America, Shopify POS is used worldwide.

    This is harder to do than it may seem because the volume of commerce varies throughout the year. Time of day, day of the week, holidays, seasons, and even the ongoing pandemic affect how much merchants are able to sell. Setting manual thresholds to detect issues can cause a lot of false negatives and alert fatigue. To overcome this, we leverage Datadog’s Anomaly Detection. Once the selected algorithm has enough data to establish a baseline, alerts will only get fired if there’s an anomaly for that particular time of the year.

    We direct these alerts to Slack so that the right folks can investigate and fix them.

    Handling Outages

    Air Traffic Control

    In the early days of POS, bugs and outages were reported in the team Slack channel, and whoever on the team had the bandwidth, investigated them. This worked well when we had a handful of developers, but this approach didn’t scale as the team grew. Issues kept going to just a few folks who had the most context, and teams kept getting distracted from regular project work, causing delays.

    To fix this, we set up a rotating on-call schedule called Retail ATC (Air Traffic Control). Every week, there is a group of developers on the team dedicated to monitoring how things are working in production and handling outages. These developers are responsible only for this and are not expected to contribute to regular project work. When there are no outages, ATCs spend time tackling tech debt and helping our Technical Merchant Support team.

    Every developer on the team is on-call for two weeks at a time. The first week they are Primary ATC, and the next week they are Secondary ATC. Primary ATC is paged when something goes wrong, and they are responsible for triaging and investigating it. If they need help or are unavailable (commute time, connectivity issues, etc.), the Secondary ATC is paged. ATCs are not expected to fix all issues that arise by themselves, while often they can. They are instead responsible for working with the team that has the most context.


    Since we offer the POS app on both Android on iOS, we have ATC schedules for developers that work on each of those apps. Some areas, like payments, for instance, need a lot of domain knowledge to investigate issues. So we have dedicated ATCs for developers that work in those areas.

    Having folks dedicated to handling issues in production frees up the rest of the team to focus on regular project work. This approach has greatly reduced the amount of context switching teams had to do. It has also reduced the stress that comes with the responsibility of working on a mission-critical mobile application.

    Over the last couple of years, ATC has also become a great way for us to help new team members onboard faster. Investigating bugs and outages exposes them to various tools and parts of our codebase in a short amount of time. This allows them to become more self-sufficient quickly. However, being on-call can be stressful. So, we only add them to the schedule after they have been on the team for a few months and have undergone training. We also pair them with more experienced folks when they go on call.

    Incident Management

    When an outage occurs, it must be resolved as quickly as possible. To do this, we have a set of best practices that the team can follow so that we can spend more time investigating the issue vs figuring out how to do something.

    An incident is started by the ATC in response to an automated alert. ATCs use our ChatOps tools to start the incident in a dedicated Slack channel.


    Incidents are always started in the same channel, and all communication happens in it. This is to ensure that there is a single source of information for all stakeholders.

    As the investigation goes on, findings are documented by adding the 📝 emoji to messages. Our chatbot, spy automatically adds them to a service disruption document and confirms it by adding a  emoji to the same message.

    Once we identify the cause of the outage and verify that it has been resolved, the incident is stopped.



    The ATC then schedules a Root Cause Analysis (RCA) for the incident on the next working day. We have a no-blame culture, and the meeting is focused on determining what went wrong and how we can prevent it from happening in the future.

    At the end of the RCA, action items are identified and assigned owners. Keeping track of outages over time allows us to find areas that need more engineering investment to improve reliability.

    Thanks to these efforts, we've been able to take an app built for small stores and scale it for some of our largest merchants. Today, we support a large number of businesses to sell products worth billions of dollars each year. Along the way, we also scaled up our engineering team and can ship faster while improving reliability.

    We are far from done, though, as each year we are onboarding bigger and bigger merchants onto our platform. If these kinds of challenges sound interesting to you, come work with us! Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.

    Continue reading

    Using DNS Traffic Management to Add Resiliency to Shopify’s Services

    Using DNS Traffic Management to Add Resiliency to Shopify’s Services

    If you are lacking understanding of what is DNS, traffic management, or why we would even use it, read Part 1: Introduction to DNS traffic management.

    Distributed systems are only as resilient as we build them to be. Domain Name System (DNS) traffic management (TM) is a well-used approach to do so. In this second part of the two-part series, we’re sharing Shopify’s DNS traffic management journey from the numerous manually set-up, maintained, and updated traffic management approaches to the fully automated self-served system used by 40+ domains owned amongst 12+ different teams, handling 100M+ requests per day.

    Shopify’s Previous Approaches to DNS Traffic Management

    DNS traffic management isn’t entirely new at Shopify. A number of different teams had their own way of doing traffic management through DNS changes before we automated in 2019, which brought different sets of features and techniques to update records. 

    Traffic management through DNS changes

    Streaming Platform Team

    The team handling our Kafka pipelines used Kubernetes ConfigMaps to define target clusters. So, making changes required

    On top of the process duration which isn't ideal for failover time, using this manual approach doesn't open the door to any active/active configuration (where we share the traffic between two active clusters), since it would require to use two target clusters at once, while this only allows using the one defined in the ConfigMaps. At the time, being able to share traffic wasn't considered necessary for Kafka.

    Search Platform Team

    The team handling Elasticsearch set up the beginnings of DNS updates automation. They used our chatops bot, spy, to run commands requesting failovers. The bot creates a PR in our record-store repository (which uses the record_store open source project) with the requested DNS change, which then needs to be approved, merged, and deployed after passing the tests. Except for the automation, it’s a similar approach to the manual one, hence it has the same limitations with failover delays and active/active capabilities.

    Edgescale Team

    Part of the Edgescale team's responsibilities is to handle our assets (images and static files) and Content Delivery Networks (CDN), which bring the assets as close to the client as possible thanks to a network of distributed servers that store those files. To succeed in this mission, the team wanted more control and active/active capabilities. They used DNS providers allowing for weighted traffic management. They set weights to their records and define which share of traffic goes to which endpoint. It allowed them to share the traffic between their two CDN providers. To set this up, they used the DNS provider’s APIs with weights that could span from 0 (disabled) to 15, using DNS A records (hostname to IP address). To make their lives easier, they wrote a spy cdn command, responsible for making the API calls to the two DNS providers. It reduced the limitations of failover time, as well as provided the active/active capabilities. However, we couldn’t produce a stable, easy to reproduce, and version-controlled configuration of the providers. Adding endpoints to the traffic management was to be done manually, and thus prone to errors. The providers we use for our CDN needs don’t perform the same way in every region of the world and incur different costs in those regions. Using different traffic shares depending on the geographical location of the requesters is one of the traffic-management use-cases we presented in Part 1: Introduction to DNS traffic management(URL), but that wasn’t available here.

    Other Teams

    Finally, a few other teams decided to manually create records in one of the DNS providers used by the team handling our assets. They had fast failover and active/active capabilities but didn’t have a stable configuration, nor redundancy from using a single DNS provider.

    We had too many setups going all over the place, corresponding to maintainability problems. Also, any impactful changes needed a lot of coordination and moving pieces. All of these setups had similar use cases and needs. We started working on how to improve and consolidate our approach to DNS traffic management to

    • reduce the limitations encountered by teams
    • connect other teams with similar needs
    • improve maintainability
    • create ownership

    Consolidation and Ownership of DNS Traffic Management

    The first step of creating a better service used by many teams at Shopify was to define the requirements and goals. We wanted a reliable and redundant system that would provide

    • regionalized traffic
    • fine-grained traffic sharing
    • failover capabilities
    • easy setup and updating by the teams owning traffic-managed domains

    Consolidation and ownership of DNS traffic management

    The final state of our setup provides a unified way of setting up and handling our services. We use a git repository to store the domain's configuration that’s then deployed to two DNS providers. The configuration can be tweaked in a fast and easy manner for both providers through a set of spy commands, allowing for efficient failovers. Let's talk about those choices to build our system, and how we built it.

    Establish Reliability and Redundancy

    Each domain name has a set of nameservers, and when using a DNS client, one of those nameservers is selected and queried first, another one is used when a timeout occurs. Shopify used a single DNS provider until 2016, where a large DNS outage happened while our DNS provider was under a distributed denial of service (DDoS) attack, effectively dropping a large number of legitimate requests. We learned from our mistakes and increased our reliability and redundancy by using more than one DNS provider.

    When CDN traffic management was set up, it used two different DNS providers to follow in the steps of our static DNS records in the record_store. The decision for our new system was easy to make since it prevented being dependent on a single provider, we wanted to follow the same approach and adopt two providers to build our new standard.

    Define The Traffic Management Layers

    Two of our DNS providers allowed for regionalized and weighted traffic management, as well as multiple failover layers. It was just a matter of defining how we wanted things to work and build the equivalent approach for both providers.

    We defined our approach in layers of traffic management and considered that each layer had a decision to make that would reduce the set of options that the next layers can choose from.

    Layer 1 Geographical Fencing

    The layer supports globally matching endpoints, which are mandatoryThe layer supports globally matching endpoints, which are mandatory. We always have an answer to a DNS request for an existing traffic-managed domain, even if there is no region specifically matching the requester. We defined a global region that is selected when nothing more specific matches. The geographical fencing layer selects the endpoints that fit the region where the request originates from. This layer selects the best geographical match with the client’s request. For instance, we set a rule to have endpoint A answered for Canada and endpoint B for Quebec. When we get a request originating from Montreal, we return B. If the request originates from Ottawa, we return A.

    Layer 2 Endpoint Status

    Automatically setting endpoints status depends on a process called health checking or monitoringWe provided a way to enable and disable the use of endpoints depending on their status, which is manually or automatically set. Automatically setting endpoints status depends on a process called health checking or monitoring, where we try to reach the endpoint regularly in order to verify if it does (healthy) or does not (unhealthy) answer. We added a layer of traffic management based on the endpoint status aimed at selecting only the endpoints currently considered as healthy. However, we don’t want the requester to receive an empty answer for a domain that does exist, as it would trigger the negative TTL, most of the time higher than our traffic-managed domain TTL. If any of the endpoints is healthy, then only the healthy endpoints are returned by Layer 2. If none of the endpoints are healthy, all the endpoints are returned. The logic behind this is simple: returning something that doesn’t work is better than not returning anything, as it allows the client to start back using the service as soon as endpoints are back online.

    Layer 3 Endpoint Priority

    Endpoint priority. We allow users to define levels of priorities for the traffic shares of their domainsAnother aspect we want control over is the failover approach for our endpoints. We allow users to define levels of priorities for the traffic shares of their domains. For instance, they could define, as the highest priority, that three endpoints A, B, and C would receive 100%, 0%, and 0% of the traffic respectively. However, when A is unhealthy, instead of using B and C, we define, as a second priority, that B would receive 100% of the traffic. This can’t be done without a layer selecting endpoints based on their priority, as we’d be sending a share of the normal traffic to B, or we don’t have automated control over how B and C share the traffic in the case where A is failing. Endpoints of higher priority layers with a weight set to 0 (not receiving traffic) are also considered down for those layers. This means when the endpoints receiving traffic are unhealthy, through health checking, any higher-priority endpoints get discarded at Layer 2, allowing Layer 3 to keep only the highest priority endpoints left in the returned list.

    Layer 4 Weighted Selection

    This final layer deals with the weights defined for the endpointsThis final layer deals with the weights defined for the endpoints. Each endpoint E reaching this layer has a probability PE of being selected as the answer. PE is obtained through the formula <weight of E>/<sum of the weights of all endpoints reaching Layer 4>. Any 0-weighted endpoints will be automatically discarded unless there are only 0-weighted endpoints where all endpoints will have an equal chance of being selected and returned to the requester. 

    Deploying and Maintaining The Traffic-Managed Domains

    We try to build tooling in a self-service way. It creates a new standard requiring us to make tooling easily accessible for other teams to deploy their traffic-managed domains. Since we use Terraform with Atlantis for a number of our deployments, we built a Terraform module that receives only the required parameters for an application owner and hides most of the work happening behind the scenes to configure our providers.

    The above code represents the gist of what an application owner needs to provide to deploy their own traffic-managed domain.

    We work to keep our deployments organized, so we derive the zone and subdomain parameters from the path of the domain being terraformed. For example, the path to this file is terraform/ allows deriving that the zone is and the subdomain is test.

    When an application owner wants to make changes to their traffic-managed domain, they just need to update the file and apply the terraform change. There are a number of extra features that control

    • automated monitoring and failover for their domains 
    • monitoring configuration for domains
    • paging when a failover is automatically triggered or not.

    When we make changes to how traffic-managed domains are deployed, or add new features, we update the module and move domains to the new module version one by one. Everything stays transparent for the application owners and easy to maintain for us, the Edgescale team.

    Everyday Traffic Steering Operations

    We allowed the users of our new standard to make changes fast and easily applied to their traffic-managed domain. We built a new command in our chatops bot, spy endpoints, to perform operations on the traffic-managed domains.

    Those commands will operate on relative and absolute domains. Relative domains will automatically receive our default traffic-managed zone as a suffix. It’s also possible to specify in which region the change should apply by using square brackets; for example, cdn[us-*,na] would concern the cdn traffic-managed domain but only in region na and only ones starting with us-.

    The spy endpoints get command gets the current traffic shares between endpoints for a given domain

    The spy endpoints get command gets the current traffic shares between endpoints for a given domain. If all providers are holding the same data (which should be the case most of the time), then the command results without any mention of the DNS providers. When results are different (the providers went out of sync), the data will be shown with mentions of the providers to make sure that the current traffic shares are known.

    The spy endpoints set command changes the traffic shares using specific weight values we provide

    The spy endpoints set command changes the traffic shares using specific weight values we provide. It updates every provider and runs the spy endpoints set command to show the new traffic shares. Instead of specifying the weights for each endpoint, it’s possible to use a number of defined profiles that set predefined traffic shares. For example, our test domain with its two endpoints mostly[central-4] will define weights of 95 for central-4 and 5 for central-5.

    Our Success Stories

    Moving to ElasticSearch 7.0.0 - We talked about the process that was used by the ElasticSearch team and the fact that it was limited to failovers and didn’t allow traffic shares. When we moved our internal ElasticSearch clusters to ElasticSearch 7.0.0, the team was able to use the weighted load balancing provided by our tools to move the traffic chunk by chunk and ensure everything was working properly. It allowed them to keep the regular traffic going and mitigate any issue they might have encountered along the way, making the transition to ElasticSearch 7.0.0 seamless to the different systems using it.

    Recovering from Kafka overload during a flash sale - During a large flash sale, the Kafka brokers in one of our clusters started overloading from the traffic they had to handle. Once the problem was identified, it took a few minutes for the Kafka team to realize that they now had traffic share capabilities from our DNS traffic manager. They used it to divert half the traffic from the overloaded region and send it to another available region. Less than five minutes after making that change, the Kafka queues started recovering.

    Relieving on-call stress - Being on-call is stressful, especially when running errands and we don’t want to be stuck at home waiting for our phone to ring. Even with the great on-call culture at Shopify, and people always happy to override parts of your shift, being able to use the DNS traffic manager to steer traffic of an application to another cluster when something happens helps in so many cases. One aspect the different teams appreciated is that work can be done from the phone easily (thanks spy!). Another one is allowing Shopifolk to stay serene while solving the incidents which are mitigated thanks to traffic management and don’t impact our merchants and their clients. In summary, easy to use tooling and practical features together improve the experience of both our merchants and coworkers.

    Since the creation of our new DNS traffic management standard in the middle of 2019, we’ve onboarded more than 40 different domains across more than 12 different teams.

    Why Ownership Is Important. Demonstrated By Example

    A few months ago, while we moved teams to use the initial version of our DNS traffic manager, we got an email from one of our DNS providers letting us know that they would discontinue their service because it would be merged with the services of the company that bought them a few years prior. Of course, we weren’t so lucky, having their systems merged together would require action on our part. We needed to manually migrate our zones to the new provider.

    We launched a project to find our next DNS provider as a result. Since we needed to manually migrate our zones and consequently all of our tooling, we might as well evaluate our options. We looked at more than 40 providers, keeping in mind our needs for our static zones and traffic management requirements. We selected a few providers that fit our needs and decided on which one to sign a contract with.

    Once we chose the provider, the big migration happened. First, we updated our terraform module to support the new provider and deployed the traffic-managed domains in the three providers. Then we updated our spy endpoints tooling to update all providers when making changes so everything was ready and in sync. Next, we moved the nameservers of our traffic-managed zone one by one from the DNS provider we were leaving to the new DNS provider, making sure that in case of a problem only a controlled share of the traffic would be affected. We explained our migration plans to the different teams owning domains in the traffic manager, letting them know when it would happen, and that it should be transparent to them, but if anything unexpected seems to be happening, they should contact us. We also told the incident manager on-call of the changes happening and the timelines.

    Everything was in the plan for the change to happen. However, 30 minutes before change time, the provider we were leaving had an incident, preventing us from moving traffic and happening at the same time as one of the application owners having an incident that they wanted to mitigate with the traffic manager. It pushed our timeline forward, but we continued with the change without any issue and it was fully transparent to all the application owners.

    Looking back to how things were before we rolled out our new standard for DNS traffic management, we easily can say that moving to a new DNS provider wouldn’t have been that smooth. We would have had to

    • contact every team using their own approach to gather their needs and usage so we could find a good alternative to our current DNS provider (luckily this was done while preparing and building this project)
    • coordinate between those teams for the change to happen, and then chase after them to make sure they updated any tooling used

    The change couldn’t be handled for all of them as a whole as there wasn’t one product that one team handles, but many products that many teams handle.

    With our DNS traffic management system, we brought ownership to this aspect of our infrastructure because we understand the capabilities and requirements of teams, and how we can maintain and evolve as our teams’ needs evolve, improving the experience of our merchants and their customers.

    Our DNS traffic management journey took us from many manually setup, maintained, and updated traffic management approaches to a fully automated self-served system used by more than 40 domains owned by more than 12 different teams, and handling more than 100M requests per 24h. If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.

    Continue reading

    An Introduction to DNS Traffic Management

    An Introduction to DNS Traffic Management

    Distributed systems are only as resilient as we build them to be. Domain Name System (DNS) traffic management is a well-used approach to do so. In this first part of a two-part series, we aim to give a broad overview of DNS and how it’s used for traffic management, as well as the different reasons why we want to use DNS traffic management.

    If you already have context on what is DNS, what is traffic management, and the reasons why you would need to use DNS traffic management, you can skip directly to where we share our journey and improvements made regarding DNS traffic management at Shopify in Part 2: Shopify’s DNS Traffic Management.

    A Summarized History of DNS

    Everything started with humans trying to communicate and a plain text file, even before the advent of the modern internet.

    The Advanced Research Projects Agency Network (ARPANET) was thought, in 1966, to enable access to remote computers. In 1969, the first computers were connected to ARPANET, followed by the implementation of the Network Control Program (NCP) in 1970. Guided by the need to connect more and more computers together, and as the work on the Transmission Control Protocol (TCP), started in 1974, evolved, TCP/IP was created in the late 1970s to provide the ability to join separate networks in a network of networks and replaced NCP in ARPANET on January 1st, 1983.

    At the beginning of ARPANET there were just a handful of computers from four different universities connected together, which was easier for people to remember the addresses. This became challenging with new computers joining the network. The Stanford Research Institute provided, through file sharing, a manually maintained file containing the hostnames and related addresses of hosts as provided by member organizations of ARPANET. This file, originally named HOSTS.TXT, is now also broadly known as the /etc/hosts file on Unix and Unix-like systems.

    A growing network with an increasingly large number of computers meant an increasingly large file to download and maintain. By the early 1980s, this process became slow and an automated naming system was required to address the technical and personnel issues of the current approach. The Domain Name System (DNS) was born, a protocol converting human-readable (and rememberable) domain names into Internet Protocol (IP) addresses.

    What is DNS?

    Let’s consider that DNS is a very large library where domains are organized from the least to the most meaningful parts of their names. For instance, if you (the client) want to resolve, you would consider that .com is the least meaningful part as it’s shared with many domain names, and shops the most meaningful part as it’s a specification on the subdomain you’re requesting. Finding in this DNS Library would thus mean going to the .com shelves and finding the myshopify book. Once the book in hand, we would then open it to the shops page, and see something that looks like the following:

    The image is telling us that corresponds to the IP address
    DNS Library Book

    The image is telling us that corresponds to the IP address Also, our DNS Library provides us with the equivalent of a Due Date, which is called Time To Live (TTL). It corresponds to the amount of time the association of hostname to IP address is valid. We remember or cache that information for the given amount of time. If we need this information after expiration, we have to “find that book” again to verify if the association is still valid.

    The opposite concept already exists: if you’re trying to find a page in the book and can’t find it, chances are that you won’t wait there until someone writes it down for you. In DNS, this concept is driven by the Negative TTL, which represents the duration we consider a NXDOMAIN (non-existing domain) answer can be cached. This means that the author of a new page in this book cannot consider their update is known by everyone until that period of time has elapsed.

    Another relevant element is that the DNS Library doesn't necessarily hold only one book but multiple identical ones, from different editors, enabling others to be consulted if one copy is unavailable.

    In DNS terms, the editors are DNS providers. The shelves contain multiple sections for each domain nameserver, the servers that provide DNS resolution as a source of truth for a domain. The books are zones in the domain nameservers, and the book pages are DNS records, the direct relation between the queried record and the value it should resolve to. The DNS Library is what we call root servers, a set of 13 nameservers (named from a to m) that hold the keys to the root of the hostnames. The root servers are responsible for helping to locate the shelves, the nameservers of the Top Level Domains (TLD), the domains at the highest level in the hierarchy for DNS.

    What is Traffic Management?

    Traffic management is a key branch of logistics that aims to plan and control everything required to provide for the safe, orderly, and efficient movement of persons and goods. Traffic management helps to manage situations such as congestion or roadblocks, by redirecting traffic or sharing traffic between multiple routes. For instance, some navigation applications use data they get from their users (current location, current speed, etc.) to know where congestion is happening and improve the situation by suggesting alternative routes instead of sending them to the already overloaded roads.

    A more generic description is that traffic management uses data to decide where to direct the traffic. We could have different paths depending on the country of origin (think country border waiting lines for the booths, where the checks are different depending on the passport you hold), different paths depending on vehicle size (bike lanes, directions for trucks vs. cars, etc.) or any other information we find relevant.

    DNS + Traffic Management = DNS Traffic Management

    Bringing the concept of traffic management to DNS means serving data-driven answers to DNS queries resulting in different answers depending on the location of the requester or for each request. For instance, we could have two clusters of servers and want to split the traffic between the two: we can decide to answer 50% of the requests with the first cluster and the other 50% with the second. The clients obtaining the answers would connect to the cluster they got directed to, without any other action on their part.

    DNS queries are cached to avoid overloading servers with queries.

    However, from the previous section, DNS queries are cached to avoid overloading servers with queries. Each time a query is cached by a resolver, it won’t be repeated by that resolver for the duration of its TTL. Using a low TTL will make sure that the information is kept around but not for too long. For example, returning a TTL of 15 seconds means that after 15 seconds the client needs to resolve the record again, and can get a different answer than before.

    A low TTL needs careful consideration, as the time it takes to obtain the DNS record’s content from the DNS servers, called DNS resolution time, sometimes can dominate the time it takes to retrieve a resource like a webpage. The connection performance and accuracy of the result are thus often at odds. For instance, if I want my changes to appear to users in at most 15 seconds (hence setting a 15 seconds TTL), but the DNS resolution time takes 1 second means that every 15 seconds the users will take 1 more second to reach the service they are connecting to. Over a day, this added resolution time adds up to 5760 seconds, or 1 hour and 36 minutes. If we slightly sacrifice the accuracy by moving the TTL to 60 seconds, the resolution time becomes 1440 seconds over a day, or only 24 minutes, improving the overall performance.

    The use of caching and TTL implies that doing DNS traffic management isn’t instant. There's a short delay in refreshing the record that should be at most the TTL that we configured. In practice, it can be slightly more as some DNS resolvers, unbeknownst to the client, might cache the results for a longer time than they see fit. The override of TTL shouldn’t happen often, however, but it’s something to be aware of when choosing DNS to do traffic management.

    Examining Four DNS Traffic Management Use Cases

    DNS traffic management is interesting when handling systems that don’t necessarily hold load balancer capabilities at the network level, either through an IP-level load balancer or any front-facing proxy, i.e. once already connected to the service we are trying to reach. There are many reasons to use DNS traffic management in front of services, and multiple reasons why we use it at Shopify.

    Easy Failover

    One of Shopify’s use cases is easily failing over a service when the live instance crashes or is rendered unavailable for any reasonOne of Shopify’s use cases is easily failing over a service when the live instance crashes or is rendered unavailable for any reason. Using DNS management and having it ready to target two clusters, but using one by default, simply redirects the traffic to the second cluster whenever the first one crashes, it then redirects back the traffic when it recovers. This is commonly called active-passive. If you’re able to identify the unavailability of your main cluster in a timely fashion, this approach makes it almost seamless (considering the TTL) to the clients using the service, as they’d use the still-working cluster while the issue is solved, either automatically or through the intervention of the responsible on-call team (as a last resort). The pressure is relieved on those on-call teams, as they know that clients can still use the service while they solve the issues, sometimes even pushing the work to be done to the next working day.

    Share traffic between endpoints

    Share traffic between endpoints

    Services inevitably grow and end up receiving requests from many clients. Now those requests need to be shared between available endpoints offering the exact same service. This is called active-active. Another motivation behind this approach is money related, when using external vendors (an external company contracted to provide your users a service) with minimum commitment, allowing you to share your traffic load between those vendors in a way that ensures reaching those commitments. You define the percent of traffic sent to each given endpoint corresponding to the percentage of DNS requests answered with that endpoint.

    Deploy a Change Progressively

    DNS traffic management can help by allowing movement of a small percentage of your traffic to a cluster that’s already updated

    When developing production services, sometimes making a potentially disruptive change (such as deploying a new feature, changing the behavior of an existing one, or updating a system to a new version) is needed. In such cases, deploying your change and crossing your fingers while hoping for success is, at best, risky. DNS traffic management can help by allowing movement of a small percentage of your traffic to a cluster that’s already updated, then move more and more chunks of traffic until all of the traffic has been moved to the cluster with the new feature. This approach is called green-blue deployment. You can then update the other cluster, which allows you to be ready for the next update or failover.

    Regionalize Traffic Decisions

    geolocation can be fine-grained to the country, state or province level, or applied to a broader region of the world

    You might find cases where some endpoints are more performant in some regions than others, which might happen when using external vendors. If performance is important for users, as it is for Shopify’s merchants and their customers, then you want to make sure the most performant endpoints are used for users in each region by allowing DNS answers based on the client’s location. Most of the time, geolocation can be fine-grained to the country, state or province level, or applied to a broader region of the world. Routing rules are defined to indicate what should be answered depending on the origin of requests. Once done, a client connecting from a location will get the answer that fits them. 

    Our DNS traffic management journey took us from many manually set-up, maintained, and updated traffic management approaches to a fully automated self-served system used by more than 40 domains owned by more than 12 different teams, and handling more than 100M requests per 24h. If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.

    Continue reading

    Migrating Large TypeScript Codebases To Project References

    Migrating Large TypeScript Codebases To Project References

    In 2017, we began migrating the merchant admin UI of Shopify from a traditional Ruby on Rails Embedded RuBy (ERB) based front-end to an entirely new codebase, TypeScript paired with React and GraphQL. Using TypeScript enabled our ever-growing Admin teams to leverage TypeScript’s compiler to catch potential bugs and errors well before they ship. The VSCode editor also provides useful TypeScript language-specific features, such as inline feedback from TypeScript’s type-checker.

    Many teams work independently but in parallel on the Shopify merchant admin UI. These teams need to stay highly aligned and loosely coupled to ensure they quickly ship reliable code. We chose TypeScript to empower our developers with great tools that help them ship with confidence. Pairing TypeScript with VSCode provides developers with actionable real-time feedback right in their editor without the need to run separate commands or push to CI. With these tools, teams leverage the benefits of TypeScript’s static type-checker right in the editor as they pull in the latest changes.

    The Problems with Large TypeScript Codebases

    Over the years as teams grew, many new features shipped, and the codebase dramatically increased in size. Teams had a poorer experience in the editor. Our codebase was taxing the editor tooling available to use. TS-Server (VSCode’s built-in language server for TypeScript) would seemingly halt, leaving developers with all the weight of TypeScript’s syntax and none of the editor tooling payoff. The editor took ~2-3 minutes to load up all of TypeScript’s great tooling, slowing how fast we ship great features for our merchants, and it also created a very frustrating development experience.

    We need to ship commerce features fast, so we investigated solutions to bring the developer experience in the editor up to speed.

    Understanding What VSCode Didn’t Like About Our Codebase

    First, we started by understanding why and where our codebase was taxing VSCode and TypeScript’s type-checker. Thankfully, the TypeScript team has an in-depth wiki page on Performance, and the VSCode team has an insightful wiki page on Performance Issues.

    Before implementing any solutions, we needed to understand what VSCode didn’t like about our codebase. To my pleasant surprise, around this time Taryn Musgrave, a developer from our Orders & Fulfillment team, had recently attended JSConfHi 2020. At the conference, she met a few folks from TC39 (the team that standardizes the JavaScript language under the “ECMAScript” specification) , one of them being Daniel Rosenwasser, the Program Manager for TypeScript at Microsoft. After the conference, we connected and shared our codebase diagnostics with the core TypeScript team from Microsoft.

    The TypeScript team had asked us for more insight into our codebase. We ran some initial diagnostics against the single TypeScript configuration (tsconfig.json) file for the project. The TypeScript compiler provides an --extendedDiagnostics flag that we can pass to get some useful diagnostic info

    Their team was surprised by the size of our codebase and mentioned that it was in the top 1% of codebase sizes they’d seen. After sharing some more context into our tools, libraries, and build setup they suggested we break up our codebase into smaller projects and leverage a new TypeScript feature released in 3.0 called project references.

    Measuring Improvements

    Before diving into migrating our entire admin codebase, we needed to take a step back to decide how we could measure and track our improvements. To confidently know we’re making the right improvements, we need tools that enable us to measure changes.

    At first, we used VSCode’s TSServer logs to measure and verify our changes. For our Admin codebase the TSServer would spit out ~80,000 lines of logs over the course of ~2m30s on every bootup of VSCode.

    We quickly realized that this approach wasn’t scalable for our teams. Expecting teams to parse through 80,000 lines to verify their improvement wasn’t feasible. So, for this reason we set out to build a VSCode plugin to help our teams at Shopify measure and track their editor initialization times over time. Internally we called this plugin TypeTrack. 

    TypeTrack in action
    TypeTrack in action

    TypeTrack is a lightweight plugin to measure VSCode's TypeScript language feature initialization time. This plugin is intended to be used by medium-large TypeScript codebases to track editor performance improvements as projects migrate their code to TypeScript project references.

    Migrating to Project References

    In large projects, migrating to project references in one go isn’t feasible. The migration would have to be done gradually over time. We found it best to start out by identifying leaf nodes in our project’s dependency graph. These leaf nodes generally include the most widely shared parts of our codebase and have little to no coupling with the rest of our codebase.

    Our Admin codebase’s structure loosely resembles this:

    In our case, a good place to start was by migrating the packages and tests folders. Both of these folders are meant to be treated as “isolated projects” already. Migrating them to leverage project references not only improves their initialization performance in VSCode, but also encourages a healthier codebase. Developers are encouraged to break up their codebase into smaller pieces that are more manageable and reusable.

    Starting At The Leaves

    Once we’ve identified our leaf nodes, we start by creating an entrypoint for the project references.

    Whenever a project is referenced, that respective folder is also migrated to a project reference, meaning it requires a project-level tsconfig specifying the dependencies that project references in turn. In the case of the above-mentioned project-references.tsconfig.json, the ../../package path reference looks for a packages/tsconfig.json.

    Once you’ve got a few folder with project references created, you run the TypeScript compiler to check if your project builds:

    > Protip: Use the --diagnostics flag to get insight into what the compiler is doing.

    At this point it’s very likely that you’ll get tons of errors, the most common being that the compiler can’t find a module that your project is referencing.

    To illustrate this, see the screenshot below. The compiler isn’t able to find the module @shopify/address-consts from within the @shopify/address package

    Compiler can’t find a module the project is referencing
    Compiler can’t find a module the project is referencing

    The solution here would be:

    1. Create a project-level tsconfig.json for the module being depended on. You can think of this as a child node in your project dependency graph. For our example error from above, we need to create a tsconfig.json in packages/address-consts.
    2. Reference the new project reference (the tsconfig.json created from 1) in the consuming parent tsconfig.json that requires the new module. In our example, we need to include the new project reference.

    Keep in mind that you'll have to respect the following restrictions for project references:

    • The include/files compiler option must include all the input files that the project reference relies on.
    • Any project that’s referenced must itself have a references array (which may be empty).
    • Any local namespaces you import must be listed in the config/typescript/tsconfig.base.json #compilerOptions#paths field.

    General Steps to Migrating Your Codebase to Project References

    Once the TypeScript compiler successfully builds the leaf nodes that you’ve migrated to project references, you can start working your way up your project dependency graph. Outlined below are general steps you’ll take to continue migrating your whole codebase:

    1. Identify the folder you plan to migrate. Add this to your project’s config/typescript/project-references.tsconfig.json
    2. Create a tsconfig.json for the folder you’ve identified.
    3. Run the TypeScript compiler
    4. Fix the errors. Determine what other dependencies outside of the identified folder need to be migrated. Migrate and reference those dependencies.
    5. Go to step 3 until the compiler succeeds.

    For large projects, you may see hundreds or even thousands of compiler errors in step 4. If this is the case, pipe out those errors to a log file and look for patterns in the errors.

    Here are some common solutions we’ve found when we’ve come across this roadblock:


    • The folder is just too large. At the root of the folder create a tsconfig.json that includes references to child folders. Migrate the child folders with the general steps mentioned above
    • This folder contains spaghetti code. It’s possible this folder could be reaching into other dependencies/folders that it shouldn’t. In our case we’ve found it best to move shared code into our packages folder consisting of shared isolated packages.

    We Saw Drastic Improvements in VSCode

    Once we migrated our packages and tests folders in our Admin codebase, we made great improvements in the amount of time it takes VSCode to initialize the editor’s TypeScript language features in those folders.

    File Before  After
    packages/@web-utilities/env/index.ts ~155s ~13s
    packages/@shopify/address-autocomplete/AddressAutocomplete.ts ~155s ~8s
    tests/utilities/environment/app.tsx ~155s ~10s

    How Does This Make Sense? Why Is This 10x Faster?

    When a TypeScript file is opened up in VSCode, the editor sends off a request to the TypeScript Server for analysis on that file. The TS Server then looks for the closest tsconfig.json relative to that file for it to understand the types associated with that project. In our case, that’s the project-level tsconfig.json which included our whole codebase.

    With the addition of project references (i.e tsconfig.json with references to its dependent projects) in the packages and tests folders, we’re explicitly separating the codebase into smaller blocks of code. Now, when VSCode loads a file that uses project references, the TS Server will only load up the files that the project depends on.

    Scaling Migration Duties to Other Teams

    In our case, given the size of our Admin codebase, my team couldn’t migrate all of it ourselves. For this reason we wrote documentation and provided tools (our internal VSCode plugin) to other Admin section teams to improve their section’s editor performance.

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    ShipIt! Presents: AR/VR at Shopify

    ShipIt! Presents: AR/VR at Shopify

    On June 20, 2020, ShipIt!, our monthly event series, presented AR/VR at Shopify. Daniel Beauchamp, Head of AR/VR at Shopify talked about how we’re using this increasingly ubiquitous technology at Shopify.

    The missing Shopify AR Stroller in video in the AR/VR at Shopify presentation can be viewed at

    Additional Information

    Daniel Beauchamp



    Shopify AR



    Continue reading

    Start your free 14-day trial of Shopify