Stop trying to do it all alone, add Kit to your team. Learn more.
Unlocking Real-time Predictions with Shopify's Machine Learning Platform

Unlocking Real-time Predictions with Shopify's Machine Learning Platform

In 2022, we shipped Merlin, Shopify’s new and improved machine learning platform built on Ray. We built a flexible platform to support the varying requirements, inputs, data types, dependencies and integrations we deal with at Shopify (you can read all about Merlin’s architecture in our previous blog). Since then, we’ve enhanced the platform by onboarding more use cases and adding features to complete the end-to-end machine learning workflow, including:

  1. Merlin Online Inference, which provides the ability to deploy and serve machine learning models for real-time predictions.
  2. Model Registry and experiment tracking with Comet ML.
  3. Merlin Pipelines, a framework for reproducible machine learning pipelines on top of Merlin.
  4. Pano Feature Store, an offline / online feature store built on top of an open source feature store, Feast.

The ability to provide real-time predictions for user-facing applications and services is an important requirement at Shopify, and will become increasingly critical as more machine learning models are integrated closer to user-facing products. But it’s a challenging requirement. As machine learning is utilized by many different teams across Shopify, each with its own use-cases and requirements, we had to ensure that Merlin’s online inference could be an effective, generalized solution. We needed to build something robust that could be used by all of our use-cases and allow low-latency, while serving machine learning models at Shopify scale.

In this blog post, we’ll walk through how we built Merlin’s online inference capabilities to deploy and serve machine learning models for real-time prediction at scale. We’ll cover everything from the serving layer where our users can focus solely on their specific inference logic, to service deployment where we utilized our internal service ecosystem to ensure that online inference services can scale to Shopify’s capabilities.

What is online inference?

In machine learning workflows, once a model has been trained, evaluated and productionized, it can be applied on input data to return predictions. This is called machine learning inference. There are two main types of inference, batch and online inference.

While batch inference can run periodically on a finite set of data, online inference is the ability to compute predictions in real-time as the input becomes available. With online inference, the observations that we produce predictions for can be infinite, as we’re using streams of data that are generated over time.

Popular examples for online inference and real-time predictions can be found in many use cases. Some examples are recommender systems, where having real-time predictions can provide users with the most relevant results, or fraud detection, where the ability to detect fraud is required as soon as it happens. Other Shopify specific use cases that benefit from online inference are product categorization and inbox classification.

When serving machine learning models for real time predictions, latency becomes much more important compared to batch jobs, as the latency of machine learning models can impact the performance of user facing services and impact the business. When considering online inference for a machine learning use case, it is important to consider: 

  • The cost
  • The use-case requirements 
  • The skillset of the team members (e.g. being able to handle running and maintaining machine learning services in production versus running them in batch jobs).

As we prepared to expand Merlin for online inference, we first interviewed our internal stakeholders to better understand their use-case requirements more in detail. Once we had these, we could design the system, its architecture and infrastructure in ways that optimize for cost and performance, as well as advise the machine learning teams on staffing and required support, based on the service-level objectives of their use-case.

Online Inference with Merlin

There are several requirements that we set out to provide with Merlin Online Inference:

  • Robust and flexible serving of machine learning models, to enable us to serve and deploy different models and machine learning libraries used at Shopify (e.g. TensorFlow, PyTorch, XGBoost).
  • Low latency, in order to optimize processing of an inference request to provide responses with minimal delay.
  • State-of-the-art features for online inference services, like rolling deployments, autoscaling, observability, automatic model updates, etc.
  • Integration with the Merlin machine learning platform, so users can utilize the different Merlin features, such as model registry and Pano online feature store.
  • Seamless and streamlined service creation, management and deployment for machine learning models.

With these requirements in mind, we aimed to build Merlin Online Inference to enable deploying and serving machine learning models for real-time predictions.

Merlin Online Inference Architecture

In our previous Merlin blog post, we described how Merlin is used for training machine learning models. With Merlin Online Inference, we can provide our data scientists and machine learning engineers with the tools to deploy and serve their machine learning models and use cases.

Every machine learning use case that requires online inference runs as its own dedicated service. These services are deployed on Shopify’s Kubernetes clusters (Google Kubernetes Engine) as any other Shopify service. Each service lives in its own Kubernetes namespace and can be configured individually, for example, to autoscale based on different parameters and metrics.

Merlin Online Inference Architecture
A high level architecture diagram of Merlin Online Inference

Each service loads its dedicated machine learning model from our model registry, Comet ML, as well as any other required artifacts. Different clients can call an inference endpoint to generate predictions in real-time. The main clients that use Merlin Online Inference services are Shopify’s core services (or any other internal service that requires real-time inference), as well as streaming pipelines on Flink for near real-time predictions. Pano, our feature store, can be used to access features in low latency during inference both from the Merlin Online Inference service or from the different clients that send requests to the service.

Each service has a monitoring dashboard with predefined metrics such as latency, requests per second, CPU, etc. This can be used to observe the health of the service, and can be further customized per service.

Each Merlin Online Inference service has two main components:

  • Serving layer: the API that enables an endpoint to return predictions from a model.
  • Deployment: how the service will be deployed in Shopify’s infrastructure.

Serving Layer

The Merlin Online Inference serving layer is the API that serves the model (or models). It exposes an endpoint for processing inputs from the clients, and returns predictions from the model. The serving layer accomplishes the following:

  1. Starts a web server for the service
  2. Loads the model and additional artifacts to memory as part of the initialization process
  3. Exposes an endpoint for the inference function that will take features as inputs and return predictions
Merlin Online Inference Serving Layer
A high level diagram of the serving layer

The serving layer is written in Python, which makes it easier for our data scientists and machine learning engineers to implement it. It’s defined in the Merlin Project of each use case. A Merlin Project is a folder in our Merlin mono-repo where the code, configuration and tests of the use case are kept. This allows the serving layer to reuse different parts of the machine learning workflow logic. The serving layer is added to the Merlin Docker image, which is created by Podman from a dedicated Dockerfile. It is then deployed in the Merlin Online Inference Kubernetes clusters.

Serving Layer Types

When analyzing the requirements of our users, we identified a need to support multiple types of serving layers. We wanted to make sure that we can abstract from our users a lot of the hassle of writing a complete API for every machine learning use case, while also enabling other more custom services.

In order to do that, we currently support two types of serving libraries with Merlin Online Inference:

  1. MLServer: An open source inference server for machine learning models built by Seldon. MLServer supports REST and gRPC interfaces out of the box as well the ability to batch requests. It supports the V2 inference protocol that standardizes communication protocols around with different inference servers and by that increases their utility and portability.
  2. FastAPI: A fast and high-performance web framework for building APIs with Python.

These two libraries enable three different methods for our users to implement their online inference serving layer, starting from no code or low code with MLServer, to a fully customizable API using FastAPI. The following table describes the differences between these approaches:

Serving Layer Type Description What to Use
No Code

Uses MLServer’s pre-built inference serving implementations. Only requires configuration changes in MLServer json files.

For this serving layer type, you have a model and all you need is to deploy it behind an end point such as Scikit-learn, XGBoost, LightGBM, ect.

Low Code

Uses MLServer’s custom serving implementation.  Requires minimal code implementation of the serving class.

Include transformation or business logic in the serving layer, or a model that uses an unsupported ML library in MLServer.

Full Custom In this case, the user gets pre-defined boilerplate code for a FastAPI serving layer which they can fully customize. The machine learning use case needs to expose additional endpoints or a requirement that is unsupported in MLServer. 

Creating the Serving Layer in Merlin

In order to abstract away much of the complexity of starting to write a serving layer from scratch, we wrote Merlin CLI, which uses input from the user to build a custom serving layer. This serving layer will be generated from a predefined cookiecutter, and will provide boilerplate code for the user to build on. 

Creating the serving layer in Merlin
Creating the serving layer in Merlin

In the image above, the user can choose the inference library that they want to implement their serving layer in, then pick the machine learning library that they will use to load and serve their model.

Example of Serving a Custom Model with MLServer

Once we’ve created the boilerplate code for the serving layer, we can start implementing the specific logic for our use-case. In the following example, we’re serving a 🤗Hugging Face 🤗 model to translate English to French with online inference.

We create a class for our model which inherits from mlserver.MLModel and implements the following methods:

  1. Load: loads the model and any additional required artifact to memory.
  2. Predict: generates predictions from the model based on a payload that the method received.

In addition, with MLServer, there’s an option to serve models as a pre-built HuggingFace runtime using only configuration files:

The above is a very basic example of how low code and no-code examples can be used with MLServer without the need to define the whole API from scratch. This allows our users to focus only on what matters most to them, which is loading the model and serving it to generate inference.

Testing the Serving Layer with Merlin Workspaces

For testing purposes, our users can take advantage of the tools we already have in Merlin. They can create Merlin Workspaces, which are dedicated environments that can be defined by code, dependencies and required resources. These dedicated environments also enable distributed computing and scalability for the machine learning tasks that run on them. They are intended to be short lived, and our users can create them, run the serving layer in them, and expose a temporary endpoint for development, debugging and stress testing. This enables fast iterations on the serving layer, as it abstracts the friction of deploying a complete service between each iteration. 

Merlin Workspaces Architecture
A high level architecture diagram of Merlin Workspaces

In the diagram above, we can see how different use cases run with their respective Merlin Workspaces on Merlin. Each use case can define their own infrastructure resources, machine learning libraries, Python packages, serving layer, etc. and run them in an isolated and scalable environment. This is where our users can iterate on their use cases while developing their model and serving layer, in the same way that they would for any other part of their machine learning workflow.

The Merlin Workspace enables our users to access the swagger page of their API. This enables them to test out their code and make sure that it works before deploying it as a service.  

Merlin Online Inference Service Deployment

Once the serving layer has been tested and validated, it’s ready to be deployed to production as a service. The deployment phase is where all of the Merlin components will be put together to form the deployed service. These components include the serving layer code, API, model, artifacts, libraries and requirements, image container that was created in the CI/CD pipelines and any additional configuration libraries and package requirements to form the Merlin Service.

Merlin Service Components
The components that form a Merlin Service

Creating a Merlin Service

Similarly to the serving layer creation, we leverage the same Merlin CLI to create a Merlin Service in Shopify’s ecosystem. When generating the service, we utilize as much of Shopify’s service infrastructure to deploy Merlin Services as any other Shopify service. This ensures that Merlin Services can scale to Shopify’s capabilities, as well as allowing them to integrate and benefit from the existing tooling that we have in place.

When a Merlin Service is created, it is registered to Shopify’s Services DB. Services DB is an internal tool used to track all production services running at Shopify. It supports creating new services and provides comprehensive views and tools to help development teams maintain and operate their services with high quality.

Merlin Service Deployment Process
The deployment process of a Merlin Service

Once the Merlin Service is created, the entire build and deployment workflow is automatically generated for it. When a user merges new changes to their repository, Shopify’s Buildkite pipeline is automatically triggered and among other actions, builds the image for the service. In the next step of the workflow, that image is then deployed on Shopify’s Kubernetes clusters using our internal Shipit pipelines.

Merlin Service Configuration

Each Merlin Service is created with two configuration files, one for a production environment and one for a staging environment. These include settings for the resources, parameters and related artifacts of the service. Having a different configuration for each environment allows the user to define a different set of resources and parameters per environment. This can help optimize the resources used by the service, which in turn can reduce infrastructure cost. In addition, our users can leverage the staging environment of Merlin Services to test new model versions or configuration settings before deploying them to production.

The following is an example of a Merlin Service configuration file which contains different parameters such as the project name, metadata, artifact paths, CPU, memory, GPUs, autoscaling configuration, etc.: 

This example shows a configuration file for the classification_model_example project which uses MLServer to serve its model. It uses at least 3 replicas that can scale to 10 replicas, and each one has 2 CPUs, 32 GB of memory and nvidia-tesla-t4 GPU. In addition, when the service starts, it will load a model from our model registry.

What’s Next for Merlin Online Inference

And there you have it, our path for deploying and serving machine learning models for real-time predictions with Merlin. To sum it up, our users start by creating a Merlin Project that contains everything required for their machine learning use case. An image is automatically built for their project which is then used in the training pipeline, resulting in a trained model that is saved to our model registry. If the use case requires online inference, Merlin can be used to create a serving layer and a dedicated Merlin Service. Once the service is deployed to production, Merlin users can continue iterating on their models and deploy new versions as they become available.

Merlin Online Inference User Journey
A high level overview of the user journey

As we onboard new online inference use cases to Merlin we plan to tackle additional areas in order to enable:

  1. Ensemble models / inference graphs: while we have the ability to deploy and serve machine learning models, we are aware that in some cases we will need to combine multiple models in the inference process. We are looking into leveraging some open source tools that can help us achieve that with Ray Serve, Seldon Core or BentoML.
  2. Monitoring for online inference: with our current abilities, it is possible to create workflows, metrics and dashboards to detect different drifts. However, at the moment this step is completely manual and requires a lot of effort from our users. We want to enable a platform-based monitoring solution that will seamlessly integrate with the rest of Merlin.
  3. Continuous training: as the amount of input and data which are used for predictions increases, some of our use cases will need to start training their models more frequently and will require an automated and easier deployment process. We are looking into automating more of the service management process and lifecycle of our online inference models.

While online inference is still a new part of Merlin, it’s already empowering our users and data science teams with the low latency, scalability and fast iterations that we had in mind when designing it. We're excited to keep building the platform and onboarding new use cases, so we can continue to unlock new possibilities to keep merchants on the cutting edge of innovation. With Merlin we help enable the millions of businesses powered by Shopify.

Isaac Vidas is a tech lead on the ML Platform team, focusing on designing and building Merlin, Shopify’s machine learning platform. Connect with Isaac on LinkedIn.

Are you passionate about solving data problems and eager to learn more about Shopify? Check out openings on our careers page.

Continue reading

The Case Against Monkey Patching, From a Rails Core Team Member

The Case Against Monkey Patching, From a Rails Core Team Member

Monkey patching is considered one of the more powerful features of the Ruby programming language. However, by the end of this post I’m hoping to convince you that they should be used sparingly, if at all, because they are brittle, dangerous, and often unnecessary. I’ll also share tips on how to use them as safely as possible in the rare cases where you do need to monkey patch.

Continue reading

Making Your React Native Gestures Feel Natural

Making Your React Native Gestures Feel Natural

When working with draggable elements in React Native mobile apps, I’ve learned that there are some simple ways to help gestures and animations feel better and more natural.

Let’s look at the Shop app’s Sheet component as an example:

A gif showing a sample Shop app store that shows the Sheet Component being dragged and close on the screen
The Sheet component being dragged open and closed by the user’s gestures

This component can be dragged by the user. Once the drag completes, it either animates back to the open position or down to the bottom of the screen to close.

To implement this, we can start by using a gesture handler which sets yPosition to move the sheet with the user’s finger:

When the drag ends and the user lifts their finger, we animate to either the closed or open position based on the finger's position, as implemented in onEnd above. This works but there are some issues.

Problem 1: Speed of Drag

If we drag down quickly from the top, shouldn’t it close? We only take the position into account when determining whether it opens or closes. Shouldn’t we also take the speed of the drag when it ends as well?

A gif showing a sample Shop app store that shows the Sheet Component screen being flicked by a user but not closing“ style=
The user tries to drag the Sheet closed by quickly flicking it down, but it does not close

In this example above, the user may feel frustrated that they are flicking the sheet down hard, yet it won’t close.

Problem 2: Position Animation

No matter what the distance is from the end position, the animation after the drag ends always takes 600 ms. If it’s closer, shouldn’t it take less time to get there? If you drag it with more force before letting go, shouldn’t that momentum make it go to the destination faster?

A gif showing a sample Shop app store that shows the Sheet Component being dragged to the open position on the screen
The Sheet takes the same amount of time to move to the open position regardless of the distance it has to move

Springs and Velocity

To address problem number one, we use event.velocityY from onEnd, and add it to the position to determine whether to close or open. We have a multiplier as well to adjust how much we want velocity to count towards where the sheet ends up.

For problem number two, we use a spring animation rather than a fixed duration one! Spring animations don’t necessarily need to have an elastic bounce back. withSpring takes into account distance and velocity to animate in a physically realistic way.

A gif showing a sample Shop app store that shows the Sheet Component being dragged and close on the screen
The Sheet can now be quickly flicked open or closed easily. It animates to the open or closed position in a way that takes distance and drag velocity into account.

In the example above, it’s now easy to flick it quickly closed or open, and the animations to the open or closed position behave in a more realistic and natural way by taking distance and drag velocity into account.

Elasticity and Resistance

The next time you drag down a photo or story to minimize or close it, try doing it slowly and watch what’s happening. Is the element that’s being dragged matching your finger position exactly? Or is it moving slower than your finger?

When the dragged element moves slower than your finger, it can create a feeling of elasticity, as if you’re pulling against a rubber band that resists the drag.

In the Sheet example below, what if the user drags it up instead of down while the sheet is already open?

A gif showing a sample Shop app store that shows the Sheet Component being dragged up the screen
The Sheet stays directly under the user’s finger as it’s dragged further up while open

Notice that the Sheet matches the finger position perfectly as the finger moves up. As a result, it feels very easy to continue dragging it up. However, dragging it up further has no functionality since the Sheet is already open. To teach the user that it can’t be dragged up further, we can add a feeling of resistance to the drag. We can do so by dividing the distance dragged so the element only moves a fraction of the distance of the finger:

alt text
Instead of moving directly under the user’s finger, the sheet is dragged up by a fraction of the distance the finger has moved, giving a sense of resistance to the drag gesture.

The user will now feel that the Sheet is resisting being dragged up further, intuitively teaching them more about how the UI works.

Make Gestures Better for Everyone

This is the final gesture handler with all the above techniques included:

As user interface developers, we have an amazing opportunity to delight people and make their experiences better.

If we care about and nail these details, they’ll combine together to form a holistic user experience that feels good to touch and interact with.

I hope that you have as much fun working on gestures like I do!

The above videos were taken with the simulator in order to show the simulated touches. For testing the gestures yourself however, I recommend trying the above examples by touching a real device.

Andrew Lo is a Staff Front End Developer on the Shop's Design Systems team. He works remotely from Toronto, Canada.

We all get shit done, ship fast, and learn. We operate on low process and high trust, and trade on impact. You have to care deeply about what you’re doing, and commit to continuously developing your craft, to keep pace here. If you’re seeking hypergrowth, can solve complex problems, and can thrive on change (and a bit of chaos), you’ve found the right place. Visit our Engineering career page to find your role.

Continue reading

Monte Carlo Simulations: Separating Signal from Noise in Sampled Success Metrics

Monte Carlo Simulations: Separating Signal from Noise in Sampled Success Metrics

Usually, when you set success metrics you’re able to directly measure the value of interest in its entirety. For example, Shopify can measure Gross Merchandise Volume (GMV) with precision because we can query our databases for every order we process. However, sometimes the information that tells you whether you’re having an impact isn’t available, or is too expensive or time consuming to collect. In these cases, you'll need to rely on a sampled success metric.

In a one-shot experiment, you can estimate the sample size you’ll need to achieve a given confidence interval. However, success metrics are generally tracked over time, and you'll want to evaluate each data point in the context of the trend, not in isolation. Our confidence in our impact on the metric is cumulative. So, how do you extract the success signal from sampling noise? That's where a Monte Carlo Simulation comes in.

A Monte Carlo simulation can be used to understand the variability of outcomes in response to variable inputs. Below, we’ll detail how to use a Monte Carlo simulation to identify the data points you need for a trusted sampled success metric. We’ll walkthrough an example and share how to implement this in Python and pandas so you can do it yourself.

What is a Monte Carlo Simulation? 

A Monte Carlo simulation can be used to generate a bunch of random inputs based on real world assumptions. It does this by feeding these inputs through a function that approximates the real world situation of interest, and observing the attributes of the output to understand the likelihood of possible outcomes given reasonable scenarios.

In the context of a sampled success metric, you can use the simulation to understand the tradeoff between:

  • Your sample size
  • Your ability to extract trends in the underlying population metric from random noise

These results can then be used to explain complex statistical concepts to your non-technical stakeholders. How? You'll be able to simply explain the percentage of certainty your sample size yields, against the cost of collecting more data.

Using a Monte Carlo Simulation to Estimate Metric Variability 

To show you how to use a Monte Carlo simulation for a sampled success metric, we'll turn to the Shopify App Store as an example. The Shopify App Store is a marketplace where our merchants can find apps and plugins to customize their store. We have over 8,000 apps solving a range of problems. We set a high standard for app quality, with over 200 minimum requirements focused on security, functionality, and ease of use. Each app needs to meet these requirements in order to be listed, and we have various manual and automated app review processes to ensure these requirements are met. 

We want to continuously evaluate how our review processes are improving the quality of our app store. At the highest level, the question we want to answer is, “How good are our apps?”. This can be represented quantitatively as, “How many requirements does the average app violate?”. With thousands of apps in our app store, we can’t check every app, every day. But we can extrapolate from a sample.

By auditing randomly sampled apps each month, we can estimate a metric that tells us how many requirement violations merchants experience with the average installed app—we call this metric the shop issue rate. We can then measure against this metric each month to see whether our various app review processes are having an impact on improving the quality of our apps. This is our sampled success metric. 

With mock data and parameters, we’ll show you how we can use a Monte Carlo simulation to identify how many apps we need to audit each month to have confidence in our sampled success metric. We'll then repeatedly simulate auditing randomly selected apps, varying the following parameters:

  • Sample size
  • Underlying trend in issue rate

To understand the sensitivity of our success metric to relevant parameters, we need to conduct five steps:

  1. Establish our simulation metrics
  2. Define the distribution we’re going to draw our issue count from 
  3. Run a simulation for a single set of parameters
  4. Run multiple simulations for a single set of parameters
  5. Run multiple simulations across multiple parameters

To use a Monte Carlo simulation, you'll need to have a success metric in mind already. While it’s ideal if you have some idea of its current value and the distribution it’s drawn from, the whole point of the method is to see what range of outcomes emerges from different plausible scenarios. So, don’t worry if you don’t have any initial samples to start with. 

Step 1: Establishing Our Simulation Metrics

We start by establishing simulation metrics. These are different from our success metric as they describe the variability of our sampled success metric. Metrics on metrics!

For our example, we'll want to check on this metric on a monthly basis to understand whether our approach is working. So, to establish our simulation metric, we ask ourselves, “Assuming we decrease our shop issue rate in the population by a given amount per month, in how many months would our metric decrease?”. Let’s call this bespoke metric: 1 month decreases observed or 1mDO.

We can also ask this question over longer time periods, like two consecutive months (2mDO) or a full quarter (1qDO). As we make plans on an annual basis, we’ll want to simulate these metrics for one year into the future. 

On top of our simulation metric, we’ll also want to measure the mean absolute percentage error (MAPE). MAPE will help us identify the percentage by which the shop issue rate departs from the true underlying distribution each month. 

Now, with our simulation metrics established, we need to define what distribution we're going to be pulling from. 

Step 2: Defining Our Sampling Distribution

For the purpose of our example, let’s say we’re going to generate a year’s worth of random app audits, assuming a given monthly decrease in the population shop issue rate (our success metric). We’ll want to compare the sampled shop issue rate that our Monte Carlo simulation generates to that of the population that generated it.

We generate our Monte Carlo inputs by drawing from a random distribution. For our example, we've identified that the number of issues an app has is well represented by the Poisson distribution which models the sum of a collection of independent Bernoulli trials (where the evaluation of each requirement can be considered as an individual trial). However, your measure of interest might match another, like the normal distribution. You can find more information about fitting the right distribution to your data here.

The Poisson distribution has only one parameter, λ (lambda), which ends up being both the mean and the variance of the population. For a normal distribution, you’ll need to specify both the population mean and the variance.

Hopefully you already have some sample data you can use to estimate these parameters. If not, the code we’ll work through below will allow you to test what happens under different assumptions. 

Step 3: Running Our Simulation with One Set of Parameter Values

Remember, the goal is to quantify how much the sample mean will differ from the underlying population mean given a set of realistic assumptions, using your bespoke simulation metrics. 

We know that one of the parameters we need to set is Poisson’s λ. We also assume that we’re going to have a real impact on our metric every month. We’ll want to specify this as a percentage by which we’re going to decrease the λ (or mean issue count) each month.

Finally, we need to set how many random audits we’re going to conduct (aka our sample size). As the sample size goes up, so does the cost of collection. This is a really important number for stakeholders. We can use our results to help communicate the tradeoff between certainty of the metric versus the cost of collecting the data.

Now, we’re going to write the building block function that generates a realistic sampled time series given some assumptions about the parameters of the distribution of app issues. For example, we might start with the following assumptions:

  1. Our population mean is 10 issues per install. This is our λ parameter.
  2. Our shop issue rate decreases 5 percent per month. This is how much of an impact we expect our app review processes to have.

Note that these assumptions could be wrong, but the goal is not to get your assumptions right. We’re going to try lots of combinations of assumptions in order to understand how our simulation metrics respond across reasonable ranges of input parameters. 

For our first simulation, we’ll start with a function that generates a time series of issue counts, drawn from a distribution of apps where the population issue rate is in fact decreasing by a given percentage per month. For this simulation, we’ll draw from 100 sample time series. This sample size will provide us with a fairly stable estimate of our simulation metrics, without taking too long to run. Below is the output of the simulation:

This function returns a sample dataset of n=audits_per_period apps over m=periods months, where the number of issues for each app is drawn from a Poisson distribution. In the chart below, you can see how the sampled shop issue rate varies around the true underlying number. We can see 10 mean issues decreasing 5 percent every month.

A Monte Carlo Simulation
Our first Monte Carlo simulation with one set of parameter values

Now that we’ve run our first simulation, we can calculate our variability metrics MAPE and 1mDO. The below code block will calculate our variability metrics for us:

This code will tell us how many months it will take before we actually see a decrease in our shop issue rate. Interpreted another way, "How long do we need to wait to act on this data?".

In this first simulation, we found that the MAPE was 4.3 percent. In other words, the simulated shop issue rate differed from the population mean by 4.3 percent on average. Our 1MDO was 72 percent, meaning our sampled metric decreased in 72 percent of months. These results aren’t great, but was it a fluke? We’ll want to run a few more simulations to identify confidence in your simulation metrics.

Step 4: Running Multiple Simulations with the Same Parameter Values 

The code below runs our generate_time_series function n=iterations times with the given parameters, and returns a DataFrame of our simulation metrics for each iteration. So, if we run this with 50 iterations, we'll get back 50 time series, each with 100 sampled audits per month. By averaging across iterations, we can find the averages of our simulation metrics.

Now, the number of simulations to run depends on your use case, but 50 is a good place to start. If you’re simulating a manufacturing process where millimeter precision is important, you’ll want to run hundreds or thousands of iterations. These iterations are cheap to run, so increasing the iteration count to improve your precision just means they’ll take a little while longer to complete.

Multiple Monte Carlo Simulations
Four sample Monte Carlo simulations with the same parameter values

For our example, 50 sampled time series enables us with enough confidence that these metrics represent the true variability of the shop issue rate. That is, as long as our real world inputs are within the range of our assumptions. 

Step 5: Running Simulations Across Combinations of Parameter Values

Now that we’re able to get representative certainty for our metrics for any set of inputs, we can run simulations across various combinations of assumptions. This will help us understand how our variability metrics respond to changes in inputs. This approach is analogous to the grid search approach to hyperparameter tuning in machine learning. Remember, for our app store example, we want to identify the impact of our review processes on the metric for both the monthly percentage decrease and monthly sample size.

We'll use the code below to specify a reasonable range of values for the monthly impact on our success metric, and some possible sample sizes. We'll then run the run_simulation function across those ranges. This code is designed to allow us to search across any dimension. For example, we could replace the monthly decrease parameter with the initial mean issue count. This allows us to understand the sensitivity of our metrics across more than two dimensions.

The simulation will produce a range of outcomes. Looking at our results below, we can tell our stakeholders that if we start at 10 average issues per audit, run 100 random audits per month, and decrease the underlying issue rate by 5 percent each month, we should see monthly decreases in our success metric 83 percent of the time. Over two months, we can expect to see a decrease 97 percent of the time. 

Monte Carlo Simulation Outcomes
Our Monte Carlo simulation outputs

With our simulations, we're able to clearly express the uncertainty tradeoff in terms that our stakeholders can understand and implement. For example, we can look to our results and communicate that an additional 50 audits per month would yield quantifiable improvements in certainty. This insight can enable our stakeholders to make an informed decision about whether that certainty is worth the additional expense.

And there we have it! The next time you're looking to separate signal from noise in your sampled success metric, try using a Monte Carlo simulation. This fundamental guide just scratches the surface of this complex problem, but it's a great starting point and I hope you turn to it in the future.

Tom is a data scientist working on systems to improve app quality at Shopify. In his career, he tried product management, operations and sales before figuring out that SQL is his love language. He lives in Brooklyn with his wife and enjoys running, cycling and writing code.

Are you passionate about solving data problems and eager to learn more about Shopify? Check out openings on our careers page.

Continue reading

React Native Skia: A Year in Review and a Look Ahead

React Native Skia: A Year in Review and a Look Ahead

With the latest advances in the React Native architecture, allowing direct communication between the JavaScript and native sides, we saw an opportunity to provide an integration for Skia, arguably the most versatile 2D graphic engine. We wondered: How should these two pieces of technology play together? Last December, we published the first alpha release of React Native Skia and eighty nine more releases over the past twelve months. We went from offering a model that decently fit React Native and Skia together to a fully-tailored declarative architecture that’s highly performant. We’re going on what kept Christian Falch, Colin Gray, and I busy and a look at what's ahead for the library.

One Renderer, Three Platforms (and Counting... )

React Native Skia relies on a custom React Renderer to express Skia drawings declaratively. This allows us to use the same renderer on iOS and Android, the Web, and even Node.js. Because this renderer has no coupling with DOM nor Native API, it provides a straightforward path for the library to be integrated onto new platforms where React is available and provided that the Skia host API is available as well.

A gif showing a breath animation on each different platform: iOS, Android, and web
The React renderer runs on iOS, Android and Web.
Because the renderer is not coupled with DOM or Native APIs, we can use it for testing on Node.js.

On React Native, the Skia host API is available via the JavaScript Interface (JSI), exposing the C++ Skia API to JavaScript. On the Web, the Skia API is available via CanvasKit, a WebAssembly build of Skia. We liked the CanvasKit API from the get-go: the Skia team did a great job of conciseness and completeness with this API. It is also based on the Flutter drawing API, showing great relevance to our use-cases. We immediately decided to make our host API fully compatible with it. An interesting side-effect of this compatibility is that we could use our renderer on the Web immediately; in fact, the graphic motions we built for the original project announcement were written using React Native Skia itself via Remotion, a tool to make videos in React.

A screenshot of a React Native Skia video tutorial showing the code to render the word hello in rainbow colours.
Thanks to Remotion, React Native Skia video tutorials are renderer using React Native Skia.

After the first release we received a great response from the community and we had at heart to ship the library to as many people as possible. The main tool to have Web like development and release agility for React Native is Expo. We coordinated with the team at Expo to have the library working out of the box with Expo dev clients and integrate it as part of the Expo GO client. Part of this integration with Expo, it was important to ship full React Native Web support.

A gif showing an Expo screen for universal-skia-demo code on the left hand side and the corresponding code executing on the right
Thanks to Expo, all you need to get started with React Native Skia is a Web browser

On each platform, different GPU APIs are available. We integrated with Metal on iOS, and OpenGL on Android and the Web. Finally, we found our original declarative API to be quite productive; it closely follows the Skia imperative API and augments it with a couple of sensible concepts. We added a paint (an object describing the colors and effects applied to a drawing) to the original Skia drawing context to enable cascading effects such as opacity and some utilities that would feel familiar to React Native developers. The React Native transform syntax can be used directly instead of matrices, and images can be resized in a way that should also feel familiar.

The Road to UI Thread Rendering

While the original alpha release was able to run some compelling demos, we quickly identified two significant bottlenecks:

  1. Using the JavaScript thread. Originally we only ran parts of the drawings on the JS thread to collect the Skia drawing commands to be executed on another thread. But this dependency on the JS thread was preventing us from displaying the first canvas frame as fast as possible. In scenarios where you have a screen transition displaying many elements, including many Skia canvases, locking the JavaScript thread for each canvas creates a short delay that’s noticeable on low-end devices.
  2. Too many JSI allocations. We quickly came up with use cases where a drawing would contain a couple of thousand components. This means thousands of JSI object allocations and invocations. At this scale, the slight overhead of using JSI ( instead of using C++ directly) adds up to something severely noticeable.

Based on this analysis, it was clear that we needed a model to

  1. execute drawings entirely on the UI thread
  2. not rely on JSI for executing the drawing.

That led us to design an API called Skia DOM. While we couldn't come up with a cool name for it, what's inside is super cool.

The Skia DOM API allows us to express any Skia drawing declaratively. Skia DOM is platform agnostic. In fact, we use a C++ implementation of that API on iOS and Android and a JS implementation on React Native Web. This API is also framework agnostic. It doesn’t rely on concepts from React, making it quite versatile, especially for animations.

Behind the scenes, Skia DOM keeps a source of truth of the drawing. Any change to the drawing recomputes the necessary parts of the drawing state and only these necessary parts, allowing for incredible performance levels.

The Skia DOM API allows to execute Skia drawings outside the JavaScript thread.
The Skia DOM API allows to execute Skia drawings outside the JavaScript thread.
  1. The React reconciler builds the SkiaDOM, a declarative representation of a Skia drawing via JSI.
  2. Because the SkiaDOM has no dependencies with the JavaScript thread, it can always draw on the UI thread and the first time to frame is very fast.
  3. Another benefit of the SkiaDOM API is that it only computes things once. It can receive updates from the JS or animation thread. An update will recompute all the necessary parts of a drawing.
  4. The Skia API is directly available via a thin JSI layer. This is useful to build objects such as paths or shaders.

Interestingly enough, when we started with this project, we took a lot of inspiration from existing projects in the Skia ecosystem such as CanvasKit. With Skia DOM, we have created a declarative model for Skia drawing that can be extremely useful for projects outside the React ecosystem.

The Magic Of Open Source

For React Native Skia to be a healthy open-source project, we focused on extensibility and quality assurance. React Native Skia provides extension points allowing developers to build their own libraries on top of it. And the community is already taking advantage of it. Two projects, in particular, have caught our attention.

The first one extends React Native Skia with the Skottie module. Skottie is a Lottie player implemented in Skia. While we don’t ship the Skottie module part of React Native Skia, we made sure that library developers can use our C++ API to extend it with any module they wish. That means we can keep the size of the core library small while letting developers opt-in for the extra modules they wish to use.

A gif of 5 different coloured square lego blocks sliding around each other
Skottie is an implementation of Lottie in Skia

Of all our great open-source partners, none has taken the library on such a crazy ride as the Margelo agency did. The React Native Vision Camera is a project that allows React Native developers to write JavaScript plugins to process camera frames on the UI frame. The team has worked hard to enable Skia image filters and effects to be applied in real time onto camera frames.

A gif showing how the Skia shader applies it’s image filters to a smartphone camera.
A Skia shader is applied on every camera frame

React Native Skia is written in TypeScript and C++. As part of the project quality assurance, we heavily rely on static code analysis for both languages. We also built an end-to-end test suite that draws each example on iOS, Android, and Web. Then we check that the drawings are correct and identical on each platform. We can also use it to test for Skia code executions where the result is not necessarily a drawing but can be a Skia object such as a path for instance. By comparing results across platforms, we gained tons of insights on Skia (for instance, we realized how each platform displays fonts slightly differently). And while the idea of building reliable end-to-end testing in React Native can seem daunting, we worked our way up (by starting from node testing only and then incrementally adding iOS and Android) to a test environment that is really fun and has substantially helped improve the quality of the library.

A gif showing a terminal window running tests. On the right hand side of the image is a phone showing the tests running.
Tests are run on iOS, Android, and Web and images are checked for correctness

We also worked on documentation. Docusaurus appears to be the gold standard for documenting open-source project and it hasn’t disappointed. Thanks to Shiki Twoslash, we could add TypeScript compiler checking to our documentation examples. Allowing us to statically check that all of our documentation examples compile against the current version, and that the example results are part of the test suite.

A screenshot of the indices page on React Native Skia Docs
Thanks to Docusaurus, documentation examples are also checked for correctness.

A Look Ahead to 2023

Now that we have a robust model for UI thread rendering with Skia DOM, we’re looking to use it for animations. This means butter smooth animation even when the JavaScript thread is not available. We have already prototyped Skia animations via JavaScript worklets and we are looking to deliver this feature to the community. For animations, we are also looking at UI-thread-level integration between Skia and Reanimated. As mentioned in a Reanimated/Skia tutorial, we believe that a deep integration between these two libraries is key.

We’re also looking to provide advanced text layouts using the Skia paragraph module. Advanced text layouts will enable a new range of use cases such as advanced text animations which are currently not available in React Native as well as having complex text layouts available alongside existing Skia drawings.

A gif showing code on the left hand side and the result on the left which is a paragraph automatically adjusting upon resize
Sneak peek at the Skia paragraph module in React Native

Can Skia take your React Native App to the next level in 2023? Let us know your thoughts on the project discussion board, and until then: Happy Hacking!

William Candillon is the host of the “Can it be done in React Native?” YouTube series, where he explores advanced user-experiences and animations in the perspective of React Native development. While working on this series, William partnered with Christian to build the next-generation of React Native UIs using Skia.

We all get shit done, ship fast, and learn. We operate on low process and high trust, and trade on impact. You have to care deeply about what you’re doing, and commit to continuously developing your craft, to keep pace here. If you’re seeking hypergrowth, can solve complex problems, and can thrive on change (and a bit of chaos), you’ve found the right place. Visit our Engineering careers page to find your role.

Continue reading

3 (More) Tips for Optimizing Apache Flink Applications

3 (More) Tips for Optimizing Apache Flink Applications

By Kevin Lam and Rafael Aguiar

At Shopify, we’ve adopted Apache Flink as a standard stateful streaming engine that powers a variety of use cases. Earlier this year, we shared our tips for optimizing large stateful Flink applications. Below we’ll walk you through 3 more best practices.

1. Set the Right Parallelism

A Flink application consists of multiple tasks, including transformations (operators), data sources, and sinks. These tasks are split into several parallel instances for execution and data processing. 

Parallelism refers to the parallel instances of a task and is a mechanism that enables you to scale in or out. It's one of the main contributing factors to application performance. Increasing parallelism allows an application to leverage more task slots, which can increase the overall throughput and performance. 

Application parallelism can be configured in a few different ways, including:

  • Operator level
  • Execution environment level
  • Client level
  • System level

The configuration choice really depends on the specifics of your Flink application. For instance, if some operators in your application are known to be a bottleneck, you may want to only increase the parallelism for that bottleneck. 

We recommend starting with a single execution environment level parallelism value and increasing it if needed. This is a good starting point as task slot sharing allows for better resource utilization. When I/O intensive subtasks block, non I/O subtasks can make use of the task manager resources. 

A good rule to follow when identifying parallelism is:

The number of task managers multiplied by the number of tasks slots in each task manager must be equal (or slightly higher) to the highest parallelism value 

For example, when using parallelism of 100 (either defined as a default execution environment level or at a specific operator level), you would need to run 25 task managers, assuming each task manager has four slots: 25 x 4 = 100.

2. Avoid Sink Bottlenecks 

Data pipelines usually have one or more data sinks (destinations like Bigtable, Apache Kafka, and so on) which can sometimes become bottlenecks in your Flink application. For example, if your target Bigtable instance has high CPU utilization, it may start to affect your Flink application due to Flink being unable to keep up with the write traffic. You may not see any exceptions, but decreased throughput all the way to your sources. You’ll also see backpressure in the Flink UI.

When sinks are the bottleneck, the backpressure will propagate to all of its upstream dependencies, which could be your entire pipeline. You want to make sure that your sinks are never the bottleneck! 

In cases where latency can be sacrificed a little, it’s useful to combat bottlenecks by first batch writing to the sink in favor of higher throughput. A batch write request ​​is the process of collecting multiple events as a bundle and submitting those to the sink at once, rather than submitting one event at a time. Batch writes will often lead to better compression, lower network usage, and smaller CPU hit on the sinks. See Kafka’s batch.size property, and Bigtable’s bulk mutations for examples. 

You’ll also want to check and fix any data skew. In the same Bigtable example, you may have heavily skewed keys which will affect a few of Bigtable’s hottest nodes. Flink uses keyed streams to scale out to nodes. The concept involves the events of a stream being partitioned according to a specific key. Flink then processes different partitions on different nodes. 

KeyBy is frequently used to re-key a DataStream in order to perform aggregation or a join. It’s very easy to use, but it can cause a lot of problems if the chosen key isn’t properly distributed. For example, at Shopify, if we were to choose a shop ID as our key, it wouldn’t be ideal. A shop ID is the identifier of a single merchant shop on our platform. Different shops have very different traffic, meaning some Flink task managers would be busy processing data, while the others would stay idle. This could easily lead to out-of-memory exceptions and other failures. Low cardinality IDs (< 100) are also problematic because it’s hard to distribute them properly amongst the task managers. 

But what if you absolutely need to use a less than ideal key? Well, you can apply a bucketing technique:

  • Choose a maximum number (start with a number smaller than or equal to the operator parallelism)
  • Randomly generate a value between 0 and the max number
  • Append it to your key before keyBy

By applying a bucketing technique, your processing logic is better distributed (up to the maximum number of additional buckets per key). However, you need to come up with a way to combine the results in the end. For instance, if after processing all your buckets you find the data volume is significantly reduced, you can keyBy the stream by your original “less than ideal” key without creating problematic data skew. Another approach could be to combine your results at query time, if your query engine supports it. 

3. Use HybridSource to Combine Heterogeneous Sources 

Let’s say you need to abstract several heterogeneous data sources into one, with some ordering. For example, at Shopify a large number of our Flink applications read and write to Kafka. In order to save costs associated with storage, we enforce per-topic retention policies on all our Kafka topics. This means that after a certain period of time has elapsed, data is expired and removed from the Kafka topics. Since users may still care about this data after it’s expired, we support configuring Kafka topics to be archived. When a topic is archived, all Kafka data for that topic are copied to a cloud object storage for long-term storage. This ensures it’s not lost when the retention period elapses. 

Now, what do we do if we need our Flink application to read all the data associated with a topic configured to be archived, for all time? Well, we could create two sources—one source for reading from the cloud storage archives, and one source for reading from the real-time Kafka topic. But this creates complexity. By doing this, our application would be reading from two points in event time simultaneously, from two different sources. On top of this, if we care about processing things in order, our Flink application has to explicitly implement application logic which handles that properly. 

If you find yourself in a similar situation, don’t worry there’s a better way! You can use HybridSource to make the archive and real-time data look like one logical source. Using HybridSource, you can provide your users with a single source that first reads from the cloud storage archives for a topic, and then when the archives are exhausted, switches over automatically to the real-time Kafka topic. The application developer only sees a single logical DataStream and they don’t have to think about any of the underlying machinery. They simply get to read the entire history of data. 

Using HybridSource to read cloud object storage data also means you can leverage a higher number of input partitions to increase read throughput. While one of our Kafka topics might be partitioned across tens or hundreds of partitions to support enough throughput for live data, our object storage datasets are typically partitioned across thousands of partitions per split (e.g. day) to accommodate for vast amounts of historical data. The superior object storage partitioning, when combined with enough task managers, will allow Flink to blaze through the historical data, dramatically reducing the backfill time when compared to reading the same amount of data straight from an inferiorly partitioned Kafka topic.

Here’s what creating a DataStream using our HybridSource powered KafkaBackfillSource looks like in Scala:

In the code snippet, the KafkaBackfillSource abstracts away the existence of the archive (which is inferred from the Kafka topic and cluster), so that the developer can think of everything as a single DataStream.

HybridSource is a very powerful construct and should definitely be considered if you need your Flink application to read several heterogeneous data sources in an ordered format. 

And there you go! 3 more tips for optimizing large stateful Flink applications. We hope you enjoyed our key learnings and that they help you out when implementing your own Flink applications. If you’re looking for more tips and haven’t read our first blog, make sure to check them out here

Kevin Lam works on the Streaming Capabilities team under Production Engineering. He's focused on making stateful stream processing powerful and easy at Shopify. In his spare time he enjoys playing musical instruments, and trying out new recipes in the kitchen.

Rafael Aguiar is a Senior Data Engineer on the Streaming Capabilities team. He is interested in distributed systems and all-things large scale analytics. When he is not baking some homemade pizza he is probably lost outdoors. Follow him on Linkedin.

Interested in tackling the complex problems of commerce and helping us scale our data platform? Join our team.

Continue reading

Planning in Bets: Risk Mitigation at Scale

Planning in Bets: Risk Mitigation at Scale

What do you do with a finite amount of time to deal with an infinite number of things that can go wrong? This post breaks down a high-level risk mitigation process into four questions that can be applied to nearly any scenario in order to help you make the best use of your time and resources available.

Continue reading

Using Server Sent Events to Simplify Real-time Streaming at Scale

Using Server Sent Events to Simplify Real-time Streaming at Scale

When building any kind of real-time data application, trying to figure out how to send messages from the server to the client (or vice versa) is a big part of the equation. Over the years, various communication models have popped up to handle server-to-client communication, including Server Sent Events (SSE). 

SSE is a unidirectional server push technology that enables a web client to receive automatic updates from a server via an HTTP connection. With SSE data delivery is quick and simple because there’s no periodic polling, so there’s no need to temporarily stage data.

This was a perfect addition to a real-time data visualization product Shopify ships every year—our Black Friday Cyber Monday (BFCM) Live Map. 

Our 2021 Live Map system was complex and used a polling communication model that wasn’t well suited. While this system had 100 percent uptime, it wasn't without its bottlenecks. We knew we could improve performance and data latency.

Below, we’ll walk through how we implemented an SSE server to simplify our BFCM Live Map architecture and improve data latency. We’ll discuss choosing the right communication model for your use case, the benefits of SSE, and code examples for how to implement a scalable SSE server that’s load-balanced with Nginx in Golang.  

Choosing a Real-time Communication Model

First, let’s discuss choosing how to send messages. When it comes to real-time data streaming, there are three communication models:

  1. Push: This is the most real-time model. The client opens a connection to the server and that connection remains open. The server pushes messages and the client waits for those messages. The server manages a registry of connected clients to push data to. The scalability is directly related to the scalability of this registry.
  2. Polling: The client makes a request to the server and gets a response immediately, whether there's a message or not. This model can waste bandwidth and resources when there are no new messages. While this model is the easiest to implement, it doesn’t scale well. 
  3. Long polling: This is a combination of the two models above. The client makes a request to the server, but the connection is kept open until a response with data is returned. Once a response with new data is returned, the connection is closed. 

No model is better than the other, it really depends on the use case. 

Our use case is the Shopify BFCM Live Map, a web user interface that processes and visualizes real-time sales made by millions of Shopify merchants over the BFCM weekend. The data we’re visualizing includes:

  • Total sales per minute 
  • Total number of orders per minute 
  • Total carbon offset per minute 
  • Total shipping distance per minute 
  • Total number of unique shoppers per minute 
  • A list of latest shipping orders
  • Trending products
Shopify BFCM Live Map 2022 Frontend
Shopify’s 2022 BFCM Live Map frontend

BFCM is the biggest data moment of the year for Shopify, so streaming real-time data to the Live Map is a complicated feat. Our platform is handling millions of orders from our merchants. To put that scale into perspective, during BFCM 2021 we saw 323 billion rows of data ingested by our ingestion service. 

For the BFCM Live Map to be successful, it requires a scalable and reliable pipeline that provides accurate, real-time data in seconds. A crucial part of that pipeline is our server-to-client communication model. We need something that can handle both the volume of data being delivered, and the load of thousands of people concurrently connecting to the server. And it needs to do all of this quickly.

Our 2021 BFCM Live Map delivered data to a presentation layer via WebSocket. The presentation layer then deposited data in a mailbox system for the web client to periodically poll, taking (at minimum) 10 seconds. In practice, this worked but the data had to travel a long path of components to be delivered to the client.

Data was provided by a multi-component backend system consisting of a Golang based application (Cricket) using a Redis server and a MySQL database. The Live Map’s data pipeline consisted of a multi-region, multi-job Apache Flink based application. Flink processed source data from Apache Kafka topics and Google Cloud Storage (GCS) parquet-file enrichment data to produce into other Kafka topics for Cricket to consume.

Shopify BFCM 2021 Backend Architecture
Shopify’s 2021 BFCM globe backend architecture

While this got the job done, the complex architecture caused bottlenecks in performance. In the case of our trending products data visualization, it could take minutes for changes to become available to the client. We needed to simplify in order to improve our data latency. 

As we approached this simplification, we knew we wanted to deprecate Cricket and replace it with a Flink-based data pipeline. We’ve been investing in Flink over the past couple of years, and even built our streaming platform on top of it—we call it Trickle. We knew we could leverage these existing engineering capabilities and infrastructure to streamline our pipeline. 

With our data pipeline figured out, we needed to decide on how to deliver the data to the client. We took a look at how we were using WebSocket and realized it wasn’t the best tool for our use case.

Server Sent Events Versus WebSocket

WebSocket provides a bidirectional communication channel over a single TCP connection. This is great to use if you’re building something like a chat app, because both the client and the server can send and receive messages across the channel. But, for our use case, we didn’t need a bidirectional communication channel. 

The BFCM Live Map is a data visualization product so we only need the server to deliver data to the client. If we continued to use WebSocket it wouldn’t be the most streamlined solution. SSE on the other hand is a better fit for our use case. If we went with SSE, we’d be able to implement:

  • A secure uni-directional push: The connection stream is coming from the server and is read-only.
  • A connection that uses ubiquitously familiar HTTP requests: This is a benefit for us because we were already using a ubiquitously familiar HTTP protocol, so we wouldn’t need to implement a special esoteric protocol.
  • Automatic reconnection: If there's a loss of connection, reconnection is automatically retried after a certain amount of time.

But most importantly, SSE would allow us to remove the process of retrieving, processing, and storing data on the presentation layer for the purpose of client polling. With SSE, we would be able to push the data as soon as it becomes available. There would be no more polls and reads, so no more delay. This, paired with a new streamlined pipeline, would simplify our architecture, scale with peak BFCM volumes and improve our data latency. 

With this in mind, we decided to implement SSE as our communication model for our 2022 Live Map. Here’s how we did it.

Implementing SSE in Golang

We implemented an SSE server in Golang that subscribes to Kafka topics and pushes the data to all registered clients’ SSE connections as soon as it’s available. 

Shopify BFCM Live Map 2022 Frontend
Shopify’s 2022 BFCM Live Map backend architecture with SSE server

A real-time streaming Flink data pipeline processes raw Shopify merchant sales data from Kafka topics. It also processes periodically-updated product classification enrichment data on GCS in the form of compressed Apache Parquet files. These are then computed into our sales and trending product data respectively and published into Kafka topics.

Here’s a code snippet of how the server registers an SSE connection:

Subscribing to the SSE endpoint is simple with the EventSource interface. Typically, client code creates a native EventSource object and registers an event listener on the object. The event is available in the callback function:

When it came to integrating the SSE server to our frontend UI, the UI application was expected to subscribe to an authenticated SSE server endpoint to receive data. Data being pushed from the server to client is publicly accessible during BFCM, but the authentication enables us to control access when the site is no longer public. Pre-generated JWT tokens are provided to the client by the server that hosts the client for the subscription. We used the open-sourced EventSourcePolyfill  implementation to pass an authorization header to the request:

Once subscribed, data is pushed to the client as it becomes available. Data is consistent with the SSE format, with the payload being a JSON parsable by the client.

Ensuring SSE Can Handle Load 

Our 2021 system struggled under a large number of requests from user sessions at peak BFCM volume due to the message bus bottleneck. We needed to ensure our SSE server could handle our expected 2022 volume. 

With this in mind, we built our SSE server to be horizontally scalable with the cluster of VMs sitting behind Shopify’s NGINX load-balancers. As the load increases or decreases, we can elastically expand and reduce our cluster size by adding or removing pods. However, it was essential that we determined the limit of each pod so that we could plan our cluster accordingly.

One of the challenges of operating an SSE server is determining how the server will operate under load and handle concurrent connections. Connections to the client are maintained by the server so that it knows which ones are active, and thus which ones to push data to. This SSE connection is implemented by the browser, including the retry logic. It wouldn’t be practical to open tens of thousands of true browser SSE connections. So, we need to simulate a high volume of connections in a load test to determine how many concurrent users one single server pod can handle. By doing this, we can identify how to scale out the cluster appropriately.

We opted to build a simple Java client that can initiate a configurable amount of SSE connections to the server. This Java application is bundled into a runnable Jar that can be distributed to multiple VMs in different regions to simulate the expected number of connections. We leveraged the open-sourced okhttp-eventsource library to implement this Java client.

Here’s the main code for this Java client:

Did SSE Perform Under Pressure?

With another successful BFCM in the bag, we can confidently say that implementing SSE in our new streamlined pipeline was the right move. Our BFCM Live Map saw 100 percent uptime. As for data latency in terms of SSE, data was delivered to clients within milliseconds of its availability. This was much improved from the minimum 10 second poll from our 2021 system. Overall, including the data processing in our Flink data pipeline, data was visualized on the BFCM’s Live Map UI within 21 seconds of its creation time. 

We hope you enjoyed this behind the scenes look at the 2022 BFCM Live Map and learned some tips and tricks along the way. Remember, when it comes to choosing a communication model for your real-time data product, keep it simple and use the tool best suited for your use case.

Bao is a Senior Staff Data Engineer who works on the Core Optimize Data team. He's interested in large-scale software system architecture and development, big data technologies and building robust, high performance data pipelines.

Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Want to help us scale and make commerce better for everyone? Join our team.

Continue reading

How to Export Datadog Metrics for Exploration in Jupyter Notebooks

How to Export Datadog Metrics for Exploration in Jupyter Notebooks

"Is there a way to extract Datadog metrics in Python for in-depth analysis?" 

This question has been coming up a lot at Shopify recently, so I thought detailing a step-by-step guide might be useful for anyone going down this same rabbit hole.

Follow along below to learn how to extract data from Datadog and build your analysis locally in Jupyter Notebooks.

Why Extract Data from Datadog?

As a quick refresher, Datadog is a monitoring and security platform for cloud applications, used to find issues in your platform, monitor the status of different services, and track the health of an infrastructure in general. 

So, why would you ever need Datadog metrics to be extracted?

There are two main reasons why someone may prefer to extract the data locally rather than using Datadog:

  1. Limitation of analysis: Datadog has a limited set of visualizations that can be built and it doesn't have the tooling to perform more complex analysis (e.g. building statistical models). 
  2. Granularity of data: Datadog dashboards have a fixed width for the visualizations, which means that checking metrics across a larger time frame will make the metric data less granular. For example, the below image shows a Datadog dashboard capturing a 15 minute span of activity, which generates metrics on a 1 second interval:
Datadog dashboard granularity of data over 15 minutes
Datadog dashboard showing data over the past 15 minutes

Comparatively, the below image shows a Datadog dashboard that captures a 30 day span of activity, which generates metrics on a 2 hour interval:

Datadog dashboard granularity of data over 30 days
Datadog dashboard showing data over the past 30 days

As you can see, Datadog visulaizes an aggregated trend in the 2 hour window, which means it smoothes (hides) any interesting events. For those reasons, someone may prefer to extract the data manually from Datadog to run their own analysis.

How to Extract Data and Build Your Own analysis

For the purposes of this blog, we’ll be running our analysis in Jupyter notebooks. However, feel free to use your own preferred tool for working with Python.

Datadog has a REST API which we’ll use to extract data from.

In order to extract data from Datadog's API, all you need are two things :

  1. API credentials: You’ll need credentials (an API key and an APP key) to interact with the datadog API. 
  2. Metric query: You need a query to execute in Datadog. For the purposes of this blog, let’s say we wanted to track the CPU utilization over time.

Once you have the above two requirements sorted, you’re ready to dive into the data.

Step 1: Initiate the required libraries and set up your credentials for making the API calls:


Step 2: Specify the parameters for time-series data extraction. Below we’re setting the time period from Tuesday, November 22, 2022 at 16:11:49 GMT to Friday, November 25, 2022 at 16:11:49 GMT:

One thing to keep in mind is that Datadog has a rate limit of API requests. In case you face rate issues, try increasing the “time_delta” in the query above to reduce the number of requests you make to the Datadog API.

Step 3: Run the extraction logic. Take the start and the stop timestamp and split them into buckets of width = time_delta

An example of bucketing start and stop timestamp
An example of taking the start and the stop timestamp and splitting them into buckets of width = time_delta

Next, make calls to the Datadog API for the above bucketed time windows in a for loop. For each call, append the data you extracted for bucketed time frames to a list.

Lastly, convert the lists to a dataframe and return it:


Step 4: Voila, you have the data! Looking at the below mock data table, this data will have more granularity compared to what is shown in Datadog.

Granularity of data after exporting from Datadog
Example of the granularity of data exported from Datadog

Now, we can use this to visualize data using any tool we want. For example, let’s use seaborn to look at the distribution of the system’s CPU utilization using KDE plots:


As you can see below, this visualization provides a deeper insight.

Data visualization in seaborn
Visualizing the data we pulled from Datadog in seaborn to look at the distribution using KDE plots

And there you have it. A super simple way to extract data from Datadog for exploration in Jupyter notebooks.

Kunal is a data scientist on the Shopify ProdEng data science team, working out of Niagara Falls, Canada. His team helps make Shopify’s platform performant, resilient and secure. In his spare time, Kunal enjoys reading about tech stacks, working on IoT devices and spending time with his family.

Are you passionate about solving data problems and eager to learn more about Shopify? Check out openings on our careers page.

Continue reading

Caching Without Marshal Part 2: The Path to MessagePack

Caching Without Marshal Part 2: The Path to MessagePack

In part one of Caching Without Marshal, we dove into the internals of Marshal, Ruby’s built-in binary serialization format. Marshal is the black box that Rails uses under the hood to transform almost any object into binary data and back. Caching, in particular, depends heavily on Marshal: Rails uses it to cache pretty much everything, be it actions, pages, partials, or anything else.

Marshal’s magic is convenient, but it comes with risks. Part one presented a deep dive into some of the little-documented internals of Marshal with the goal of ultimately replacing it with a more robust cache format. In particular, we wanted a cache format that would not blow up when we shipped code changes.

Part two is all about MessagePack, the format that did this for us. It’s a binary serialization format, and in this sense it’s similar to Marshal. Its key difference is that whereas Marshal is a Ruby-specific format, MessagePack is generic by default. There are MessagePack libraries for Java, Python, and many other languages.

You may not know MessagePack, but if you’re using Rails chances are you’ve got it in your Gemfile because it’s a dependency of Bootsnap.

The MessagePack Format

On the surface, MessagePack is similar to Marshal: just replace .dump with .pack and .load with .unpack. For many payloads, the two are interchangeable.

Here’s an example of using MessagePack to encode and decode a hash:

MessagePack supports a set of core types that are similar to those of Marshal: nil, integers, booleans, floats, and a type called raw, covering strings and binary data. It also has composite types for array and map (that is, a hash).

Notice, however, that the Ruby-specific types that Marshal supports, like Object and instance variable, aren’t in that list. This isn’t surprising since MessagePack is a generic format and not a Ruby format. But for us, this is a big advantage since it’s exactly the encoding of Ruby-specific types that caused our original problems (recall the beta flag class names in cache payloads from Part One).

Let’s take a closer look at the encoded data of Marshal and MessagePack. Suppose we encode a string "foo" with Marshal, this is what we get:

Visual representation encoded data of Marshall.dump("foo") =  0408 4922 0866 6f6f 063a 0645 54
Encoded data from Marshal for Marshall.dump("foo")

Let’s look at the payload: 0408 4922 0866 6f6f 063a 0645 54. We see that the payload "foo" is encoded in hex as 666f6f and prefixed by 08 representing a length of 3 (f-o-o). Marshal wraps this string payload in a TYPE_IVAR, which as mentioned in part 1 is used to attach instance variables to types that aren’t strictly implemented as objects, like strings. In this case, the instance variable (3a 0645) is named :E. This is a special instance variable used by Ruby to represent the string’s encoding, which is T (54) for true, that is, this is a UTF-8 encoded string. So Marshal uses a Ruby-native idea to encode the string’s encoding.

In MessagePack, the payload (a366 6f6f) is much shorter:

Visual representation of encoded data MessagePack(“foo") = 0408 4922 0866 6f6f 063a 0645 54
Encoded data from MessagePack for MessagePack.pack("foo")

The first thing you’ll notice is that there isn’t an encoding. MessagePack’s default encoding is UTF-8, so there’s no need to include it in the payload. Also note that the payload type (10100011), String, is encoded together with its length: the bits 101 encodes a string of less than 31 bytes, and 00011 says the actual length is 3 bytes. Altogether this makes for a very compact encoding of a string.

Extension Types

After deciding to give MessagePack a try, we did a search for Rails.cache.write and in the codebase of our core monolith, to figure out roughly what was going into the cache. We found a bunch of stuff that wasn’t among the types MessagePack supported out of the box.

Luckily for us, MessagePack has a killer feature that came in handy: extension types. Extension types are custom types that you can define by calling register_type on an instance of MessagePack::Factory, like this:

An extension type is made up of the type code (a number from 0 to 127—there’s a maximum of 128 extension types), the class of the type, and a serializer and deserializer, referred to as packer and unpacker. Note that the type is also applied to subclasses of the type’s class. Now, this is usually what you want, but it’s something to be aware of and can come back to bite you if you’re not careful.

Here’s the Date extension type, the simplest of the extension types we use in the core monolith in production:

As you can see, the code for this type is 3, and its class is Date. Its packer takes a date and extracts the date’s year, month, and day. It then packs them into the format string "s< C C" using the Array#pack method with the year to a 16 bit signed integer, and the month and day to 8-bit unsigned integers. The type’s unpacker goes the other way: it takes a string and, using the same format string, extracts the year, month, and day using String#unpack, then passes them to to create a new date object.

Here’s how we would encode an actual date with this factory:

Converting the result to hex, we get d603 e607 0909 that corresponds to the date (e607 0909) prefixed by the extension type (d603):

Visual breakdown of hex results d603 e607 0909
Encoded date from the factory

As you can see, the encoded date is compact. Extension types give us the flexibility to encode pretty much anything we might want to put into the cache in a format that suits our needs.

Just Say No

If this were the end of the story, though, we wouldn’t really have had enough to go with MessagePack in our cache. Remember our original problem: we had a payload containing objects whose classes changed, breaking on deploy when they were loaded into old code that didn’t have those classes defined. In order to avoid that problem from happening, we need to stop those classes from going into the cache in the first place.

We need MessagePack, in other words, to refuse encoding any object without a defined type, and also let us catch these types so we can follow up. Luckily for us, MessagePack does this. It’s not the kind of “killer feature” that’s advertised as such, but it’s enough for our needs.

Take this example, where factory is the factory we created previously:

If MessagePack were to happily encode this—without any Object type defined—we’d have a problem. But as mentioned earlier, MessagePack doesn’t know Ruby objects by default and has no way to encode them unless you give it one.

So what actually happens when you try this? You get an error like this:

NoMethodError: undefined method `to_msgpack' for <#Object:0x...>

Notice that MessagePack traversed the entire object, through the hash, into the array, until it hit the Object instance. At that point, it found something for which it had no type defined and basically blew up.

The way it blew up is perhaps not ideal, but it’s enough. We can rescue this exception, check the message, figure out it came from MessagePack, and respond appropriately. Critically, the exception contains a reference to the object that failed to encode. That’s information we can log and use to later decide if we need a new extension type, or if we are perhaps putting things into the cache that we shouldn’t be.

The Migration

Now that we’ve looked at Marshal and MessagePack, we’re ready to explain how we actually made the switch from one to the other.

Making the Switch

Our migration wasn’t instantaneous. We ran with the two side-by-side for a period of about six months while we figured out what was going into the cache and which extension types we needed. The path of the migration, however, was actually quite simple. Here’s the basic step-by-step process:

  1. First, we created a MessagePack factory with our extension types defined on it and used it to encode the mystery object passed to the cache (the puzzle piece in the diagram below).
  2. If MessagePack was able to encode it, great! We prefixed a version byte prefix that we used to track which extension types were defined for the payload, and then we put the pair into the cache.
  3. If, on the other hand, the object failed to encode, we rescued the NoMethodError which, as mentioned earlier, MessagePack raises in this situation. We then fell back to Marshal and put the Marshal-encoded payload into the cache. Note that when decoding, we were able to tell which payloads were Marshal-encoded by their prefix: if it’s 0408 it’s a Marshal-encoded payload, otherwise it’s MessagePack.
Path of the migration
The migration three step process

The step where we rescued the NoMethodError was quite important in this process since it was where we were able to log data on what was actually going into the cache. Here’s that rescue code (which of course no longer exists now since we’re fully migrated to MessagePack):

As you can see, we sent data (including the class of the object that failed to encode) to both logs and StatsD. These logs were crucial in flagging the need for new extension types, and also in signaling to us when there were things going into the cache that shouldn’t ever have been there in the first place.

We started the migration process with a small set of default extension types which Jean Boussier, who worked with me on the cache project, had registered in our core monolith earlier for other work using MessagePack. There were five:

  • Symbol (offered out of the box in the messagepack-ruby gem. It just has to be enabled)
  • Time
  • DateTime
  • Date (shown earlier)
  • BigDecimal

These were enough to get us started, but they were certainly not enough to cover all the variety of things that were going into the cache. In particular, being a Rails application, the core monolith serializes a lot of records, and we needed a way to serialize those records. We needed an extension type for ActiveRecords::Base.

Encoding Records

Records are defined by their attributes (roughly, the values in their table columns), so it might seem like you could just cache them by caching their attributes. And you can.

But there’s a problem: records have associations. Marshal encodes the full set of associations along with the cached record. This ensures that when the record is deserialized, the loaded associations (those that have already been fetched from the database) will be ready to go without any extra queries. An extension type that only caches attribute values, on the other hand, needs to make a new query to refetch those associations after coming out of the cache, making it much more inefficient.

So we needed to cache loaded associations along with the record’s attributes. We did this with a serializer called ActiveRecordCoder. Here’s how it works. Consider a simple post model that has many comments, where each comment belongs to a post with an inverse defined:

Note that the Comment model here has an inverse association back to itself via its post association. Recall that Marshal handles this kind of circularity automatically using the link type (@ symbol) we saw in part 1, but that MessagePack doesn’t handle circularity by default. We’ll have to implement something like a link type to make this encoder work.

Instance Tracker handles circularity
Instance Tracker handles circularity

The trick we use for handling circularity involves something called an Instance Tracker. It tracks records encountered while traversing the record’s network of associations. The encoding algorithm builds a tree where each association is represented by its name (for example :comments or :post), and each record is represented by its unique index in the tracker. If we encounter an untracked record, we recursively traverse its network of associations, and if we’ve seen the record before, we simply encode it using its index.

This algorithm generates a very compact representation of a record’s associations. Combined with the records in the tracker, each encoded by its set of attributes, it provides a very concise representation of a record and its loaded associations.

Here’s what this representation looks like for the post with two comments shown earlier:

Once ActiveRecordCoder has generated this array of arrays, we can simply pass the result to MessagePack to encode it to a bytestring payload. For the post with two comments, this generates a payload of around 300 bytes. Considering that the Marshal payload for the post with no associations we looked at in Part 1 was 1,600 bytes in length, that’s not bad.

But what happens if we try to encode this post with its two comments using Marshal? The result is shown below: a payload over 4,000 bytes long. So the combination of ActiveRecordCoder with MessagePack is 13 times more space efficient than Marshal for this payload. That’s a pretty massive improvement.

Visual representation of the difference between an ActiveRecordCoder + MessagePack payload vs a Marshal payload
ActiveRecordCoder + MessagePack vs Marshal

In fact, the space efficiency of the switch to MessagePack was so significant that we immediately saw the change in our data analytics. As you can see in the graph below, our Rails cache memcached fill percent dropped after the switch. Keep in mind that for many payloads, for example boolean and integer valued-payloads, the change to MessagePack only made a small difference in terms of space efficiency. Nonetheless, the change for more complex objects like records was so significant that total cache usage dropped by over 25 percent.

Line graph showing Rails cache memcached fill percent versus time. The graph shows a decrease when changed to MessagePackRails cache memcached fill percent versus time

Handling Change

You might have noticed that ActiveRecordCoder, our encoder for ActiveRecord::Base objects, includes the name of record classes and association names in encoded payloads. Although our coder doesn’t encode all instance variables in the payload, the fact that it hardcodes class names at all should be a red flag. Isn’t this exactly what got us into the mess caching objects with Marshal in the first place?

And indeed, it is—but there are two key differences here.

First, since we control the encoding process, we can decide how and where to raise exceptions when class or association names change. So when decoding, if we find that a class or association name isn’t defined, we rescue the error and re-raise a more specific error. This is very different from what happens with Marshal.

Second, since this is a cache, and not, say, a persistent datastore like a database, we can afford to occasionally drop a cached payload if we know that it’s become stale. So this is precisely what we do. When we see one of the exceptions for missing class or association names, we rescue the exception and simply treat the cache fetch as a miss. Here’s what that code looks like:

The result of this strategy is effectively that during a deploy where class or association names change, cache payloads containing those names are invalidated, and the cache needs to replace them. This can effectively disable the cache for those keys during the period of the deploy, but once the new code has been fully released the cache again works as normal. This is a reasonable tradeoff, and a much more graceful way to handle code changes than what happens with Marshal.

Core Type Subclasses

With our migration plan and our encoder for ActiveRecord::Base, we were ready to embark on the first step of the migration to MessagePack. As we were preparing to ship the change, however, we noticed something was wrong on continuous integration (CI): some tests were failing on hash-valued cache payloads.

A closer inspection revealed a problem with HashWithIndifferentAccess, a subclass of Hash provided by ActiveSupport that makes symbols and strings work interchangeably as hash keys. Marshal handles subclasses of core types like this out of the box, so you can be sure that a HashWithIndifferentAccess that goes into a Marshal-backed cache will come back out as a HashWithIndifferentAccess and not a plain old Hash. The same cannot be said for MessagePack, unfortunately, as you can confirm yourself:

MessagePack doesn’t blow up here on the missing type because  HashWithIndifferentAccess is a subclass of another type that it does support, namely Hash. This is a case where MessagePack’s default handling of subclasses can and will bite you; it would be better for us if this did blow up, so we could fall back to Marshal. We were lucky that our tests caught the issue before this ever went out to production.

The problem was a tricky one to solve, though. You would think that defining an extension type for HashWithIndifferentAccess would resolve the issue, but it didn’t. In fact, MessagePack completely ignored the type and continued to serialize these payloads as hashes.

As it turns out, the issue was with msgpack-ruby itself. The code handling extension types didn’t trigger on subclasses of core types like Hash, so any extensions of those types had no effect. I made a pull request (PR) to fix the issue, and as of version 1.4.3, msgpack-ruby now supports extension types for Hash as well as Array, String, and Regex.

The Long Tail of Types

With the fix for HashWithIndifferentAccess, we were ready to ship the first step in our migration to MessagePack in the cache. When we did this, we were pleased to see that MessagePack was successfully serializing 95 percent of payloads right off the bat without any issues. This was validation that our migration strategy and extension types were working.

Of course, it’s the last 5 percent that’s always the hardest, and indeed we faced a long tail of failing cache writes to resolve. We added types for commonly cached classes like ActiveSupport::TimeWithZone and Set, and edged closer to 100 percent, but we couldn’t quite get all the way there. There were just too many different things still being cached with Marshal.

At this point, we had to adjust our strategy. It wasn’t feasible to just let any developer define new extension types for whatever they needed to cache. Shopify has thousands of developers, and we would quickly hit MessagePack’s limit of 128 extension types.

Instead, we adopted a different strategy that helped us scale indefinitely to any number of types. We defined a catchall type for Object, the parent class for the vast majority of objects in Ruby. The Object extension type looks for two methods on any object: an instance method named as_pack and a class method named from_pack. If both are present, it considers the object packable, and uses as_pack as its serializer and from_pack as its deserializer. Here’s an example of a Task class that our encoder treats as packable:

Note that, as with the ActiveRecord::Base extension type, this approach relies on encoding class names. As mentioned earlier, we can do this safely since we handle class name changes gracefully as cache misses. This wouldn’t be a viable approach for a persistent store.

The packable extension type worked great, but as we worked on migrating existing cache objects, we found many that followed a similar pattern, caching either Structs or T::Structs (Sorbet’s typed struct). Structs are simple objects defined by a set of attributes, so the packable methods were each very similar since they simply worked from a list of the object’s attributes. To make things easier, we extracted this logic into a module that, when included in a struct class, automatically makes the struct packable. Here’s the module for Struct:

The serialized data for the struct instance includes an extra digest value (26450) that captures the names of the struct’s attributes. We use this digest to signal to the Object extension type deserialization code that attribute names have changed (for example in a code refactor). If the digest changes, the cache treats cached data as stale and regenerates it:

Simply by including this module (or a similar one for T::Struct classes), developers can cache struct data in a way that’s robust to future changes. As with our handling of class name changes, this approach works because we can afford to throw away cache data that has become stale.

The struct modules accelerated the pace of our work, enabling us to quickly migrate the last objects in the long tail of cached types. Having confirmed from our logs that we were no longer serializing any payloads with Marshal, we took the final step of removing it entirely from the cache. We’re now caching exclusively with MessagePack.

Safe by Default

With MessagePack as our serialization format, the cache in our core monolith became safe by default. Not safe most of the time or safe under some special conditions, but safe, period. It’s hard to underemphasize the importance of a change like this to the stability and scalability of a platform as large and complex as Shopify’s.

For developers, having a safe cache brings a peace of mind that one less unexpected thing will happen when they ship their refactors. This makes such refactors—particularly large, challenging ones—more likely to happen, improving the overall quality and long-term maintainability of our codebase.

If this sounds like something that you’d like to try yourself, you’re in luck! Most of the work we put into this project has been extracted into a gem called Shopify/paquito. A migration process like this will never be easy, but Paquito incorporates the learnings of our own experience. We hope it will help you on your journey to a safer cache.

Chris Salzberg is a Staff Developer on the Ruby and Rails Infra team at Shopify. He is based in Hakodate in the north of Japan.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Caching Without Marshal Part 1: Marshal from the Inside Out

Caching Without Marshal Part 1: Marshal from the Inside Out

Caching is critical to how Rails applications work. At every layer, whether it be in page rendering, database querying, or external data retrieval, the cache is what ensures that no single bottleneck brings down an entire application. 

But caching has a dirty secret, and that secret’s name is Marshal.

Marshal is Ruby’s ultimate sharp knife, able to transform almost any object into a binary blob and back. This makes it a natural match for the diverse needs of a cache, particularly the cache of a complex web framework like Rails. From actions, to pages, to partials, to queries—you name it, if Rails is touching it, Marshal is probably caching it. 

Marshal’s magic, however, comes with risks.

A couple of years ago, these risks became very real for us. It started innocently enough. A developer at Shopify, in an attempt to clean up some code in our core monolith, shipped a PR refactoring some key classes around beta flags. The refactor got the thumbs up in review and passed all tests and other checks.

As it went out to production, though, it became clear something was very wrong. A flood of exceptions triggered an incident, and the refactor was quickly rolled back and reverted. We were lucky to escape so easily.

The incident was a wake-up call for us. Nothing in our set of continuous integration (CI) checks had flagged the change. Indeed, even in retrospect, there was nothing wrong with the code change at all. The issue wasn’t the code, but the fact that the code had changed.

The problem, of course, was Marshal. Being so widely used, beta flags were being cached. Marshal serializes an object’s class along with its other data, so many of the classes that were part of the refactor were also hardcoded in entries of the cache. When the newly deployed code began inserting beta flag instances with the new classes into the cache, the old code—which was still running as the deploy was proceeding—began choking on class names and methods that it had never seen before.

As a member of Shopify’s Ruby and Rails Infrastructure team, I was involved in the follow-up for this incident. The incident was troubling to us because there were really only two ways to mitigate the risk of the same incident happening again, and neither was acceptable. The first is simply to put less things into the cache, or less variety of things; this decreases the likelihood of cached objects conflicting with future code changes. But this defeats the purpose of having a cache in the first place.

The other way to mitigate the risk is to change code less, because it’s code changes that ultimately trigger cache collisions. But this was even less acceptable: our team is all about making code cleaner, and that requires changes. Asking developers to stop refactoring their code goes against everything we were trying to do at Shopify.

So we decided to take a deeper look and fix the root problem: Marshal. We reasoned that if we could use a different serialization format—one that wouldn’t cache any arbitrary object the way Marshal does, one that we could control and extend—then maybe we could make the cache safe by default.

The format that did this for us is MessagePack. MessagePack is a binary serialization format that’s much more compact than Marshal, with stricter typing and less magic. In this two-part series (based on a RailsConf talk by the same name), I’ll pry Marshal open to show how it works, delve into how we replaced it, and describe the specific challenges posed by Shopify’s scale.

But to start, let’s talk about caching and how Marshal fits into that.

You Can’t Always Cache What You Want

Caching in Rails is easy. Out of the box, Rails provides caching features that cover the common requirements of a typical web application. The Rails Guides provide details on how these features work, and how to use them to speed up your Rails application. So far, so good.

What you won’t find in the guides is information on what you can and can’t put into the cache. The low-level caching section of the caching guide simply states: “Rails’ caching mechanism works great for storing any kind of information.” (original emphasis) If that sounds too good to be true, that’s because it is.

Under the hood, all types of cache in Rails are backed by a common interface of two methods, read and write, on the cache instance returned by Rails.cache. While there are a variety of cache backends—in our core monolith we use Memcached, but you can also cache to file, memory, or Redis, for example—they all serialize and deserialize data the same way, by calling Marshal.load and Marshal.dump on the cached object.

A diagram showing the differences between the cache encoding format between Rails 6 and Rail 7
Cache encoding format in Rails 6 and Rails 7

If you actually take a peek at what these cache backends put into the cache, you might find that things have changed in Rails 7 for the better. This is thanks to work by Jean Boussier, who’s also in the Ruby and Rails Infrastructure team at Shopify, and who I worked with on the cache project. Jean recently improved cache space allocation by more efficiently serializing a wrapper class named ActiveSupport::Cache::Entry. The result is a more space-efficient cache that stores cached objects and their metadata without any redundant wrapper.

Unfortunately, that work doesn’t help us when it comes to the dangers of Marshal as a serialization format: while the cache is slightly more space efficient, all those issues still exist in Rails 7. To fix the problems with Marshal, we need to replace it.

Let’s Talk About Marshal

But before we can replace Marshal, we need to understand it. And unfortunately, there aren’t a lot of good resources explaining what Marshal actually does.

To figure that out, let’s start with a simple Post record, which we will assume has a title column in the database:

We can create an instance of this record and pass it to  Marshal.dump:

This is what we get back:

This is a string of around 1,600 bytes, and as you can see, a lot is going on in there. There are constants corresponding to various Rails classes like ActiveRecord, ActiveModel and ActiveSupport. There are also instance variables, which you can identify by the @ symbol before their names. And finally there are many values, including the name of the post, Caching Without Marshal, which appears three times in the payload.

The magic of Marshal, of course, is that if we take this mysterious bytestring and pass it to Marshal.load, we get back exactly the Post record we started with.

You can do this a day from now, a week from now, a year from now, whenever you want—you will get the exact same object back. This is what makes Marshal so powerful.

And this is all possible because Marshal encodes the universe. It recursively crawls objects and their references, extracts all the information it needs, and dumps the result to the payload.

But what is actually going on in that payload? To figure that out, we’ll need to dig deeper and go to the ultimate source of truth in Ruby: the C source code. Marshal’s code lives in a file called marshal.c. At the top of the file, you’ll find a bunch of constants that correspond to the types Marshal uses when encoding data.

Marshal types defined in marshal.c
Marshal types defined in marshal.c

Top in that list are MARSHAL_MAJOR and MARSHAL_MINOR, the major and minor versions of Marshal, not to be confused with the version of Ruby. This is what comes first in any Marshal payload. The Marshal version hasn’t changed in years and can pretty much be treated as a constant.

Next in the file are several types I will refer to here as “atomic”, meaning types which can’t contain other objects inside themself. These are the things you probably expect: nil, true, false, numbers, floats, symbols, and also classes and modules.

Next, there are types I will refer to as “composite” that can contain other objects inside them. Most of these are unsurprising: array, hash, struct, and object, for example. But this group also includes two you might not expect: string and regex. We’ll return to this later in this article.

Finally, there are several types toward the end of the list whose meaning is probably not very obvious at all. We will return to these later as well.


Let’s first start with the most basic type of thing that Marshal serializes: objects. Marshal encodes objects using a type called TYPE_OBJECT, represented by a small character o.

Marshal-encoded bytestring for the example Post
Marshal-encoded bytestring for the example post

Here’s the Marshal-encoded bytestring for the example Post we saw earlier, converted to make it a bit easier to parse.

The first thing we can see in the payload is the Marshal version (0408), followed by an object, represented by an ‘o’ (6f). Then comes the name of the object’s class, represented as a symbol: a colon (3a) followed by the symbol’s length (09) and name as an ASCII string (Post). (Small numbers are stored by Marshal in an optimized format—09 translates to a length of 4.) Then there’s an integer representing the number of instance variables, followed by the instance variables themselves as pairs of names and values.

You can see that a payload like this, with each variable itself containing an object with further instance variables of its own, can get very big, very fast.

Instance Variables

As mentioned earlier, Marshal encodes instance variables in objects as part of its object type. But it also encodes instance variables in other things that, although seemingly object-like (subclassing the Object class), aren’t in fact implemented as such. There are four of these, which I will refer to as core types, in this article: String, Regex, Array, and Hash. Since Ruby implements these types in a special, optimized way, Marshal has to encode them in a special way as well.

Consider what happens if you assign an instance variable to a string, like this:

This may not be something you do every day, but it’s something you can do. And you may ask: does Marshal handle this correctly?

The answer is: yes it does.

It does this using a special type called TYPE_IVAR to encode instance variables on things that aren’t strictly implemented as objects, represented by a variable name and its value. TYPE_IVAR wraps the original type (String in this case), adding a list of instance variable names and values. It’s also used to encode instance variables in hashes, arrays, and regexes in the same way.


Another interesting problem is circularity: what happens when an object contains references to itself. Records, for example, can have associations that have inverses pointing back to the original record. How does Marshal handle this?

Take a minimal example: an array which contains a single element, the array itself:

What happens if we run this through Marshal? Does it segmentation fault on the self-reference? 

As it turns out, it doesn’t. You can confirm yourself by passing the array through Marshal.dump and Marshal.load:

Marshal does this thanks to an interesting type called the link type, referred to in marshal.c as TYPE_LINK.

TYPE_LINK example

The way Marshall does this is quite efficient. Let’s look at the payload: 0408 5b06 4000. It starts with an open square bracket (5b) representing the array type, and the length of the array (as noted earlier, small numbers are stored in an optimized format, so 06 translates to a length of 1). The circularity is represented by a @ (40) symbol for the link type, followed by an index of the element in the encoded object the link is pointing to, in this case 00 for the first element (the array itself).

In short, Marshal handles circularity out of the box. That’s important to note because when we deal with this ourselves, we’re going to have to reimplement this process.

Core Type Subclasses

I mentioned earlier that there are a number of core types that Ruby implements in a special way, and that Marshal also needs to handle in a way that’s distinct from other objects. Specifically, these are: String, Regex, Array, and Hash.

One interesting edge case is what happens when you subclass one of these classes, like this:

If you create an instance of this class, you’ll see that while it looks like a hash, it’s, indeed, an instance of the subclass:

So what happens if you encode this with Marshal? If you do, you’ll find that it actually captures the correct class:

Marshal does this because it has a special type called TYPE_UCLASS. To the usual data for the type (hash data in this case), TYPE_UCLASS adds the name of the class, allowing it to correctly decode the object when loading it back. It uses the same type to encode subclasses of strings, arrays, and regexes (the other core types).

The Magic of Marshal

We’ve looked at how Marshal encodes several different types of objects in Ruby. You might be wondering at this point why all this information is relevant to you.

The answer is because—whether you realize it or not—if you’re running a Rails application, you most likely rely on it. And if you decide, like we did, to take Marshal’s magic out of your application, you’ll find that it’s exactly these things that break. So before doing that, it’s a good idea to figure out how to replace each one of them.

That’s what we did, with a little help from a format called MessagePack. In the next part of this series, we’ll take a look at the steps we took to migrate our cache to MessagePack. This includes re-implementing some of the key Marshal features, such as circularity and core type subclasses, explored in this article, as well as a deep dive into our algorithm for encoding records and their associations.

Chris Salzberg is a Staff Developer on the Ruby and Rails Infra team at Shopify. He is based in Hakodate in the north of Japan.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Reducing BigQuery Costs: How We Fixed A $1 Million Query

Reducing BigQuery Costs: How We Fixed A $1 Million Query

During the infrastructural exploration of a pipeline my team was building, we discovered a query that could have cost us nearly $1 million USD a month in BigQuery. Below, we’ll detail how we reduced this and share our tips for lowering costs in BigQuery.

Processing One Billion Rows of Data

My team was responsible for building a data pipeline for a new marketing tool we were shipping to Shopify merchants. We built our pipeline with Apache Flink and launched the tool in an early release to a select group of merchants. Fun fact: this pipeline became one of the first productionized Flink pipelines at Shopify. During the early release, our pipeline ingested one billion rows of data into our Flink pipeline's internal state (managed by RocksDB), and handled streaming requests from Apache Kafka

We wanted to take the next step by making the tool generally available to a larger group of merchants. However, this would mean a significant increase in the data our Flink pipeline would be ingesting. Remember, our pipeline was already ingesting one billion rows of data for a limited group of merchants. Ingesting an ever-growing dataset wouldn’t be sustainable. 

As a solution, we looked into a SQL-based external data warehouse. We needed something that our Flink pipeline could submit queries to and that could write back results to Google Cloud Storage (GCS). By doing this, we could simplify the current Flink pipeline dramatically by removing ingestion, ensuring we have a higher throughput for our general availability launch.

The external data warehouse needed to meet the following three criteria:

  1. Atomically load the parquet dataset easily
  2. Handle 60 requests per minute (our general availability estimation) without significant queuing or waiting time
  3. Export the parquet dataset to GCS easily

The first query engine that came to mind was BigQuery. It’s a data warehouse that can both store petabytes of data and query those datasets within seconds. BigQuery is fully managed by Google Cloud Platform and was already in use at Shopify. We knew we could load our one billion row dataset into BigQuery and export query results into GCS easily. With all of this in mind, we started the exploration but we met an unexpected obstacle: cost.

A Single Query Would Cost Nearly $1 Million

As mentioned above, we’ve used BigQuery at Shopify, so there was an existing BigQuery loader in our internal data modeling tool. So, we easily loaded our large dataset into BigQuery. However, when we first ran the query, the log showed the following:

total bytes processed: 75462743846, total bytes billed: 75462868992

That roughly translated to 75 GB billed from the query. This immediately raised an alarm because BigQuery is charged by data processed per query. If each query were to scan 75 GB of data, how much would it cost us at our general availability launch? 

I quickly did some rough math. If we estimate 60 RPM at launch, then:

60 RPM x 60 minutes/hour x 24 hours/day x 30 days/month = 2,592,000 queries/month 

If each query scans 75 GB of data, then we’re looking at approximately 194,400,000 GB of data scanned per month. According to BigQuery’s on-demand pricing scheme, it would cost us $949,218.75 USD per month!

Clustering to the Rescue

With the estimation above, we immediately started to look for solutions to reduce this monstrous cost. 

We knew that clustering our tables could help reduce the amount of data scanned in BigQuery. As a reminder, clustering is the act of sorting your data based on one or more columns in your table. You can cluster columns in your table by fields like DATE, GEOGRAPHY, TIMESTAMP, ect. You can then have BigQuery scan only the clustered columns you need.

With clustering in mind, we went digging and discovered several condition clauses in the query that we could cluster. These were ideal because if we clustered our table with columns appearing in WHERE clauses, we could apply filters in our query that would ensure only specific conditions are scanned. The query engine will stop scanning once it finds those conditions, ensuring only the relevant data is scanned instead of the entire table. This reduces the amount of bytes scanned and would save us a lot of processing time. 

We created a clustered dataset on two feature columns from the query’s WHERE clause. We then ran the exact same query and the log now showed 508.1 MB billed. That’s 150 times less data scanned than the previous unclustered table. 

With our newly clustered table, we identified that the query would now only scan 108.3 MB of data. Doing some rough math again:

2,592,000 queries/month x 0.1 GB of data = 259,200 GB of data scanned/month

That would bring our cost down to approximately $1,370.67 USD per month, which is way more reasonable.

Other Tips for Reducing Cost

While all it took was some clustering for us to significantly reduce our costs, here are a few other tips for lowering BigQuery costs:

  • Avoid the SELECT* statement: Only select the columns in the table you need queried. This will limit the engine scan to only those columns, therefore limiting your cost. 
  • Partition your tables: This is another way to restrict the data scanned by dividing your table into segments (aka partitions). You can create partitions in BigQuery based on time-units, ingestion time or integer range.
  • Don’t run queries to explore or preview data: Doing this would be an unnecessary cost. You can use table preview options to view data for free.

And there you have it. If you’re working with a high volume of data and using BigQuery, following these tips can help you save big. Beyond cost savings, this is critical for helping you scale your data architecture. 

Calvin is a senior developer at Shopify. He enjoys tackling hard and challenging problems, especially in the data world. He’s now working with the Return on Ads Spend group in Shopify. In his spare time, he loves running, hiking and wandering in nature. He is also an amateur Go player.

Are you passionate about solving data problems and eager to learn more about Shopify? Check out openings on our careers page.

Continue reading

The Management Poles of Developer Infrastructure Teams

The Management Poles of Developer Infrastructure Teams

Over the past few years, as I’ve been managing multiple developer infrastructure teams at once, I’ve found some tensions that are hard to resolve. In my current mental model, I have found that there are three poles that have a natural tension and are thus tricky to balance: management support, system and domain expertise, and road maps. I’m going to discuss the details of these poles and some strategies I’ve tried to manage them.

What’s Special About Developer Infrastructure Teams?

Although this model likely can apply to any software development team, the nature of developer infrastructure (Dev Infra) makes this situation particularly acute for managers in our field. These are some of the specific challenges faced in Dev Infra:

  • Engineering managers have a lot on their plates. For whatever reason, infrastructure teams usually lack dedicated product managers, so we often have to step in to fill that gap. Similarly, we’re responsible for tasks that usually fall to UX experts, such as doing user research.
  • There’s a lot of maintenance and support. Teams are responsible for keeping multiple critical systems online with hundreds or thousands of users, usually with only six to eight developers. In addition, we often get a lot of support requests, which is part of the cost of developing in-house software that has no extended community outside the company.
  • As teams tend to organize around particular phases in the development workflow, or sometimes specific technologies, there’s a high degree of domain expertise that’s developed over time by all its members. This expertise allows the team to improve their systems and informs the team’s road map.

What Are The Three Poles?

The Dev Infra management poles I’ve modelled are tensions, much like that between product and engineering. They can’t, I don’t believe, all be solved at the same time—and perhaps they shouldn’t be. We, Dev Infra managers, balance them according to current needs and context and adapt as necessary. For this balancing act, it behooves us to make sure we understand the nature of these poles.

1. Management Support

Supporting developers in their career growth is an important function of any engineering manager. Direct involvement in team projects allows the tightest feedback loops between manager and report, and thus the highest-quality coaching and mentorship. We also want to maximize the number of reports per manager. Good managers are hard to find, and even the best manager adds a bit of overhead to a team’s impact.

We want the manager to be as involved in their reports’ work as possible, and we want the highest number of reports per manager that they can handle. Where this gets complicated is balancing the scope and domain of individual Dev Infra teams and of the whole Dev Infra organization. This tension is a direct result of the need for specific system and domain expertise on Dev Infra teams.

2. System and Domain Expertise

As mentioned above, in Dev Infra we tend to build teams around domains that represent phases in the development workflow, or occasionally around specific critical technologies. It’s important that each team has both domain knowledge and expertise in the specific systems involved. Despite this focus, the scope of and opportunities in a given area can be quite broad, and the associated systems grow in size and complexity.

Expertise in a team’s systems is crucial just to keep everything humming along. As with any long-running software application, dependencies need to be managed, underlying infrastructure has to be occasionally migrated, and incidents must be investigated and root causes solved. Furthermore, at any large organization, Dev Infra services can have many users relative to the size of the teams responsible for them. Some teams will require on-call schedules in case a critical system breaks during an emergency (finding out the deployment tool is down when you’re trying to ship a security fix is, let’s just say, not a great experience for anyone).

A larger team means less individual on-call time and more hands for support, maintenance, and project work. As teams expand their domain knowledge, more opportunities are discovered for increasing the impact of the team’s services. The team will naturally be driven to constantly improve the developer experience in their area of expertise. This drive, however, risks a disconnect with the greatest opportunities for impact across Dev Infra as a whole.

3. Road Maps

Specializing Dev Infra teams in particular domains is crucial for both maintenance and future investments. Team road maps and visions improve and expand upon existing offerings: smoothing interfaces, expanding functionality, scaling up existing solutions, and looking for new opportunities to impact development in their domain. They can make a big difference to developers during particular phases of their workflow like providing automation and feedback while writing code, speeding up continuous integration (CI) execution, avoiding deployment backlogs, and monitoring services more effectively.

As a whole Dev Infra department, however, the biggest impact we can have on development at any given time changes. When Dev Infra teams are first created, there’s usually a lot of low- hanging fruit—obvious friction at different points in the development workflow—so multiple teams can broadly improve the developer experience in parallel. At some point, however, some aspects of the workflow will be much smoother than others. Maybe CI times have finally dropped to five minutes. Maybe deploys rarely need attention after being initiated. At a large organization, there will always be edge cases, bugs, and special requirements in every area, but their impact will be increasingly limited when compared to the needs of the engineering department as a whole.

At this point, there may be an opportunity for a large new initiative that will radically impact development in a particular way. There may be a few, but it’s unlikely that there will be the need for radical changes across all domains. Furthermore, there may be unexplored opportunities and domains for which no team has been assembled. These can be hard to spot if the majority of developers and managers are focused on existing well-defined scopes.

How to Maintain the Balancing Act

Here’s the part where I confess that I don’t have a single amazing solution to balance management support, system maintenance and expertise, and high-level goals. Likely there are a variety of solutions that can be applied and none are perfect. Here are three ideas I’ve thought about and experimented with.

1. Temporarily Assign People from One Team to a Project on Another

If leadership has decided that the best impact for our organization at this moment is concentrated in the work of a particular team, call it Team A, and if Team A’s manager can’t effectively handle any more reports, then a direct way to get more stuff done is to take a few people from another team (Team B) and assign them to the Team A’s projects. This has some other benefits as well: it increases the number of people with familiarity in Team A’s systems, and people sometimes like to change up what they’re working on.

When we tried this, the immediate question was “should the people on loan to Team A stay on the support rotations for their ‘home’ team?” From a technical expertise view, they’re important to keep the lights on in the systems they’re familiar with. Leaving them on such rotations prevents total focus on Team A, however, and at a minimum extends the onboarding time. There are a few factors to consider: the length of the project(s), the size of Team B, and the existing maintenance burden on Team B. Favour removing the reassigned people from their home rotations, but know that this will slow down Team A’s work even more as they pick up the extra work.

The other problem we ran into is that the manager of Team B is disconnected from the work their reassigned reports are now working on. Because the main problem is that Team A’s manager doesn’t have enough bandwidth to have more reports, there’s less management support for the people on loan, in terms of mentoring, performance management, and prioritization. The individual contributor (IC) can end up feeling disconnected from both their home team and the new one.

2. Have a Whole Team Contribute to Another Team’s Goals

We can mitigate at least the problem of ICs feeling isolated in their new team if we have the entire team (continuing the above nomenclature, Team B) work on the systems that another team (Team A) owns. This allows members of Team B to leverage their existing working relationships with each other, and Team B’s manager doesn’t have to split their attention between two teams. This arrangement can work well if there is a focused project in Team A’s domain that somehow involves some of Team B’s domain expertise.

This is, of course, a very blunt instrument, in that no project work will get done on Team B’s systems, which themselves still need to be maintained. There’s also a risk of demotivating the members of Team B, who may feel that their domain and systems aren’t important, although this can be mitigated to some extent if the project benefits or requires their domain expertise. We’ve had success here in exactly that way in an ongoing project done by our Test Infrastructure team to add data from our CI systems to Services DB, our application-catalog app stewarded by another team, Production Excellence. Their domain expertise allowed them to understand how to expose the data in the most intuitive and useful way, and they were able to more rapidly learn Services DB’s codebase by working together.

3. Tiger Team

A third option we’ve tried out in Dev Infra is a tiger team: “a specialized, cross-functional team brought together to solve or investigate a specific problem or critical issue.” People from multiple teams form a new, temporary team for a single project, often prototyping a new idea. Usually the team operates in a fast-paced, autonomous way towards a very specific goal, so management oversight is fairly limited. By definition, most people on a tiger team don’t usually work together, so the home and new team dichotomy is sidestepped, or at least very deliberately managed. The focus of the team means that members put aside maintenance, support, and other duties from their home team for the duration of the team’s existence.

The very first proof of concept for Spin was built this way over about a month. At that time, the value was sufficiently clear that we then formed a whole team around Spin and staffed it up to tackle the challenge of turning it into a proper product. We’ve learned a lot since then, but that first prototype was crucial in getting the whole project off the ground!

No Perfect Solutions

From thinking about and experimenting with team structures during my decade of management experience, there doesn’t seem to be a perfect solution to balance the three poles of management support, system maintenance and domain expertise, and high-level goals. Each situation is unique, and trade-offs have to be judged and taken deliberately. I would love to hear other stories of such balancing acts! Find me on Twitter and LinkedIn.

Mark Côté is the Director of Engineering, Developer Infrastructure, at Shopify. He's been in the developer-experience space for over a decade and loves thinking about infrastructure-as-product and using mental models in his strategies.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Hubble: Our Tool for Encapsulating and Extending Security Tools

Hubble: Our Tool for Encapsulating and Extending Security Tools

Fundamentally, Shopify is a company that thrives by building simplicity. We take hard, risky, and complex things and make them easy, safe, and simple.

Trust is Shopify’s team responsible for making commerce secure for everyone. First and foremost, that means securing our internal systems and IT resources, and maintaining a strong cybersecurity posture. If you’ve worked in these spaces before, you know that it takes a laundry list of tools to effectively manage and secure a large fleet of computers. Not only does it take tons of tools, but also takes training, access provisioning and deprovisioning, and constant patching. In any large or growing company, these problems compound and can become exponential costs if they aren’t controlled and solved for.

You either pay that cost by spending countless human hours on menial processes and task switching, or you accept the risk of shadow IT—employees developing their own processes and workarounds rather than following best practices. You either get choked by bureaucracy, or you create such a low trust environment that people don’t feel their company is interested in solving their problems.

Shopify is a global company that, in 2020, embraced being Digital by Design—in essence, the firm belief that our people have the greatest impact when we support them to work whenever and wherever they like. As you can imagine, this only magnified the problems described above. With the end of office centricity, suddenly the work of securing our devices got a lot more important, and a lot more difficult. Network environments got more varied, the possibility of in-person patching or remediation went out the window—the list goes on. Faced with these challenges, we searched for off-the-shelf solutions, but couldn’t find anything that fully fit our needs.

Hubble logo which features an image of a telescope followed by the word Hubble
Hubble Logo

So, We Built Hubble.

An evolution of previous internal solutions, Hubble is a tool that encapsulates and extends many of the common tools used in security. Mobile device management services and more are all fully integrated into Hubble. For IT staff, Hubble is a one stop shop for inventory, device management, and security. Rather than granting hundreds of employees access to multiple admin panels, they access Hubble—which ingests and standardizes data from other systems, and then sends commands back to those systems. We also specify levels of granularity in access (a specialist might have more access than an entry level worker, for instance). On the back end, we also track and audit access in one central location with a consistent set of fields—making incident response and investigation less of a rabbit hole.

A screenshot of the Hubble screen on a Macbook Pro. It shows a profile picture and several lines of status updates about the machine
Hubble’s status screen on a user’s machine

For everyone else at Shopify, Hubble is a tool to manage and view the devices that belong to them. At a glance, they can review the health of their device and its compliance, and not just an arbitrary set of metrics, but something that we define and find valuable - things like OS/Patch Compliance, VPN usage, and more. Folks don’t need to ask IT or just wonder if their device is secure. Hubble informs them, either via the website or device notification pings. And if their device isn’t secure, Hubble provides them with actionable information on how to fix it. Users can also specify test devices, or opt in to betas that we run. This enables us to easily build beta cohorts for any testing we might be running. When you give people the tools to be proactive about their security, and show that you support that proactivity, you help build a culture of ownership.

And, perhaps most importantly, Hubble is a single source of truth for all the data it consumes. This makes it easier for other teams to develop automations and security processes. They don’t have to worry about standardizing data, or making calls to 100 different services. They can access Hubble, and trust that the data is reliable and standardized.

Now, why should you care about this? Hubble is an internal tool for Shopify, and unfortunately it isn’t open source at this time. But these two lessons we learned building and realizing Hubble are valuable and applicable anywhere.

1. When the conversation is centered on encapsulation, the result is a partnership in creating a thoughtful and comprehensive solution.

Building and maintaining Hubble requires a lot of teams talking to each other. Developers talk to support staff, security engineers, and compliance managers. While these folks often work near each other, they rarely work directly together. This kind of collaboration is super valuable and can help you identify a lot of opportunities for automation and development. Plus, it presents the opportunity for team members to expand their skills, and maybe have an idea of what their next role could be. Even if you don’t plan to build a tool like this, consider involving frontline staff with the design and engineering processes in your organization. They bring valuable context to the table, and can help surface the real problems that your organization faces.

2. It’s worth fighting for investment.

IT and Cybersecurity are often reactive and ad-hoc driven teams. In the worst cases, this field lends itself to unhealthy cultures and an erratic work life balance. Incident response teams and frontline support staff often have unmanageable workloads and expectations, in large part due to outdated tooling and processes. We strive to make sure it isn’t like that at Shopify, and it doesn’t have to be that way where you work. We’ve been able to use Hubble as a platform for identifying automation opportunities. By having engineering teams connected to support staff via Hubble, we encourage a culture of proactivity. Teams don’t just accept processes as broken and outdated—they know that there’s talent and resources available for solving problems and making things better. Beyond culture and work life balance, consider the financial benefits and risk-minimization that this strategy realizes.

For each new employee onboarded to your IT or Cybersecurity teams, you spend weeks if not months helping them ramp up and safely access systems. This can incur certification and training costs (which can easily run in the thousands of dollars per employee if you pay for their certifications), and a more difficult job search to find the right candidate. Then you take on the risk of all these people having direct access to sensitive systems. And finally, you take on the audit and tracking burden of all of this.

With each tool you add to your environment, you increase complexity exponentially. But there’s a reason those tools exist, and complexity on its own isn’t a good enough reason to reject a tool. This is a field where costs want to grow exponentially. It seems like the default is to either accept that cost and the administrative overhead it brings, or ignore the cost and just eat the risk. It doesn’t have to be that way.

We chose to invest and to build Hubble to solve these problems at Shopify. Encapsulation can keep you secure while keeping everyone sane at the same time.

Tony is a Senior Engineering Program Manager and leads a team focussed on automation and internal support technologies. He’s journaled daily for more than 9 years, and uses it as a fun corpus for natural language analysis. He likes finding old bread recipes and seeing how baking has evolved over time!

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

How to Structure Your Data Team for Maximum Influence

How to Structure Your Data Team for Maximum Influence

One of the biggest challenges most managers face (in any industry) is trying to assign their reports work in an efficient and effective way. But as data science leaders—especially those in an embedded model—we’re often faced with managing teams with responsibilities that traverse multiple areas of a business. This juggling act often involves different streams of work, areas of specialization, and stakeholders. For instance, my team serves five product areas, plus two business areas. Without a strategy for dealing with these stakeholders and related areas of work, we risk operational inefficiency and chaotic outcomes. 

There are many frameworks out there that suggest the most optimal way to structure a team for success. Below, we’ll review these frameworks and their positives and negatives when applied to a data science team. We’ll also share the framework that’s worked best for empowering our data science teams to drive impact.

An example of the number of product and business areas my team supports at Shopify
An example of the number of product and business areas my data team supports at Shopify

First, Some Guiding Principles

Before looking at frameworks for managing these complex team structures, I’ll first describe some effective guiding principles we should use when organizing workflows and teams:  

  1. Efficiency: Any structure must provide an ability to get work done in an efficient and effective manner. 
  2. Influence: Structures must be created in such a way that your data science team continues to have influence on business and product strategies. Data scientists often have input that is critical to business and product success, and we want to create an environment where that input can be given and received.
  3. Stakeholder clarity: We need to create a structure where stakeholders clearly know who to contact to get work done, and seek help and advice from.
  4. Stability: Some teams structures can create instability for reports, which leads to a whole host of other problems.
  5. Growth: If we create structures where reports only deal with stakeholders and reactive issues, it may be difficult for them to develop professionally. We want to ensure reports have time to tackle work that enables them to acquire a depth of knowledge in specific areas.
  6. Flexibility: Life happens. People quit, need change, or move on. Our team structures need to be able to deal with and recognize that change is inevitable. 

Traditional Frameworks for Organizing Data Teams

Alright, now let’s look at some of the more popular frameworks used to organize data teams. While they’re not the only ways to structure teams and align work, these frameworks cover most of the major aspects in organizational strategy. 

Swim Lanes

You’ve likely heard of this framework before, and maybe even cringed when someone has told you or your report to "stay in your swim lanes". This framework involves assigning someone to very strictly defined areas of responsibility. Looking at the product and business areas my own team supports as an example, we have seven different groups to support. According to the swim lane framework, I would assign one data scientist to each group. With an assigned product or business group, their work would never cross lanes. 

In this framework, there's little expected help or cross-training that occurs, and everyone is allowed to operate with their own fiefdom. I once worked in an environment like this. We were a group of tenured data scientists who didn’t really know what the others were doing. It worked for a while, but when change occurred (new projects, resignations, retirements) it all seemed to fall apart.  

Let’s look at this framework’s advantages: 

  • Distinct areas of responsibility. In this framework, everyone has their own area of responsibility. As a manager, I know exactly who to assign work to and where certain tasks should go on our board. I can be somewhat removed from the process of workload balancing.  
  • High levels of individual ownership. Reports own an area of responsibility and have a stake in its success. They also know that their reputation and job are on the line for the success or failure of that area.
  • The point-of-contact is obvious to stakeholders. Ownership is very clear to stakeholders, so they always know who to go. This model also fosters long-term relationships. 

And the disadvantages:

  • Lack of cross-training. Individual reports will have very little knowledge of the work or codebase of their peers. This becomes an issue when life happens and we need to react to change.
  • Reports can be left on an island. Reports can be left alone which tends to matter more when times are tough. This is a problem for both new reports who are trying to onboard and learn new systems, but also for tenured reports who may suddenly endure a higher workload. Help may not be coming.  
  • Fails under high-change environments. For the reasons mentioned above, this system fails under high-change environments. It also creates a team-level rigidity that means when general organizational changes happen, it’s difficult to react and pivot.

Referring back to our guiding principles when considering how to effectively organize a date team, this framework hits our stakeholder clarity and efficiency principles, but only in stable environments. Swim lanes often fail in conditions of change or when the team needs to pivot to new responsibilities—something most teams should expect.  

Stochastic Process

As data scientists, we’re often educated in the stochastic process and this framework resembles this theory. As a refresher, the stochastic process is defined by randomness of assignment, where expected behavior is near random assignments to areas or categories.  

Likewise, in this framework each report takes the next project that pops up, resembling a random assignment of work. However, projects are prioritized and when an employee finishes one project, they take on the next, highest priority project. 

This may sound overly random as a system, but I’ve worked on a team like this before. We were a newly setup team, and no one had any specific experience with any of the work we were doing. The system worked well for about six months, but over the course of a year, we felt like we'd been put through the wringer and as though no one had any deep knowledge of what we were working on.  

The advantages of this framework are:

  • High levels of team collaboration. Everyone is constantly working on each other’s code and projects, so a high-level of collaboration tends to develop.
  • Reports feel like there is always help. Since work is assigned in terms of next priority gets the resource, if someone is struggling with a high-priority task, they can just ask for help from the next available resource.
  • Extremely flexible under high levels of change. Your organization decides to reorg to align to new areas of the business? No problem! You weren’t aligned to any specific groups of stakeholders to begin with. Someone quits? Again, no problem. Just hire someone new and get them into the rotation.

And the disadvantages:

  • Can feel like whiplash. As reports are asked to move constantly from one unrelated project to the next, they can develop feelings of instability and uncertainty (aka whiplash). Additionally, as stakeholders work with a new resource on each project, this can limit the ability to develop rapport.
  • Inability to go deep on specialized subject matters. It’s often advantageous for data scientists to dive deep into one area of the business or product. This enables them to develop deep subject area knowledge in order to build better models. If we’re expecting them to move from project to project, this is unlikely to occur.
  • Extremely high management inputs. As data scientists become more like cogs in a wheel in this type of framework, management ends up owning most stakeholder relationships and business knowledge. This increases demands on individual managers.

Looking at the advantages and disadvantages of this framework, and measuring them against our guiding principles, this framework only hits two of our principles: flexibility and efficiency. While this framework can work in very specific circumstances (like brand new teams), the lack of stakeholder clarity, relationship building, and growth opportunity will result in the failure of this framework to sufficiently serve the needs of the team and stakeholders. 

A New Framework: Diamond Defense 

Luckily, we’ve created a third way to organize data teams and work. I like to compare this framework to the concept of diamond defense in basketball. In diamond defense, players have general areas (zones) of responsibility. However, once play starts, the defense focuses on trapping (sending extra resources) to the toughest problems, while helping out areas in the defense that might be left with fewer resources than needed.  

This same defense method can be used to structure data teams to be highly effective. In this framework, you loosely assign reports to your product or business areas, but ensure to rotate resources to tough projects and where help is needed.

Referring back to the product and business areas my team supports, you can see how I use this framework to organize my team: 

An example of how I use the diamond defense framework to structure my data team and align them to zones of work
An example of how I use the diamond defense framework to structure my data team

Each data scientist is assigned to a zone. I then aligned our additional business areas (Finance and Marketing) to a product group, and assigned resources to these groupings. Finance and Marketing are aligned differently here because they are not supported by a team of Software Engineers. Instead, I aligned them to the product group that mostly closely resembles their work in terms of data accessed and models built. Currently, Marketing has the highest number of requests for our team, so I added more resources to support this group. 

You’ll notice on the chart that I keep myself and an additional data scientist in a bullpen. This is key to the diamond defense as it ensures we always have additional resources to help out where needed. Let’s dive into some examples of how we may use resources in the bullpen: 

  1. DS2 is under-utilized. We simultaneously find out that DS1 is overwhelmed by the work of their product area, so we tap DS2 to help out. 
  2. SR DS1 quits. In this case, we rotate DS4 into their place, and proceed to hire a backfill. 
  3. SR DS2 takes a leave of absence. In this situation, I as the manager slide in to manage SR DS2’s stakeholders. I would then tap DS4 to help out, while the intern who is also assigned to the same area continues to focus on getting their work done with help from DS4. 

This framework has several advantages:

  • Everyone has dedicated areas to cover and specialize in. As each report is loosely assigned to a zone (specific product or business area), they can go deep and develop specialized skills.  
  • Able to quickly jump on problems that pop up. Loose assignment to zones enable teams the flexibility to move resources to the highest-priority areas or toughest problems.  
  • Reports can get the help they need. If a report is struggling with the workload, you can immediately send more resources towards that person to lighten their load.  

And the disadvantages:

  • Over-rotation. In certain high-change circumstances, a situation can develop where data scientists spend most of their time covering for other people. This can create very volatile and high-risk situations, including turnover.

This framework hits all of our guiding principles. It provides the flexibility and stability needed when dealing with change, it enables teams to efficiently tackle problems, focus areas enable report growth and stakeholder clarity, and relationships between reports and their stakeholders improves the team's ability to influence policies and outcomes. 


There are many ways to organize data teams to different business or product areas, stakeholders, and bodies of work. While the traditional frameworks we discussed above can work in the short-term, they tend to over-focus either on rigid areas of responsibility or everyone being able to take on any project. 

If you use one of these frameworks and you’re noticing that your team isn’t working as effectively as you know they can, give our diamond defense framework a try. This hybrid framework addresses all the gaps of the traditional frameworks, and ensures:

  • Reports have focus areas and growth opportunity 
  • Stakeholders have clarity on who to go to
  • Resources are available to handle any change
  • Your data team is set up for long-term success and impact

Every business and team is different, so we encourage you to play around with this framework and identify how you can make it work for your team. Just remember to reference our guiding principles for complex team structures.

Levi manages the Banking and Accounting data team at Shopify. He enjoys finding elegant solutions to real-world business problems using math, machine learning, and elegant data models. In his spare time he enjoys running, spending time with his wife and daughters, and farming. Levi can be reached via LinkedIn.

Are you passionate about solving data problems and eager to learn more about Shopify? Check out openings on our careers page.

Continue reading

Finding Relationships Between Ruby’s Top 100 Packages and Their Dependencies

Finding Relationships Between Ruby’s Top 100 Packages and Their Dependencies

In June of this year, RubyGems, the main repository for Ruby packages (gems), announced that multi-factor authentication (MFA) was going to be gradually rolled out to users. This means that users eventually will need to login with a one-time password from their authenticator device, which will drastically reduce account takeovers.

The team I'm interning on, the Ruby Dependency Security team at Shopify, played a big part in rolling out MFA to RubyGems users. The team’s mission is to increase the security of the Ruby software supply chain, so increasing MFA usage is something we wanted to help implement.

A large Ruby with stick arms and leg pats a little Ruby with stick arms and legs
Illustration by Kevin Lin

One interesting decision that the RubyGems team faced is determining who was included in the first milestone. The team wanted to include at least the top 100 RubyGems packages, but also wanted to prevent packages (and people) from falling out of this cohort in the future.

To meet those criteria, the team set a threshold of 180 million downloads for the gems instead. Once a gem crosses 180 million downloads, its owners are required to use multi-factor authentication in the future.

Bar graph showing gem download numbers for Gem 1 and Gem 2
Gem downloads represented as bars. Gem 2 is over the 180M download threshold, so its owners would need MFA.

This design decision led me to a curiosity. As packages frequently depend on other packages, could some of these big (more than 180M downloads) packages depend on small (less than 180M downloads) packages? If this was the case, then there would be a small loophole: if a hacker wanted to maximize their reach in the Ruby ecosystem, they could target one of these small packages (which would get installed every time someone installed one of the big packages), circumventing the MFA protection of the big packages.

On the surface, it might not make sense that a dependency would ever have fewer downloads than its parent. After all, every time the parent gets downloaded, the dependency does too, so surely the dependency has at least as many downloads as the parent, right?

Screenshot of a Slack conversation between coworkers discussing one's scepticism about finding exceptions
My coworker Jacques, doubting that big gems will rely on small gems. He tells me he finds this hilarious in retrospect.

Well, I thought I should try to find exceptions anyway, and given that this blog post exists, it would seem that I found some. Here’s how I did it.

The Investigation

The first step in determining if big packages depended on small packages was to get a list of big packages. The stats page shows the top 100 gems in terms of downloads, but the last gem on page 10 has 199 million downloads, meaning that scraping these pages would yield an incomplete list, since the threshold I was interested in is 180 million downloads.

A screenshot of a page of statistics
Page 10 of, just a bit above the MFA download threshold

To get a complete list, I instead turned to using the data dumps that makes available. Basically, the site takes a daily snapshot of the database, removes any confidential information, and then publishes it. Their repo has a convenient script that allows you to load these data dumps into your own local database, and therefore run queries on the data using the Rails console. It took me many tries to make a query that got all the big packages, but I eventually found one that worked:

Rubygem.joins(:gem_download).where(gem_download: {count: 180_000_000..}).map(&:name)

I now had a list of 112 big gems, and I had to find their dependencies. The first method I tried was using the API. As described in the documentation, you can give the API the name of a gem and it’ll give you the name of all of its dependencies as part of the response payload. The same endpoint of this API also tells you how many downloads a gem has, so the path was clear: for each big gem, get a list of its dependencies and find out if any of them had fewer downloads than the threshold.

Here are the functions that get the dependencies and downloads:

Ruby function that gets a list of dependencies as reported by the API. Requires built-in uri, net/http, and json packages.
Ruby function that gets downloads from the same API endpoint. Also has a branch to check the download count for specific versions of gems, that I later used.

Putting all of this together, I found that 13 out of the 112 big gems had small gems as dependencies. Exceptions! So why did these small gems have fewer downloads than their parents? I learned that it was mainly due to two reasons:

  1. Some gems are newer than their parents, that is, a new gem came out and a big gem developer wanted to add it as a dependency.
  2. Some gems are shipped with Ruby by default, so they don’t need to be downloaded and thus have low(er) download count (for example, racc and rexml).

With this, I now had proof of the existence of big gems that would be indirectly vulnerable to account takeover of a small gem. While an existence proof is nice, it was pointed out to me that the API only returns a list symbolic of the direct dependencies of a gem, and that those dependencies might have sub-dependencies that I wasn’t checking. So how could I find out which packages get installed when one of these big gems gets installed?

With Bundler, of course!

Bundler is the Ruby dependency manager software that most Ruby users are probably familiar with. Bundler takes a list of gems to install (the Gemfile), installs dependencies that satisfy all version requirements, and, crucially for us, makes a list of all those dependencies and versions in a Gemfile.lock file. So, to find out which big gems relied in any way on small gems, I programmatically created a Gemfile with only the big gem in it, programmatically ran bundle lock, and programmatically read the Gemfile.lock that was created to get all the dependencies.

Here’s the function that did all the work with Bundler:

Ruby function that gets all dependencies that get installed when one gem is installed using Bundler

With this new methodology, I found that 24 of the 112 big gems rely on small gems, which is a fairly significant proportion of them. After discovering this, I wanted to look into visualization. Up until this point, I was just printing out results to the command line to make text dumps like this:

Text dump of dependency results. Big gems are red, their dependencies that are small are indented in black
Text dump of dependency results. Big gems are red, their dependencies that are small are indented in black

This visualization isn’t very convenient to read, and it misses out on patterns. For example, as you can see above, many big gems rely on racc. It would be useful to know if they relied directly on it, or if most packages depended on it indirectly through some other package. The idea of making a graph was in the back of my mind since the beginning of this project, and when I realized how helpful it might be, I committed to it. I used the graph gem, following some examples from this talk by Aja Hammerly. I used a breadth-first search, starting with a queue of all the big gems, adding direct dependencies to the queue as I went. I added edges from gems to their dependencies and highlighted small gems in red. Here was the first iteration:

The output of the graph gem that highlights gem dependencies
The first iteration

It turns out there a lot of AWS gems, so I decided to remove them from the graph and got a much nicer result:

The output of the graph gem that highlights gem dependencies
Full size image link if you want to zoom and pan

The graph, while moderately cluttered, shows a lot of information succinctly. For instance, you can see a galaxy of gems in the middle-left, with rails being the gravitational attractor, a clear keystone in the Ruby world.

Output of the gem graph with Rails at the center
The Rails galaxy

The node with the most arrows pointing into it is activesupport, so it really is an active support.

A close up view of activesupport in the output of the gem graph. activesupport has many arrows pointing into it.
14 arrows pointing into activesupport

Racc, despite appearing in my printouts as a small gem for many big gems, is only the dependency of nokogiri.

A close up view of racc in the output of the gems graph
racc only has 1 edge attached to it

With this nice graph created, I followed up and made one final printout. This time, whenever I found a big gem that depended on a small gem, I printed out all the paths on the graph from the big gem to the small gem, that is, all the ways that the big gem relied on the small gem.

Here’s an example printout:

Big gem is in green (googleauth), small gems are in purple, and the black lines are all the paths from the big gem to the small gem.

I achieved this by making a directional graph data type and writing a depth-first search algorithm to find all the paths from one node to another. I chose to create my own data type because finding all paths on a graph isn’t already implemented in any Ruby gem from what I could tell. Here’s the algorithm, if you’re interested (`@graph` is a Hash of `String:Array` pairs, essentially an adjacency list):

Recursive depth-first search to find all paths from start to end

What’s Next

In summary, I found four ways to answer the question of whether or not big gems rely on small gems:

  1. direct dependency printout (using API)
  2. sub-dependency printout (using Bundler)
  3. graph (using graph gem)
  4. sub-dependency printout with paths (2. using my own graph data type).

I’m happy with my work, and I’m glad I got to learn about file I/O and use graph theory. I’m still relatively new to Ruby, so offshoot projects like these are very didactic.

The question remains of what to do with the 24 technically insecure gems. One proposal is to do nothing, since everyone will eventually need to have MFA enabled, and account takeover is still an uncommon event despite being on the rise. 

Another option is to enforce MFA on these specific gems as a sort of blocklist, just to ensure the security of the top gems sooner. This would mean a small group of owners would have to enable MFA a few months earlier, so I could see this being a viable option. 

Either way, more discussion with my team is needed. Thanks for reading!

Kevin is an intern on the Ruby Dependency Security team at Shopify. He is in his 5th year of Engineering Physics at the University of British Columbia.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

How to Write Code Without Having to Read It

How to Write Code Without Having to Read It

Do we need to read code before editing it?

The idea isn’t as wild as it sounds. In order to safely fix a bug or update a feature, we may need to learn some things about the code. However, we’d prefer to learn only that information. Not only does extra reading waste time, it overcomplicates our mental model. As our model grows, we’re more likely to get confused and lose track of critical info.

But can we really get away with reading nothing? Spoiler: no. However, we can get closer by skipping over areas that we know the computer is checking, saving our focus for areas that are susceptible to human error. In doing so, we’ll learn how to identify and eliminate those danger areas, so the next person can get away with reading even less.

Let’s give it a try.

Find the Entrypoint

If we’re refactoring code, we already know where we need to edit. Otherwise, we’re changing a behavior that has side effects. In a backend context, these behaviors would usually be exposed APIs. On the frontend, this would usually be something that’s displayed on the screen. For the sake of example, we’ll imagine a mobile application using React Native and Typescript, but this process generalizes to other contexts (as long as they have some concept of build or test errors; more on this later).

If our goal was to read a lot of code, we might search for all hits on RelevantFeatureName. But we don’t want to do that. Even if we weren’t trying to minimize reading, we’ll run into problems if the code we need to modify is called AlternateFeatureName, SubfeatureName, or LegacyFeatureNameNoOneRemembersAnymore.

Instead, we’ll look for something external: the user-visible strings (including accessibility labels—we did remember to add those, right?) on the screen we’re interested in. We search various combinations of string fragments, quotation marks, and UI inspectors until we find the matching string, either in the application code or in a language localization file. If we’re in a localization file, the localization key leads us to the application code that we’re interested in.

If we’re dealing with a regression, there’s an easier option: git bisect. When git bisect works, we really don’t need to read the code. In fact, we can skip most of the following steps. Because this is such a dramatic shortcut, always keep track of which bugs are regressions from previously working code.

Make the First Edit

If we’ve come in to make a simple copy edit, we’re done. If not, we’re looking for a component that ultimately gets populated by the server, disk, or user. We can no longer use exact strings, but we do have several read-minimizing strategies for zeroing in on the component:

  1. Where is this component on the screen, relative to our known piece of text?
  2. What type of standard component is it using? Is it a button? Text input? Text?
  3. Does it have some unusual style parameter that’s easy to search? Color? Corner radius? Shadow?
  4. Which button launches this UI? Does the button have searchable user-facing text?

These strategies all work regardless of naming conventions and code structure. Previous developers would have a hard time making our life harder without making the code nonfunctional. However, they may be able to make our life easier with better structure. 

For example, if we’re using strategy #1, well-abstracted code helps us quickly rule out large areas of the screen. If we’re looking for some text near the bottom of the screen, it’s much easier to hit the right Text item if we can leverage a grouping like this:

<SomeHeader />
<SomeContent />
<SomeFooter />

rather than being stuck searching through something like this:

// Header
<StaticImage />
<Text />
<Text />
<Button />
<Text />
// Content
// Footer

where we’ll have to step over many irrelevant hits.

Abstraction helps even if the previous developer chose wacky names for header, content, or footer, because we only care about the broad order of elements on the screen. We’re not really reading the code. We’re looking for objective cues like positioning. If we’re still unsure, we can comment out chunks of the screen, starting with larger or highly-abstracted components first, until the specific item we care about disappears.

Once we’ve found the exact component that needs to behave differently, we can make the breaking change right now, as if we’ve already finished updating the code. For example, if we’re making a new component that displays data newText, we add that parameter to its parent’s input arguments, breaking the build.

If we’re fixing a bug, we can also start by adjusting an argument list. For example, the condition “we shouldn’t be displaying x if y is present” could be represented with the tagged union {mode: 'x', x: XType} | {mode: 'y'; y: YType}, so it’s physically impossible to pass in x and y at the same time. This will also trigger some build errors.

Tagged Unions
Tagged unions go by a variety of different names and syntaxes depending on language. They’re most commonly referred to as discriminated unions, enums with associated values, or sum types.

Climb Up the Callstack

We now go up the callstack until the build errors go away. At each stage, we edit the caller as if we’ll get the right input, triggering the next round of build errors. Notice that we’re still not reading the code here—we’re reading the build errors. Unless a previous developer has done something that breaks the chain of build errors (for example, accepting any instead of a strict type), their choices don’t have any effect on us.

Once we get to the top of the chain, we adjust the business logic to grab newText or modify the conditional that was incorrectly sending x. At this point, we might be done. But often, our change could or should affect the behavior of other features that we may not have thought about. We need to sweep back down through the callstack to apply any remaining adjustments. 

On the downswing, previous developers’ choices start to matter. In the worst case, we’ll need to comb through the code ourselves, hoping that we catch all the related areas. But if the existing code is well structured, we’ll have contextual recommendations guiding us along the way: “because you changed this code, you might also like…”

Update Recommended Code

As we begin the downswing, our first line of defense is the linter. If we’ve used a deprecated library, or a pattern that creates non-obvious edge cases, the linter may be able to flag it for us. If previous developers forgot to update the linter, we’ll have to figure this out manually. Are other areas in the codebase calling the same library? Is this pattern discouraged in documentation?

After the linter, we may get additional build errors. Maybe we changed a function to return a new type, and now some other consumer of that output raises a type error. We can then update that other consumer’s logic as needed. If we added more cases to an enum, perhaps we get errors from other exhaustive switches that use the enum, reminding us that we may need to add handling for the new case. All this depends on how much the previous developers leaned on the type system. If they didn’t, we’ll have to find these related sites manually. One trick is to temporarily change the types we’re emitting, so all consumers of our output will error out, and we can check if they need updates.

An exhaustive switch statement handles every possible enum case. Most environments don’t enforce exhaustiveness out of the box. For example, in Typescript, we need to have strictNullChecks turned on, and ensure that the switch statement has a defined return type. Once exhaustiveness is enforced, we can remove default cases, so we’ll get notified (with build errors) whenever the enum changes, reminding us that we need to reassess this switch statement.

Our final wave of recommendations comes from unit test failures. At this point, we may also run into UI and integration tests. These involve a lot more reading than we’d prefer; since these tests require heavy mocking, much of the text is just noise. Also, they often fail for unimportant reasons, like timing issues and incomplete mocks. On the other hand, unit tests sometimes get a bad rap for requiring code restructures, usually into more or smaller abstraction layers. At first glance, it can seem like they make the application code more complex. But we didn’t need to read the application code at all! For us, it’s best if previous developers optimized for simple, easy-to-interpret unit tests. If they didn’t, we’ll have to find these issues manually. One strategy is to check git blame on the lines we changed. Maybe the commit message, ticket, or pull request text will explain why the feature was previously written that way, and any regressions we might cause if we change it.

At no point in this process are comments useful to us. We may have passed some on the upswing, noting them down to address later. Any comments that are supposed to flag problems on the downswing are totally invisible—we aren’t guaranteed to find those areas unless they’re already flagged by an error or test failure. And whether we found comments on the upswing or through manual checking, they could be stale. We can’t know if they’re still valid without reading the code underneath them. If something is important enough to be protected with a comment, it should be protected with unit tests, build errors, or lint errors instead. That way it gets noticed regardless of how attentive future readers are, and it’s better protected against staleness. This approach also saves mental bandwidth when people are touching nearby code. Unlike standard comments, test assertions only pop when the code they’re explaining has changed. When they’re not needed, they stay out of the way.

Clean Up

Having mostly skipped the reading phase, we now have plenty of time to polish up our code. This is also an opportunity to revisit areas that gave us trouble on the downswing. If we had to read through any code manually, now’s the time to fix that for future (non)readers.

Update the Linter

If we need to enforce a standard practice, such as using a specific library or a shared pattern, codify it in the linter so future developers don’t have to find it themselves. This can trigger larger-scale refactors, so it may be worth spinning off into a separate changeset.

Lean on the Type System

Wherever practical, we turn primitive types (bools, numbers, and strings) into custom types, so future developers know which methods will give them valid outputs to feed into a given input. A primitive like timeInMilliseconds: number is more vulnerable to mistakes than time: MillisecondsType, which will raise a build error if it receives a value in SecondsType. When using enums, we enforce exhaustive switches, so a build error will appear any time a new case may need to be handled.

We also check methods for any non-independent arguments:

  • Argument A must always be null if Argument B is non-null, and vice versa (for example, error/response).
  • If Argument A is passed in, Argument B must also be passed in (for example, eventId/eventTimestamp).
  • If Flag A is off, Flag B can’t possibly be on (for example, visible/highlighted).

If these arguments are kept separate, future developers will need to think about whether they’re passing in a valid combination of arguments. Instead, we combine them, so the type system will only allow valid combinations:

  • If one argument must be null when the other is non-null, combine them into a tagged union: {type: 'failure'; error: ErrorType} | {type: 'success'; response: ResponseType}.
  • If two arguments must be passed in together, nest them into a single object: event: {id: IDType; timestamp: TimestampType}.
  • If two flags don’t vary independently, combine them into a single enum: 'hidden'|'visible'|'highlighted'.

Optimize for Simple Unit Tests

When testing, avoid entanglement with UI, disk or database access, the network, async code, current date and time, or shared state. All of these factors produce or consume side effects, clogging up the tests with setup and teardown. Not only does this spike the rate of false positives, it forces future developers to learn lots of context in order to interpret a real failure.

Instead, we want to structure our code so that we can write simple tests. As we saw, people can often skip reading our application code. When test failures appear, they have to interact with them. If they can understand the failure quickly, they’re more likely to pay attention to it, rather than adjusting the failing assertion and moving on. If a test is starting to get complicated, go back to the application code and break it into smaller pieces. Move any what code (code that decides which side effects should happen) into pure functions, separate from the how code (code that actually performs the side effects). Once we’re done, the how code won’t contain any nontrivial logic, and the what code can be tested—and therefore documented—without complex mocks.

Trivial vs. Nontrivial Logic
Trivial logic would be something like if (shouldShow) show(). Something like if (newUser) show() is nontrivial (business) logic, because it’s specific to our application or feature. We can’t be sure it’s correct unless we already know the expected behavior.

Whenever we feel an urge to write a comment, that’s a signal to add more tests. Split the logic out into its own unit tested function so the “comment” will appear automatically, regardless of how carefully the next developer is reading our code.

We can also add UI and integration tests, if desired. However, be cautious of the impulse to replace unit tests with other kinds of tests. That usually means our code requires too much reading. If we can’t figure out a way to run our code without lengthy setup or mocks, humans will need to do a similar amount of mental setup to run our code in their heads. Rather than avoiding unit tests, we need to chunk our code into smaller pieces until the unit tests become easy.


Once we’ve finished polishing our code, we manually test it for any issues. This may seem late, but we’ve converted many runtime bugs into lint, build, and test errors. Surprisingly often, we’ll find that we’ve already handled all the edge cases, even if we’re running the code for the first time.

If not, we can do a couple more passes to address the lingering issues… adjusting the code for better “unread”-ability as we go.

Sometimes, our end goal really is to read the code. For example, we might be reviewing someone else’s code, verifying the current behavior, or ruling out bugs. We can still pose our questions as writes:

  • Could a developer have done this accidentally, or does the linter block it when we try?
  • Is it possible to pass this bad combination of arguments, or would that be rejected at build time?
  • If we hardcode this value, which features (represented by unit tests) would stop working?

JM Neri is a senior mobile developer on the Shop Pay team, working out of Colorado. When not busy writing unit tests or adapting components for larger text sizes, JM is usually playing in or planning for a TTRPG campaign.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

A Flexible Framework for Effective Pair Programming

A Flexible Framework for Effective Pair Programming

Pair programming is one of the most important tools we use while mentoring early talent in the Dev Degree program. It’s an agile software development technique where two people work together, either to share context, solve a problem, or learn from one another. Pairing builds technical and communication skills, encourages curiosity and creative problem-solving, and brings people closer together as teammates.

In my role as a Technical Educator, I’m focused on setting new interns joining the Dev Degree program up for success in their first 8 months at Shopify. Because pair programming is a method we use so frequently in onboarding, I saw an opportunity to streamline the process to make it more approachable for people who might not have experienced it before. I developed this framework during a live workshop I hosted at RenderATL. I hope it helps you structure your next pair programming session!

Pair Programming in Dev Degree

“Pair programming was my favorite weekly activity while on the Training Path. When I first joined my team, I was constantly pairing with my teammates. My experiences on the Training Path made these new situations manageable and incredibly fun! It was also a great way to socialize and work technically with peers outside of my school cohort that I wouldn't talk to often. I've made some good friends just working on a little project for the weekly module.” - Mikail Karimi, Dev Degree Intern

In the Dev Degree program, we use mentorship to build up the interns’ technical and soft skills. As interns are usually early in their career journey, having recently graduated from high school or switched careers, their needs differ from someone with more career experience. Mentorship from experienced developers is crucial to prepare interns for their first development team placement. It shortens the time it takes to start making a positive impact with their work by developing their technical and soft skills like problem solving and communication. This is especially important now that Shopify is digital by design, as learning and working remotely is a completely new experience to many interns. 

All first year Dev Degree interns at Shopify go through an eight-month period known as the Training Path. During this period, we deliver a set of courses designed to teach them all about Shopify-specific development. They’re mentored by Student Success Specialists, who coach them on building their soft skills like communication, and Technical Instructors , who focus on the technical aspects of the training. Pair programming with a peer or mentor is a great way to support both of these areas of development.

Each week, we allocate two to three hours for interns to pair program with each other on a problem. We don’t expect them to solve the problem completely, but they should use the concepts they learned from the week to hone their technical craft.

We also set up a bi-weekly 30 minute pair programming sessions with each intern. The purpose of these sessions is to provide dedicated one-on-one time to learn and work directly with an instructor. They can share what they are having trouble with, and we help them work through it.

“When I’m switching teams and disciplines, pair programming with my new team is extremely helpful to see what resources people use to debug, the internal tools they use to find information and how they approach a problem. On my current placement, I got better at resolving problems independently when I saw how my mentor handled a new problem neither of us had seen.” Sanaa Syed, Dev Degree Intern

As we scale up the program, there are some important questions I keep returning to:

  • How do we track their progress most effectively?
  • How do we know what they want to pair on each day?
  • How can we provide a safe space?
  • What are some best practices for communicating?

I started working on a framework to help solve these issues. I know I’m not the only one on my team who may be asking themselves this. Along the way, an opportunity arose to do a workshop at RenderATL. At Shopify, we are encouraged to learn as part of our professional development. Wanting to level up my public speaking skills, I decided to talk about mentorship through a pair programming lens. As the framework was nearly completed, I decided to crowdsource and finish the framework together with the RenderATL attendees.

Crowdsourcing a Framework for All

On June 1st, 2022, Shopify hosted free all-day workshops at RenderATL called Heavy Hitting React at Shopify Engineering. It contained five different workshops, covering a range of topics from specific technical skills like React Native to broader skills like communication. We received a lot of positive feedback, met many amazing folks, and made sure those who attended gained new knowledge or skills they could walk away with.

For my workshop, Let’s Pair Program a Framework Together, we pair programmed a pair programming framework. The goal was to crowdsource and finish the pair programming framework I was working on based on the questions I mentioned above. We had over 30 attendees, and the session was structured to be interactive. I walked the audience through the framework and got their suggestions on the unfinished parts of the framework. At the end, the attendees paired up and used the framework to work together and draw a picture they both wanted to draw.

Before the workshop, I sent a survey internally asking developers a few questions about pair programming. Here are the results:

  • 62.5% had over 10 years of programming experience
  • 78.1% had pair programmed before joining Shopify
  • 50% of the surveyor pair once or twice a week at Shopify

When asked “What is one important trait to have when pair programming?”, this is what Shopify developers had to say:


  • Expressing thought processes (what you’re doing, why you’re making this change, etc.)
  • Sharing context to help others get a thorough understanding
  • Use of visual aids to assist with explanation


  • Being aware of energy levels
  • Not being judgemental to others


  • Curious to learn
  • Willingness to take feedback and improve
  • Don’t adhere to just one’s opinion


  • Providing time to allow your partner to think and formulate opinions
  • Encourage repetition of steps, instructions to encourage question asking and learn by doing

Now, let’s walk through the crowdsourced framework we finished at RenderATL.

Note: For those who attended the workshop, the framework below is the same framework that you walked away with, but with more details and resources.

The Framework

A women with her back to the viewer writing on a whiteboard
Pair programming can be used for more than just writing code. Try pairing on other problems and using tools like a whiteboard to work through ideas together.

This framework covers everything you need to run a successful pair programming session, including: roles, structure, agenda, environment, and communication. You can pick and choose within each section to design your session based on your needs.

1a. Pair Programming Styles

There are many different ways to run a pair programming session. Here are the ones we found to be the most useful, and when you may want to use each depending on your preferences and goals.

Driver and Navigator

Think about this style like a long road trip. One person is focused on driving to get from point A to B. While the other person is providing directions, looking for future pit stops for breaks, and just observing the surroundings. As driving can be taxing, it’s a good idea to switch roles frequently.

The driver is the person leading the session and typing on the keyboard. As they are typing, they’re explaining their thought process. The navigator, also known as the observer, is the person observing, reviewing code that’s being written, and making suggestions along the way. For example, suggesting refactoring code and thinking about potential edge cases.

If you’re an experienced person pairing with an intern or junior developer, I recommend using this style after you paired together a few sessions. They’re likely still gaining context and getting comfortable with the code base in the first few sessions.

Tour Guide

This style is like giving someone a personal tour of the city. The experienced person drives most of the session, hence the title tour guide. While the partner is just observing and asking questions along the way.

I suggest using this style when working with someone new on your team. It’s a great way to give them a personal tour to how your team’s application works and share context along the way. You can also flip it, where the least experienced person is the tour guide. I like to do this with the Dev Degree interns who are a bit further into their training when I pair with them. I find it helps bring out their communication skills once they’ve started to gain some confidence in their technical abilities.


The unstructured style is more of a freestyle way to work on something together, like learning a new language or concept. The benefit of the unstructured style is the team building and creative solutions that can come from two people hacking away and figuring things out. This is useful when a junior developer or intern pairs with someone at their level. The only downside is that without a mentor overseeing them, there’s a risk of missteps or bad habits going unchecked. This can be solved after the session by having them share their findings with a mentor for discussion.

We allocated time for the interns to pair together. This is the style the interns typically go with, figuring things out using the concepts they learned.

1b. Activities

When people think of pair programming, they often strictly think of coding. But pair programming can be effective for a range of activities. Here are some we suggest.

Code Reviews

I remember when I first started reviewing code, I wasn’t sure what exactly I was meant to be looking for. Having an experienced mentor support with code reviews helps early talent pick up context and catch things that they might not otherwise know to look for. Interns also bring a fresh perspective, which can benefit mentors as well by prompting them to unpack why they might make certain decisions.

Technical Design and Documentation

Working together to design a new feature or go through team documents. If you put yourself in a junior developer’s shoes, what would that look like? It could look like a whiteboarding session mapping out logic for the new feature or improving the team’s onboarding documentation. This could be an incredibly impactful session for them. You’ll be helping broaden their technical depth, helping future teammates onboard faster, and sharing your expertise along the way.

Writing Test Cases Only

Imagine you’re a junior developer working on your very first task. You have finished writing the functionality, but haven’t written any tests for it. You tested it manually and know it works. One thing you’re trying to figure out is how to write a test for it now. This is where a pair programming session with an experienced developer is beneficial. You work together to extend testing coverage and learn team-specific styles when writing tests.


Pairing is a great way to help onboard someone new onto your team. It helps the new person joining your team ramp up quicker with your wealth of knowledge. Together you explore the codebase, documentation, and team-specific rituals.

Hunting Bugs

Put yourselves in your users’ shoes and go on a bug hunt together. As you test functionalities on production, you'll gain context on the product and reduce the number of bugs on your application. A win-win!

2. Set an Agenda

A photo of a women writing in a notebook on a desk with a coffee and croissant
Pair programming can be used for more than just writing code. Try pairing on other problems and using tools like a whiteboard to work through ideas together.

Setting an agenda beforehand is key to making sure your session is successful.

Before the session, work with your partner to align on the style, activity, and goals for your pair programming session. This way you can hit the ground running while pairing and work together to achieve a common goal. Here are some questions you can use to set your agenda: 

  • What do you want to pair on today?
  • How do you want the session to go? You drive? I drive? Or both?
  • Where should we be by the end of the session?
  • Is there a specific skill you want to work on?
  • What’s blocking you?

3. Set the Rules of Engagement

A classroom with two desks and chairs in the foreground 
Think of your pair programming session like a classroom. How would you make sure it’s a great environment to learn?

"After years of frequent pair programming, my teammates and I have established patterns that give the impression that we are always right next to each other, which makes context sharing and learning from peers much simpler." -Olakitan Bello, Developer

Now that your agenda is set, it’s time to think about the environment you want to have during the session. Imagine yourself as a teacher. If this was a classroom, how would you provide the best learning environment?

Be Inclusive

Everyone should feel welcomed and invited to collaborate with others. One way we can set the tone is to establish that “There are no wrong answers here” or “There are no dumb questions.” If you’re a senior colleague, saying “I don’t know” to your partner is a very powerful thing. It shows that you’re a human too! Keep accessibility in mind as well. There are tools and styles available to tailor pair programming sessions to the needs of you and your partner. For example, there are alternatives to verbal communication, like using a digital whiteboard or even sending messages over a communication platform. Invite people to be open about how they work best and support each other to create the right environment.

Remember That Silence Isn’t Always Golden

If it gets very quiet as you’re pairing together, it’s usually not a good sign. When you pair program with someone, communication is very important to both parties. Without it, it’s hard for one person to perceive the other person’s thoughts and feelings. Make a habit of explaining your thought process out loud as you work. If you need a moment to gather your thoughts, simply let your partner know instead of going silent on them without explanation.

Respect Each Other

Treat them like how you want to be treated, and value all opinions. Someone's opinions can help lead to an even greater solution. Everyone should be able to contribute and express their opinions.

Be Empathic Not Apathetic

If you’re pair programming remotely, displaying empathy goes a long way. As you’re pair programming with them, read the room. Do they feel flustered with you driving too fast? Are you aware of their emotional needs at the moment?

As you’re working together, listen attentively and provide them space to contribute and formulate opinions.

Mistakes Are Learning Opportunities

If you made a mistake while pair programming, don’t be embarrassed about it. Mistakes happen, and are actually opportunities to learn. If you notice your partner make a mistake, point it out politely—no need to make a big deal of it.

4. Communicate Well

A room with five happy people talking in a meeting
Communication is one of the most important aspects of pair programming.

Pair programming is all about communication. For two people to work together to build something, both need to be on the same page. Remote work can introduce unique communication challenges, since you don’t have the benefit of things like body language or gestures that come with being in the same physical room. Fortunately, there’s great tooling available, like Tuple, to solve these challenges and even enhance the pair programming experience. It’s a MacOS only application which allows people to pair program with each other remotely. Users can share their screen and either can take control to drive. The best part is that it’s a seamless experience without any additional UI taking up space on your screen.

During your session, use these tips to make sure you’re communicating with intention. 

Use Open-ended Questions

Open-ended questions lead to longer dialogues and provide a moment for someone to think critically. Even if they don’t know, it lets them learn something new that they will take away from the session. With closed questions, it’s usually a “yes” or a “no” only. Let’s say we’re working together on building a grouped React component of buttons, which one sounds more inviting for a discussion:

  • Is ButtonGroup a good name for the component? (Close-ended question)
  • What do you think of the name ButtonGroup? (Open-ended question)

Other examples of open-ended questions:

  • What are some approaches you took to solving this issue?
  • Before we try this approach, what do you think will happen?
  • What do you think this block of code is doing?

Give Positive Affirmations

Encouragement goes a long way, especially when folks are finding their footing early in their career. After all, knowing what’s gone right can be just as important as knowing what’s gone wrong. Throughout the session, pause and celebrate progress by noting when you see your partner do something well.

For example, you and your partner are building a new endpoint that’s part of a new feature your team is implementing. Instead of waiting until the feature is released, celebrate the small win.

Here are a few example messages you can give:

  • It looks like we marked everything off the agenda. Awesome session today.
  • Great work on catching the error. This is why I love pair programming because we work together as a team.
  • Huge win today! The PR we worked together is going to help not only our team, but others as well.

Communication Pitfalls to Avoid

No matter how it was intended, a rude or condescending comment made in passing can throw off the vibe of a pair programming session and quickly erode trust between partners. Remember that programming in front of someone can be a vulnerable experience, especially for someone just starting out. While it might seem obvious, we all need a reminder sometimes to be mindful of what we say. Here are some things to watch out for.

Passive-Aggressive (Or Just-Plain-Aggressive) Comments

Passive-aggressive behavior is when someone expresses anger or negative feelings in an indirect way. If you’re feeling frustrated during a session, avoid slipping into this behavior. Instead, communicate your feelings directly in a constructive way. When negative feelings are expressed in a hostile or outwardly rude way, this is aggressive behavior and should be avoided completely. 

Passive-aggressive behavior examples:

  • Giving your partner the silent treatment
  • Eye-rolling or sighing in frustration when your partner makes a mistake
  • Sarcastic comments like “Could you possibly code any slower?” 
  • Subtle digs like “Well…that’s an interesting idea.”

Aggressive behavior examples

  • “Typical intern mistake, rookie.”
  • “This is NOT how someone at your level should be working.”
  • “I thought you knew how to do this.”

Absolute Words Like Always and Never

Absolute words means you are 100 percent certain. In oral communication, depending on the use of the absolute word, it can sound condescending. Programming is also a world full of nuance, so overgeneralizing solutions as right or wrong is often misleading. Instead, use these scenarios as a teaching opportunity. If something is usually true, explain the scenarios where there might be exceptions. If a solution rarely works, explain when it might. Edge cases can open some really interesting and valuable conversations.

For example:

  • “You never write perfect code”
  • “I always review code”
  • “That would never work”

Use alternative terms instead:

  • For always, you can use usually
  • For never, you can use rarely

“When you join Shopify, I think it's overwhelming given the amount of resources to explore even for experienced people. Pair programming is the best way to gain context and learn. In this digital by design world, pair programming really helped me to connect with team members and gain context and learn how things work here in Shopify which helped me with faster onboarding.” -Nikita Acharya, Developer

I want to thank everyone at RenderATL who helped me finish this pair programming framework. If you’re early on in your career as a developer, pair programming is a great way to get to know your teammates and build your skills. And if you’re an experienced developer, I hope you’ll consider mentoring newer developers using pair programming. Either way, this framework should give you a starting point to give it a try.

In the pilot we’ve run so far with this framework, we’ve received positive feedback from our interns about how it allowed them to achieve their learning goals in a flexible format. We’re still experimenting and iterating with it, so we’d love to hear your feedback if you give it a try! Happy pairing!

Raymond Chung is helping to build a new generation of software developers through Shopify’s Dev Degree program. As a Technical Educator, his passion for computer science and education allows him to create bite-sized content that engages interns throughout their day-to-day. When he is not teaching, you’ll find Raymond exploring for the best bubble tea shop. You can follow Raymond on Twitter, GitHub, or LinkedIn.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Lessons From Building Android Widgets

Lessons From Building Android Widgets

By Matt Bowen, James Lockhart, Cecilia Hunka, and Carlos Pereira

When the new widget announcement was made for iOS 14, our iOS team went right to work designing an experience to leverage the new platform. However, widgets aren’t new to Android and have been around for over a decade. Shopify cares deeply about its mobile experience and for as long as we’ve had the Shopify mobile app, both our Android and iOS teams ship every feature one-to-one in unison. With the spotlight now on iOS 14, this was a perfect time to revisit our offering on Android.

Since our contribution was the same across both platforms, just like our iOS counterparts at the time, we knew merchants were using our widgets, but they needed more.

Why Widgets are Important to Shopify

Our widgets mainly focus on analytics that help merchants understand how they’re doing and gain insights to make better decisions quickly about their business. Monitoring metrics is a daily activity for a lot of our merchants, and on mobile, we have the opportunity to give merchants a faster way to access this data through widgets. They provide merchants a unique avenue to quickly get a pulse on their shops that isn’t available on the web.

Add Insights widget
Add Insights widget

After gathering feedback and continuously looking for opportunities to enhance our widget capabilities, we’re at our third iteration, and we’ll share with you the challenges we faced and how we solved them.

Why We Didn’t Use React Native

A couple of years ago Shopify decided to go full on React Native. New development should be done in React Native, and we’re also migrating some apps to the technology. This includes the flagship admin app, which is the companion app to the widgets.

Then why not write the widgets in React Native?

After doing some initial investigation, we quickly hit some roadblocks (like the fact that RemoteViews are the only way to create widgets). There’s currently no official support in the React Native community for RemoteViews, which is needed to support widgets. This felt very akin to a square peg in a round hole. Our iOS app also ran into issues supporting React Native, and we were running down the same path. Shopify believes in using the right tool for the job, we believe that native development was the right call in this case.

Building the Widgets

When building out our architecture for widgets, we wanted to create a consistent experience on both Android and iOS while preserving platform idioms where it made sense. In the sections below, we want to give you a view of our experiences building widgets. Pointing out some of the more difficult challenges we faced. Our aim is to shed some light around these less used surfaces, hopefully give some inspiration, and save time when it comes to implementing widgets in your applications.

Fetching Data

Some types of widgets have data that change less frequently (for example, reminders) and some that can be forecasted for the entire day (for example, calendar and weather). In our case, the merchants need up-to-date metrics about their business, so we need to show data as fresh as possible. Delays in data can cause confusion, or even worse, delay information that could change an action. Say you follow the stock market, you expect the stock app and widget data to be as up to date as possible. If the data is multiple hours stale, you may have missed something important! For our widgets to be valuable, we need information to be fresh while considering network usage.

Fetching Data in the App

Widgets can be kept up to date with relevant and timely information by using data available locally or fetching it from a server. The server fetching can be initiated by the widget itself or by the host app. In our case, since the app doesn’t need the same information the widget needed, we decided it would make more sense to fetch it from the widget. 

One benefit to how widgets are managed in the Android ecosystem over iOS is the flexibility. On iOS you have limited communication between the app and widget, whereas on Android there doesn’t seem to be the same restrictions. This becomes clear when we think about how we configure a widget. The widget configuration screen has access to all of the libraries and classes that our main app does. It’s no different than any other screen in the app. This is mostly true with the widget as well. We can access the resources contained in our main application, so we don’t need to duplicate any code. The only restrictions in a widget come with building views, which we’ll explore later.

When we save our configuration, we use shared preferences to persist data between the configuration screen and the widget. When a widget update is triggered, the shared preferences data for a given widget is used to build our request, and the results are displayed within the widget. We can read that data from anywhere in our app, allowing us to reuse this data in other parts of our app if desired.

Making the Widgets Antifragile

The widget architecture is built in a way that updates are mindful of battery usage, where updates are controlled by the system. In the same way, our widgets must also be mindful of saving bandwidth when fetching data over a network. While developing our second iteration, we came across a peculiar problem that was exacerbated by our specific use case. Since we need data to be fresh, we always pull new data from our backend on every update. Each update is approximately 15 minutes apart to avoid having our widgets stop updating. What we found is that widgets call their update method onUpdate(), more than once in an update cycle. In widgets like calendar, these extra calls come without much extra cost as the data is stored locally. However, in our app, this was triggering two to five extra network calls for the same widget with the same data in quick succession.

In order to correct the unnecessary roundtrips, we built a simple short-lived cache:

  1. The system asks our widget to provide new data from Reportify (Shopify’s data service)
  2. We first look into the local cache using the widgetID provided by the system.
  3. If there’s data, and that data was set less than one minute ago, we return it and avoid making a network request. We also take into account configuration such as locale so as not to avoid forcing updates after a language change.
  4. Otherwise, we fetch the data as normal and store it in the cache with the timestamp.
A flow diagram highlighting the steps of the cache
The simple short-lived cache flow

With this solution, we reduced unused network calls and system load and avoided collecting incorrect analytics.

Implementing Decoder Strategy with Dynamic Selections

We follow a similar approach as we have on iOS. We create a dynamic set of queries based on what the merchant has configured.

For each metric we have a corresponding definition implementation. This approach allows each metric the ability to have complete flexibility around what data it needs, and how it decodes the data from the response.

When Android asks us to update our widgets, we pull the merchants selection from our configuration object. Since each of the metric IDs has a definition, we map over them to create a dynamic set of queries.

We include an extension on our response object that binds the definitions to a decoder. Our service sends back an array of the response data corresponding to the queries made. We map over the original definitions, decoding each chunk to the expected return type.

Building the UI

Similar to iOS, we support three widget sizes for versions prior to Android 12 and follow the same rules for cell layout, except for the small widget. The small widget on Android supports a single metric (compared to the two on iOS) and the smallest widget size on Android is a 2x1 grid. We quickly found that only a single metric would fit in this space, so this design differs slightly between the platforms.

Unlike iOS with swift previews, we were limited to XML previews and running the widget on the emulator or device. We’re also building widgets dynamically, so even XML previews were relatively useless if we wanted to see an entire widget preview. Widgets are currently on the 2022 Jetpack Compose roadmap, so this is likely to change soon with Jetpack composable previews.

With the addition of dynamic layouts in Android 12, we created five additional sizes to support each size in between the original three. These new sizes are unique to Android. This also led to using grid sizes as part of our naming convention as you can see in our WidgetLayout enum below.

For the structure of our widget, we used an enum that acts as a blueprint to map the appropriate layout file to an area of our widget. This is particularly useful when we want to add a new widget because we simply need to add a new enum configuration.

To build the widgets dynamically, we read our configuration from shared preferences and provide that information to the RemoteViews API.

If you’re familiar with the RemoteViews API, you may notice the updateView() method, which is not a default RemoteViews method. We created this extension method as a result of an issue we ran into while building our widget layout in this dynamic manner. When a widget updates, the new remote views get appended to the existing ones. As you can probably guess, the widget didn’t look so great. Even worse, more remote views get appended on each subsequent update. We found that combining the two RemoteViews API methods removeAllViews() and addView() solved this problem.

Once we build our remote views, we then pass the parent remote view to the AppWidgetProvider updateAppWidget() method to display the desired layout.

It’s worth noting that we attempted to use partiallyUpdateAppWidget() to stop our remote views from appending to each other, but encountered the same issue.

Using Dynamic Dates

One important piece of information on our widget is the last updated timestamp. It helps remove confusion by allowing the merchants to quickly know how fresh the data is they are looking at. If the data is quite stale (say you went to the cottage for the weekend and missed a few updates) and there wasn’t a displayed timestamp, you would assume the data you’re looking at is up to the second. This can cause unnecessary confusion for our merchants. The solution here was to ensure there’s some communication to our merchant when the last update was made.

In our previous design, we only had small widgets, and they were able to display only one metric. This information resulted in a long piece of text that, on smaller devices, would sometimes wrap and show over two lines. This was fine when space was abundant in our older design but not in our new data rich designs. We explored how we could best work with timestamps on widgets, and the most promising solution was to use relative time. Instead of having a static value such as “as of 3:30pm” like our previous iteration. We would have a dynamic date that would look like: “1 min, 3 sec ago.”

One thing to remember is that even though the widget is visible, we have a limited number of updates we can trigger. Otherwise, it would be consuming a lot of unnecessary resources on the device. We knew we couldn’t keep triggering updates on the widget as often as we wanted. Android has a strategy for solving this with TextClock. However, there’s no support for relative time, which wouldn’t be useful in our use case. We also explored using Alarms, but potentially updating every minute would consume too much battery. 

One big takeaway we had from these explorations was to always test your widgets under different conditions. Especially low battery or poor network. These surfaces are much more restrictive than general applications and the OS is much more likely to ignore updates.

We eventually decided that we wouldn’t use relative time and kept the widget’s refresh time as a timestamp. This way we have full control over things like date formatting and styling.

Adding Configuration

Our new widgets have a great deal of configuration options, allowing our merchants to choose exactly what they care about. For each widget size, the merchant can select the store, a certain number of metrics and a date range. This is the only part of the widget that doesn’t use RemoteViews, so there aren’t any restrictions on what type of View you may want to use. We share information between the configuration and the widget via shared preferences.

Insights widget configuration
Insights widget configuration

Working with Charts and Images

Android widgets are limited to RemoteViews as their building blocks and are very restrictive in terms of the view types supported. If you need to support anything outside of basic text and images, you need to be a bit creative.

Our widgets support both a sparkline and spark bar chart built using the MPAndroidCharts library. We have these charts already configured and styled in our main application, so the reuse here was perfect; except, we can’t use any custom Views in our widgets. Luckily, this library is creating the charts via drawing to the canvas, and we simply export the charts as a bitmap to an image view. 

Once we were able to measure the widget, we constructed a chart of the required size, create a bitmap, and set it to a RemoteView.ImageView. One small thing to remember with this approach, is that if you want to have transparent backgrounds, you’ll have to use ARGB_8888 as the Bitmap Config. This simple bitmap to ImageView approach allowed us to handle any custom drawing we needed to do. 

Eliminating Flickering

One minor, but annoying issue we encountered throughout the duration of the project is what we like to call “widget flickering.” Flickering is a side-effect of the widget updating its data. Between updates, Android uses the initialLayout from the widget’s configuration as a placeholder while the widget fetches its data and builds its new RemoteViews. We found that it wasn’t possible to eliminate this behavior, so we implemented a couple of strategies to reduce the frequency and duration of the flicker.

The first strategy is used when a merchant first places a widget on the home screen. This is where we can reduce the frequency of flickering. In an earlier section “Making the Widgets Antifragile,” we shared our short-lived cache. The cache comes into play because the OS will trigger multiple updates for a widget as soon as it’s placed on the home screen. We’d sometimes see a quick three or four flickers, caused by updates of the widget. After the widget gets its data for the first time, we prevent any additional updates from happening for the first 60 seconds, reducing the frequency of flickering.

The second strategy reduces the duration of a flicker (or how long the initialLayout is displayed). We store the widgets configuration as part of shared preferences each time it’s updated. We always have a snapshot of what widget information is currently displayed. When the onUpdate() method is called, we invoke a renderEarlyFromCache() method as early as possible. The purpose of this method is to build the widget via shared preferences. We provide this cached widget as a placeholder until the new data has arrived.

Gathering Analytics

Largest Insights widget in light mode
Largest widget in light mode

Since our first widgets were developed, we added strategic analytics in key areas so that we could understand how merchants were using the functionality.  This allowed us to learn from the usage so we could improve on them. The data team built dashboards displaying detailed views of how many widgets were installed, the most popular metrics, and sizes.

Most of the data used to build the dashboards came from analytics events fired through the widgets and the Shopify app.

For these new widgets, we wanted to better understand adoption and retention of widgets, so we needed to capture how users are configuring their widgets over time and which ones are being added or removed.

Detecting Addition and Removal of Widgets 

Unlike iOS, capturing this data in Android is straight-forward. To capture when a merchant adds a widget, we send our analytical event when the configuration is saved. When removing a widget, the widgets built-in onDeleted method gives us the widget ID of the removed widget. We can then look up our widget information in shared preferences and send our event prior permanently deleting the widget information from the device. 

Supporting Android 12

When we started development of our new widgets, our application was targeting Android 11. We knew we’d be targeting Android 12 eventually, but we didn’t want the upgrade to block the release. We decided to implement Android 12 specific features once our application targeted the newer version, leading to an unforeseen issue during the upgrade process with widget selection.

Our approach to widget selection in previous versions was to display each available size as an option. With the introduction of responsive layouts, we no longer needed to display each size as its own option. Merchants can now pick a single widget and resize to their desired layout. In previous versions, merchants can select a small, medium, and large widget. In versions 12 and up, merchants can only select a single widget that can be resized to the same layouts as small, medium, large, and several other layouts that fall in between. This pattern follows what Google does with their large weather widget included on devices, as well as an example in their documentation. We disabled the medium and small widgets in Android 12 by adding a flag to our AndroidManifest and setting that value in attrs.xml for each version:

The approach above behaves as expected, the medium and small widgets were now unavailable from the picker. However, if a merchant was already on Android 12 and had added a medium or large widget before our widget update, those widgets were removed from their home screen. This could easily be seen as a bug and reduce confidence in the feature. In retrospect, we could have prototyped what the upgrade would have looked like to a merchant who was already on Android 12.

Allowing only the large widget to be available was a data-driven decision. By tracking widget usage at launch, we saw that the large widget was the most popular and removing the other two would have the least impact on current widget merchants.

Building New Layouts

We encountered an error when building the new layouts that fit between the original small, medium and large widgets.

After researching the error, we were exceeding the Binder transaction buffer. However, the buffer’s size is 1mb and the error displayed .66mb, which wasn’t exceeding the documented buffer size. The error has appeared to stump a lot of developers. After experimenting with ways to get the size down, we found that we could either drop a couple of entire layouts or remove support for a fourth and fifth row of the small metric. We decided on the latter, which is why our 2x3 widget only has three rows of data when it has room for five.

Rethinking the Configuration Screen

Now that we have one widget to choose from, we had to rethink what our configuration screen would look like to a merchant. Without our three fixed sizes, we could no longer display a fixed number of metrics in our configuration. 

Our only choice was to display the maximum number of metrics available for the largest size (which is seven at the time of this writing). Not only did this make the most sense to us from a UX perspective, but we also had to do it this way because of how the new responsive layouts work. Android has to know all of the possible layouts ahead of time. Even if a user shrinks their widget to a size that displays a single metric, Android has to know what the other six are, so it can be resized to our largest layout without any hiccups.

We also updated the description that’s displayed at the top of the configuration screen that explains this behavior.

Capturing More Analytics

On iOS, we capture analytical data when a merchant reconfigures a widget to gain insights into usage patterns. Reconfiguration for Android was only possible in version 12 and due to the limitations of the AppWidgetProvider’s onAppWidgetOptionsChanged() method, we were unable to capture this data on Android.

I’ll share more information about building our layouts in order to give context to our problem. Setting breakpoints for new dynamic layouts works very well across all devices. Google recommends creating a mapping of your breakpoints to the desired remote view layout. To build on a previous example where we showed the buildWidgetRemoteView() method, we used this method again as part of our breakpoint mapping. This approach allows us to reliably map our breakpoints to the desired widget layout.

When reconfiguring or resizing a widget, the onAppWidgetOptionsChanged() method is called. This is where we’d want to capture our analytical data about what had changed. Unfortunately,  this view mapping doesn’t exist here. We have access to width and height values that are measured in dp, initially appearing to be useful. At first, we felt that we could discern these measurements into something meaningful and map these values back to our layout sizes. After testing on a couple of devices, we realized that the approach was unreliable and would lead to a large volume of bad analytical data. Without confidently knowing what size we are coming from, or going to, we decided to omit this particular analytics event from Android. We hope to bring this to Google’s attention, and get it included in a future release.

Shipping New Widgets

Already having a pair of existing widgets, we had to decide how to transition to the new widgets as they would be replacing the existing implementation.

We didn’t find much documentation around migrating widgets. The docs only provided a way to enhance your widget, which means adding the new features of Android 12 to something you already had. This wasn’t applicable to us since our existing widgets were so different from the ones we were building.

The major issue that we couldn’t get around was related to the sizing strategies of our existing and new widgets. The existing widgets used a fixed width and height so they’d always be square. Our new widgets take whatever space is available. There wasn’t a way to guarantee that the new widget would fit in the available space that the existing widget had occupied. If the existing widget was the same size as the new one, it would have been worth exploring further.

The initial plan we had hoped for, was to make one of our widgets transform into the new widget while removing the other one. Given the above, this strategy would not work.

The compromise we came to, so as to not completely remove all of a merchant’s widgets overnight, was to deprecate the old widgets at the same time we release the new one. To deprecate, we updated our old widget’s UI to display a message informing that the widget is no longer supported and the merchant must add the new ones.

Screen displaying the notice: This widget is no longer active. Add the new Shopify Insights widget for an improved view of your data. Learn more.
Widget deprecation message

There’s no way to add a new widget programmatically or to bring the merchant to the widget picker by tapping on the old widget. We added some communication to help ease the transition by updating our help center docs, including information around how to use widgets, pointing our old widgets to open the help center docs, and just giving lots of time before removing the deprecation message. In the end, it wasn’t the most ideal situation and we came away learning about the pitfalls within the two ecosystems.

What’s Next

As we continue to learn about how merchants use our new generation of widgets and Android 12 features, we’ll continue to hone in on the best experience across both our platforms. This also opens the way for other teams at Shopify to build on what we’ve started and create more widgets to help Merchants. 

On the topic of mobile only platforms, this leads us into getting up to speed on WearOS. Our WatchOS app is about to get a refresh with the addition of Widgetkit; it feels like a great time to finally give our Android Merchants watch support too!

Matt Bowen is a mobile developer on the Core Admin Experience team. Located in West Warwick, RI. Personal hobbies include exercising and watching the Boston Celtics and New England Patriots.

James Lockhart is a Staff Developer based in Ottawa, Canada. Experiencing mobile development over the past 10+ years: from Phonegap to native iOS/Android and now React native. He is an avid baker and cook when not contributing to the Shopify admin app.

Cecilia Hunka is a developer on the Core Admin Experience team at Shopify. Based in Toronto, Canada, she loves live music and jigsaw puzzles.

Carlos Pereira is a Senior Developer based in Montreal, Canada, with more than a decade of experience building native iOS applications in Objective-C, Swift and now React Native. Currently contributing to the Shopify admin app.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Lessons From Building iOS Widgets

Lessons From Building iOS Widgets

By Carlos Pereira, James Lockhart, and Cecilia Hunka

When the iOS 14 beta was originally announced, we knew we needed to take advantage of the new widget features and get something to our merchants. The new widgets looked awesome and could really give merchants a way to see their shop’s data at a glance without needing to open our app.

Fast forward a couple of years, and we now have lots of feedback from the new design. We knew merchants were using them, but they needed more. The current design was lacking and only provided two metrics—also, they took up a lot of space. This experience prompted us to start a new project. To upgrade our original design to better fit our merchant’s needs.

Why Widgets Are Important to Shopify

Our widgets mainly focus on analytics. Analytics can help merchants understand how they’re doing and gain insights to make better decisions quickly about their business. Monitoring metrics is a daily activity for a lot of our merchants, and on mobile, we have the opportunity to give merchants a faster way to access this data through widgets. As widgets provide access to “at a glance” information about your shop and allow merchants a unique avenue to quickly get a pulse on their shops that they wouldn’t find on desktop.

A screenshot showing the add widget screen for Insights on iOS
Add Insights widget

After gathering feedback and continuously looking for opportunities to enhance our widget capabilities, we’re at our third iteration, and we’ll share with you how we approached building widgets and some of the challenges we faced.

Why We Didn’t Use React Native

A couple years ago Shopify decided to go all in on React Native. New development was done in React Native and we began migrating some apps to the new stack. Including our flagship admin app, where we were building our widgets. Which posed the question, should we write the widgets in React Native?

After doing some investigation we quickly hit some roadblocks: app extensions are limited in terms of memory, WidgetKit’s architecture is highly optimized to work with SwiftUI as the view hierarchy is serialized to disk, there’s also, at this time, no official support in the React Native community for widgets.

Shopify believes in using the right tool for the job, we believe that native development with SwiftUI was the best choice in this case.

Building the Widgets

When building out our architecture for widgets, we wanted to create a consistent experience on both iOS and Android while preserving platform idioms where it made sense. Below we’ll go over our experience and strategies building the widgets, pointing out some of the more difficult challenges we faced. Our aim is to shed some light around these less talked about surfaces, give some inspiration for your projects, and hopefully, save time when it comes to implementing your widgets.

Fetching Data

Some types of widgets have data that change less frequently (for example, reminders) and some that can be forecasted for the entire day (for example, Calendar and weather). In our case, the merchants need up-to-date metrics about their business, so we need to show data as fresh as possible. Time for our widget is crucial. Delays in data can cause confusion, or even worse, delay information that could inform a business decision. For example, let’s say you watch the stocks app. You would expect the stock app and its corresponding widget data to be as up to date as possible. If the data is multiple hours stale, you could miss valuable information for making decisions or you could miss an important drop or rise in price. With our product, our merchants need as up to date information as we can provide them to run their business.

Fetching Data in the App

Widgets can be kept up to date with relevant and timely information by using data available locally or fetching it from a server. The server fetching can be initiated by the widget itself or by the host app. In our case, since the app doesn’t share the same information as the widget, we decided it made more sense to fetch it from the widget.

We still consider moving data fetching to the app once we start sharing similar data between widgets and the app. This architecture could simplify the handling of authentication, state management, updating data, and caching in our widget since only one process will have this job rather than two separate processes. It’s worth noting that the widget can access code from the main app, but they can only communicate data through keychain and shared user defaults as widgets run on separate processes. Sharing the data fetching, however, comes with an added complexity of having a background process pushing or making data available to the widgets, since widgets must remain functional even if the app isn’t in the foreground or background. For now, we’re happy with the current solution: the widgets fetch data independently from the app while sharing the session management code and tokens.

A flow diagram highlighting widgets fetch data independently from the app while sharing the session management code and tokens
Current solution where widgets fetch data independently

Querying Business Analytics Data with Reportify and ShopifyQL

The business data and visualizations displayed in the widgets are powered by Reportify, an in-house service that exposes data through a set of schemas queried via ShopifyQL, Shopify’s Commerce data querying language. It looks very similar to SQL but is designed around data for commerce. For example, to fetch a shops total sales for the day:

Making Our Widgets Antifragile

iOS Widgets' architecture is built in a way that updates are mindful of battery usage and are budgeted by the system. In the same way, our widgets must also be mindful of saving bandwidth when fetching data over a network. While developing our second iteration we came across a peculiar problem that was exacerbated by our specific use case.

Since we need data to be fresh, we always pull new data from our backend on every update. Each update is approximately 15 minutes apart to avoid having our widgets stop updating (which you can read about why on Apple’s Developer site). We found that iOS calls the update methods, getTimeline() and getSnapshot(), more than once in an update cycle. In widgets like calendar, these extra calls come without much extra cost as the data is stored locally. However, in our app, this was triggering two to five extra network calls for the same widget with the same data in quick succession.

We also noticed these calls were causing a seemingly unrelated kick out issue affecting the app. Each widget runs on a different process than the main application, and all widgets share the keychain. Once the app requests data from the API, it checks to see if it has an authenticated token in the keychain. If that token is stale, our system pauses updates, refreshes the token, and continues network requests. In the case of our widgets, each widget call to update was creating another workflow that could need a token refresh. When we only had a single widget or update flow, it worked great! Even four to five updates would usually work pretty well. However, eventually one of these network calls would come out of order and an invalid token would get saved. On our next update, we have no way to retrieve data or request a new token resulting in a session kick out. This was a great find as it was causing a lot of frustration for our affected merchants and ourselves, who could never really put our finger on why these things would, every now and again, just log us out.

In order to correct the unnecessary roundtrips, we built a simple short-lived cache:

  1. The system asks our widget to provide new data
  2. We first look into the local cache using a key specific to that widget. On iOS, our key is produced from a configuration for that widget as there’s no unique identifiers provided. We also take into account configuration such as locale so as not to avoid forcing updates after a language change.
  3. If there’s data, and that data was set less than one minute ago, we return it and avoid making a network request. 
  4. Otherwise, we fetch the data as normal and store it in the cache with the timestamp.
      A flow diagram highlighting the steps of the cache
      The simple short-lived cache flow

      With this solution, we reduced unused network calls and system load, avoided collecting incorrect analytics, and fixed a long running bug with our app kick outs!

      Implementing Decoder Strategy with Dynamic Selections

      When fetching the data from the Analytics REST service, each widget can be configured with two to seven metrics from a total of 12. This set should grow in the future as new metrics are available too! Our current set of metrics are all time-based and have a similar structure.

      But that doesn’t mean the structure of future metrics will not change. For example, what about a metric that contains data that isn’t mapped over a time range? (like orders to fulfill, which does not contain any historical information).

      The merchant is also able to configure the order the metrics appear, which shop (if they have more than one shop), and which date range represents the data: today, last 7 days, and last 30 days.

      We had to implement a data fetching and decoding mechanism that:

      • only fetches the data the merchant requested in order to avoid asking for unneeded information
      • supports a set of metrics as well as being flexible to add future metrics with different shapes
      • supports different date ranges for the data.

      A simplified version of the solution is shown below. First, we create a struct to represent the query to the analytics service (Reportify).

      Then, we create a class to represent the decodable response. Right now it has a fixed structure (value, comparison, and chart values), but in the future we can use an enum or different subclasses to decode different shapes.

      Next, we create a response wrapper that attempts to decode the metrics based on a list of metric types passed to it. Each metric has its configuration, so we know which class is used to read the values.

      Finally, when the widget Timeline Provider asks for new data, we fetch the data from the current metrics and decode the response. 

      Building the UI

      We wanted to support the three widget sizes: small, medium, and large. From the start we wanted to have a single View to support all sizes as an attempt to minimize UI discrepancies and make the code easy to maintain.

      We started by identifying the common structure and creating components. We ended up with a Metric Cell component that has three variations:

      A metric cell from a widget
      A metric cell
      A metric cell from a widget
      A metric cell with a sparkline
      A metric cell from a widget
      A metric cell with barchart

      All three variations consist of a metric name and value, chart, and a comparison. As the widget containers become bigger, we show the merchant more data. Each view size contains more metrics, and the largest widget contains a full width chart on the first chosen metric. The comparison indicator also gets shifted from bottom to right on this variation.

      The first chosen metric, on the large widget, is shown as a full width cell with a bar chart showing the data more clearly; we call it the Primary cell. We added a structure to indicate if a cell is going to be used as primary or not. Besides the primary flag, our component doesn’t have any context about the widget size, so we use chart data as an indicator to render a cell primary or not. This paradigm fits very well with SwiftUI.

      A simplified version of the actual Cell View:

      After building our cells, we need to create a structure to render them in a grid according to the size and metrics chosen by the merchant. This component also has no context of the widget size, so our layout decisions are mainly based on how many metrics we are receiving. In this example, we’ll refer to the View as a WidgetView.

      The WidgetView is initialized with a WidgetState, a struct that holds most of the widget data such as shop information, the chosen metrics and their data, and a last updated string (which represents the last time the widget was updated).

      To be able to make decisions on layout based on the widget characteristics, we created an OptionSet called LayoutOption. This is passed as an array to the WidgetView.

      Layout options:

      That helped us not to tie this component to Widget families, rather to layout characteristics that makes this component very reusable in other contexts.

      The WidgetView layout is built using mainly a LazyVGrid component:

      A simplified version of the actual View:

      Adding Dynamic Dates

      One important piece of information on our widget is the last updated timestamp. It helps remove confusion by allowing merchants too quickly know how fresh the data is they’re looking at. Since iOS has an approximate update time with many variables, coupled with data connectivity, it’s very possible the data could be over 15 minutes old. If the data is quite stale (say you went to the cottage for the weekend and missed a few updates) and there was no update string, you would assume the data you’re looking at is up to the second. This can cause unnecessary confusion for our merchants. The solution here was to ensure there’s some communication to our merchant when the last update was.

      In our previous design, we only had small widgets, and they were able to display only one metric. This information resulted in a long string, that on smaller devices, would sometimes wrap and show over two lines. This was fine when space was abundant in our older design but not in our new data rich designs. We explored how we could best work with timestamps on widgets, and the most promising solution was to use relative time. Instead of having a static value such as “as of 3:30pm” like our previous iteration, we would have a dynamic date that would look like: “1 min, 3 sec ago.”

      One thing to remember is that even though the widget is visible, we have a limited number of updates we can trigger. Otherwise, it would be consuming a lot of unnecessary resources on the merchant’s device. We knew we couldn’t keep triggering updates on the widget as often as we wanted (nor would it be allowed), but iOS has ways to deal with this. Apple did release support for dynamic text on widgets during our development that allowed using timers on your widgets without requiring updates. We simply need to pass a style to a Text component and it automatically keeps everything up to date:

      Text("\(now, style: .relative) ago")

      It was good, but we have no options to customize the relative style. Being able to customize the relative style was an important point for us, as the current supported style does not fit well with our widget layout. One of our biggest constraints with widgets is space as we always need to think about the smallest widget possible. In the end we decided not to move forward with the relative time approach, and kept a reduced version of our previous timestamp.

      Adding Configuration

      Our new widgets have a great amount of configuration, allowing for merchants to choose exactly what they care about. For each widget size, the merchant can select the store, a certain number of metrics, and a date range. On iOS, widgets are configured through the SiriKit Intents API. We faced some challenges with the WidgetConfiguration, but fortunately, all had workarounds that fit our use cases.

      Insights widget configuration
      Insights widget configuration

      It’s Not Possible to Deselect a Metric

      When defining a field that has multiple values provided dynamically, we can limit the number of options per widget family. This was important for us, since each widget size has a different number of metrics it can support. However, the current UI on iOS for widget configuration only allows selecting a value but not deselecting it. So, once we selected a metric we couldn’t remove it, only update the selection. But what if the merchant were only interested in one metric on the small widget? We solved this with a small design change, by providing “None” as an option. If the merchant were to choose this option, it would be ignored and shown as an empty state. 

      It's not possible to validate the user selections

      With the addition of “None” and the way intents are designed, it was possible to select all “None” and have a widget with no metrics. In addition, it was possible to select the same metric twice.. We would like to be able to validate the user selection, but the Intents API didn't support it. The solution was to embrace the fact that a widget can be empty and show as an empty state. Duplicates were filtered out so any more than a single metric choice was changed to “None” before we sent any network requests.

      The First Calls to getTimeline and getSnapshot Don’t Respect the Maximum Metric Count

      For intent configurations provided dynamically, we must provide default values in the IntentHandler. In our case, the metrics list varies per widget family. In the IntentHandler, it’s not possible to query which widget family is being used. So we had to return at least as many metrics as the largest widget (seven). 

      However, even if we limit the number of metrics per family, the first getTimeline and getSnapshot calls in the Timeline Provider were filling the configuration object with all default metrics, so a small widget would have seven metrics instead of two!

      We ended up adding some cleanup code in the beginning of the Timeline Provider methods that trims the list depending on the expected number of metrics.

      Optimizing Testing

      Automated tests are a fundamental part of Shopify’s development process. In the Shopify app, we have a good amount of unit and snapshot tests. The old widgets on Android had good test coverage already, and we built on the existing infrastructure. On iOS, however, there were no tests since it’s currently not possible to add test targets against a widget extension on Xcode.

      Given this would be a complex project and we didn’t want to compromise on quality, we investigated possible solutions for it.

      The simplest solution would be to add each file on both the app and in the widget extension targets, then we could unit test it in the app side in our standard test target. We decided not to do this since we would always need to add a file to both targets, and it would bloat the Shopify app unnecessarily.

      We chose to create a separate module (a framework in our case) and move all testable code there. Then we could create unit and snapshot tests for this module.

      We ended up moving most of the code, like views and business logic, to this new module (WidgetCore), while the extension only had WidgetKit specific code and configuration like Timeline provider, widget bundle, and intent definition generated files.

      Given our code in the Shopify app is based on UIKit, we did have to update our in-house snapshot testing framework to support SwiftUI views. We were very happy with the results. We ended up achieving a high test coverage, and the tests flagged many regressions during development.

      Fast SwiftUI Previews 

      The Shopify app is a big application, and it takes a while to build. Given the widget extension is based on our main app target, it took a long time to prepare the SwiftUI previews. This caused frustration during development. It also removed one of the biggest benefits of SwiftUI—our ability to iterate quickly with Previews and the fast feedback cycle during UI development.

      One idea we had was to create a module that didn’t rely on our main app target. We created one called WidgetCore where we put a lot of our reusable Views and business logic. It was fast to build and could also render SwiftUI previews. The one caveat is, since it wasn’t a widget extension target, we couldn’t leverage the WidgetPreviewContext API to render views on a device. It meant we needed to load up the extension to ensure the designs and changes were always working as expected on all sizes and modes (light and dark).

      To solve this problem, we created a PreviewLayout extension. This had all the widget sizes based on the Apple documentation, and we were able to use it in a similar way:

      Our PreviewLayout extension would be used on all of our widget related views in our WidgetCore module to emulate the sizes in previews:

      Acquiring Analytics

      Shopify Insights widget largest size in light mode
      Largest widget in light mode

      Since our first widgets were developed, we wanted to understand how merchants are using the functionality, so we can always improve it. The data team built some dashboards showing things like a detailed view of how many widgets installed, the most popular metrics, and sizes.

      Most of the data used to build the dashboards come from analytics events fired through the widgets and the Shopify app.

      For the new widgets, we wanted to better understand adoption and retention of widgets, so we needed to capture how users are configuring their widgets over time and which ones are being added or removed.

      Managing Unique Ids

      WidgetKit has the WidgetCenter struct that allows requesting information about the widgets currently configured in the device through the getCurrentConfigurations method. However, the list of metadata returned (WidgetInfo) doesn’t have a stable unique identifier. Its identifier is the object itself, since it’s hashable. Given this constraint, if two identical widgets are added, they’ll both have the same identifier. Also, given the intent configuration is part of the id, if something changes (for example, date range) it’ll look like it’s a totally different widget.

      Given this limitation, we had to adjust the way we calculate the number of unique widgets. It also made it harder to distinguish between different life-cycle events (adding, removing, and configuring). Hopefully there will be a way to get unique ids for widgets in future versions of iOS. For now we created a single value derived from the most important parts of the widget configuration.

      Detecting, Adding, and Removing Widgets 

      Currently there’s no WidgetKit life cycle method that tells us when a widget was added, configured, or removed. We needed it so we can better understand how widgets are being used.

      After some exploration, we noticed that the only methods we could count on were getTimeline and getSnapshot. We then decided to build something that could simulate these missing life cycle methods by using the ones we had available. getSnapshot is usually called on state transitions and also on the widget Gallery, so we discarded it as an option.

      We built a solution that did the following

      1. Every time the Timeline providers’ getTimeline is called, we call WidgetKit’s getCurrentConfigurations to see what are the current widgets installed.
      2. We then compare this list with a previous snapshot we persist on disk.
      3. Based on this comparison we try to guess which widgets were added and removed.
      4. Then we triggered the proper life cycle methods: didAddWidgets(), didRemoveWidgets().

      Due to identifiers not being stable, we couldn’t find a reliable approach to detect configuration changes, so we ended up not supporting it.

      We also noticed that WidgetKit.getCurrentConfigurations’s results can have some delay. If we remove a widget, it may take a couple getTimeline calls for it to be reflected. We adjusted our analytics scheme to take that into account.

      Detecting, adding, and removing widgets solution
      Current solution

      Supporting iOS 16

      Our approach to widgets made supporting iOS 16 out of the gate really simple with a few changes. Since our lock screen complications will surface the same information as our home screen widgets, we can actually reuse the Intent configuration, Timeline Provider, and most of the views! The only change we need to make is to adjust the supported families to include .accessoryInline, .accessoryCircular, and .accessoryRectangular, and, of course, draw those views.

      Our Main View would also just need a slight adjustment to work with our existing home screen widgets.

      Migrating Gracefully

      WidgetKit was introduced for watchOS complications in iOS 16. This update comes with a foreboding message from Apple:

      As soon as you offer a widget-based complication, the system stops calling ClockKit APIs. For example, it no longer calls your CLKComplicationDataSource object’s methods to request timeline entries. The system may still wake your data source for migration requests.

      We really care about our apps at Shopify, so we really needed to unpack what this meant, and how does this affect our merchants running older devices? With some testing on devices, we were able to find out, everything is fine.

      If you’re currently running WidgetKit complications and add support for lock screen complications, your ClockKit app and complications will continue to function as you’d expect.

      What we had assumed was that WidgetKit itself was taking the place of WatchOS complications; however, to use Widgetkit on WatchOS, you need to create a new target for the Watch. This makes sense, although the APIs are so similar we had assumed it was a one and done approach. One WidgetKit extension for both platforms.

      One thing to watch out for,  if you do implement the new WidgetKit on WatchOS, if your users are on WatchOS 9 and above will lose all of their complications from ClockKit. Apple did provide a migration API to support the change that’s called instead of your old complications.

      If you don’t have the luxury of just setting your target to iOS 16, your complications will continue to load up for those on WatchOS 8 and below from our testing.

      Shipping New Widgets

      We already had a set of widgets running on both platforms, now we had to decide how to transition to the new update as they would be replacing the existing implementation. On iOS we had two different widget kinds each with their own small widget (you can think of kinds as a widget group). With the new implementation, we wanted to provide a single widget kind that offered all three sizes. We didn’t find much documentation around the migration, so we simulated what happens to the widgets under different scenarios.

      If the merchant has a widget on their home screen and the app updates, one of two things would happen:

      1. The widget would become a white blank square (the kind IDs matched).
      2. The widget just disappeared altogether (the kind ID was changed).

      The initial plan (we had hoped for) was to make one of our widgets transform into the new widget while removing the other one. Given the above, this strategy wouldn’t work. This also includes some annoying tech debt since all of our Intent files would continue to mention the name of the old widget.

      The compromise we came to, so as to not completely remove all of a merchant’s widgets overnight, was to deprecate the old widgets at the same time we release the new ones. To deprecate, we updated our old widget’s UI to display a message informing that the widget is no longer supported, and the merchant must add the new ones. The lesson here is you have to be careful when you make decisions around widget grouping as it’s not easy to change.

      There’s no way to add a new widget programmatically or to bring the merchant to the widget gallery by tapping on the old widget. We also added some communication to help ease the transition by:

        • updating our help center docs, including information around how to use widgets 
        • pointing our old widgets to open the help center docs 
        • giving lots of time before removing the deprecation message.

        In the end, it wasn’t the most ideal situation, and we came away learning about the pitfalls within the two ecosystems. One piece of advice is to really reflect on current and future needs when defining which widgets to offer and how to split them, since a future modification may not be straightforward.

        Screen displaying the notice: This widget is no longer active. Add the new Shopify Insights widget for an improved view of your data. Learn more.
        Widget deprecation message

        What’s Next

        As we continue to learn about how merchants use our new generation of widgets, we’ll continue to hone in on the best experience across both our platforms. Our widgets were made to be flexible, and we’ll be able to continually grow the list of metrics we offer through our customization. This work opens the way for other teams at Shopify to build on what we’ve started and create more widgets to help Merchants too.

        2022 is a busy year with iOS 16 coming out. We’ve got a new WidgetKit experience to integrate to our watch complications, lock screen complications, and live activities hopefully later this year!

        Carlos Pereira is a Senior Developer based in Montreal, Canada, with more than a decade of experience building native iOS applications in Objective-C, Swift and now React Native. Currently contributing to the Shopify admin app.

        James Lockhart is a Staff Developer based in Ottawa, Canada. Experiencing mobile development over the past 10+ years: from Phonegap to native iOS/Android and now React native. He is an avid baker and cook when not contributing to the Shopify admin app.

        Cecilia Hunka is a developer on the Core Admin Experience team at Shopify. Based in Toronto, Canada, she loves live music and jigsaw puzzles.

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Start your free 14-day trial of Shopify