Best-in-Class Developer Experience with Vite and Hydrogen

Best-in-Class Developer Experience with Vite and Hydrogen

Hydrogen is a framework that combines React and Vite for creating custom storefronts on Shopify. It maximizes performance for end-users and provides a best-in-class developer experience for you and your team. Since it focuses on evergreen browsers, Hydrogen can leverage modern capabilities, best practices, and the latest tooling in web development to bring the future of ecommerce closer.

Creating a framework requires a lot of choices for frontend tooling. One major part of it is the bundler. Traditionally, developers had no native way to organize their code in JavaScript modules. Therefore, to minimize the amount of code and waterfall requests in the browser, new frontend tools like Webpack started to appear, powering projects such as Next.js and many more.

Bundling code became the de facto practice for the last decade, especially when using view libraries like React or Vue. Whereas these tools successfully solved the problem, they quickly became hard to understand and maintain due to the increasing complexity of the modern web. On top of that, the development process started to slow down because bundling and compiling are inherently slow: the more files in a project, the more work the tool needs to do. Repeat this process for every change made in a project during active development, and one can quickly notice how the developer experience (DX) tanks. 

Diagram showing bundle-based dev server. Modules are bundled and compiled to be server ready
Bundle-based dev server image from Vite.js docs

Thanks to the introduction of ES Modules (a native mechanism to author JavaScript modules) and its support in browsers, some new players like Snowpack and Parcel appeared and started shaping up the modern web development landscape.

Image showing use of native ES Modules to minimize the amount of bundling required during development
Native ESM-based dev server from Vite.js docs

This new generation of web tooling aims to improve the DX of building apps. Whereas Webpack needs a complex configuration, even for simple things due to its high flexibility, these new tools provide very sensible but configurable default values. Furthermore, they leverage native ES Modules to minimize the amount of bundling required during development. In particular, they tend to bundle and cache only third-party dependencies to keep network connections low (the number of files downloaded by the browser). Some dependencies may have dozens or hundreds of files, but they don't need to be updated often. On the other hand, it provides user code unbundled to the browser, thus speeding up the refresh rates when making changes.

Enter Vite. With its evergreen and modern philosophy, we believe Vite aligns perfectly with Hydrogen. Featuring a lightning-fast development server with hot module replacement, a rich plugin ecosystem, and clever default configurations that make it work out of the box for most apps, Vite was among the top options to power Hydrogen's development engine.

Why Vite?

Vite is French for "quick", and the Hydrogen team can confirm: it's really fast. From the installation and setup to its hot reloading, things that used to be a DX pain are (mostly) gone. It’s also highly configurable and simple to use.

Partially, this is thanks to the two magnificent tools that power it: ESBuild, a Golang-based, lightning-fast compiler for JavaScript, and Rollup, a flexible and intelligible bundler. However, Vite is much more than the sum of these parts.

Ease of Use

In Vite, the main entry point is a simple index.html file, making it a first-class citizen instead of an after-thought asset. Everything else flows from here by using stylesheets and scripts tags. It crawls and analyzes all of the imported assets and transforms them accordingly.

Thanks to its default values, most flavors of CSS and JavaScript, including JSX, TypeScript (TS), PostCSS, work out of the box.

Let me reiterate this: it just works™. No painful configuration is needed to get those new CSS prefixes or the latest TS type checking working. It even lets you import WebAssembly or SVG files from JavaScript just like that. Also, since Vite's main target is modern browsers, it’s prepared to optimize the code and styles by using the latest supported features by default.

We value the simplicity Vite brings in Hydrogen and share it with our users. It all sums up to saving a lot of time configuring your tooling compared to other alternatives.

A Proven Plugin System

Rollup has been around for a much longer time than Vite. It does one thing and does it very well: bundling. The key here is that Vite can tell it what to bundle.

Furthermore, Rollup has a truly rich plugin ecosystem that is fully compatible with Vite. With this, Vite provides hooks during development and building phases that enable advanced use cases, such as transforming specific syntax like Vue files. There are many plugins out there that use these hooks for anything you can imagine: Markdown pages with JSX, SSR-ready icons, automatic image minification, and more.

In Hydrogen, we found these Vite hooks are easier to understand and use than those in Webpack, and it allows us to write more maintainable code.


A common task that tends to slow down web development is compiling JavaScript flavors and new features to older and widely supported code. Babel, a compiler written in JavaScript, has been the king in this area for a long time.

However, new tools like ESBuild started to appear recently with a very particular characteristic: they use a machine-compiled language to transform JavaScript instead of using JavaScript itself. In addition, and perhaps more importantly, they also apply sophisticated algorithms to avoid repeating AST parsing and parallelize work, thus establishing a new baseline for speed.

Apart from using ESBuild, Vite applies many other optimizations and tricks to speed up development. For instance, it pre-bundles some third-party dependencies and caches them in the filesystem to enable faster startups.

All in all, we can say Vite is one of the fastest alternatives out there when it comes to local development, and this is something we also want our users to benefit from in Hydrogen.


Along with Snowpack and Parcel, Vite is one of the first tools to embrace ECMAScript Modules (ESM) and inject JavaScript into the browser using script tags with type=module.

This, paired with hot-module replacement (HMR), means that changes to files on the local filesystem are updated instantly in the browser.

Vite is also building for the future of the web and the NPM ecosystem. While most third-party libraries are still on CommonJS (CJS) style modules (native in Node.js), the new standard is ESM. Vite performs an exhaustive import analysis of dependencies and transforms CJS modules into ESM automatically, thus letting you import code always in a modern fashion. And this is not something to take lightly. CJS and ESM interoperability is one of the biggest headaches web developers have faced in recent years.

As app developers ourselves in Hydrogen, it is such a relief we can focus on coding without wasting time on this issue. Someday most packages will, hopefully, follow the ESM standard. Until that day, Vite has us covered.

Server-Side Rendering

Server-side rendering (SSR) is a critical piece to modern frameworks like Hydrogen and is another place where Vite shines. It extends Rollup hooks to provide SSR information, thus enabling many advanced use cases.

For example, it is possible to transform the same imported file in different ways depending on the running environment (browser or server). This is key to supporting some advanced features we need in Hydrogen, such as React Server Components, which was only available in Webpack to this date.

Vite can also load front-end code in the server by converting dependencies to a Node-compatible runtime and modules to CJS. Think of simply importing a React application in Node. It greatly eases the way SSR works and is something Hydrogen leverages to remove extra dependencies and simplify code.


Last but not least, Vite has a large and vibrant community around it.

Many projects in addition to Hydrogen are relying on and contributing to Vite, such as Vitest, SvelteKit, Astro, Storybook, and many more.

And it's not just about the projects, but also the people behind them who are incredibly active and always willing to help in Vite's Discord channel. From Vite's creator, @youyuxi, to many other contributors and maintainers such as @patak_dev, @alecdotbiz, or @antfu7.

Hydrogen is also a proud sponsor of Vite. We want to support the project to ensure it stays up to date with the latest DX improvements to make web developers’ life easier.

How Hydrogen uses Vite

Our goal when building Hydrogen on top of Vite was to keep things as “close to the metal” as possible and not reinvent the wheel. CLI tools can rely on Vite commands internally, and most of the required configuration is abstracted away.

Creating a Vite-powered Hydrogen storefront is as easy as adding the @shopify/hydrogen/plugin plugin to your vite.config.js:

Behind the scenes, we are invoking 4 different plugins:

  • hydrogen-config: This is responsible for altering the default Vite config values for Hydrogen projects. It helps ensure bundling for both Node.js and Worker runtimes work flawlessly, and that third-party packages are processed properly.
  • react-server-dom-vite: It adds support for React Server Components (RSC). We extracted this plugin from Hydrogen core and made it available in the React repository.
  • hydrogen-middleware: This plugin is used to hook into Vite’s dev server configuration and inject custom behavior. It allows us to respond to SSR and RSC requests while leaving the asset requests to Vite’s default web server.
  • @vite/plugin-react: This is an official Vite plugin that adds some goodies for React development such as fast refresh in the browser.

Just with this, Hydrogen is able to support server components, streaming requests, clever caching, and more. By combining this with all the features Shopify already provides, you can unlock unparalleled performance and best-in-class DX for your storefront.

Choosing the Right Tool

There are still many advanced use cases where Webpack is a good fit since it is very mature and flexible. Many projects and teams, such as React’s, rely heavily on it for their day-to-day development.

However, Vite makes building modern apps a delightful experience and empowers framework authors with many tools to make development easier. Storefront developers can enjoy a best-in-class DX while building new features at a faster pace. We chose Vite for Hydrogen and are happy with that decision so far.

Fran works as a Staff Software Engineer on the Hydrogen team at Shopify. Located in Tokyo, he's a web enthusiast and an active open source contributor who enjoys all things tech and all things coconut. Connect with Fran on Twitter and GitHub.

If you’re passionate about solving complex problems at scale, and you’re eager to learn more, we're hiring! Reach out to us or apply on our careers page.

Continue reading

10 Books Shopify’s Tech Talent Think You Should Read

10 Books Shopify’s Tech Talent Think You Should Read

How we think, absorb information, and maximize time—these are the topics Shopify developers and engineers are reading up on.

We have a book bar of the company’s favorite reads and make sure any employee who wants a copy of any title can get one. So we thought we’d flip the script and ask 10 of our technical minds to tell us the books they think everyone in tech should read this year.

Many of their choices were timeless, suggesting a clear desire to level up beyond hard skills. There are a couple deep dives into software design and computing systems, but many of the titles on this reading list are guides for reframing personal habits and patterns: taking notes, receiving feedback, sharing knowledge, and staying focused amid myriad distractions.

The Talent Code by Daniel Coyle

(Bantam Books)

I received my copy of The Talent Code shortly before uprooting my life to attend a front-end bootcamp. The school sent a copy to every student about to start their nine-week program. Coyle’s thesis is “Greatness isn’t born. It’s grown.” He highlights areas that allow us to become great at almost anything: deep practice, passion, and master coaching. The book made me rethink whether I’m destined to be bad at some things. One example for me was softball, but a more pressing use case was my upcoming immersion in coding. Coyle’s lessons helped me thrive during my course’s long hours, but I haven’t applied the same lessons to softball, yet.

Carys Mills, Staff Front End Developer

The 5 Elements of Effective Thinking by Edward B. Burger and Michael Starbird

(Princeton University Press)

I’ve always followed the adage of “work smarter, not harder,” but in knowledge work, how do we “think smarter, not harder”? The 5 Elements of Effective Thinking presents an answer, packaged in a framework that’s applicable in work and life more broadly. The book is short and pithy. I keep it near my desk. The elements of the book include how to understand a topic, how to think about failure, how to generate good questions, and how to follow those questions. I won’t spoil the fifth element for you, you’ll have to read about it yourself!

Ash Furrow, Senior Staff Developer

Thanks for the Feedback: The Science and Art of Receiving Feedback Well by Sheila Heen and Douglas Stone

(Viking Adult)

As developers, we give and receive feedback all the time—every code review, tech review, and, of course, feedback on our foundational and soft skills too. There’s a lot of focus on how to do a good code review—how to give feedback, but there’s also an art of receiving feedback. Sheila Heen and Douglas Stone’s Thanks for the Feedback: The Science and Art of Receiving Feedback Well does an excellent job of laying out the different layers involved in receiving feedback and the different kinds there are. Being able to identify the kind of feedback I’m getting (beyond "constructive")—appreciation or encouragement, coaching or evaluative—has helped me leverage even poorly delivered feedback to positively impact my personal and professional growth.

Swati Swoboda, Development Manager, Shipping

How to Take Smart Notes by Sönke Ahrens


Occasionally there are books that will totally flip how you think about doing something. How to Take Smart Notes is one of those. The title is about notes, but the book is about taking a totally different approach to learning and digesting information. Even if you choose not to follow the exact note taking technique it describes, the real value is in teaching you how to think about your own methods of absorbing and integrating new information. It’s completely changed the approach I take to studying nonfiction books.

Rose Wiegley, Staff Software Engineer

Extreme Ownership: How U.S. Navy SEALs Lead and Win by Jocko Willink and Leif Babin

(Echelon Front)

The book that I'd recommend people read, if they haven't read it before, is actually a book we recommend internally at Shopify: Extreme Ownership by Jocko Willink and Leif Babin. Don't let the fact that it's about the Navy SEALs put you off. There are so many generally applicable lessons that are critical as our company continues to grow at a rapid pace. Success in a large organization—especially one that is globally distributed—is about decentralized leadership from teams and individuals: we all have the autonomy and permission to go forth and build amazing things for our merchants, so we should do just that whilst setting great examples for others to follow.

James Stanier, Director of Engineering, Core

The Elements of Computing Systems: Building a Modern Computer from First Principles by Noam Nisan and Shimon Schocken

(The MIT Press)

Curious how tiny hardware chips become the computers we work on? I highly recommend The Elements of Computing Systems for any software developer wanting a more well-rounded understanding of a computer’s abstraction layers—not just at the level you’re most comfortable with, but even deeper. This workbook guides you through building your own computer from the ground up: hardware chip specifications, assembly language, programming language, and operating system. The authors did a great job of including the right amount of knowledge to not overwhelm readers. This book has given me a stronger foundation in computing systems while working at Shopify. Don’t like technical books? The authors also have lectures on Coursera available for free.

Maple Ong, Senior Developer

A Philosophy of Software Design by John Ousterhout

(Yaknyam Press)

A Philosophy of Software Design tackles a complicated topic: how to manage complexity while building systems. And, surprisingly, it’s an easy read. One of Stanford computer science professor John Ousterhout’s insights I strongly agree with is that working code isn’t enough. Understanding the difference between tactical vs strategic coding helps you level up—specifically, recognizing when a system is more complex than it needs to be is a crucial yet underrated skill. I also like how Ousterhout likens software to playing a team sport, and when he explains why our work as developers isn’t only writing code that works, but also creating code and systems that allow others to work easily. Read with an open mind. A Philosophy of Software Design offers a different perspective from most books on the subject.

Stella Miranda, Senior Developer

Living Documentation by Cyrille Martraire

(Addison-Wesley Professional)

Living Documentation isn’t so much about writing good documentation than about transmitting knowledge, which is the real purpose of documentation. In the tech world where the code is the source of truth, we often rely on direct interactions when sharing context, but this is a fragile process as knowledge can be diluted from one person to another and even lost when people leave a company. On the other side of the spectrum lies traditional documentation. It’s more perennial but requires significant effort to keep relevant, and that’s the main reason why documentation is the most overlooked task in the tech world. Living Documentation is an attempt at bridging the gap between these two extremes by applying development tools and methods to documentation in an incremental way, ensuring knowledge transmission in a 100-year company.

Frédéric Bonnet, Staff Developer

Uncanny Valley by Anna Wiener

(MCD Books)

Sometimes you need to read something that’s both resonant and entertaining in addition to job or specific skill-focused books. In the memoir Uncanny Valley, Anna Wiener vividly describes her journey from working as a publishing assistant in New York to arriving in the Bay Area and befriending CEOs of tech unicorns. At a time when tech is one of the biggest and most influential industries in the world, her sharp observations and personal reflections force those of us working in the sector to look at ourselves with a critical eye.

Andrew Lo, Staff Front End Developer

Deep Work: Rules For Focused Success in a Distracted World by Cal Newport

(Grand Central Publishing)

I've found that the most impactful way to tackle hard problems is to first get into a flow state. Having the freedom to work uninterrupted for long blocks of time has often been the differentiator in discovering creative solutions. Once you've experienced it, it's tough going back to working any other way. Most of the activities we do as knowledge workers benefit from this level of attention and focus. And if you've never tried working in long, focused time blocks, Deep Work should convince you to give it a shot. A word of warning though: make sure you have a bottle of water and some snacks handy. It's easy to completely lose track of time and skip meals. Don't do that!

Krishna Satya, Development Manager

For more book recommendations, check out this Twitter thread from last year’s National Book Lovers Day.

If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

How We Built the Add to Favorite Animation in Shop

How We Built the Add to Favorite Animation in Shop

I just want you to feel it

Jay Prince, from the song Feel It

I use the word feeling a lot when working on animations and gestures. For example, animations or gestures sometimes feel right or wrong. I think about that word a lot because our experiences using software are based on an intuitive understanding of the real world. When you throw something in real life, it influences how you expect something on screen to behave after you drag and release it.

By putting work, love, and care into UI details and designs, we help shape the experience and feeling users have when using an app. All the technical details and work is in service of the user's experiences and feelings. The user may not consciously notice the subtle animations we create, but if we do our job well, the tiniest gesture will feel good to them.

The team working on Shop, our digital shopping assistant, recently released a feature that allows buyers to favorite products and shops. By pressing a heart button on a product, buyers can save those products for later. When they do, the product image drops into the heart icon (containing a list of favorite products) in the navigation tab at the bottom.

In this post, I’ll show you how I approached implementing the Add to Favorite animation in Shopify’s Shop app. Specifically, we can look at the animation of the product image thumbnail appearing, then moving into the favorites tab bar icon:

Together, we'll learn:

  • How to sequence animations.
  • How to animate multiple properties at the same time.
  • What interpolation is.

Getting Started

When I start working on an animation from a video provided by a designer, I like to slow it down so I can see what's happening more clearly:

If a slowed video isn’t provided, you can record the animation using Monosnap or Quicktime. This also allows you to slowly scrub through the video. Fortunately, we also have this great motion spec to work with as well:

As you can see, the motion spec defines the sequence of animations. Based on the spec, we can determine:

  • which properties are animating
  • what values to animate to
  • how long each animation will take
  • the easing curve of the animation
  • the overall order of the animations

Planning the Sequence

Firstly, we should recognize that there are two elements being animated:

  • the product thumbnail
  • the favorites tab bar icon

The product thumbnail is being animated first, then the Favorites tab bar icon is being animated second. Let's break it down step by step:

1. Product thumbnail fades in from 0% to 100% opacity. At the same time, it scales from 0 to 1.2.
2. Product thumbnail scales from 1.2 to 1
(A 50 ms pause where nothing happens)
3. Product thumbnail moves down, then disappears instantly at the end of this step.
4. The Favorite tab bar icon moves down. At the same time, it changes color from white to purple.
5. The Favorite tab bar icon moves up. At the same time, it changes color from purple to white.
6. The Favorite tab bar icon moves down.
7. The Favorite tab bar icon moves up to its original position.


Each of the above steps is an animation that has a duration and easing curve, as specified in the motion spec provided by the motion designer. Our motion specs define these easing curves that define how a property changes over time:

Coding the Animation Sequence

Let's write code! The Shop app is a React Native application and we use the Reanimated library to implement animations.

For this animation sequence, there are multiple properties being animated at times. However, these animations happen together, driven by the same timings and curves. Therefore we can use only one shared value for the whole sequence. That shared progress value can drive animations for each step by moving from values 1 to 2 to 3 etc.

So the progress value tells us which step of the animation we are in, and we can set the animated properties accordingly. As you can see, this sequence of steps match the steps we wrote down above, along with each step's duration and easing curves, including a delay at step 3:

We can now start mapping the progress value to the animated properties!

Product Thumbnail Styles

First let's start with the product thumbnail fading in:

What does interpolate mean?

Interpolating maps a value from an input range to an output range. For example, if the input range is [0, 1] and the output range is [0, 10], then as the input increases from 0 to 1, the output increases correspondingly from 0 to 10. In this case, we're mapping the progress value from [0, 1] to [0, 1] (so no change in value).

In the first step of the animation, the progress value changes from 0 to 1 and we want the opacity to go from 0 to 1 during that time so that it fades in. “Clamping” means that when the input value is greater than 1, the output value stays at 1 (it restricts the output to the maximum and minimum of the output range). So the thumbnail will fade in during step 1, then stay at full opacity for the next steps because of the clamping.

However, we also want the thumbnail to disappear instantly at step 3. In this case, we don't use interpolate because we don't want it to animate a fade-out. Instead, we want an instant disappearance:

Now the item is fading in, but it also has to grow in scale and then shrink back a bit:

This interpolation is saying that from step 0 to 1, we want scale to go from 0 to 1.2. From step 1 to 2, we want the scale to go from 1.2 to 1. After step 2, it stays at 1 (clamping).

Let's do the final property, translating it vertically:

So we're moving from position -60 to -34 (half way behind the tab bar) between steps 2 and 3. After step 3, the opacity becomes 0 and it disappears! Let's test the above code:

Nice, it fades in while scaling up, then scales back down, then slides down halfway under the tab bar, and then disappears.

Tab Bar Icon Styles

Now we just need to write the Favorite tab bar icon styles!

First, let's handle the heart becoming filled (turning purple), then unfilled (turning white). I did this by positioning the filled heart icon over the unfilled one, then fading in the filled one over the unfilled one. Therefore, we can use a simple opacity animation where we move from 0 to 1 and back to 0 over steps 3, 4 and 5:

For the heart bouncing up and down we have:

From steps 3 to 7, this makes the icon move up and down, creating a bouncing effect. Let's see how it looks!

Nice, we now see the tab bar icon react to having a product move into it.

Match Cut

By using a single shared value, we ensured that the heart icon moves down immediately when the thumbnail disappears, creating a match cut. A “match cut” is a cinematic technique where the movement of an item immediately cuts to the movement of another item during a scene transition. The movement that the users’ eye expects as the product thumbnail moves down cuts to a matching downward movement of the heart icon. This creates an association of the item and the Favorites section in the user's mind.

In another approach, I tried using setTimeout to start the tab bar icon animation after the thumbnail one. I found that when the JS thread was busy, this would delay the second animation, which ruined the match cut transition! It felt wrong when seeing it with that delay. Therefore, I did not use this approach. Using withDelay from Reanimated would have avoided this issue by keeping the timer on the UI thread.

When I started learning React Native, the animation code was intimidating. I hope this post helps make implementing animations in React Native more fun and approachable. When done right, they can make user interactions feel great!

You can see this animation by favoriting a product in the Shop app!

Special thanks to Amber Xu for designing these animations, providing me with great specs and videos to implement them, and answering my many questions.

Andrew Lo is a Staff Front End Developer on the Shop's Design Systems team. He works remotely from Toronto, Canada. 

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

A Data Scientist’s Guide To Measuring Product Success

A Data Scientist’s Guide To Measuring Product Success

If you’re a data scientist on a product team, much of your work involves getting a product ready for release. You may conduct exploratory data analyses to understand your product’s market, or build the data models and pipelines needed to power a new product feature, or design a machine learning model to unlock new product functionality. But your work doesn’t end once a product goes live. After a product is released, it’s your job to help identify if your product is a success.

Continue reading

Using Terraform to Manage Infrastructure

Using Terraform to Manage Infrastructure

Large applications are often a mix of code your team has written and third-party applications your team needs to manage. These third-party applications could be things like AWS or Docker. In my team’s case, it’s Twilio TaskRouter.

The configuration of these services may not change as often as your app code does, but when it does, the process is fraught with the potential for errors. This is because there is no way to write tests for the changes or easily roll them back–things we depend on as developers when shipping our application code.

Using Terraform improves your infrastructure management by allowing users to implement engineering best practices in what would otherwise be a GUI with no accountability, tests, or revision history.

On the Conversations team, we recently implemented Terraform to manage a piece of our infrastructure to great success. Let’s take a deeper look at why we did it, and how.

My team builds Shopify’s contact center. When a merchant or partner interacts with an agent, they are likely going through a tool we’ve built. Our app suite contains applications we’ve built in-house and third-party tools. One of these tools is Twilio TaskRouter.

TaskRouter is a multi-channel skill-based task routing API. It handles creating tasks (voice, chat, etc.) and routing them to the most appropriate agent, based on a set of routing rules and agent skills that we configure.

As our business grows and becomes more complex, we often need to make changes to how merchants are routed to the appropriate agent.

Someone needs to go into our Twilio console and use the graphic user interface (GUI) to update the configuration. This process is fairly straightforward and works well for getting off the ground quickly. However, the complexity quickly becomes too high for one person to understand it in its entirety.

In addition, the GUI doesn’t provide a clear history of changes or a way to roll them back.

As developers, we are used to viewing a commit history, reading PR descriptions and tests to understand why changes happened, and rolling back changes that are not working as expected. When working with Twilio TaskRouter, we had none of these.

Using Terraform to Configure Infrastructure

Terraform is an open source tool for configuring infrastructure as code.

It is a state machine for infrastructure that gives teams all the benefits of engineering best practices listed above to infrastructure that was previously only manageable via a GUI.

Terraform requires three things to work:

  1. A reliable API. We need a reliable API for Terraform to work. When using Terraform, we will stop using the GUI and rely on Terraform to make our changes for us via the API. Anything you can’t change with the API, you won’t be able to manage with Terraform.
  2. A Go client library. Terraform is written in Go and requires a client library for the API you’re targeting written in Go. The client library makes HTTP(S) calls to your target app.
  3. A Terraform provider. The core Terraform software uses a provider to interact with the target API. Providers are written in Go using the Terraform Plugin SDK.

With these three pieces, you can manage just about any application with Terraform!

Image from:<

A Terraform provider adds a set of resources Terraform can manage. Providers are not part of Terraform’s code. They are created separately to manage a specific application. Twilio did not have a provider when we started this project, so we made our own.

Since launching this project, Twilio has developed its own Terraform provider, which can be found here.

At its core, a provider enables Terraform to perform CRUD operations on a set of resources. Armed with a provider, Terraform can manage the state of the application.

Creating a Provider

Note: If you are interested in setting up Terraform for a service that already has a provider, you can skip to the next section.

Here is the basic structure of a Terraform provider:

This folder structure contains your Go dependencies, a Makefile for running commands, an example file for local development, and a directory called twilio. This is where our provider lives.

A provider must contain a resource file for every type of resource you want to manage. Each resource file contains a set of CRUD instructions for Terraform to follow–you’re basically telling Terraform how to manage this resource.

Here is the function defining what an activity resource is in our provider:

Note: Go is a strongly typed language, so the syntax might look unusual if you’re not familiar with it. Luckily you do not need to be a Go expert to write your own provider!

This file defines what Terraform needs to do to create, read, update and destroy activities in Task Router. Each of these operations is defined by a function in the same file.

The file also defines an Importer function, a special type of function that allows Terraform to import existing infrastructure. This is very handy if you already have infrastructure running and want to start using Terraform to manage it.

Finally, the function defines a schema–these are the parameters provided by the API for performing CRUD operations. In the case of Task Router activities, the parameters are friendly_name, available, and workspace_sid.

To round out the example, let’s look at the create function we wrote:

Note: Most of this code is boilerplate Terraform provider code which you can find in their docs.

The function is passed context, a schema resource, and an empty interface.

We instantiate the Twilio API client and find our workspace (Task Router activities all exist under a single workspace).

Then we format our parameters (defined in our Schema in the resourceTwilioActivity function) and pass them into the create method provided to us by our API client library.

Because this function creates a new resource, we set the id (setID) to the sid of the result of our API call. In Twilio, a sid is a unique identifier for a resource. Now Terraform is aware of the newly created resource and it’s unique identifier, which means it can make changes to the resource.

Using Terraform

Once you have created your provider or are managing an app that already has a provider, you’re ready to start using Terraform.

Terraform uses a DSL for managing resources. The good news is that this DSL is more straightforward than the Go code that powers the provider.

The DSL is simple enough that with some instruction, non-developers should be able to make changes to your infrastructure safely–but more on that later.

Here is the code for defining a new Task Router activity:

Yup, that’s it!

We create a block declaring the resource type and what we want to call it. In that block, we pass the variables defined in the Schema block of our resourceTwilioActivity, and any resources that it depends on. In this case, activities need to exist within a workspace. So we pass in the workspace resource in the depends_on array. Terraform knows it needs this resource to exist or to create it before attempting to create the activity.

Now that you have defined your resource, you’re ready to start seeing the benefits of Terraform.

Terraform has a few commands, but plan and apply are most common. Plan will print out a text-based representation of the changes you’re about to make:

Terraform makes visualizing the changes to your infrastructure very easy. At this planning step you may uncover unintended changes - if there was already an offline activity the plan step would show you an update instead of a create. At this step, all you need to do is change your resource block’s name,and run terraform plan again.

When you are satisfied with your changes, run terraform apply to make the changes to your infrastructure. Now Terraform will know about the newly created resource, and its generated id, allowing you to manage it exclusively through Terraform moving forward.

To get the full benefit of Terraform (PRs, reviews, etc.), we use an additional tool called Atlantis to manage our GitHub integration.

This allows people to make pull requests with changes to resource files, and have Atlantis add a comment to the PR with the output of terraform plan. Once the review process is done, we comment atlantis apply -p terraform to make the change. Then the PR is merged.

We have come a long way from managing our infrastructure with a GUI in a web app! We have a Terraform provider communicating via a Go API client to manage our infrastructure as code. With Atlantis plugged into our team’s GitHub, we now have many of the best practices we rely on when writing software–reviewable PRs that are easy to understand and roll back if necessary, with a clear history that can be scanned with a git blame.

How was Terraform Received by Other Teams?

The most rewarding part of this project was how it was received by other teams. Instead of business and support teams making requests and waiting for developers to change Twilio workflows, Terraform empowered them to do it themselves. In fact, some people’s first PRs were changes to our Terraform infrastructure!

Along with freeing up developer time and making the business teams more independent, Terraform provides visibility to infrastructure changes over time. Terraform shows the impact of changes, and the ease of searching GitHub for previous changes makes it easy to understand the history of changes our teams have made.

Building great tools will often require maintaining third-party infrastructure. In my team’s case, this means managing Twilio TaskRouter to route tasks to support agents properly.

As the needs of your team grow, the way you configure your infrastructure will likely change as well. Tracking these changes and being confident in making them is very important but can be difficult.

Terraform makes these changes more predictable and empowers developers and non-developers alike to use software engineering best practices when making these changes.

Jeremy Cobb is a developer at Shopify. He is passionate about solving problems with code and improving his serve on the tennis court.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Creating a React Library for Consistent Data Visualization

Creating a React Library for Consistent Data Visualization

At Shopify, we tell a lot of stories through data visualization. This is the driving force behind business decisions—not only for our merchants, but also for teams within Shopify.

With more than 10,000 Shopify employees, though, it is only natural that different teams started using different tools to display data, which is great—after all, creative minds create diverse solutions, right? The problem is that it led to a lot of inconsistencies, like these two line charts that used to live in the Shopify admin—the page you see after logging in to Shopify, where you can set up your store, configure your settings, and manage your business—for example:

Let’s play Spot the Difference: line widths, dashed line styles, legend styles, background grids, one has labels on the X Axis, the other doesn’t... This isn’t just a “visual styles” problem since they use different libraries, one was accessible to screen readers and the other wasn’t; one was printable the other not.

To solve this problem, the Insights team has been working on creating a React data visualization library—Polaris Viz—that other teams can rely on to quickly implement data visualization without having to solve the same problems over and over again.

But first things first, if you haven’t yet, I recommend you start by reading my co-worker Miru Alves’ amazing blog post where she describes how we used Delta-E and Contrast Ratio calculations to create a color matrix with a collection of colors we can choose from to safely use without violating any accessibility rules.

This post is going to focus on the process of implementing the light and dark themes in the library, as well as allowing library consumers to create their own themes, since not all Shopify brands like Shop, Handshake, or Oberlo use the same visual identity.

Where Did the Inconsistencies Come From?

When we started tackling this issue, the first thing we noticed was that even in places that were already using only Polaris Viz, we had visual inconsistencies. This is because our original components API looked like this:

As you can see, changing the appearance of a chart involved many different options spread in different props, and you either had to create a wrapping component that has all the correct values or pass the props over and over again to each instance. OK, this explains a lot.

Ideally, all charts in the admin should use either the default dark or light themes that the UX team created, so we should make it easy for developers to choose light or dark without all this copyin’ && pasta.

Implementing Themes

To cover the use cases of teams that used the default dark or light themes, we removed all the visual style props and introduced a new theme prop to all chart components:

  • The theme prop accepts the name of a theme defined in a record of Themes.
  • The Theme type contains all visual configurations like colors, line styles, spacing, and if bars should be rounded or not.

These changes allow consumers to have all the good styles by default—styles that match our visual identity, take accessibility into consideration, and have no accidental discrepancies—and they just have to pass in theme=’Light’ if they want to use the Light instead of the Dark theme

This change should cover the majority of use cases, but we still need to support other visual identities. Putting back all those style props would lead to the same problems for whoever wasn’t using the default styles. So how could we make it easy to specify a different visual identity?

Introducing the PolarisVizProvider

We needed a way to allow consumers to define what their own visual identity looks like in a centralized manner so all charts across their applications would just use the correct styles. So instead of having the chart components consume the themes record from a const directly, we introduced a context provider that stores the themes:

By having the provider accept a themes prop, we allow consumers to overwrite the Default and Light themes or add their own. This implementation could cause some problems though: what happens if a user overwrites the Default theme, but doesn’t provide all the properties that are necessary to render a chart. For example what if they forget to pass the tooltip background color?

To solve this, we first implemented a createTheme function:

createTheme allows you to pass in a partial theme and obtain a complete theme. All the properties that are missing in the partial theme will just use the library’s default values.

Next, we implemented a createThemes function. It guarantees that even if properties are overwritten, the theme record will always contain the Default and Light themes:

With both of these in place, we just needed to update the PolarisVizProvider implementation:

Overwriting the Default Theme

From a consumer perspective, this means that you could wrap your application with a PolarisVizProvider, define your Default theme, and all charts will automagically inherit the correct styles. For example:

All charts inside of <App/> will have a blue background by default:

It hurts my eyes, but IT WORKS!

Creating Multiple Themes

You can also define multiple extra themes in the PolarisVizProvider. Each top level key in this object is used as a theme name that you can pass to individual charts later on. For example:

The first chart uses a theme named AngryRed and the second HappyGreen

We did have to repeat the definition of the single series color twice though—seriesColors.single = [‘black’]—it would be even more annoying if we had multiple properties shared by both themes and only wanted to overwrite some. We can make this easier by changing the implementation of the createTheme function to accept an optional baseTheme, instead of always using the default from the library:

With those changes in place, as a consumer I can just import createTheme from the library and use AngryRed as the base theme when creating HappyGreen:

Making Colors Change According to the Data Set

Another important feature we had in the library and didn’t want to lose was to change the series colors according to the data.

In this example, we’re applying a green gradient to the first chart to highlight the highest values as having more ordered items—more sales—is a good thing! In the second chart though, we’re applying a red gradient to highlight the highest values, since having more people return what they ordered isn’t such a good thing.

It would be super cumbersome to create extra themes any time we wanted a specific data series to use a different color, so we changed our DataSeries type to accept an optional colour that can overwrite the series colour coming from the theme:

So for the example above, we could have something like:

Next Steps

Polaris Viz will be open source soon! If you want to get access to the beta version of the library, help us test, or suggest features that might be useful for you, reach out to us at

Krystal is a Staff Developer on the Visualization Experiences team. When she’s not obsessing over colors, shapes and animation she’s usually hosting karaoke & billiards nights with friends or avoiding being attacked by her cat, Pluma.

Continue reading

Test Budget: Time Constrained CI Feedback

Test Budget: Time Constrained CI Feedback

At Shopify we run more than 170,000 tests in our core monolith. Naturally, we're constantly exploring ways to make this faster, and the Test Infrastructure team analyzed the feasibility of introducing a test budget: a fixed amount of time for tests to run. The goal is to speed up the continuous integration (CI) test running phase by accepting more risk. To achieve that goal we used prioritization to reorder the test execution plan in order to increase the probability of a fast failure. Our analysis provided insights into the effectiveness of executing prioritized tests under a time constraint. The single most important finding was that we were able to find failures after we had run only 70% of the test-selection suite.

The Challenge

Shopify’s codebase relies on CI to avoid regressions before releasing new features. As the code submission rate grows along with the development team size, so does the size of the test pool and the time between code check-ins and test result feedback. As seen in the figure below developers will occasionally get late CI feedback while other times the CI builds complete in under 10 minutes. This non-normal cadence of receiving CI feedback leads to more frequent context switches.

The feedback time varies

Various techniques exist to speed up CI such as running tests in parallel or reducing the number of tests to run with test selection. Balancing the cost of running tests against the value of running them is a fundamental topic in test selection. Furthermore, if we think of the value as a variable then we can make the following observations for executing tests:

  • No amount of tests can give us complete confidence that no production issue will occur.
  • The risk of production issues is lower if we run all the tests.
  • As complexity of the system increases, the value of testing any individual component decreases.
  • Not all tests increase our confidence level the same way.

The Approach

It’s important to note first the difference between the test selection and test prioritization. Test selection selects all tests that correspond to the given changes using a call graph deterministically. On the other hand, test prioritization orders the test with the goal of discovering failures fast. Also, that sorted set won’t always be the same for the same change since the prioritization techniques use historical data.

The system we built produces a prioritized set of tests on top of test selection and constrains the execution of those tests using a predetermined time budget. Having established that there’s a limited time to execute the tests, the next step is to determine what’s the best time to stop executing tests and enforce it.

The time constraint or budget, and where the name Test Budget comes from, is the predetermined time we terminate test execution while considering that we must find as many failures as possible during that period of time.

System Overview

The guiding principle we used to build the Test Budget was: we can't be sure there will be no bugs in production that affect the users after running our test suite in any configuration.

To identify the most valuable tests to run within an established time budget, the following steps must be performed:

  1. identify prioritization criteria and compute the respective prioritized sets of tests
  2. compute the metrics for all criteria and analyze the results to determine the best criteria
  3. further analyze the data to pick a time constraint for running the tests

The image below gives a structural overview of the test prioritization system we built. First, we are computing the prioritized sets of tests using historical test results for every prioritization criterion (for example the criterion failure rate will have it’s own prioritized set of tests). Then, given some commit and the test-selection set that corresponds to that commit, we’re executing the prioritized tests as a CI build. These prioritized tests are a subset of the test selection test suite.

Test Prioritization System

First, the system obtains the test result data needed by the prioritization techniques. The data is ingested into a Rails app that’s responsible for the processing and persistence. It exposes the test results through a HTTP API and a GUI. For persistence, we chose to use Redis, not only because of the unstructured nature of our data, but also because of the Redis Sorted Sets data structure that enables us to query for ordered sets of tests in O(logn) time, where n is the number of elements in the set.

The goal of the next step is to select a subset of tests given the changes of the committed code. We created a pipeline that’s being triggered for a percentage of the builds that contain failures. We execute this pipeline with a specific prioritization each time and calculate metrics based on it.

Modeling Risk

During the CI phase, the risk of not finding a fault can be thought of as a numbers game. How certain are we that the application will be released successfully if we have tested all the flows? What if we test the same flows 1000 times? We leaned on test prioritization to order the tests in such a way that early faults are found as soon as possible, which encouraged the application of heuristics as the prioritization criteria. This section explores how to measure the risk of not detecting faults using the time budget and if we don’t skip a test randomly but after using the best heuristics.

Prioritization Criteria

We built six test prioritization criteria that produced a rating for every test in the codebase:

  • failure_rate: how frequently a test fails based on historical data.
  • avg_duration: how fast a test executes. Executing faster tests allows us to execute more tests in a short amount of time.
  • churn: a file that’s changing too much could be more brittle.
  • coverage: how much of the source code is executed when running a test.
  • complexity: based on the lines of code per file.
  • default: this is the random order set.

Evaluation Criteria

After we get the prioritized tests, we need to evaluate the results of executing the test suite following the prioritized order. We chose two groups of metrics to evaluate the criteria:

  1. The first includes the Time to First Failure (TTFF) which acts as a tripwire since if the time to first failure is 10 minutes then we can’t enforce a lower time constraint than 10 minutes.
  2. The second group of metrics includes the Average Percentage of Faults Detected (APFD) and the Convergence Index. We needed to start thinking of the test execution timing problem using a risk scale, which would open the way for us to run fewer tests by tweaking how much risk we will accept.

The APFD is a measure of how early a particular test suite execution detects failures. APFD is calculated using the following formula:

APFD Formula

The equation tells us that in order to calculate the APFD we will take the difference between 1 and the sum of the positions of the tests that expose each failure. In the equation above:

  • n is the number of test cases in the test suite
  • m is the total number of failures in the test suite
  • Fi is the position in the prioritized order set of the first test that exposes the fault i.

The APFD values range from 0 to 1, where higher APFD values imply a better prioritization.

For example, for the test suites (produced by different prioritization algorithms) T1 and T2 that each have a total number of tests (n) = 100 and total number of faults (m) = 4, we get the following matrix:
















And we calculate their APFD values:

APFD Values
APFD Values

The first prioritization has a better APFD rating (0.7525 versus 0.6425).

The Convergence Index tells us when to stop testing within a time constrained environment because a high convergence indicates we’re running fewer tests and finding a big percentage of failures.

Convergence Index = Percentage of faults detectedPercentage of tests executed
Convergence Index

The formula to calculate the Convergence Index is the percentage of faults detected divided by the percentage of tests executed.

Data Analysis

For each build, we created and instrumented a prioritized pipeline to produce artifacts for building the prioritization sets and emit test results to Kafka topics.

The prioritization pipeline in Buildkite

We ran the prioritized pipeline multiple times to apply statistical analysis to our results. Finally, we used Python Notebooks to combine all the measurements and easily visualize the percentiles. For APFD and TTFF we decided to use boxplot to visualize possible outliers and skewness of the data.

When Do We Find the First Failing Test?

We used the TTFF metric to quantify how fast we could know that the CI will eventually fail. Finding the failure within a time window is critical because the goal is to enforce that window and stop the test execution when the time window ends.


In the figure above we present the statistical distributions for the prioritization criteria using boxplots. The median time to find a failure is less than five minutes for all the criteria. Complexity, churn, and avg_duration have the worst third quartile results with a maximum of 16 minutes. On the other hand, default and failure_rate gave more promising results with a median of less than three minutes.

Which Prioritization Criteria Have the Best Failure Detection Rates?

We used the APFD metric to compare the prioritization criteria. A higher APFD value indicates a better failure detection rate.

APFD scores

The figure above presents the boxplots of APFD values for all the prioritization criteria. We notice that there isn’t a significant difference between the churn and complexity prioritization criteria. Both of these have median values close to zero which make them very inappropriate for prioritizing the tests. We also see that the failure_rate has the best detection rate that’s marginally better than the random (default) one.

Which Prioritization Criteria Has the Quickest Convergence Time?

The increase of test failures detected decreases as we execute more tests. This is what we visualized with the convergence index data and using a step chart. In all the convergence graphs the step is 10% of the test suite executed.

Mean convergence index

The above figure indicates that while all the criteria find a percentage of faults after running only 50% of the test suite for the mean, the default and failure_rate prioritization criteria stand out.

For the mean case, executing 50% of the test suite finds 50% of the failures using the default prioritization and 60% using the failure_rate. The failure_rate criterion is able to detect 80% of the failures after running only 60% of the test suite.

How Much Can We Shrink the Test Suite Given a Time Constraint?

The p20 and p5 visualizations of the convergence quantify how reliably we could detect faults within the time budget. We use the p20 and p5 visualizations because a higher value of convergence is better. The time budget is an upper bound. The CI system executes the tests up to that time bound.

Convergence index p20

For example, after looking at the p20 (80% of builds) plot (the above figure), we need to execute 60% of the test-selection tests (the test-selection suite is 40% of the whole test suite on median) to detect an acceptable amount of failures. Then, the time budget is the time it takes to execute 60% of the selected tests.

Convergence index p5

Looking at the plot of the 5th percentile (95% of the builds) plot (see the figure above), we notice that we could execute 70% of the already test-selection reduced test suite to detect 50% of the failures.

The Future of Test Budget Prioritization

By looking at our convergence and TTFF and if we want to emphasize the discovery of a faulty commit, that is the first failure, we can see that we could execute less than 70% of the test-selection suite.

The results of the data analysis suggest several alternatives for future work. First, deep learning models could utilize the time budget as a constraint while they are building the prioritized sets. Prioritizing tests using a feedback mechanism could be the next prioritization to explore, where tests that never run could be automatically deleted from the codebase, or failures that result in problems during production testing could be given a higher priority.

Finally, one possible potential for a Test Budget prioritization system could be outside the scope of the Continuous Integration environment: the development environment. Another way of looking at the ordered sets is that the first tests are more impactful or more susceptible to failures. Then we could use such data to inform developers during the development phase that parts of the codebase are more likely to have failing tests in CI. A message such as “this part of the codebase is covered by a high priority test which breaks in 1% of the builds” would give feedback to developers immediately while they’re writing the code. It would shift testing to the left by giving code suggestions during development, and eventually reduce the costs and time of executing tests in the CI environment.

If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

Adding the V8 CPU Profiler to v8go

Adding the V8 CPU Profiler to v8go

V8 is Google’s open source high-performance JavaScript and WebAssembly engine written in C++. v8go is a library written in Go and C++ allowing users to execute JavaScript from Go using V8 isolates. Using Cgo bindings allows us to run JavaScript in Go at native performance.

The v8go library, developed by Roger Chapman, aims to provide an idiomatic way for Go developers to interface with V8. As it turns out, this can be tricky. For the past few months, I’ve been contributing to v8go to expose functionality in V8. In particular, I’ve been adding support to expose the V8 CPU Profiler

From the start, I wanted this new API to be:

  • easy for the library's Go users to reason about
  • easy to extend for other profiler functionality eventually
  • aligned closely with the V8 API
  • as performant as possible.

The point about performance is especially interesting. I theorized that my first iteration of the implementation was less performant than a proposed alternative. Without benchmarking them, I proceeded to rewrite. That second implementation was merged, and I moved on with my life. So when I was like "Hey! I should write a post about the PR and benchmark the results" only to actually see the benchmarks and reconsider everything.

If you’re interested in API development, Go/Cgo/C++ performance or the importance of good benchmarks, this is a story for you.

Backing Up to the Starting Line: What Was My Goal?

The goal of adding the V8 CPU Profiler to v8go was so users of the library could measure the performance of any JavaScript being executed in a given V8 context. Besides providing insight on the code being executed, the profiler returns information about the JavaScript engine itself including garbage collection cycles, compilation and recompilation, and code optimization. While virtual machines and the like can run web applications incredibly fast, code should still be performant, and it helps to have data to understand when it's not. 

If we have access to a CPU profiler, we can ask it to start profiling before we start executing any code. The profiler samples the CPU stack frames at a preconfigured interval until it's told to stop. Sufficient sampling helps show the hot code paths whether that be in the source code or in the JavaScript engine. Once the profiler has stopped, a CPU profile is returned. The profile comes in the form of a top-down call tree composed of nodes. To walk the tree, you get the root node and then follow its children all the way down.

Here’s an example of some JavaScript code we can profile:

Using v8go, we start by creating the V8 isolate, context, and CPU profiler. Before running the above code, the profiler is told to start profiling:

After the code has finished running, the profiling is stopped and the CPU profile returned. A simplified profile in a top-down view for this code looks like:

Each of these lines corresponds to a node in the profile tree. Each node comes with plenty of details including:

  • name of the function (empty for anonymous functions)
  • id of the script where the function is located
  • name of the script where the function originates
  • number of the line where the function originates
  • number of the column where the function originates
  • whether the script where the function originates is flagged as being shared cross-origin
  • count of samples where the function was currently executing
  • child nodes of this node
  • parent node of this node
  • and more found in the v8-profiler.h file.

For the purposes of v8go, we don’t need to have opinions about how the profile should be formatted, printed, or used since this can vary. Some may even turn the profile into a flame graph. It’s more important to focus on the developer experience of trying to generate a profile in a performant and idiomatic way.

Evolving the API Implementation

Given the focus on performance and an idiomatic-to-Go API, the PR went through a few different iterations. These iterations can be categorized into two distinct rounds: the first where the profile was lazily loaded and the second where the profile was eagerly loaded. Let’s start with lazy loading.

Round 1: Lazy Loading

The initial approach I took aligned v8go with V8's API as closely as possible. This meant introducing a Go struct for each V8 class we needed and their respective functions (that is, CPUProfiler, CPUProfile, and CPUProfileNode).

This is the Go code that causes the profiler to stop profiling and return a pointer to the CPU profile:

This is the corresponding C++ code that translates the request in Go to V8's C++:

With access to the profile in Go, we can now get the top-down root node:

The root node exercises this C++ code to access the profiler pointer and its corresponding GetTopDownRoot() method:

With the top-down root node, we can now traverse the tree. Each call to get a child, for instance, is its own Cgo call as shown here:

The Cgo call exercises this C++ code to access the profile node pointer and its corresponding GetChild() method:

The main differentiator of this approach is that to get any information about the profile and its nodes, we have to make a separate Cgo call. For a very large tree, this makes at least kN more Cgo calls where k is the number of properties queried, and N is the number of nodes. The value for k will only increase as we expose more properties on each node.

How Go and C Talk to Each Other

At this point, I should explain more clearly how v8go works. v8go uses Cgo to bridge the gap between Go and V8's C code. Cgo allows Go programs to interoperate with C libraries: calls can be made from Go to C and vice versa.

Upon some research about Cgo's performance, you'll find Sean Allen’s Gophercon 2018 talk where he made the following recommendation:

“Batch your CGO calls. You should know this going into it, since it can fundamentally affect your design. Additionally once you cross the boundary, try to do as much on the other side as you can. So for go => “C” do as much as you can in a single “C” call. Similarly for “C” => go do as much as you can in a single go call. Even more so since the overhead is much higher.”

Similarly, you’ll find Dave Cheney’s excellent “cgo is not go” that explains the implications of using cgo: 

“C doesn’t know anything about Go’s calling convention or growable stacks, so a call down to C code must record all the details of the goroutine stack, switch to the C stack, and run C code which has no knowledge of how it was invoked, or the larger Go runtime in charge of the program.

The take-away is that the transition between the C and Go world is non trivial, and it will never be free from overhead.”

When we talk about “overhead,” the actual cost can vary by machine but some benchmarks another contributor v8go (Dylan Thacker-Smith) ran show an overhead of about 54 nanoseconds per operation (ns/op) for Go to C calls and 149 ns/op for C to Go calls:

Given this information, the concern for the lazy loading is justified: when a user needs to traverse the tree, they’ll make many more Cgo calls, incurring the overhead cost each time. After reviewing the PR, Dylan made the suggestion of: building the entire profile graph in C code and then passing a single pointer back to Go so Go could rebuild the same graph using Go data structures loaded with all the information that can then be passed to the user. This dramatically reduces the number of Cgo calls. This brings us to round #2.

Round 2: Eager Loading

To build out a profile for visualization, users will need access to most if not all of the nodes of the profile. We also know that for performance, I want to limit the number of C calls that have to be made in order to do so. So, we move the heavy-lifting of getting the entire call graph inside of our C++ function StopProfiling so that the pointer we return to the Go code is to the call graph fully loaded with all the nodes and their properties. Our go CPUProfile and CPUProfileNode objects will match V8’s API in that they have the same getters, but now, internally, they just return the values from the structs private fields instead of reaching back to the C++ code.

This is what the StopProfiling function in C++ does now: once the profiler returns the profile, the function can traverse the graph starting at the root node and build out the C data structures so that a single pointer to the profile can be returned to the Go code that can traverse the graph to build corresponding Go data structures.

The corresponding function in Go, StopProfiling, uses Cgo to call the above C function (CPUProfilerStopProfiling) to get the pointer to our C struct CPUProfile. By traversing the tree, we can build the Go data structures so the CPU profile is completely accessible from the Go side:

With this eager loading, the rest of the Go calls to get profile and node data is as simple as returning the values from the private fields on the struct.

Round 3 (Maybe?): Lazy or Eager Loading

There’s the potential for a variation where both of the above implementations are options. This means allowing users to decide where they want to lazily or eagerly load everything on the profile. It’s another reason why, in the final implementation of the PR, the getters were kept instead of just making all of the Node and Profile fields public. With the getters and private fields, we can change what’s happening under the hood based on how the user wants the profile to load.

Speed is Everything, So Which One's Faster?

Comparing lazy and eager loading required a test that executed some JavaScript program with a decently sized tree so we could exercise a number of Cgo calls on many nodes. We would measure if there was a performance gain by building the tree eagerly in C and returning that complete call graph as a pointer back to Go.

For quite a while, I ran benchmarks using the JavaScript code from earlier. From those tests, I found that:

  1. When lazy loading the tree, the average duration to build it is ~20 microseconds.
  2. When eagerly loading the tree, the average duration to build it is ~25 microseconds.

It's safe to say these results were unexpected. As it turns out, the theorized behavior of the eager approach wasn’t more optimal than lazy loading, in fact, it was the opposite. It relied on more Cgo calls for this tree size. 

However, because these results were unexpected, I decided to try a much larger tree using the Hydrogen starter template. From testing this, I found that:

  1. When lazy loading the tree, the average duration to build it is ~90 microseconds.
  2. When eagerly loading the tree, the average duration to build it is ~60 microseconds.

These results aligned better with our understanding of the performance implications of making numerous Cgo calls. It seems that, for a tiny tree, the cost of traversing it three times (twice to eagerly load information and once to print it) doesn’t cost less than the single walk to print it that includes numerous Cgo calls. The true cost only shows itself on a much larger tree where the benefit of the upfront graph traversal cost greatly benefits the eventual walkthrough of a large tree to be printed. If I hadn’t tried a different sized input, I would never have seen that the value of eager loading eventually shows itself. If I drew the respective approaches of growth lines on a graph, it would look something like:

Simple graph with time to build profile on the y axis and size of javascript on x axis. 2 lines indicating eager and lazy are plotted on the graph with lazy being higher

Looking Back at the Finish line

As a long time Go developer, there’s plenty of things I take for granted about memory management and performance. Working on the v8go library has forced me to learn about Cgo and C++ in such a way that I can understand where the performance bottlenecks might be, how to experiment around them, and how to find ways to optimize for them. Specifically contributing the functionality of CPU profiling to the library reminded me that:

  1. I should benchmark code when performance is critical rather than just going with my (or another’s) gut. It absolutely takes time to flesh out a sufficient alternative code path to do fair benchmarking, but chances are there are discoveries made along the way. 
  2. Designing a benchmark matters. If the variables in the benchmark aren’t reflective of the average use case, then the benchmarks are unlikely to be useful and may even be confusing.

Thank you to Cat Cai, Oliver Fuerst, and Dylan Thacker-Smith for reviewing, clarifying, and generally just correcting me when I'm wrong.

About the Author:

Genevieve is a Staff Developer at Shopify, currently working on Oxygen.

If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

RubyConf 2021: The Talks You Might Have Missed

RubyConf 2021: The Talks You Might Have Missed

Shopify loves Ruby and opportunities to get together with other engineers who love Ruby to learn, share, and build relationships. In November, Rubyists from Shopify’s Ruby and Rails infrastructure teams gathered in Denver at RubyConf 2021 to immerse themselves in all things Ruby with a community of their peers. If you weren’t there or want to revisit the content, we’ve compiled a list of the talks from our engineers. 

A History of Compiling Ruby by Chris Seaton

Love Ruby compilers? Chris does.

“Why is it worth looking at Ruby compilers? Why is it worth looking at compilers at all? Well, I think compilers are fascinating. I’ve been working on them for a couple of decades. I think one of the great things about compilers, you can talk to anyone who’s a developer about compilers, because we all use compilers. Everyone’s got an opinion on how the languages should be designed. You can have conversations with anyone at every level about compilers, and compilers are just really fun. They may seem like a deeply technical topic, but they’re conceptually fairly simple. They take a file as input, they do something internally, and they produce a file as output.”

In this talk, Chris dives into the history of Ruby compilers, the similarities and differences, and what we can learn from them.

Learn more about Chris’ work on TruffleRuby:

Some Assembly Required by Aaron Patterson 

In typical Aaron style, this talk is filled with puns and humor while being educational and thought-provoking. Aaron shares why he wrote a JIT compiler for Ruby. Why did he write a JIT compiler? 

To see if he could.

“I wanted to see if I could build this thing. For me, programming is a really creative and fun endeavor. I love to program. And many times I’ll just write a project just to see if I can do it. And this is one of those cases. So, I think maybe people are asking, ‘does this thing actually work?’” 

Watch Aaron’s talk to find out if it does work and learn how to build a JIT compiler in pure Ruby. 

Learn more about TenderJIT on GitHub

Building a New JIT Compiler Inside CRuby by Maxime Chevalier Boisvert

In this talk, Maxime talks about YJIT, an open-source project led by a small team of developers at Shopify to incrementally build a new JIT compiler inside CRuby. She discusses the key advantages of YJIT, the approach the team is taking to implement YJIT, and early performance results.

“The objective is to produce speedups on real-world software. For us, real-world software means large web workloads, such as Ruby on Rails. The benefits of our approach is we’re highly compatible with all existing Ruby code and we’re able to support all of the latest Ruby features.”

Check out YJIT in Ruby 3.1!

Learn more about YJIT:

Gradual Typing in Ruby–A Three Year Retrospective by Ufuk Kayserilioglu and Alexandre Terrasa 

Ufuk and Alexandre share a retrospective of adopting Sorbet at Shopify, why you don’t have to go full-in on types out of the gate, and why gradual typing might be a great middle-ground for your team. They also share lessons learned from a business and technical perspective. 

“You shouldn’t be getting in the way of people doing work. If you want adoption to happen, you need to ramp up gently. We’re doing gradual type adoption. And because this is gradual-type adoption, it’s totally okay to start slow, to start at the lowest strictness levels, and to gradually turn it up as people are more comfortable and as you are more comfortable using the tools.”

Check out the following posts from Ufuk and Alexandre to learn more about static typing for Ruby and adopting Sorbet at scale at Shopify.

Building Native Extensions. This Could Take A While... by Mike Dalessio 

At RubyKaigi 2021, Mike did a deep dive into the techniques and toolchain used to build and ship native C extensions for Ruby. In his latest talk at RubyConf 2021, Mike expands upon the conversation to explore why Nokogiri evolved to use more complex techniques for compilation and installation over the years and touches upon human trust and security. 

“Nokogiri is web-scale now. Since January (2021), precompiled versions of Nokogiri have been downloaded 60 million times. It’s a really big number. If you do back of the envelope power calculations, assuming some things about your core, 2.75 megawatts over 10 months have been saved.”

Mike has provided companion material to the talk on GitHub.

Parsing Ruby by Kevin Newton

Kevin digs into the topic of Ruby parsers with a thorough deep dive into the technical details and tradeoffs of different tools and implementations. While parsing is a technically challenging topic, Kevin delivers a talk that speaks to junior and senior developers, so there’s something for everyone! 

“Parser generators are complicated technologies that use shift and reduce operations to build up syntax trees. Parser generators are difficult to maintain across implementations of languages. They’re not the most intuitive of technologies and it’s difficult to maintain upstream compatibility. It’s a good thing that Ruby is going to slow down on syntax and feature development because it’s going to give an opportunity for all the other Ruby implementations to catch up.”

Problem Solving Through Pair Programming by Emily Harber

We love pair programming at Shopify. In this talk, Emily explores why pair programming is a helpful tool for getting team members up to speed and writing high-quality code, allowing your team to move faster and build for the long term. Emily also provides actionable advice to get started to have more productive pairing sessions.

“Pair programming is something that should be utilized at all levels and not exclusively as a part of your onboarding or mentorship processes. Some of the biggest benefits of pairing carry through all stages of your career and through all phases of development work. Pairing is an extremely high fidelity way to build and share context with your colleagues and to keep your code under constant review and to combine the strengths of multiple developers on a single piece of a shared goal.”


Achieving Fast Method Metaprogramming: Lessons from MemoWise by Jemma Issroff

In this talk, Jemma and Jacob share the journey of developing MemoWise, Ruby’s most performant memoization gem. The presentation digs into benchmarking, unexpected object allocations, performance problems common to Ruby metaprogramming, and their experimentation to develop techniques to overcome these concerns.

“So we were really critically concerned with optimizing our performance as much as possible. And like any good scientist, we followed the scientific method to ensure this happens. So four steps: Observation, hypothesis, experiment, and analysis. Benchmarks are one of the best ways to measure performance and to an experiment that we can use over and over again to tell us exactly how performant our code is or isn’t.” 

Programming with Something by Tom Stuart

In this talk, Tom explores how to store executable code as data in Ruby and write different kinds of programs that process it. He also tries to make “fasterer” and “fastererer” words, but we’ll allow it because he shares a lot of great content.

“A simple idea like the SECD machine is the starting point for a journey of iterative improvement that lets us eventually build a language that’s efficient, expressive, and fast.”

If you are interested in exploring the code shown in Tom’s talk, it’s available on GitHub.

The Audacious Array by Ariel Caplan

Do you love Arrays? In this talk, Ariel explores the “powerful secrets” of Ruby arrays by using…cats! Join Ariel on a journey through his game, CatWalk, which he uses to discuss the basics of arrays, adding and removing elements, creating randomness, interpretation, arrays as sets, and more. 

“When we program, many of the problems that we solve fall into the same few categories. We often need to create constructs like a randomizer, a 2D representation of data like a map, some kind of search mechanism, or data structures like stacks and queues. We might need to take some data and use it to create some kind of report, And sometimes we even need to do operations that are similar to those we do on a mathematical set. It turns out, to do all of these things, and a whole lot more, all we need is a pair of square brackets. All we need is one of Ruby’s audacious arrays.” 

If you want to explore the code for Ariel’s “nonsensical” game, CatWalk, check it out on GitHub

Ruby Archaeology by Nick Schwaderer

In this talk, Nick “digs” into Ruby archeology to run old code and explore Ruby history and interesting gems from the past and shares insights into what works and what’s changed from these experiments.  

“So why should you become a Ruby archeologist? There are hundreds of millions, if not billions, of lines of valid code, open source for free, on the internet that you can access today. In the Ruby community today, sometimes it feels like we’re converging.”

Keeping Developers Happy With a Fast CI by Christian Bruckmayer

As a member of Shopify’s test infrastructure team, Christian ensures that the continuous integration (CI) systems are scalable, robust, and usable. In this talk, Christian shares techniques such as monitoring, test selection, timeouts, and the 80/20 rule to speed up test suites. 

“The reason we have a dedicated team is just the scale of Shopify. So the Rails core monolith has approximately 2.8 million lines of code, over a thousand engineers work on it, and in terms of testing we have 210,000 Ruby tests. If you execute them it would take around 40 hours. We run around 1,000 builds per day, which means we run around 100 million test runs per day. So that’s a lot.”

Read more about keeping development teams happy with fast CI on the blog.

Note: The first 1:40 of Christian’s talk has minor audio issues, but don’t bail on the talk because the audio clears up quickly, and it’s worth it!

Parallel Testing With Ractors–Putting CPU's to Work by Vinicius Stock

Vini talks about using Ractors to parallelize test execution, builds a test framework built on Ractors, compares current solutions, and discusses the advantages and limitations.

“Fundamentally, tests are just pieces of code that we want to organize and execute. It doesn’t matter if in Minitest they are test methods and in RSpec they are Ruby blocks, they’re just blocks of code that we want to run in an organized manner. It then becomes a matter of how fast we can do it in order to reduce the feedback loop for our developers. Then we start getting into strategies for parallelizing the execution of tests.”

Optimizing Ruby's Memory Layout by Peter Zhu & Matt Valentine-House

Peter and Matt discuss how their variable width allocation project can move system heap memory into Ruby heap memory, reducing system heap allocations, and providing finer control of the memory layout to optimize for performance.

“We’re confident about the stability of variable width allocation. Variable width allocation passes all tests on CI on Shopify’s Rails monolith, and we ran it for a small portion of production traffic of a Shopify service for a week, where it served over 500 million requests.”

Bonus: Meet Shopify's Ruby and Rails Infrastructure Team (AMA)

There were a LOT of engineers from the Ruby and Rails teams at Shopify at RubyConf 2021. Attendees had the opportunity to sit with them at a meet and greet session to ask questions about projects, working at Shopify, “Why Ruby?”, and more.

Jennie Lundrigan is a Senior Engineering Writer at Shopify. When she's not writing nerd words, she's probably saying hi to your dog.

We want your feedback! Take our reader survey and tell us what you're interested in reading about this year.

If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

How to Get an Engineering Internship at Shopify: A Complete Guide

How to Get an Engineering Internship at Shopify: A Complete Guide

An important component of being an engineer is getting hands-on experience in the real world, which internships can provide. This is especially true for engineering internships, which are critical for helping students develop real-life skills they can’t learn in the classroom. Sadly the internship market has suffered heavily since the pandemic with internship opportunities dropping by 52%, according to Glassdoor.

The silver lining is that many companies are now transitioning to incorporate virtual internships like we did. Whether you are a student, recent graduate, career switcher, bootcamp graduate, or another type of candidate, our virtual engineering internships are designed to kickstart your career and impact how entrepreneurs around the world do business.

What is it like to be a Shopify intern? During our latest intern satisfaction survey, 98% of respondents said they would recommend the program to friends. There are many opportunities to build your career, make an impact, and gain real-world experience at Shopify. But don’t just take our word for it! Keep reading to see what our interns had to say about their experience and learn how you can apply.

How to Get an Internship at Shopify

If you’re looking to jumpstart your personal growth and start an internship that can help lay the foundation for a successful career, we can help. Interning at Shopify allows you to work on real projects, solve hard problems, and gain practical feedback along the way. We provide the tools you need to succeed and trust you to take ownership and make great decisions. Here’s the steps to getting started.

Step 1: Review Available Opportunities

At Shopify, our engineering internships vary in length from three to eight months, with disciplines such as front-end and back-end development, infrastructure engineering, data engineering, mobile development, and more. Currently we run three intern application cycles a year. Applicants for the Fall 2022 cohort will be able to apply in May of 2022. Join our Shopify Early Talent Community, and we'll notify you. We also list available internships on our Early Careers page, these include a variety of three, four, and eight-month paid programs.

Step 2: Apply Online

Getting a Shopify engineering internship starts with an online application. We'll ask you for your resume, cover letter, contact information, education status, LinkedIn profile, and personal website. You will also be asked to complete an Intern Challenge to demonstrate your interest in the internship topic. This is a great place to show off your love for engineering. Perhaps you have your site using Ruby on Rails. We’d love to hear about it!

Step 3: Get Ready for the Skills Challenge

Depending on your specialization, you may be asked to submit a personal project like a GitHub link so that the recruiter can test your skills. Challenges differ by category, but you might be asked to design a Shopify store or to use a coding language like Python or Ruby on Rails to solve a problem. We want to see that you care about the subject, so be specific and put effort into your challenges to make your skills stand out.

Step 4: Prepare for the Interview Process

Shopify's interview process is divided into two phases. Our first stage allows us to get to know you better. Our conversation is called the Life Story, and it's a two-sided conversation that presents both your professional and personal experiences so far. Our second stage is used to assess your technical skills. A challenge will be presented to you, and you will be asked to propose a technical solution.

Top Skills for Engineering Interns

In a series of recent Twitter discussions from August and January, we asked about the most important skills for an engineering intern. More than 100 hiring managers, engineering professionals, and thought leaders responded. Here’s a summary of the skills they look for, along with how our very own interns have learned and applied them.

A visual representation of interns acting out the top skills: collaboration, lifelong learning, curiosity, GitHub experience, remote work experience, communication, interviewing, and accountability
Top skills for engineering interns


When you are working with a team, as most interns do, you need to be able to work together smoothly and effectively. Collaboration can encompass several characteristics, including communication, group brainstorming, emotional intelligence, and more. According to one follower on Twitter: “tech is a small part of software engineering, the valuable part is working well in teams.”

Our interns collaborate with talented people around the world. Emily Liu, a former intern and upcoming UX designer at Shopify, said the core team was spread out across five countries. The time differences didn’t hinder them from collaborating together to achieve a common goal. “Teamwork makes the dream work,” says Emily.

Lifelong Learning

Being a constant learner is one of Shopify's values and is considered a measure of success. As one Twitter follower pointed out, this is especially important in engineering since you should "always be willing to learn, to adapt, and to accept help" and that “even the most senior staff developer can learn something from an intern.”

This is echoed by former intern Andrea Herscovich, who says “if you are looking to intern at a company which values impact over everything, lifelong learning, and entrepreneurship, apply to Shopify!”


Without curiosity, an intern might become stagnant and not stay on top of the latest tools and technologies. Lack of curiosity can hinder an intern's career in engineering, where technological developments are rapid. In a response given by a hiring manager, curiosity is one of the key things he looks for among engineering interns, but says it's hard to find.

Andrea Herscovich also says she was encouraged to "be curious." This curiosity allowed her to build her own path for the internship. A particularly memorable project involved contributing to Polaris, Shopify's open-source design system, says Andrea. When working on adding a feature to a component in Polaris, Andrea learned how to develop for a more general audience.   

GitHub Experience

GitHub is an essential tool for collaborating with other developers in most engineering environments. As one Twitter user says: “I don't care if you got an A+ or C- in compilers; I'm going to look at your GitHub (or other public work) to see if you've been applying what you learned.” At Shopify, GitHub plays an important role in collaboration.

Using GitHub, former Shopify intern Kelly Ma says that her mentor provided a list of challenges instead of clearly-defined work for her to solve. During this time, Kelly had a chance to ask questions and learn more about the work of her team. As a result, she interacted with Shopifolk outside of her team and forged new relationships.

Remote Work Experience

A growing number of engineers are now working remotely. The trend is likely to continue well into the future due to COVID-19. As an intern, you will have the opportunity to gain experience working remotely, which can prepare you for the growing virtual workforce. Perhaps you're wondering if a remote internship can deliver the same experience as an in-person internship?

One former Shopify intern, Alex Montague, was anxious about how a remote internship would work. After completing the program, he told us, "working from home was pretty typical for a normal day at work" and that the tools he used made remote work easy, and he was "just as productive, if not more so, than if I was in the office." Alex is now a front-end developer on our App Developer Experience team, which provides insights and tools to help partners and merchants build and maintain apps.


Today, communication is one of the most important skills engineers can have—and one that they sometimes lack. As one Twitter follower puts it: "nothing in CS you learn will be more important than how to communicate with humans." Fortunately, as an intern, you get the chance to improve on these skills even before you enter the workforce.

Meeting over Google Hangouts, pair programming on Tuple, brainstorming together on Figma, communicating through Slack, and discussing on GitHub are just a few ways that Shopify interns communicate, says Alex Montague. Interns can take advantage of these opportunities to develop core communication skills such as visual communication, written communication, and nonverbal communication.


“Practice interviewing. This is a skill,” one Twitter follower advises. Interviewing well is key to a successful internship search, and it can set you apart from other candidates. At Shopify, the interview process is divided into two different phases. We begin with a Life Story to learn more about you, what motivates you, and how we can help you grow. Our later rounds delve into your technical skills.

As part of his preparation for his Life Story internship interview, Elio Hasrouni noted all the crucial events in his life that have shaped who he is today, starting from his childhood. Among other things, he mentioned his first job, his first coding experience, and what led him into Software Engineering. Elio is now a full-time developer within our Retail and Applications division, which helps power our omnichannel commerce.


Accountability involves taking responsibility for actions, decisions, and failures. For an engineering intern, accountability might mean accepting responsibility for your mistakes (and you’ll make plenty of them) and figuring out how to improve. Acknowledging your mistakes helps you demonstrate self-awareness that enables you to identify the problem, address it, and avoid repeating it.

How do you keep yourself accountable? Kelly Ma credits stretch goals, which are targets that are designed to be difficult to achieve, as a way to remain accountable. Other ways she keeps accountable include exploring new technological frontiers and taking on new challenges. One way that Shopify challenged her to be accountable was by asking her to own a project goal for an entire cycle (six weeks). This process included bringing stakeholders together via ad hoc meetings, updating GitHub issues to convey the state of the goal, and learning how to find the right context.

Tips to Help You Succeed

As you might expect, great internship opportunities like this are highly competitive. Applicants must stand out in order to increase their chances of being selected. In addition to the core skills discussed above, there are a few other things that can make you stand out from the crowd.

Practice Our Sample Intern Challenges

It is likely that we will ask you to participate in an Intern Challenge to showcase your skills and help us better understand your knowledge. To help you prepare, you can practice some of our current and previous intern challenges below.

Showcase Your Past Projects

You don’t need prior experience to apply as a Shopify intern, but if you compile all your previous projects relevant to the position you're applying for, your profile will certainly stand out. Your portfolio is the perfect place to show us what you can do. 

Research the Company

It's a good idea to familiarize yourself with Shopify before applying. Take the time to learn about our product, values, mission, vision, and find a connection with them. In order to achieve success, our goals and values should align with your own. 

Additional Resources

Want to learn more about Shopify's Engineering intern program? Check out these posts:

Want to learn more about the projects our interns work on? Check out these posts:

About the Author:

Nathan Quarrie is a Digital Marketing Lead at Shopify based in Toronto, Ontario. Before joining Shopify, he worked in the ed-tech industry where he developed content on topics such as Software Engineering, UX & UI, Cloud Computing, Data Analysis, and Web Development. His content and articles have been published by more than 30 universities including Columbia, Berkeley, Northwestern, and University of Toronto.

If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

Changing a polymorphic_type in Rails

Changing a polymorphic_type in Rails

In this post I'm going to share how my teammates and I redefined the way we store one of the polymorphic associations in the Shopify codebase. I am part of the newly formed Payment Flexibility team. We work on features that empower merchants to better manage their payments and receivables on Shopify.

Code at Shopify is organized in components. As a new team, we decided to take ownership over some existing code and to move it under the component we’re responsible for (payment flexibility). This resulted in moving classes (including models) from one module to another, meaning their namespace had to change. While thinking about how we were going to move certain classes under different modules, we realized we may benefit from changing the way Rails persists a polymorphic association to a database. Our team had not yet entirely agreed on the naming of the modules and classes. We wanted to facilitate name changes during the future build phase of the project.

We decided to stop storing class names as a polymorphic type for certain records. By default, Rails stores class names as polymorphic types. We decided to instead use an arbitrary string. This article is a step by step representation of how we solved this challenge. I say representation because the classes and data used for this article are not taken from the Shopify codebase. They’re a practical example of the initial situation and the solution we applied.

I’m going to start with a short and simple reminder of what polymorphism is, then move on to a description of the problem, and finish with a detailed explanation of the solution we chose.

What is Polymorphism?

Polymorphism means that something has many forms (from the Greek “polys” for many and “morphē” for form).

Polymorphic relationship in Rails refers to a type of Active Record association. This concept is used to attach a model to another model that can be of a different type by only having to define one association.

For the purpose of this post, I’ll take the example of a Vehicle that has_one :key and the Key belongs_to :vehicle.

A Vehicle can be a Car or a Boat.

You can see here that Vehicle has many forms. The relationship between Key and Vehicle is polymorphic.

The foreign key stored on the child object (the Key record in our example) points to a single object (Vehicle) that can have different forms (Car or Boat). The form of the parent object is stored on the child object under the polymorphic_type column. The value of the polymorphic_type is equal to the class name of the parent object, "Car" or "Boat" in our example.

The code block below shows how a polymorphic association is stored in Rails.

The Issue

As I said initially, our vehicle classes had to move under another module, a change in module results in a different namespace. For this example I’ll pretend I want to change how our code is organized and put Car under the Garage module.

I go ahead and move the Car and Boat models under the new module Garage:

I’m now running into the following:

The vehicle_type column now contains "Garage::Car", which means we’ll have vehicle_type: "Car" and vehicle_type: "Garage::Car" both stored in our database.

Having these two different vehicle_type values means the Key records with vehicle_type: "Car" won’t be returned when calling a_vehicle.key. The Active Record association has to be aware of all the possible values for vehicle_type in order to find the associated record:

Both these vehicle_type values should point towards the updated model Garage::Car for our polymorphic ActiveRecord association to continue to work. The association is broken in both directions. Calling #vehicle on a Key record that has vehicle_type: "Car" won’t return the associated record:

The Idea

Once we realized changing a namespace was going to introduce complexity and a set of tasks (see next paragraph), one of my teammates said to me, “Let's stop storing class names in the database altogether. By going from a class name to an arbitrary string we could decrease the coupling between our codebase and our database. This means we could more easily change class names and namespaces if we need to in the future.” For our example, instead of storing "Garage::Car" or "Garage::Boat" why don't we just store "car" or "boat"?

To go forward with a module and classes name change, without modifying the way Active Record stores a polymorphic association, we would have to add the ability to read from several polymorphic types when setting the ActiveRecord association. We also would have had to update existing records for them to point to the new namespace. If we go back to our example, records with vehicle_type: "Garage::Car" should point towards the new Garage::Car model until we could perform a backfill of the column with the updated model class name.

In Practice: Going From Storing a Class Name to an Arbitrary String

Rails has a way to override the writing of a polymorphic_type value. It’s done by redefining the polymorhic_name method. The code below is taken from the Rails gem source code:

Let's redefine the source code above for our Garage::Car example:

When creating a Key record we now have the following:

Now we have both "Car" the class name and "car" the arbitrary string stored as vehicle_type. Having two possible values for vehicle_type brings another problem. In a polymorphic association, the target (associated record) is looked up using the single value returned in .polymorphic_name, and this is where the limitation lies. The association is only able to look for one vehicle_type value. vehicle_type is stored as the value returned by polymorphic_name when the record was created.

An example of this limitation:

Look closely at the SQL expression, and you’ll see that we’re only looking for keys with a vehicle_type = "car" (the arbitrary string). The association won’t find the Key for vehicles created before we started our code change (keys where vehicle_type = "Car"). We have to redefine our association scope so it can look for keys with vehicle_type of "Car" or "car":

Our association now becomes the following SQL expression:

The association is now looking up keys with either "car" or "Car" as vehicle_type.

Now that we can read from both the class name and new arbitrary string as a vehicle_type for our association, we can go ahead and clean up our database to only have arbitrary strings stored as vehicle_type. At Shopify, we use MaintenanceTasks. You could run a migration or a script as the one below to update your records.

Once the clean up is complete, we only have arbitrary strings stored as vehicle_type. We can go ahead and remove the .unscope on the Garage::Car and Garage::Boat association.

But Wait, All This for What?

The main benefit from this patch is that we reduced the coupling between our codebase and our database.

Not storing class names as polymorphic types means you can move your classes, rename your modules and classes, without having to touch your existing database records. All you have to do is update the class names used as keys and values in the three CLASS_MAPPING hashes. The value stored in the database will remain the same unless you change the arbitrary strings these classes and class names resolve to.

Our solution adds complexity. It’s probably not worth it for most use cases. For us it was a good trade off since we knew the naming of our modules and classes could change in the near future.

The solution I explained isn’t the one we initially adopted. We initially went an even more complex route. This post is the solution we wish we had found when we started looking into the idea of changing how a polymorphic association is stored. After a bit of research and experimentation, I came to this simplified version and thought it was worth sharing.

Diego is a software engineer on the Payment Flexibility Team. Living in the Canadian Rockies.

We want your feedback! Take our reader survey and tell us what you're interested in reading about this year.

If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

We Want Your Feedback for the Shopify Engineering Blog

We Want Your Feedback for the Shopify Engineering Blog

Update (March 11, 2022): The reader survey is now closed. Thanks to all who provided feedback. Keep in touch with us on Twitter at @ShopifyEng.


Hello Shopify Engineering readers,

We’re conducting a survey so we can get a better sense of the stories you’re interested in reading. We want to learn more about you, your likes and dislikes, so we can create the best content possible, from deeply technical guides to pieces on developer culture.

The survey will take five minutes to complete. Your responses will be used for the purpose of improving our content and tailoring newsletters to better reflect your interests.*

Thank you for your feedback—and for reading!


Anita Clarke
Senior Managing Editor

*Your responses will be analyzed in aggregate and used for research purposes; some aggregated data may be shared externally. Your data will be treated in accordance with Shopifys privacy policy, which can be found here.

Continue reading

Shopify's Playbook for Scaling Machine Learning

Shopify's Playbook for Scaling Machine Learning

Five years ago, my team and I launched the first machine learning product at Shopify. We were determined to build an algorithm-powered product that solved a merchant problem and made a real impact on the business. Figuring out where to start was the hardest part. There was (and still is!) a lot of noise out there on best practices. 

Fast forward to today, and machine learning is threaded into many aspects of Shopify. How did we get to this point? Through our experience building our first few models, we carved out a pragmatic step-by-step guide that has enabled us to successfully scale machine learning across our organization.

Our playbook is tech-independent and can be applied in any domain, no matter what point you’re at in your machine learning journey—which is why we’re sharing it with you today.

Starting From Zero

The first few problems your team chooses to solve with machine learning have a disproportionate impact on the success and growth of your machine learning portfolio. But knowing which problem to start with can be difficult.

1. Identify A Problem Worth Solving

You want to pick a problem that your users care about. This will ensure your users use your solution day in and day out and enable you to gather feedback quickly. Having a deep understanding of your problem domain will help you achieve this. Nothing surpasses the ability to grasp your business goals and user needs. This context will guide your team on what your product is trying to achieve and what data-driven solutions will have a real impact on these priorities.

One way Shopify achieves this is by embedding our data scientists into our various product and commercial lines. This ensures that they have their finger on the pulse and are partners in decision making. With this domain context, we were able to identify a worthy first problem—order fraud detection. Sales are the main goal of our merchants, so we knew this problem impacted every merchant. We also knew the existing solution was a pain point for them (more on this later).

Screenshot of the Fraud Analysis screen in the Shopify admin showing the details for an order that's ranked Low
Order fraud detection in the Shopify admin.

2. Ensure You Have Enough Data

Good, accessible data is half the battle for building a successful model. Many organizations collect data, but you need to have a high degree of confidence in it, otherwise, you have to start collecting it anew. And it needs to be accessible. Is your data loaded into a medium that your data scientists can use? Or do you have to call someone in operations to move data from an S3 bucket? 

In our case, we had access to 10 years of transaction data that could help us understand the best inputs and outputs for detecting fraudulent orders. We have an in-house data platform and our data scientists have easy access to data through tools like Trino (formerly Presto). But the technology doesn’t matter, all that matters is that whatever problem you choose, you have trustworthy and accessible data to help you understand the problem you’re trying to solve.

3. Identify Your Model’s Downstream Dependencies

Keep in mind that any problem you pick won’t be an abstract, isolated problem—there are going to be things downstream that are impacted by it. Understanding your user’s workflow is important as it should influence the conditions of your target.

For example, in order fraud, we know that fulfillment is a downstream dependency. A merchant won’t want to fulfill an order if it’s a high risk of fraud. With this dependency in mind, we realized that we needed to detect fraud before an order is fulfilled: after would leave our prediction useless.

4. Understand Any Existing Solutions

If the problem you’re trying to solve has an existing solution, dig into the code and data, talk to the domain experts and fully understand how that solution is performing. If you’re going to add machine learning, you need to identify what you’re trying to improve. By understanding the existing solution, you’ll be able to identify benchmarks for your new solution, or decide if adding machine learning is even needed.

When we dug into the existing rule-based solution for detecting order fraud, we uncovered that it had a high false positive rate. For example, if the billing and shipping address on an order differed, our system would flag that order. Every time an order was flagged our merchants had to investigate and approve it, which ate up precious time they could be spending focused on growing their business. We also noticed that the high false positive rate was causing our merchants to cancel good orders. Lowering the false positive became a tangible benchmark for our new solution.

5. Optimize For Product Outcomes

Remember, this is not an exercise in data science—your product is going to be used by real people. While it’s tempting to optimize for scores such as accuracy, precision and recall, if those scores don’t improve your user experience, is your model actually solving your problem?

A venn diagram showing two circles with Product Outcome and User Trust, the overlap is real world success
In order for a model to have real world success, you need to optimize for product outcome and user trust, not just model scores.

For us, helping merchants be successful (i.e. make valid sales) was our guiding principle, which influenced how we optimize our models and where we put our thresholds. If we optimized our model to ensure zero fraud, then our algorithm would simply flag every order. While our merchants would sell nothing, we would achieve our result of zero fraud. Obviously this isn’t an ideal experience for our merchants. So, for our model, we optimized for helping merchants get the highest number of valid sales.

While you might not pick the same problem, or have the same technology, by focusing on these steps you’ll be able to identify where to add machine learning in a way that drives impact. For more tips from Shopify on building a machine learning model, checkout this blog.

Zero to One

So you’ve built a model, but now you’re wondering how to bring it to production? Understanding that the strength of everything that data science builds is on the foundation of a strong platform will help you find your answer. In order to bring your models to production in a way that can scale, you simply need to begin investing in good data engineering practices.

1. Create Well-Defined Pipelines

In order to confidently bring your model to production, you need to build well-defined pipelines for all stages of predictive modeling. For your training pipeline, you don’t want to waste time trying to keep track of your data and asking, “Did I replace the nulls with zeros? Did my colleagues do the same?” If you don’t trust your training, you’ll never get to the point where you feel comfortable putting your model into production. In our case, we created a clean pipeline by clearly labeling our input data, transformations and the features that go into our model.

You’ll want to do the same with your verification and testing pipeline. Building a pipeline that captures rich metadata around which model or dataset was used in your training will enable you to reproduce metrics and trace bugs. With these good data engineering practices in place, you’ll remove burdensome work and be able to establish model trust with your stakeholders.

The model lifecycle: Model Building, Model Evaluation, Productionize Model, Testing, Deployment, Monitoring & Observability
Model lifecycle

2. Decide How to Deploy Your Model

There are a lot of opinions on this, but the answer really depends on the problem and product context. Regardless of which decision you make, there are two key things to consider:

  • What volume will your model experience? Is your model going to run for every user? Or only a select group? Planning for volume means you’ll make better choices. In our case, we knew that our deployment choice had to be able to deal with varying order volumes, from normal traffic days to peak sales moments like Black Friday. That consideration influenced us to deploy the model on Shopify’s core tech stack—Ruby on Rails—because those services are used to high-volume and have resources dedicated to keeping them up and running.
  • What is the commitment between the user and the product? Understand what the user expects or what the product needs because these will have implications on what you can build. For example, our checkout is the heartbeat of our platform and our merchants expect it to be fast. In order to detect fraud as soon as an order is made, our system would have to do a real-time evaluation. If we built an amazing model, but it slowed down our checkout, we would solve one problem, but cause another. You want to limit any unnecessary product or user strain.

By focusing on these steps, we were able to quickly move our order fraud detection model into production and demonstrate if it actually worked—and it did! Our model beat the baseline, which is all we could have asked for. What we shipped was a very simple logistic regression model, but that simplicity allowed us to ship quickly and show impact. Today, the product runs on millions of orders a day and scales with the volume of Shopify. 

Our first model became the stepping stone that enabled us to implement more models. Once your team has one successful solution in production, you now have an example that will evangelize machine learning within your organization. Now it’s time to scale.

One to One Hundred

Now that you have your first model in production, how do you go from one model to multiple models? Whether that’s in the same product or bringing machine learning to other existing products? You have to think about how you can speed up and scale your model building workflows.

1. Build Trust In Your Models

While deploying your first model you focused on beginning to build good engineering practices. As you look to bring models to new products, you need to solidify those practices and build trust in your models. After we shipped our order fraud detection model, we implemented the following key processes into our model lifecycle to ensure our models are trustworthy, and would remain trustworthy:

  • Input and output reconciliation: Ensure the data sets that you use during training match the definition and the measurements of what you see at the time of inference. You’ll also want to reconcile the outcomes of the model to make sure that for the same data you’re predicting the same thing. It seems basic, but we’ve found a lot of bugs this way.
  • Production backtesting: Run your model in shadow for a cohort of users, as if it’s going to power a real user experience. Running backtests for our order fraud detection model allowed us to observe our model predictions, and helped us learn the intricacies of how what we’d built functioned with real world data. It also gave us a deployment mechanism for comparing models.
  • Monitoring: Conditions that once made a feature true, may change over time. As your models become more complex, keeping on top of these changes becomes difficult. For example, early on in Shopify’s history, mobile transactions were highly correlated with fraud. However, we passed a tipping point in ecommerce where mobile orders became the primary way of shopping, making our correlation no longer true. You have to make sure that as the world changes, as features change, or distributions change, there's either systems or humans in place to monitor these

2. Encode Best Practices In Your Platform

Now that you’ve solidified some best practices, how do you scale that as your team grows? When your team is small and you only have 10 data scientists, it’s relatively straightforward to communicate standards. You may have a Slack channel or a Google Doc. But as both your machine learning portfolio and team grow, you need something more unifying. Something that scales with you. 

A good platform isn’t just a technology enabler—it’s also a tool you can use to encode culture and best practices. That’s what we did at Shopify. For example, as we outlined above, backtesting is an important part of our training pipeline. We’ve encoded that into our platform by ensuring that if a model isn’t backtested before it goes into production, our platform will fail that model.

While encoding best practices will help you scale, it’s important that you don’t abstract too early. We took the best practices we developed while deploying our order fraud protection model, and a few other models implemented in other products, and we conducted trial and error. Only after taking a few years to see what worked, did we encode these practices into our platform.

3. Automate Things!

If on top of building the foundations, our team had to monitor, version, and deploy our models every single day, we’d still be tinkering with our very first model. Ask yourself, “How can I scale far beyond the hours I invest?” and begin thinking in terms of model operations—scheduling runs, automatic checks, model versioning, and, one day, automatic model deployment. In our case, we took the time to build all of this into our infrastructure. It all runs on a schedule every day, every week, for every merchant. Of course, we still have humans in the loop to dig into any anomalies that are flagged. By automating the more operational aspects of machine learning, you’ll free up your data scientists’ time, empowering them to focus on building more awesome models.

Shopify's Order Fraud Pipeline that goes from Python to PMML via Apache Airflow and then to Rails
Shopify’s automated order fraud detection pipeline. Models are built in Python, then PMML (predictive modeling markup language) serializes the models to become language independent, enabling us to deploy in our production system which runs on Ruby. Everything runs on a scheduler with Apache Airflow.

These last three steps have enabled us to deploy and retrain models fast. While we started with one model that sought to detect order fraud, we were able to apply our learnings to build various other models for products like Shopify Capital, product categorization, the Shopify Help Center search, and hundreds more products. If you’re looking to go from one to one hundred, follow these steps, wash, rinse and repeat and you’ll have no problem scaling.

This Is a Full-Stack Problem

Now you have a playbook to scale machine learning in your organization. And that’s where you want to end up—in a place where you’re delivering more value for the business. But even with all of these steps, you won’t truly be able to scale unless your data scientists and data engineers work together. Building data products is a full-stack problem. Regardless of your organization structure, data scientists and data engineers are instrumental to the success of your machine learning portfolio. As my last piece of wisdom, ensure your data scientists and data engineers work in alignment, off of the same road map, and towards the same goal. Here’s to making products smarter for our users!

Solmaz is the Head of Commerce Intelligence and VP of Data Science and Engineering at Shopify. In her role, Solmaz oversees Data, Engineering, and Product teams responsible for leveraging data and artificial intelligence to reduce the complexities of commerce for millions of businesses worldwide.

If you’re passionate about solving complex problems at scale, and you’re eager to learn more, we're hiring! Reach out to us or apply on our careers page.

Continue reading

Hydrogen & Tailwind: The Perfect Match for Building Beautiful Storefronts

Hydrogen & Tailwind: The Perfect Match for Building Beautiful Storefronts

Let’s get this out of the way: I really, really like Tailwind. It's my preferred way to style websites, and it enables developers to build beautiful storefronts quickly with Hydrogen, our React-based framework for building custom storefronts. If you’re not familiar with Hydrogen and want to give it a quick spin, visit

To add Tailwind to a new Hydrogen app, you don’t have to do anything. It’s the default option. It’s literally there the moment you run npx create-hydrogen-app@latest. We bundled Tailwind with the Hydrogen starter template because we think it’s a really powerful and customizable set of tools to get building quickly.

So what’s the best way to use Tailwind in your project? Let’s start with componentization. I consider it one of the most effective ways to work with Tailwind.

Componentization with Tailwind

The first thing you’ll notice about Tailwind is that you use a bunch of CSS classes (often called “utility classes”) to build your website. That’s it—you don’t need to write CSS inside a dedicated CSS file if you don’t want to.

To decipher the code you see above:

  • text-center is the equivalent of setting “text-align: center;”
  • mb-16 indicates that there should be a good amount of margin at the bottom of the div
  • font-extrabold is to assign a font-weight that’s heavier than bold, but not as heavy as black
  • text-5xl is a way to say make this text pretty large
  • md:text-7xl indicates that, at the medium breakpoint, the text should be even larger. (Yes, you read that correctly: you can define responsive styles using class names instead of needing to write `@media` rules in a stylesheet! You can’t do that with regular inline styles.)

The abundance of CSS classes catches people off guard the first time they see a Tailwind website. I was one of these people, too.

One important thing to consider is that most websites are built with components these days. If you’re building a new website, it’s probably componentized on the server (think WordPress files or Rails partials) or componentized on the client (think React or Vue).

Hydrogen is built with React. This means you can use Tailwind classes within each component, and then reuse those components throughout your Hydrogen storefront without having to copy and paste a bunch of CSS classes.

The above example is from Hydrogen’s starter template. It represents a navigation that should be hidden at small breakpoints but displayed at larger breakpoints (hidden lg:block). It outputs an unordered list which displays its items in a centered way using flexbox (flex items-center justify-center). When the navigation links are hovered, their opacity changes to 80% (hover:opacity-80).

Here’s what the navigation looks like at a larger breakpoint:

A screenshot of the Hydrogen Starter Template homepage. The navigation is centered at the top of the screen and separated from the content by a gradient blue bar.
Hydrogen starter template homepage

You can check out the /src/components folder to see a bunch of examples of using Tailwind classes in different components in the Hydrogen starter template.

You might be asking yourself, “What’s the difference between building React components with Tailwind and building React components with something like Bootstrap or my own custom CSS framework?”

At the end of the day, you’re still building a component-based system, just like you would in Bootstrap or a custom framework. The difference is that the classes you apply to your components in a Bootstrap world have names that are tightly coupled to the function of each component.

This makes for a more brittle system. You can imagine that if I have a custom framework where I’ve designed for a product card that contains a product title, image,and description:

Screenshot of a Product Card of a brown nike shoe. The title is above the photo and a description is below it.
Product card

Now, let’s pretend that I really like this design. I have some blog posts on my landing page, and I want to use this same card layout for those too. I also want to show an author avatar between my title and my image on those blog posts.

Unfortunately, my class names are tightly-coupled to the product component. My options are:

  • Just re-use my product component and grimace every time I see it being used for the wrong thing
  • Rename my product class names to be more generic, like “card”
  • Duplicate all the class definitions to a new set of classes prefixed with blog-card

I’m not faced with this same dilemma when I’m using Tailwind, since I’m using utility classes that aren’t bound to the semantic meaning of their original use: product-*. I’m free to copy and paste my Tailwind and HTML markup to a new component called <BlogCard> without having to update CSS classes or jump to a stylesheet. I can also easily extract a subset of inner markup to a dedicated component that is shared between <BlogCard> and <ProductCard> without having to deal with renaming BEM-style product-card__title classes.

What About the Learning Curve?

Another question you might have: “Why do I effectively have to learn a new language in order to be productive in Tailwind?”

It’s a fair question. The learning curve for Tailwind can be steep, especially for folks who haven’t touched CSS before. In order to be effective, you still need to have at least some knowledge of how CSS works—when to use margin, when to use padding, and how to leverage flexbox and CSS grid for layouts.

Thankfully, Tailwind’s docs are amazing. They have autocomplete search, logical grouping of CSS topics, and lots of examples. Whenever you’re using Tailwind, you’ll likely have their docs open in another browser tab. Also, Tailwind’s VSCode extension is a must-have. It makes working with Tailwind a brilliant experience in the editor because CSS classes are autocompleted along with their style representations, and you get inline swatch previews for properties like background color.

In my experience, the best way to learn Tailwind is to use it in a real project. This forces you to learn the design patterns and memorize commonly-used Tailwind classes. After working on a project for a couple hours and building up muscle memory, I found myself being way more productive using the framework than I ever was writing custom CSS.

What’s the Deal with All of These Classes?

So you’re off and running with Hydrogen and Tailwind, but maybe one thing is rubbing you the wrong way: why are there so many CSS classes? Isn’t this just like writing inline styles?

Thankfully, no, it’s not like writing inline styles. One huge benefit of Tailwind is enforced consistency and constraints. As a developer who isn’t super great at design, I know that if I’m given a blank canvas with no constraints, it’s likely that I’ll create something that is very meh. Hey, I’m trying to get better! But if I have too many options, or put another way, not enough constraints, my design leads to inconsistent choices. This manifests itself as wonky spacing between elements, subpar typography decisions, and a wild gradient of colors that mimics the result of a toddler getting unsupervised access to their parent’s makeup bag.

Tailwind offers spacing and color stops that enforce a consistent visual look:

As a developer who struggles with analysis paralysis, Tailwind’s constraints are a breath of fresh air. This is how my brain works:

  • Need a little padding? Use p-1.
  • A little more padding? OK, use p-2.
  • Gosh, just a little bit more? Ahh, p-4 should do the trick.

I don’t need to think about pixels, ems, rems, or percentages. And I don’t need to double check that my other hundred components adhere to the same convention since Tailwind enforces it for me. Hydrogen’s developer experience is rooted in this philosophy as well: we don’t want developers to have to think about the nitty-gritty boilerplate, so we provide it for them.

This doesn’t mean you’re absolutely constrained to the stops Tailwind has defined! You can override Tailwind’s design system to define your own values. You can also write arbitrary values as Tailwind classes.


Tailwind is built in a way that it can be composed into a set of components that fit your design system. These design systems are portable.

Since Tailwind leverages utility classes, this means you can copy examples from really smart developers and designers on the Internet and paste them into your website as a starting point. This is really tough to do if you’re not using Tailwind or another utility CSS framework. Without Tailwind, you’d need to:

  • copy one or more CSS files
  • place it in whatever structure you’ve defined for your website’s CSS files
  • paste the HTML into your website
  • update the CSS classes everywhere to conform to your website’s style convention.

You can get a head start by purchasing Tailwind UI, which is a product by Tailwind Labs, the creators of Tailwind. They offer an e-commerce kit with a bunch of really useful components for building custom storefronts. You can also check out other cool Tailwind component collections like Tailwind Starter Kit, HyperUI, and daisyUI.

Because of Tailwind’s composability, copy and paste is actually a feature of Tailwind! The copy paste features of Tailwind means you can browse something like TailwindUI, copy something that strikes your fancy, and paste it into your storefront to customize without any other changes or manual CSS file updates.

Working with a Team

Maybe you work as a solo developer, but working with other developers is fun, too. You should try it! When you work on a team, everybody who edits the codebase needs to be familiar with how things are supposed to be done. Otherwise, it’s easy for a codebase to get out of hand with lots of inconsistencies between each developer’s individual choices.

Tailwind is gold for working with teams. Everyone has access to Tailwind’s docs (I’ve mentioned they’re great, by the way). Once team members get accustomed to Tailwind’s classes, they can look at any component and instantly know how the component is styled at each breakpoint. They don’t need to jump between stylesheets and component markup. They don’t need to spend a few minutes figuring out how the Sass partials work together or style mixins function. In order to be productive, they just read and write CSS classes! This is great news not only for teams but also for open-source projects.

There are so many unique choices we make as individuals that don’t necessarily contribute to a team project in a good way. One example of this is ordering CSS properties in a typical CSS file. Another example of this is naming things. Oh, this actually brings up a great point…

Not Having to Name Things is By Far the Best Part About Using Tailwind, Period

If there’s one thing you take away from this post, let it be this: I’ve spent so many hours of my life as a developer trying to decide what to name things. When I use Tailwind, I don’t have to use that time naming things. Instead, I go for a walk outside. I spend time with my family. I keep writing the screenplay I’ve been putting off for so long.

It’s a hard thing to understand unless you’ve spent some time using Tailwind, not naming things. Plus, when you’re working with other people, you don’t have to quibble over naming conventions in PRs or accrue technical debt when a component’s scope changes slightly and its class names no longer make sense. Granted, you’ll still have to name some things—like components—in your codebase. However, Tailwind’s utility classes grant you the mental freedom from having to assign semantic class names that represent a chunk of styles.

Hydrogen and Tailwind: A Perfect Match

I think you’ll enjoy using Tailwind inside Hydrogen. I didn’t even find an adequate place to mention the fact that Tailwind allows you to use dark mode out of the box! Or that the Tailwind team built a complementary JavaScript library called HeadlessUI that helps you create accessible interactive experiences with any CSS styles, not just Tailwind.

If you finished reading this post, and you still don’t like Tailwind—that’s fine! I don’t think I’ll convince you with this single blog post. But I’d encourage you to give it a shot within the context of a Hydrogen storefront, because I think Tailwind and Hydrogen make for a good combination. Tailwind’s utility classes lend themselves to encapsulation inside Hydrogen’s commerce components. Developers get the best of both worlds with ready-made starter components along with composable styles. Tailwind lets you focus on what is important: building out a Hydrogen storefront and selling products to your customers.

Josh Larson is a Senior Staff Developer at Shopify working on the Hydrogen team. He works remotely from Des Moines, Iowa. Outside of work, he enjoys spending time with his wife, son, and dogs.

Learn More About Hydrogen

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

React Server Components Best Practices You Can Use with Hydrogen

React Server Components Best Practices You Can Use with Hydrogen

When my team and I started experimenting with React Server Components (RSC) while building Hydrogen, our React-based framework for building custom storefronts, I was incredibly excited. Not only for the impact this would have on Hydrogen, and the future of ecommerce experience (goodbye large bundle sizes, hello improved buying experiences!), but also for the selfish reason that many of us developers have when encountering new tech: this is going to be fun.

And, indeed, it was… but it was also pretty challenging. RSC is a paradigm shift and, personally, it took some getting used to. I started out building way too many client components and very few server components. My client components were larger than they needed to be and contained logic in them that really had no business existing on the client. Eventually, after months of trial and error and refactoring, it eventually clicked for me. I found it (dare I say it?) easy to build server and client components!

In this post, I’m going to dive into the patterns and best practices for RSC that both myself and my team learned while building Hydrogen. My goal is to increase your understanding of how to approach writing components in an RSC application and cut down your trial-and-error time. Let’s go!

Default to Shared Components

When you need to build a component from scratch in a RSC application, start out with a shared component. Shared components’ entire functionality can execute in both server and client contexts without any issues. They’re a natural middle ground between client and server components and a great starting point for development.

Starting in the middle helps you ask the right questions that lead you to build the right type of component. You’ll have to ask yourself: “Can this bit of code run only on the client?” and, similarly, Should this bit of code execute on the client?” The next section identifies some of the questions that you should ask.

In our experience, the worst approach you can take in a RSC application is to default to always building client components. While this will get you up and running quickly, your application ends up with a larger than necessary bundle size, containing too many client components that are better suited as server components.

Pivot to a Client Component in Rare Cases

The majority of the components in your RSC application should be server components, so you’ll need to analyze the use case carefully when determining if a client component is even necessary.

In our experience, there are very specific use cases in which a shared component should be pivoted to a client component. Generally, it’s not necessary to convert the entire component into a client component, only the logic necessary for the client needs to be extracted out into a client component. These use cases include

  • incorporating client side interactivity
  • using useState or useReducer
  • using lifecycle rendering logic (for example, useEffect)
  • making use of a third-party library that doesn’t support RSC
  • using browser APIs that aren’t supported on the server.

An important note on this: don’t just blindly convert your whole shared component into a client component. Rather, intentionally extract just the specific functionality you need into a client component. This helps keep your client component and bundle size as small as possible. I’ll show you some examples at the end of this post.

Pivot to a Server Component as Often as Possible

If the component doesn’t include any of the client component use cases, then it should be pivoted to a server component if it’s one of the following use cases:

  • The component includes code that shouldn’t be exposed on the client, like proprietary business logic and secrets.
  • The component won’t be used by a client component.
  • The code never executes on the client (to the best of your knowledge).
  • The code needs to access the filesystem or databases (which aren’t available on the client).
  • The code fetches data from the storefront API (in Hydrogen-specific cases).

If the component is used by a client component, dig into the use cases and implementation. It’s likely you could pass the component through to the client component as a child instead of having the client component import it and use it directly. This eliminates the need to convert your component into a client component, since client components can use server components when they’re passed into them as children.

Explore Some Examples

These are a lot of things to keep in mind, so let’s try out some examples with the Hydrogen starter template.

Newsletter Sign-up

Our first example is a component that allows buyers to sign up to my online store’s newsletter. It appears in the footer on every page, and it looks like this:

Screenshot of the footer Newsletter signup. It has a text box for email and an Sign Me Up button
Newsletter sign-up component

We’ll start with a shared component called NewsletterSignup.jsx:

In this component, we have two pieces of client interactivity (input field and submit button) that indicates that this component, as currently written, can’t be a shared component.

Instead of fully converting this into a client component, we’re going to extract just the client functionality into a separate NewsletterSignupForm.client.jsx component:

And then update the NewsletterSignup component to use this client component:

It would be tempting to stop here and keep the NewsletterSignup component as a shared component. However, I know for a fact that I want this component to only be used in the footer of my online store, and my footer component is a server component. There’s no need for this to be a shared component and be part of the client bundle, so we can safely change this to a server component by simply renaming it to NewsletterSignup.server.jsx.

And that’s it! You can take a look at the final Newsletter sign-up product on Stackblitz.

Product FAQs

For the next example, let’s add a product FAQ section to product pages. The content here is static and will be the same for each product in my online store. The interaction from the buyer can expand or collapse the content. It looks like this:

Screenshot of a collapsable Product FAQ content. The question has a toggle to hide the answers
Product FAQ content

Let’s start with a shared ProductFAQs.jsx component:

Next, we’ll add it to our product page. The ProductDetails.client component is used for the main content of this page, so it’s tempting to turn the ProductFAQs into a client component so that the ProductDetails component can use it directly. However, we can avoid this by passing the ProductFAQs through to the product/[handle].server.jsx page:

And then update the ProductDetails component to use the children:

Next, we want to add the client interactivity to the ProductFAQs component. Again, it would be tempting to convert the ProductFAQ component from a shared component into a client component, but that isn't necessary. The interactivity is only for expanding and collapsing the FAQ content—the content itself is hardcoded and doesn’t need to be part of the client bundle. What we’ll do instead is extract the client interactivity into an exclusively client component, Accordion.client.jsx:

We’ll update the ProductFAQs component to use the Accordion:

At this point, there’s no reason for the ProductFAQs component to remain a shared component. All the client interactivity is extracted out and, similar to the NewsletterSignup component, I know this component will never be used by a client component. All that’s left now is to:

  • rename the file from ProductFAQs.jsx to ProductFAQs.server.jsx
  • update the import statement in product/[handle].server.jsx
  • add some nice styling to it via Tailwind.

You can view the final Product FAQ code on Stackblitz.

React Server Components are a paradigm shift, and writing a component for an RSC application can take some getting used to. Keep the following in mind while you’re building:

  • Start out with a shared component.
  • Extract functionality into a client component in specific cases.
  • Pivot to a server component if the code never needs to or never should execute on the client.

Happy coding!

Cathryn is a Staff Front End Developer on Shopify’s Checkout team and a founding member of Hydrogen. She works remotely in Montreal, Canada. When not coding, she’s usually playing with her dog, crafting, or reading.

Learn More About Hydrogen

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Rapid Development with Hydrogen: Building a Product Page

Rapid Development with Hydrogen: Building a Product Page

Updated for compatibility with Hydrogen 0.26.0

Last year we released Hydrogen, our React-based framework for building custom storefronts. Hydrogen allows developers to build fast, dynamic commerce experiences by leveraging streaming server-side rendering, React Server Components, and caching APIs. Hydrogen is currently in developer preview and I'm excited to show you how you can rapidly build out a simple product page by leaning on Hydrogen's components.

Sample Snowdevil Product Display Page showing an image of a snowboard, the name, price, variant picker, and Add to cart button
We’ll be using Hydrogen to build a product display page.

Previously, constructing a custom storefront required developers to manually manipulate data and create custom components for each page. Hydrogen accelerates this process by offering Shopify-specific commerce components, hooks, and utilities that allows developers to focus on the fun stuff - building unique storefront experiences.

Getting Started

To get started, generate a new Hydrogen app with the ‘Hello World’ template over on StackBlitz.

Most of the files you’ll work with are located in the /src directory. This directory will contain routes, components and the main app component (App.server.jsx). For an in-depth overview, see the getting started guide.

Add a styling library

We’ll be using the Tailwind CSS framework to style the product page today. You can learn more about Tailwind on Hydrogen here.

  1. Stop the StackBlitz development server (CTRL + C)
  2. Install tailwindcss and its peer dependencies, and generate the tailwind.config.js and postcss.config.js files:
    $ npm install -D tailwindcss @tailwindcss/typography postcss autoprefixer
    $ npx tailwindcss init -p
  3. Add the paths to the template files in your tailwind.config.js file:

  4. Add Tailwind directives to /src/index.css:

  5. Start the development server again.
    $ vite

You now have access to Tailwind classes, make a change to the Index route and watch the styling kick in:

Hydrogen Hello World
A styled heading

Creating a Product route

Hydrogen uses file-based routing. To register a /products/snowboard route, we can create a /src/products/snowboard.server.jsx component.

Given product handles are dynamic, we want to catch all /products/:handle requests. We can do this by using square brackets to define a parameter.

Create a new file /src/routes/products/[handle].server.jsx and export a Product component. We can lean on the useRouterParam hook to retrieve the handle parameter:

Pointing your browser to /products/the-full-stack renders a simple header and the the-full-stack handle on screen:

Sample Hydrogen Product Display Page that's missing the image of a snowboard, name, price, variant picker, and Add to cart button
A product route displaying the product handle.

Fetching data

Hydrogen communicates with Shopify via the Storefront API which makes it possible for customers to view products and collections, add products to a cart, and check out. Hydrogen conveniently exposes a useShopQuery hook to query the Storefront API, with an access token already configured (the details can be found in /shopify.config.js).

Out of the box, the Demo Store and Hello World templates are connected to a Hydrogen Preview store, which has a number of snowboard collections, products, variants and media - ideal for testing.

Import the useShopQuery hook and use the dynamic product handle to fetch a product’s title and description:

By providing a prose class to the description, Tailwind CSS Typography plugin adds typographic defaults to the vanilla HTML pulled from the Shopify Admin.

Sample Snowdevil Product Display Page that's missing the image of a snowboard
An product page with a title and description.

Using state

Hydrogen implements React Server Components which allows the server and the client (the browser) to collaborate in rendering the React application (learn more). By default, all routes are server components.

We'll be using a ProductOptionsProvider component to set up a context with state that tracks the selected variant and options. To use state, create a client component (/src/components/ProductDetails.client.jsx) and import it into your server component (/src/routes/products/[handle].server.jsx).

Update the product query to fetch product media, variants and options, and then wrap the product details in a ProductOptionsProvider component.

With the context in place, it's a breeze to build out the interactive parts of the product page, like the variant selector. By leaning on the ProductOptions hook we can get a list of options and manage selected option state. Passing the selected variant ID to ProductPrice dynamically updates the selected variant’s price.

A variant picker has been added to the product page
A variant picker has been added to the product page.

Adding a buy button

Hydrogen exports a BuyNowButton component which sends customers to checkout. Get the selected variant ID, and pass it to a BuyNowButton. If the selected variant is out of stock, display a message:

Media gallery & finishing touches

With a functioning product page in place, create a media gallery (you guessed it, there's a component for that too) and add add some additional styling:

The final code is found on StackBlitz.

A variant picker has been added to the product page
The final product!

Hydrogen Enables Rapid Development

Taking advantage of these components, hooks and utilities allows you to skip many of the repetitive parts of building a custom storefront, speeding up the development process.

I hope Hydrogen has piqued your interest. Explore the docs or build a complete storefront by following the new tutorial and take Hydrogen for a spin on your next project!

Scott’s a Developer Advocate at Shopify, located on the east coast of Australia and formerly a Shopify app developer and developer bootcamp coach. When he's not tinkering with code, you'll find him checking the surf or hanging out with his family.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

How We Fixed the Dependency Confusion Vulnerability in Over 600 Ruby Applications

How We Fixed the Dependency Confusion Vulnerability in Over 600 Ruby Applications

Shopify has grown significantly over the years, and our success makes us an attractive target for malicious actors. We take the safety of our merchants seriously, so we have a good reason to continuously improve the security at Shopify. 

I’ll share how the Ruby Conventions team, which focuses on creating conventions to make Ruby services sustainable, used an iterative approach to solve complex problems at scale while responding to shifting circumstances. In particular, how we solved the dependency confusion vulnerability in over 600 Ruby applications, developed tooling that allows us to do large-scale migration with ease, and made the Ruby community a bit safer.

Understanding the Dependency Confusion Problem

Shopify runs a bug bounty program where we pay people to find vulnerabilities on our platform and learn what we have to improve on. One such report showed that we were vulnerable to a dependency confusion vulnerability that could give an attacker access to our local, continuous integration/continuous deployment (CI/CD), and production environments.

The vulnerability leverages the ambiguity of a package source to install malicious dependencies. If an external package is created with a higher version number under the same name as an internal Shopify package, the external dependency is resolved instead of the internal dependency. 

In Ruby, developers use Bundler to manage their dependencies and make their environments reproducible. Bundler resolves dependencies so that you use the correct versions and sources for each gem. The Bundler team fixed the issue by introducing a new Gemfile.lock file format that’s created by a fresh install or an update. The new format assigns each gem to an explicit source:

However, at that time, the new format required you to upgrade. That meant Bundler updated all dependencies in the lockfile that would require vetting each update and testing the application for regressions in behavior. 

Identifying the Impact

We didn’t know how many applications were susceptible to the dependency confusion vulnerability that made it hard to assess the impact of the problem. Our first step was to disambiguate the situation, so we could understand the problem better. 

Disambiguating unknowns doesn’t need to be fancy, and it’s better to have some insight than none. In our case, we defined a cron job in our CI system to get the Bundler version information from all repositories into our data lake. It turned out that around 600 Ruby applications were susceptible to the dependency confusion vulnerability.

Having that data also allowed us to create a metric of outstanding migrations and measure progress towards solving our problem. It’s also a great way of detaching the solution from the goal, which is less constraining.

Changing Assumptions Through Experimentation

As developers, our solution has to take quite a few constraints into account. When developing software iteratively, we try to change some of those constraints and reevaluate our solution quickly. Making those changes as soon as possible surfaces unknowns increasing the likelihood of a successful project.

In our case having over 600 repositories to migrate meant that manually migrating every application would be too time-consuming. Requiring teams to do it themselves would be tedious and error-prone because the Gemfile.lock file couldn’t be automatically updated while keeping the current gem versions. In that case, developers would need to modify the lockfile to revert the versions updates back to prevent regressions from being introduced.

If we were able to update a Gemfile.lock to the new format without updating dependencies, it would enable us to automate rolling this upgrade out to all Ruby applications in Shopify. We would only rely on the application owners to deploy the changes.

We experimented with building a Bundler plugin (a gem that extends Bundler’s functionality) to automate the upgrade. It updated the Gemfile.lock file to the new format without updating dependencies. The plugin boiled down to:

  1. Initializing the specification for a given Gemfile.lock file that contains information about the gems such as the name, the version, and remote.
  2. Updating the Gemfile.lock file to the new lockfile format that updates all gems in the process. We minimize updates by only permitting patch version updates.
  3. Replacing the versions in the updated Gemfile.lock file with the gem versions from the old Gemfile.lock file.

This approach wasn’t a perfect solution, but it worked well enough to run Bundler migrations. It allowed us to proceed to the next problem area of migrating large numbers of applications.

Running Migrations at Scale

One of the biggest challenges in running large-scale migrations is handling edge cases. Rather than exploring how migrations can go wrong beforehand, it’s more effective to migrate a handful of applications and discover the actual problems. The other benefit is that we can identify and migrate the subset of applications with issues that have known solutions while resolving the edge cases at the same time. This approach allows us to constantly deliver on our goals and put ourselves in a better spot each day.

Our Bundler plugin migrated the lockfile without dependency updates, and then we could start migrating applications. We started out running the plugin on a handful of applications that weren’t merchant-facing. This went smoothly, and we decided to run it on a larger batch for non-critical repositories. However, we noticed issues arising from inconsistent build setups, Ruby versions, and other configurations in the larger batches of migrations.

Some of our tooling didn’t support the latest Bundler version, and we had to work with our deployment, CI, and local environments teams to update them. Our collaborations were particularly fruitful when we:

  • investigated the issue first
  • tried to solve it
  • shared the context with the team. 

Most people want to help and making it easy for them benefits everyone.

Some of our Docker images are built with Heroku’s Ruby buildpack that didn’t support the required Bundler version. This situation rendered a percentage of applications unable to migrate. To solve this issue, we worked with the Heroku Buildpack team to adopt the latest Bundler version. They released a new version with the bundler update, making it broadly available in the Ruby community.

Another critical element was raising awareness with project owners and setting a deadline to deprecate the old Bundler version. Being upfront with owners and communicating the impact of the change allowed teams to prioritize and work with us to update their projects.

The Bundler migration plugin was run locally, but scalability issues arose. It became too complicated to manage different Ruby versions, parallelize them, and address failures. Instead of wasting time on building a solution that would have solved all eventualities at the start, we used the migration plugin to its breaking point, investigated the problem areas, and implemented improvements. 

As a response to our scaling issues, we built a command-line interface (CLI) tool on top of our CI system to set up the right environment for a repository, run commands on it, and open a pull request (PR) based on the changes made. Having an environment per repository worked great because we didn’t run into misconfiguration problems anymore. Using our CI system also allowed us to parallelize the execution, which in turn, sped the process up. Furthermore,  migration failures were easier to recover and track.

Preventing Future Problems

Part of iteratively solving a problem means focusing on current problems rather than future concerns. However, it doesn’t mean ignoring future concerns altogether. It’s important to distinguish between critical concerns and ones that can be figured out later on.

One example was preventing a Gemfile.lock file from regressing to its previous format that would make us vulnerable. We were aware of the possibility of regressions, but we also knew that we could build tooling to solve this issue. Instead of investing time in tackling the problem upfront, we decided to wait and start working on it once we migrated most applications. This approach also allowed us to gauge the magnitude of the problem rather than wasting resources working out hypotheticals.

We encountered a handful of regressions during our migration and were a bit concerned. We investigated each manually to see if there were bigger problems present. Since we didn't find anything suggesting deeper problems, we carried on and continued monitoring knowing that if we ran into more regressions, we had more information to change course and face the new reality.

We investigated the lockfile regression problem and shared what we learned with the Bundler team. They enhanced the tool to prevent these cases from occuring in the future. We didn’t need to implement special tooling to prevent regressions (it saved us a lot of work and time). We only had to make sure that all applications were using the correct Bundler version.

Most of our applications were migrated to the Bundler version that didn’t prevent regressions because we staggered the migration to make continuous progress. Since we battle-tested our migration tooling and resolved most configuration issues, it allowed us to migrate all of our applications to the latest Bundler version in less than a day.

Rather than waiting for the perfect solution, making iterative changes improved our tooling to the point where we made changes that used to be hard, easy. This de-risked the deployment.

To prevent the installation of malicious gems, we made changes to our local environment tooling to ensure it always defaults to the recommended Bundler version. This ensures that an individual developer machine isn’t susceptible to running malicious code from the dependency confusion vulnerability. We also started failing CI whenever it encountered an out-of-date Bundler version, ensuring that any code change that could introduce the dependency confusion vulnerability wouldn’t be merged. Since most of our other automated processes require CI to execute, we rely on CI to catch vulnerable Bundler versions.

Sharing What We've Done with the Community

We love open source at Shopify, and we like giving back to the community. When contributing, it is quite valuable to share the purpose as well as the solution. It leads to insightful conversations that result in a better solution. Often, contributions aren’t solely PRs. Providing context on investigative work, bringing problems to someone's attention, or testing another contributor’s prototypes are just as valuable.

Our plugin worked pretty well for us, so we created a proposal in Bundler to fix the issue for the Ruby community. These changes would allow Bundler to update the Gemfile.lock file without upgrading gems in the process. Our proposal didn’t make it in, but led to a conversation resulting in an alternative approach that was shipped in Bundler 2.2.21. We helped test their approach on our applications to ensure that we caught as many edge cases as possible to help minimize the potential burden on the community. 

We also ran into issues where developers using an insecure version of Bundler could accidentally revert to the old lockfile format. The problem was that the latest Bundler version (at the time) still resolved the old Gemfile.lock file on `bundle install`, which made it very simple to regress to the old format. We created a prototype to prevent that from happening that sparked another conversation with the maintainers of Bundler and brought the issue to their attention. They released version 2.2.22 of Bundler that prevents regressions and makes everybody in the community more secure.

We set out to fix the dependency confusion vulnerability in every Ruby project at Shopify and succeeded. This wouldn’t have been possible if we hadn’t followed an iterative approach that allowed us to make steady progress while taking shifting circumstances into account. We developed tooling that allows us to do large-scale migration, which has come in handy for other uses. We also aggregated Bundler version data on our Ruby projects to track adoption and make future decision-making easier. Lastly, we have worked closely with the Bundler team to improve the base functionality while leveraging Shopify’s scale to find edge cases, fix bugs, improve Bundler, and make it better for everyone in the Ruby community.

Frederik is a production engineer at Shopify and part of the Ruby & Rails infrastructure team. He contributed to massively scaling Shopify’s CI/CD system and making Ruby services more secure across Shopify and the Ruby community.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Cloud, Load, and Modular Code: What 2022 Looks Like for Shopify

Cloud, Load, and Modular Code: What 2022 Looks Like for Shopify

You may have heard that 2021 was Shopify’s biggest Black Friday Cyber Monday (BFCM) ever. This four-day period was monumental for both Shopify’s merchants and our engineering teams.

Last year’s numbers capture a moment in time but can also help us predict what’s to come in the year ahead. On our cloud in 2021, our peak BFCM traffic surpassed 32 million app server requests per minute (RPM). In the same time period our load balancers peaked at more than 34 million RPM. To put that in perspective, this means that the equivalent of Texas’s total population hit our load balancers in a given minute. One flash sale—a short-lived sale that exceeds our checkout per minute threshold—even generated enough load to use over 20% of our total computing capacity at its peak.

During BFCM 2021, we also:

  • sent nearly 145 million emails
  • averaged 30 TB per minute of egress network traffic
  • handled 42 billion API calls and delivered 13 billion webhooks
  • wrote 3.18 PB and read 15 PB of data from our storefront caching infrastructure
  • performed over 11 Million queries per second and delivered 11 terabytes per second read I/O with our MySQL database fleet

The year ahead poses even bigger challenges for our engineers, data scientists, user experience designers, and product managers. More BFCM sales are happening on mobile devices. More people are shopping on social media. Commerce is happening across a growing array of platforms and buyers expect a fast and consistent experience. If the metaverse becomes a reality, there will be commerce opportunities within that world that need to be explored. What does a flash sale look like in the metaverse and how does that play out?

Infographic of Shopify's BFCM 2021 technical stats
Shopify's technical stats from BFCM 2021

If the data and trends above tell us anything, it's that there’s no getting around the fact that flash sales, huge floods of web traffic, and many different buying environments are a big part of the future of commerce. The questions for me are: What are the enduring challenges for the engineering teams working to enable this incredible growth in the next five to ten years? How do we build scalable products and infrastructure so millions of merchants can go from zero to IPO—and beyond? Engineering at Shopify is about solving challenges and building resilient systems so merchants can focus on their business instead of technology. 

Here are a few things we’re planning on doing in 2022 to work quickly in a world that’s growing rapidly, becoming more global, and at the same time moving closer to where merchants do business and where buyers are shopping.

We are building more modular code. Shopify is famously one of the world’s largest Rails monolith codebases. We’ve been actively changing the architecture of the monolith to a majestic, modular monolith for several years. And more recently, we’ve been changing our architectural patterns as we deconstruct parts of the monolith for better developer productivity

As an example, we split out our storefront rendering process from the modular monolith repo to make sure merchants (and their customers) get the fastest online shopping experience possible. When we were done with the split and some code refactoring work, the results were four times faster cache fill rates and five times faster page render times. Also, pulling the storefront renderer out means it can now be deployed in geographies around the planet without having to deploy our full Rails monolith. The closer we can render the storefront to the buyer, the fewer round-trips between the store and the browser need to be made, again improving overall storefront performance. In 2022, we’re going to continue exploring majestic monoliths. We see that engineers working on repos that directly improve merchant performance, like storefront rendering, iterate and deploy quickly. This model also allows us to put our developer experience first and provide a simpler setup with tighter coupling with our debugging and resiliency tools. 

We are leveraging new cloud development platforms to work more efficiently on a global scale. This year, we’ll spend a lot of time making sure developers can create impact fast—in minutes not hours. We’re moving the majority of our developers into our cloud development environment, called Spin. Devs can spin up (pun intended) a full development environment in seconds as opposed to minutes. You can even have multiple environments for experimentation to share work-in-progress with teammates. (We plan to share more about Spin in the future.)

Another big part of this year will be about building on this cloud development platform foundation to make our developer workflow faster and even smoother. We also moved all of our engineering to working on Apple M1 Macbook Pro laptops and these powerful devices, combined with Spin, are already making developers much more productive. Spin creates opportunities for us to build much improved IDE and browser extensions for enhanced productivity and delight, and an exciting opportunity for us to explore new ways to solve developer problems at scale that just weren’t possible in our previous local development environment paradigm. 

We are making load testing a more natural part of the development process. To prepare for BFCM 2021, we began load testing in July and ran the highest load test in Shopify’s history: a load balancer peak of 50.7 million RPM. But, flash sales that spike in minutes are not as predictable in their load requirements as a seasonal growth pattern like BFCM. To help prepare our infrastructure and products to handle larger and spikier scale, we’re continuing to improve our load testing. These load tests, built in-house, help our teams understand how products handle the larger platform-wide surge scenarios. Our load testing helps test product sales regardless if they are exclusively online, in-person using our retail POS products, or a combination of both. Automating and combining load tests as part of our product development processes is absolutely critical to avoid performance issues as we scale alongside our merchants.

These are a few ways we’re making it as easy as possible for developers to do the best work of their lives. We want to have the right tools so we can be creative about commerce—not “How do I set up my environment?” or “How does my code get built?” Engineers want to work at scale, ship impactful changes on a regular cadence, and work with a great team.

Speaking of great teams, a team of engineers from Shopify and Github built YJIT, a new Just-in-time (JIT) compiler that merged with Ruby 3.1. It’s 31% faster than interpreted CRuby and 26% faster than MJIT, reaching near-peak performance after a single iteration of any benchmark. It’s having a huge impact on the Ruby community inside and outside of Shopify and accelerating lots of production code execution times.

What isn’t changing in 2022: We remain opinionated about our tech stack. We’re all in on Rails and doubling down on React Native for mobile. We are going to continue to make big bets on our infrastructure, on building delightful developer environments, and making sure that we’re building for the success of all of our merchants. BFCM 2022? Bring it on.

Allan Leinwand is Chief Technology Officer at Shopify leading the engineering and data teams. Allan was previously SVP of Engineering at Slack and CTO at ServiceNow. He co-founded and held senior leadership positions at multiple companies, has authored books, and ventured to the dark side as a venture capital investor for seven years. He’s passionate about helping Shopify be the best commerce platform for everyone!

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Search at Shopify—Range in Data and Engineering is the Future

Search at Shopify—Range in Data and Engineering is the Future

One thing I’ve always appreciated about Shopify is the emphasis on range: the ability to navigate across expertise. Range isn’t just a book we love at Shopify, it’s built into our entire outlook. If you’re a developer at Shopify, you could start your career building data science infrastructure, but decide a few years later to pivot to Ruby internals.

The emphasis on range inspires me. In my coding journey, I’ve loved ranging. I started building AppleBasic programs in 4th grade. Years later my high school friends would try to one-up each other, obsessed with the math behind 3D games.

What does any of this have to do with search?

While most would see search and discovery as some kind of deep specialty: it actually requires an intense amount of range. Many search teams focus too much on specialists—in the words of my former colleague Charlie Hull, teams always wanted to hire “magical search unicorns” that often don’t exist. Instead, they tended to have siloed data and engineers working on search.

I’ve taken these painful experiences to heart when helping build Shopify’s search team. I want to share why range is a core team principle that separates us from the herd and sets us up for long-term success. (And of course, why you should join, even if you’re not a magical search unicorn!).

Lack of Range: Dysfunction between Data and Engineering 

In reality, nobody on our search team is an “engineer” or “data scientist”. Instead they have the range to be both at the same time. In fact, most of the team has a wide range when it comes to past jobs or hobbies: from linguists to physicists! After all, good decisions require fitting both data science and engineering skills into one brain.  

Why? Because of the trade-offs.

Pure data scientists or engineers waste time making poor decisions because they lack full context. They won’t see the other competency’s constraints. That’s why generalizing beyond our expertise is a major part of how Shopifolk work on every project. And that’s precisely why we’ve brought this value to the search domain.

Consider life in the data silo: without engineering context, data could easily chase the bleeding edge machine learning research without considering how to deliver to production. They develop a new model, decide shipping to production isn’t their job and instead give the new model to engineers to translate. 

In the engineer silo, they don’t have the context needed to make the important tradeoffs. Can they know where to tweak the model to remove bloat that doesn’t hurt relevance? Can pure engineers make the dozens of minute-by-minute decisions they need to optimize relevance, performance, and stability? Without the data context in their brain, they’ll fail, leading to suboptimal solutions!

Great engineering is about making the best decision given the constraints. So when an engineer lacks one crucial piece of know-how (data and relevance), they won’t arrive at the optimal solution between relevance, performance, stability, and other product factors. They’ll blindly implement the model, unsure where to tweak, leading to disastrous results in one of these dimensions.

That leads me to the other end of the trade-off spectrum: the data team creates a reasonable solution, but the infrastructure won’t bend. Unfortunately the engineers, specifically skilled in performance and reliability, might not see the full search quality spectrum of relevance, experience, and performance. Their incentives focus on answering whether search satisfies a service-level agreement? Does it keep me from being woken up at 3AM when I’m on call? With only those constraints, why would an engineer care to build a complicated looking search relevance model that only runs the risk of creating more complexity and instability?

Coordination between two groups—each with only half of the skills needed to make decisions—creates dysfunction. It adds needless time to production deployment and creates politics. 

Silos like these only lead to the dark side.

The solution? RANGE

Range: The Solution to Dysfunction between Data and Engineering

At Shopify, we have one team with members from both competencies. We draw very few lines between “data” and “engineering” work. Instead we have “search” work.

Engineers on our team must grow data science skills—they learn to build and run experiments. They think scientifically and evaluate the quality of a model. Data scientists find themselves pushed to become good engineers. They must build high quality, performant, and testable code. When they build a model, it’s not just a random idea in a notebook, it’s on them to get it to production and create a maintainable system.

Why does this matter? Because search, like all software development, requires making dozens of deeply intricate tradeoffs between correctness, scalability, performance, and maintainability. Good decisions require fitting both data science and engineering skills in one brain. An elegant solution to a problem is the simplest one that satisfies all of the constraints. If you can only fit half the constraints in your head, you’ll fail to see the best solution that makes search smart, fast, and scalable.

A close partnership between data and engineering organizations makes this possible. Management on both sides has experience and commitment to close collaboration and partnership. At the level of individual contributors, we don’t think of ourselves as two teams. We’re one team, with individuals that report to a few different leads. We organize, plan, and execute together. We don’t carve out territorial fiefdoms.

Data and Engineering Range is the Future

When you look at the problems of tomorrow, they’ll increasingly be less about point-and-click interactivity. They’ll frequently include some “smart” user interaction. The user wants to:

  • talk to the system 
  • start with a curated set of possibilities tailored to them and fine tune them with their preferences 
  • be given options or taken on a journey that doesn’t filter out obvious paths they won’t care about.

This isn’t just the cool stuff people add on to an existing application: it’s increasingly the core part of what’s being built. 

I see search and discovery at Shopify as just the beginning. The more personalized or conversational products we build, like those listed above, the more engineers must have the range to push into data (and vice versa). The future isn’t specialization within data science and engineering—it’s having the range to move between both.

Doug Turnbull is a Sr. Staff Engineer at Shopify working on search and discovery. Doug wrote Relevant Search and contributed to AI Powered Search. Doug also blogs heavily at Shopify and his personal site. Currently Doug’s passion includes incubating search and discovery skills at Shopify, planning technical initiatives in search and discovery, and collaborating with peers to make commerce better for everyone through search!

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

That Old Certificate Expired and Started an Outage. This is What Happened Next

That Old Certificate Expired and Started an Outage. This is What Happened Next

In distributed systems, there’s plenty of occasions for things to go wrong.  This is why resiliency and redundancy are important. But no matter the systems you put in place, no matter whether you did or didn’t touch your deployments, issues might arise. It makes it critical to acknowledge the near misses: the situations where something could have gone wrong and the situations where something did, but it could have been worse. When was the last time it happened to you? For us, at Shopify, it was on September 30th, 2021, when the expiration of Let’s Encrypt’s (old) root certificate almost led to a global outage of our platform.

In April 2021, Let’s Encrypt announced that the former root certificate was expiring. As we use Let’s Encrypt as our public certificate provider since we became a sponsor in 2016, we made sure that Shopify’s edge infrastructure was up to date with the different requirements, so we wouldn’t stop serving traffic to all of (y)our beloved shops. As always, Let’s Encrypt did their due diligence with communications and by providing a cross-signing of their new root certificate by the old one. What this means is that while clients didn’t trust the new root certificate yet, because that new root certificate was signed by the old one, they trusted the old one and would transfer their trust to the new one. Also, the period of time between the announcement and the expiration was sufficient for any Let’s Encrypt-emitted certificates, which expire after three months, to be signed by the new cross-signed root certificate and considered valid using any of the old or new root certificates. We didn’t expect anything bad to happen on September 30th, 2021, when the root certificate was set to expire at 10:00 a.m. Eastern Standard Time.

At 10:15 am that same day, our monitors started complaining about certificate errors, but not at Shopify’s edge—that is, between the public and Shopify—but between our services. As a member of the traffic team of Shopify that handles a number of matters related to bringing traffic safely and reliably into Shopify’s platform (including the provisioning and handling of certificates), I joined the incident response to try and help figure out what was happening. Our first response was to lock the deployments of the Shopify monolith (using spy, our chatops) while some of us connected to containers to try and figure out what was happening in there. In the meantime, we started looking at the deployments that were happening when this issue started manifesting. It didn’t make any sense as those changes didn’t have anything to do with the way services interconnected, nor with certificates. This is when the Let’s Encrypt root certificate expiry started clicking in our minds. An incident showing certificate validity errors happening right after the expiry date couldn’t be a coincidence, but we couldn’t reproduce the error in our browsers or even using curl. Using openssl, we could, however, observe the certificate expiry for the old root certificate:

The error was related to the client being used for those connections. And we saw those errors appearing in multiple services across Shopify using different configurations and libraries. For a number of those services, the errors were bubbling up from the internally-built library allowing services to check people’s authentication to Shopify. While Faraday is the library we generally use for HTTP connections, our internal library has dependencies on rack-oauth2 and openid_connect. Looking at the dependency chains for both applications, we saw the following:

Both rack-oauth2 (directy) and openid_connect (indirectly) depend on httpclient, which, according to the GitHub repository of the library, “gives something like the functionality of libwww-perl (LWP) in Ruby.”

From other service errors, we identified that the google-api-client also was in error. Using the same process, we pinpointed the same library as a dependency:

And so we took a closer look at httpclient and...

Code snippet from httpclient/nahi

Uh-oh, that doesn’t look good. httpclient is heavily used, whether it’s directly or through indirect exposures of the dependency chain. Like web browsers, httpclient embeds a version of the root certificates. The main difference is that in this case, the version of the root certificate store in the library is six years old (!!) while reference root certificate stores are generally updated every few months. So even with Let’s Encrypt due diligence, a stale client store that doesn’t trust the new root certificate directly or the old one, as it expired, was sufficient to cause internal issues. 

Our emergency fix was simple. We forked the Git repository, created a branch that overrode cacert.pem with the most recent root certificate bundle and started using that branch in our deployments to make things work. After confirming the fix was working as expected and deploying it in our canaries, the problem was solved for the monolith. Then automation helped create pull requests for all our affected repositories.

The choice of overriding cacert.pem with a more recent one is a temporary fix. However, it was the one, following a solve-fast approach, we knew would work automatically for all our deployments without making any other changes. To support this fix and make sure a similar issue does not happen soon, we put in place systems to keep track of changes in the root certificates, and automatically update them in our fork when needed. A better long-term approach could be to use the system root certificates store for instance, which can commence after a review of our system root certificate stores, across all of our runtime environments.

We wondered why it took about 15 minutes for us to start seeing the effects of that certificate expiry. The answer is actually in the trigger as we identified that we started seeing the issue on the Shopify monolith when a deployment happened. HTTP has a process of permanent connections, also called HTTP keep-alive, that keeps a connection alive as long as it’s being used, and only closes it when it hasn’t been used after a short period of time. Also, TLS validation, the check of the validity of certificates, is only performed while initializing the connection, but the trust is maintained for the duration of that connection. Given the traffic on Shopify, our deployments kept alive the connections to other systems, and the only reason those connections were broken was because of Kubernetes pods being recreated to deploy the new version, leading to a new HTTP connection and the failure of TLS validation—hence the 15 minutes discrepancy.

Beside our Ruby applications having (indirect!) dependencies on httpclient, a few other of our systems were affected by the same problem. Particularly, services powered by data were left hanging as the application providing them with data was affected by the disruption. For instance, product recommendations weren’t shown during that time, marketing campaigns ended up being throttled temporarily, and, more visible to our merchants’ customers, order confirmations were delayed for a short period because the risk analysis couldn’t be performed.

Of the Shopify monolith, however, only the canaries—that is, the server to which we roll changes first to test their effect in production before rolling them to the rest of the fleet—were affected by the issue. Our incident response initial action of locking deployments also stops any deployment process in its current state. This simple action allowed us to avoid cycling Kubernetes pods for the monolith and keep the current version running, protecting us from a global outage of Shopify, leading to September 30th, 2021 being that one time an outage could have been way worse.

Raphaël is a Staff Production Engineer and the tech lead of the Traffic team at Shopify, taking care of the interfaces between Shopify and the outside world and providing reliable and scalable systems for configuring the edge of the neverending growing applications. He holds a Ph.D. in Computer Engineering in systems performance analysis and tracing, and sometimes gives lectures to future engineers, at Polytechnique Montréal, about Distributed Systems and Cloud Computing.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

Nerd Out on 10 of Our Favorite Posts From 2021

Nerd Out on 10 of Our Favorite Posts From 2021

Shopify engineers not only work on challenging and impactful projects, but they also take the time to share their craft expertise at events and on the blog. As we close out the year, here are ten of our favorite posts of 2021, a curated selection spanning technologies and engineering disciplines, as well as a few from previous years that many of you still love, read, and share. 

While pulling this list together, the song My Favorite Things kept repeating in my mind. And with that, I apologize in advance for my cringeworthy poetic interpretation. It isn’t easy to rhyme with CRuby.

Building apps with and without Rails libraries
GraphQL how-tos not one but a series
Upgrading MySQL makes your heart sing
These are a few of your favorite things

Building a JIT compiler for CRuby
Making apps faster with caching is groovy
Hydrogen-powered storefronts give you wings
These are a few of your favorite things

1. How to Build a Web App With and Without Rails Libraries

An illustrated factory producing Ruby Gems

Maple Ong, Senior Developer, wrote our most-read post of 2021, an in-depth tutorial that asks you to forget that Rails exists for a minute (whaaaat?!) and takes you on a journey to build a web application only using the standard Ruby libraries.

“With just a little bit of familiarity with the Rails framework, you’re able to build a web application—that’s the magic of Rails. Building your own Ruby web application from scratch, however, isn’t only educational—I’d argue that it’s also a rite of passage in a Ruby and Rails developer’s career!”

Want to learn more about Maple? Follow her on Twitter.

2. Building Blocks of High Performance Hydrogen-powered Storefronts

A structure built with different shapes and colors

Shopify Unite created a lot of buzz around Hydrogen, a React-based framework for building custom and creative storefronts. Ilya Grigorik, Principal Engineer at Shopify, takes us behind the scenes and shares how Hydrogen is built and optimized to power personalized, contextual, and dynamic commerce.

Fun fact: Hydrogen is in developer preview in case you want to take it out for a spin.

3. YJIT: Building a New JIT Compiler for CRuby

A cartoon Ruby Gem levelling up, jumping to higher pipes

Shopify engineers worked on many cool and impactful projects in 2021. Exhibit A: YJIT. Created by a small team led by Staff Engineer Maxime Chevalier-Boisvert, YJIT is a new JIT (just-in-time) compiler built inside CRuby that has now been merged upstream and is part of the Ruby 3.1.0 release. In this post, Maxime writes about the importance of the project to Shopify and Ruby developers worldwide.

For extra credit and additional reading, check out Noah Gibbs’ YJIT posts:

4. A Five-Step Guide for Conducting Exploratory Data Analysis

A person working on graphs at their desk

Cody Mazza-Anthony, Shopify Data Scientist, explains how to use exploratory data analysis (EDA) for answering important business questions and walks through five tips for performing an effective EDA. Here are the key takeaways:

  • Missing values can plague your data
  • Provide a basic description of your features and categorize them
  • Understand your data by visualizing its distribution
  • Your features have relationships! Make note of them
  • Outliers can dampen your fun only if you don’t know about them

5. Upgrading MySQL at Shopify

A cartoon server working out at the gym

Yi Qing Sim, Senior Production Engineer, shares how the Database Platform team performed the most recent MySQL upgrade at Shopify. She also discusses the roadblocks they encountered during rollback testing, the internal tooling built to aid in upgrading and scaling our fleet in general, and guidelines for approaching upgrades going forward.

6. Apache Beam for Search: Getting Started by Hacking Time

A image of a wristwatch

You might know Doug Turnbull, Senior Staff Engineer, for writing the book “Relevant Search”, contributing to “AI-Powered Search”, and creating relevance tooling for Solr and Elasticsearch like Splainer, Quepid, and the Elasticsearch Learning to Rank plugin. We also know him as the author of this excellent post about using Apache Beam for search at Shopify.

7. Understanding GraphQL for Beginners

An illustration of three hamburgers, varying in sizes and toppings

OK, I might be cheating here because this is a three-part series, but it’s been a top read in 2021, prompting our team to refer to these tutorials as the “Everybody Loves Raymond” series. In this hands-on tutorial, Raymond Chung, Technical Educator on the Dev Degree team, teaches you all about GraphQL and digs into the difference between REST and GraphQL.

8. Keeping Developers Happy with a Fast CI

A car races, blurred by the speed

Christian Bruckmayer, Senior Production Engineer, is part of the Test Infrastructure team responsible for ensuring Shopify’s CI systems are scalable, robust, and usable. In this post, Christian shares how the team reduced the p95 of Shopify’s core monolith from 45 minutes to 18, allowing developers to spend less time waiting and ship faster.

9. Rate Limiting GraphQL APIs by Calculating Query Complexity

Illustrated cereal boxes with various GraphQL references

Guilherme Vieira, Senior Developer on the API Patterns team, explores Shopify’s rate-limiting system for the GraphQL Admin API and how it addresses some limitations of methods commonly used in REST APIs. He also describes how we calculate query costs that adapt to the data that clients need while providing a more predictable load on servers.

10. Building an App Clip with React Native

A cartoon phone genie with the Shop App opened on the screen emitting from a lamp

Sebastian Ekström, Senior Developer on the Shop team, recounts what it was like being the first to build an App Clip in React Native that would be surfaced to millions of users each day. 

“We approached this project with a lot of unknowns—the technology was new and new to us. We were trying to build an App Clip with React Native, which isn’t typical! Our approach (to fail fast and iterate) worked well. Having a developer with native iOS development was very helpful because App Clips—even ones written in React Native—involve a lot of Apple’s tooling.”

Bonus: Older Posts You Still Love

The following posts are still very popular on the blog, so it felt wrong to leave them off the list just because they are a couple of years old. Considering the number of people who have read Building a Data Table Component in React, I’m assuming there are thousands of data table components built with React out there, thanks to this post.

  1. Building a Data Table Component in React
  2. Under Deconstruction: The State of Shopify’s Monolith
  3. How to Write Fast Code in Ruby on Rails

Jennie Lundrigan is a Senior Engineering Writer at Shopify. When she's not writing nerd words, she's probably saying hi to your dog.

If you’re passionate about solving complex problems at scale, and you’re eager to learn more, we're always hiring! Reach out to us or apply on our careers page.

Continue reading

Shopify’s Unique Data Science Hierarchy Of Needs

Shopify’s Unique Data Science Hierarchy Of Needs

You’ve probably seen the “Data Science Hierarchy of Needs” (pictured below) before. Inspired by Maslow, it shows the tooling you would use at different levels in data science—from logging and user-generated content at the bottom, to AI and deep learning at the very top.

While hierarchies like this one can serve as helpful guides, at Shopify, we don’t think it always captures the whole picture. For one, it emphasizes particular tools over finding the best solution to a given problem. Plus, it can have a tendency of prioritizing more “advanced” solutions, when a simple one would do. 

Data Science hierarchy of needs showing pyramid from top to bottom in ascending order of importance: AI, Learn, Aggregate/Label, Explore/Transform, Move/Store, and Collect
The Data Science Hierarchy of Needs

That’s why we’ve chosen to take a different approach. We’ve created our own Data Science Hierarchy of Needs to reflect the various ways we as a data team create impact, not only for Shopify, but also for our merchants and their customers. In our version, each level of the hierarchy represents a different way we deliver value—not better or worse, just different. 

Our philosophy is much more tool-agnostic, and it emphasizes trying simple solutions before jumping to more advanced ones. This enables us to make an impact faster, then iterate with more complex solutions, if necessary. We see the pinnacle of data science not as machine learning or AI, but in the impact that we’re able to have, no matter the technology we use. Above all, we focus on helping Shopify and our merchants make great decisions, no matter how we get there.  

Below, we’ll walk you through our Data Science Hierarchy of Needs and show you how our tool-agnostic philosophy was the key to navigating the unprecedented COVID-19 pandemic for Shopify and our merchants. 

Tackling The Pandemic With Data

During the pandemic, we depended on our data to give us a clear lens into what was happening, how our merchants were coping, and what we could do to support them. Our COVID-19 impact analysis—a project we launched to understand the impact of the pandemic and support our merchants—is a great example of how our Data Science Hierarchy of Needs works. 

For context—at Shopify, data scientists are embedded in different business units and product teams. When the pandemic hit, we were able to quickly launch a task force with data science representatives from each area of the business. The role of these data scientists was to surface important insights about the effects of the pandemic on our merchants and help us make timely, data-informed decisions to support them.

At every step of the way, we relied on our Data Science Hierarchy of Needs to support our efforts. With the foundations we had built, we were able to quickly ship insights to all of Shopify that were used to inform decisions on how we could best help our merchants navigate these challenging times. Let’s break it down. 

Shopify Data Science hierarchy of needs pyramid showing from top to bottom in increasing size:  Influence, prescribe, predict/infer, describe, collect and model
Shopify’s Data Science Hierarchy of Needs

1. Collecting And Modeling Data To Create A Strong Foundation

The base of our hierarchy is all about building a strong foundation that we can use to support our efforts as we move up the pyramid. After all, we can’t build advanced machine learning models or provide insightful and impactful analysis if we don’t have the data accessible in a clean and conformed manner.  

Activities At The Collect & Model Level

  • Data generation
  • Data platform
  • Acquisition
  • Pipeline build
  • Data modeling
  • Data cleansing

At Shopify, we follow the Dimensional Modeling methodology developed by Ralph Kimball—a specific set of rules around organizing data in our data warehouse—to ensure all of our data is structured consistently. Since our team is familiar with how things are structured in the foundation, it’s easy for them to interact with the data and start using the tools at higher levels in the pyramid to analyze it.It’s important to note that even though these foundational practices, by necessity, precede activities at the higher level, they’re not “less than”—they are critical to everything we do as data scientists. Having this groundwork in place was absolutely critical to the success of our COVID-19 impact analysis. We weren’t scrambling to find data—it was already clean, structured, and ready to go. Knowing that we had put in the effort to collect data the right way also gave us the security that we could trust the insights that came out during our analysis. 

2. Describing The Data To Gain A Baseline Understanding Of The Business

This next level of the hierarchy is about leveraging the data we’ve collected to describe what we observe happening within Shopify. With a strong foundation in place, we’re able to report metrics and answer questions about our business. For instance, for every product we release, we create associated dashboards to help understand how well the product is meeting merchants’ needs. 

At this phase, we’re able to start asking key questions about our data. These might be things like: What was the adoption of product X over the last three months? How many products do merchants add in their first week on the platform? How many buyers viewed our merchants’ storefronts? The answers to these questions offer us insight into particular business processes, which can help illuminate the steps we should take next—or, they might establish the building blocks for more complex analysis (as outlined in steps three and four). For instance, if we see that the adoption of product X was a success, we might ask, Why? What can we learn from it? What elements of the product launch can we repeat for next time?

Activities At The Describe Level

During our COVID-19 impact analysis, we were interested in discovering how the pandemic was affecting Shopify and our merchants’ businesses: What does COVID-19 mean for our merchants’ sales? Are they being affected in a positive or negative way, and why? This allowed us to establish a baseline understanding of the situation. While for some projects it might have been possible to stop the analysis here, we needed to go deeper—to be able to predict what might happen next and take the right actions to support our merchants. 

3. Predicting And Inferring The Answers To Deeper Questions With More Advanced Analytical Techniques

At this level, the problems start to become more complex. With a strong foundation and clear ability to describe our business, we can start to look forward and offer predictions or inferences as to what we think may happen in the future. We also have the opportunity to start applying more specialized skills to seek out the answers to our questions. 

Activities At The Predict / Infer Level 

These questions might include things like: What do we think sales will be like in the future? What do we think caused the adoption of a particular product? Once we have the answers, we can start to explain why certain things are happening—giving us a much clearer picture of our business. We’re also able to start making predictions about what is likely to happen next.

Circling back to our COVID-19 impact analysis, we investigated what was happening globally and conducted statistical analysis to predict how different regions we serve might be affected. An example of the kinds of questions we asked includes: Based on what we see happening to our merchants in Italy as they enter lockdown, what can we predict will happen in the U.S. if they were to do the same? Once we had a good idea of what we thought might happen, we were able to move on to the next level of the pyramid and decide what we wanted to do about it. 

4. Using Insights To Prescribe Action

At this level, we’re able to take everything from the underlying levels of the hierarchy to start forming opinions about what we should do as a business based on the information we’ve gathered. Within Shopify, this means offering concrete recommendations internally, as well as providing guidance to our merchants. 

Activities At The Prescribe Level

When it came to our COVID-19 impact analysis, our research at the lower levels helped provide the insights to pivot our product roadmap and ship products that we knew could support our merchants. For example:

  • We observed an increase of businesses coming online due to lockdowns, so we offered an extended 90-day free trial to all new merchants
  • Knowing the impact lockdowns would have on businesses financially, we expanded Shopify Capital (our funding program for merchants), then only available in the U.S., to Canada and the UK
  • With the increase of online shopping and delays in delivery, we expanded our shipping options, adding local delivery and the option to buy online, pick up in-store
  • Observing the trend of consumers looking to support local businesses, we made gift cards available for all Shopify plans and added a new feature to our shopping app, Shop, that made it easier to discover and buy from local merchants

By understanding what was happening in the world and the commerce industry, and how that was impacting our merchants and our business, we were able to take action and create a positive impact—which is what we’ll delve into in our next and final section. 

5. Influencing The Direction Of Your Business 

This level of the hierarchy is the culmination of the work below and represents all we should strive to achieve in our data science practice. With a strong foundation and a deep understanding of our challenges, we’ve been able to put forward recommendations—and now, as the organization puts our ideas into practice, we start to make an impact.

Activities At The Influence Level

  • Analytics
  • Machine learning
  • Artificial intelligence
  • Deep dives
  • Whatever it takes! 

It’s critical to remember that the most valuable insights don’t necessarily have to come from using the most advanced tools. Any insight can be impactful if it helps us inform a decision, changes the way we view something, or (in our case) helps our merchants.

Our COVID-19 impact analysis didn’t actually involve any artificial intelligence or machine learning, but it nevertheless had wide-reaching positive effects. It helped us support our merchants through a challenging time and ensured that Shopify also continued to thrive. In fact, in 2020, our merchants made a total of $119.6 billion, an increase of 96% over 2019. Our work at all the prior levels ensured that we could make an impact when it mattered most. 

Delivering Value At Every Level

In practice, positive influence can occur as a result of output at any level of the hierarchy—not just the very top. The highest level represents something that we should keep in mind as we deliver anything, whether it be a model, tool, data product, report analysis, or something else entirely. The lower levels of the hierarchy enable deeper levels of inquiry, but this doesn’t make them any less valuable on their own. 

Using our Data Science Hierarchy of Needs as a guide, we were able to successfully complete our COVID-19 impact analysis. We used the insights we observed and put them into action to support our merchants at the moment they needed them most, and guided Shopify’s overarching business and product strategies through an unprecedented time. 

No matter what level in the hierarchy we’re working at, we ensure we’re always asking ourselves about the impact of our work and how it is enabling positive change for Shopify and our merchants. Our Data Science Hierarchy of Needs isn’t a rigid progression—it’s a mindset.

Phillip Rossi is the Head of Expansion Intelligence at Shopify. He leads the teams responsible for using data to inform decision making for Shopify, our merchants, and our partners at scale.

If you’re passionate about solving complex problems at scale, and you’re eager to learn more, we're always hiring! Reach out to us or apply on our careers page.

Continue reading

Building a Real-time Buyer Signal Data Pipeline for Shopify Inbox

Building a Real-time Buyer Signal Data Pipeline for Shopify Inbox

By Ashay Pathak and Selina Li

Tens of thousands of merchants use Shopify Inbox as a single business chat app for all online customer interactions and staff communications. Over four million conversations were exchanged on Shopify Inbox in 2020, and 70 percent of Shopify Inbox conversations are with customers making a purchasing decision. This prompted the Shopify Data team to ask ourselves, “How can we help merchants identify and convert those conversations to sales?” 

We built a real-time buyer signal data pipeline to surface relevant customer context—including active cart activities and order completion information—to merchants while they’re chatting with their customers. With these real-time, high-intent customer signals, merchants know where the buyers are in their shopping journey—from browsing products on online stores to placing orders. Merchants can ask more direct questions, better answer customer inquiries, and prioritize conversations that are more likely to convert. 

Animation showing cart events displaying a prompts in the merchant's chat window.
Animation of cart event

We’ll share how we designed our pipeline, along with how we uncovered insights on merchant behaviors through A/B testing. We’ll also discuss how we address the common problems of streaming solutions, tackle complex use cases by leveraging various Apache Beam functions and measure success using an experiment.


Buyers can message merchants from many different channels like Online Store Chat, Facebook Messenger, and Apple Business Chat. Shopify Inbox allows merchants to manage customer conversations from different messaging channels within a single business chat app.  While it’s a great tool for managing customer conversations, we wanted to go one step further by helping merchants optimize sales opportunities on existing conversations and prioritize conversations as they grow.

The majority of Shopify Inbox conversations are with customers making a purchasing decision. We need to identify signals that represent buyers’ purchase intent and surface it at the right time during a conversation. How we achieve this is by building a real-time Apache Beam pipeline that surfaces high-intent buyer signals in Shopify Inbox.

When a buyer has an active conversation with a merchant, we currently share two buyer signals with the merchant: 

  1. Cart action event: Provides information on buyers’ actions on the cart, product details, and the current status of the cart. 
  2. Order completion event: Provides information on the recent purchase a buyer has made, including an order number URL that enables merchants to view order details in the Shopify admin (where merchants login to manage their business).

These signals are shared in the form of conversation events (as shown in the below image). Conversation events are the means for communicating context or buyer behavior to merchants that are relevant during the time of the conversation. They’re inserted in chronological order within the message flow of the conversation without adding extensive cognitive loads to merchants.

An image of a Shopify Inbox chat window on a mobile phone showing conversation events from the cart and order completion event
Example of conversation events—cart and order completion event in Shopify Inbox

In general, the cart and order completion events are aggregated and shared based on the following characteristics:

  • Pre-conversation events: Events that happen up to 14 days before a conversation is initiated.
  • Post-conversation events: Events that happen after a conversation is initiated. The conversation has a life cycle of seven days, and we maintain events in state until the conversation expires. 


To deliver quality information to merchants on time, there are two main requirements our system needs to fulfill: low latency and high reliability. We do so by leveraging three key technologies:

  • Apache Kafka 
  • Apache Beam 
  • Google Cloud Dataflow
A system diagram showing the flow from Apache Kafka to Apache Beam to Google Cloud Dataflow
Diagram of system architecture

Message Queues with Apache Kafka

For this pipeline we use two different forms of Kafka events: Monorail and Change Data Capture.


Monorail is an abstraction layer developed internally at Shopify that adds structure to the raw Kafka events before producing it to Kafka. Also with the structure there’s support for versioning, meaning that if the schema produces upstream changes, then it gets produced to the updated version while the Kafka topic remains the same. Having version control is useful in our case as it helps to ensure data integrity.

Change Data Capture(CDC) 

CDC uses binlogs and Debezium to create a stream of events from changed data in MySQL databases and supports large record delivery. Some of the inputs to our pipeline aren’t streams by nature, so CDC allows us to read such data by converting it to a stream of events.

Real-time Streaming Processing with Apache Beam 

Apache Beam is a unified batch and stream processing system. Instead of using a batch system to aggregate months of old data and a separate streaming system to process the live user traffic, Apache Beam keeps these workflows together in one system. For our specific use case where the nature of events is transactional, it’s important for the system to be robust and handle all behaviors in a way that the results are always accurate. To make this possible, Apache Beam provides support with a variety of features like windowing, timers, and stateful processing.

Google Cloud Dataflow for Deploying Pipeline

We choose to use Google Dataflow as a runner for our Apache Beam model. Using a managed service helps us concentrate on the logical composition of our data processing job without worrying too much about physical orchestration of parallel processing. 

High Level System Design

Diagram of real time buyer signal system design
Diagram of real time buyer signal system design

The pipeline ingests data from CDC and Monorail while the sink only writes to a Monorail topic. We use Monorail as the standardized communication tool between the data pipeline and dependent service. The downstream consumer processes Monorail events that are produced from our model, structuring those events and sending them to merchants in Shopify Inbox.

The real-time buyer signals pipeline includes the following two main components:

  • Events Filtering Jobs: The cart and checkout data are transactional and include snapshots on every buyer interactions on cart and checkout. Even during non-peak hours, there are tens of thousands of events we read from the cart and checkout source every second. To reduce the workloads of the pipeline and optimize resources, this job only keeps mission-critical events (that is, only relevant transactional events of Shopify Inbox users).
  • Customer Events Aggregation Job: This job hosts the heavy lifting logic for our data pipeline. It maintains the latest snapshot of a buyer’s activities in an online store, including the most recent conversations, completed orders, and latest actions with carts. To make sure this information is accessible at any point of time, we rely on stateful processing with Timers and Global Window in Apache Beam. The event-emitting rule is triggered when a buyer starts a conversation.

The customer events aggregation job is the core of our real-time pipeline, so let’s dive into the design of this job.

Customer Events Aggregation Job

A system diagram of Customer Events Aggregation Job
Diagram of Customer Events Aggregation Job

As shown on the diagram above, we ingest three different input collections, including filtered conversation, checkout and cart events in the customer events aggregation job. All the input elements are keyed by the unique identifier of a buyer on an online store using the CoGroupByKey operator _shopify_y (see our policy on what information Shopify collects from visitors’ device). This allows us to group all input elements into a single Tuple collection for easier downstream processing. To ensure we have access to historical information, we leverage the state in Apache Beam that stores values by per-key-and-window to access last seen events. As states expire when a window ends, we maintain the key over a Global Window that’s unbonded and contains a single window to allow access to states at any time. We maintain three separate states for each customer event stream: conversation, checkout, and cart state. Upon arrival of new events, a processing time trigger is used to emit the current data of a window as a pane. Next, we process last seen events from state and new events from pane through defined logics in PTransform.

In this stage of the system, upon receiving new events from a buyer, we try to answer the following questions:

1. Does this buyer have an active conversation with the merchant?

This question determines whether our pipeline should emit any output or should just process the cart/checkout events and then store them to its corresponding state. The business logic of our pipeline is to emit events only when the buyer has started a conversation with the merchant through Shopify Inbox.

2. Do these events occur before or after a conversation is started?

This question relates to how we aggregate the incoming events. We aggregate events based on the two characteristics we mentioned above:

  • Pre-conversation events:We show transactional data on buyers’ activities that occur after a conversation is initiated. Using the same scenario mentioned above, we show a cart addition event and an order completion event to the merchant.
  • Post-conversation events: We show transactional data on buyers’ activities that occur after a conversation is initiated. Using the same scenario mentioned above, we show a cart addition event and an order completion event to the merchant.
Examples of pre-conversation event(left) versus post-conversation event(right)
Examples of pre-conversation event(left) versus post-conversation event(right)

3. What is the latest interaction of a buyer on an online store?

This question reflects the key design principle of our pipeline—the information we share with merchants should be up-to-date and always relevant to a conversation. Due to the nature of how streaming data arrives at a pipeline and the interconnected process between cart and checkout, it introduces the main problems we need to resolve in our system.

There are a few challenges we faced when designing the pipeline and its solutions.

Interdependency of Cart and Checkout

Cart to checkout is a closely connected process in a buyer’s shopping journey. For example, when a buyer places an order and returns to the online store, the cart should be empty. The primary goal of this job is to mock this process in the system to ensure correct reflection on cart checkout status at any time. The challenge is that cart and checkout events are from different Monorail sources but they have dependencies on each other. By using a single PTransform function, it allows us to access all mutable states and create dynamic logic based on that. An example of this is that we clear the cart state when receiving a checkout event of the same user token.

Handling Out-of-Order Events

As the information we share in the event is accumulative (for example, total cart value) sharing the buyer signal events in the correct sequence is critical to a positive merchant experience. The output event order should be based on the chronological order of the occurrence of buyers’ interaction with the cart and chat. For example, removal of an item should always come after an item addition. However, one of the common problems with streaming data is we can’t guarantee events across data sources are read and processed in order. On top of that, the action on the cart isn’t explicitly stated in the source. Therefore, we rely on comparing quantities change between transactional events to extract the cart action. 

This problem can be solved by leveraging stateful processing in Apache Beam. A state is a buffer that stores values by per-key-and-window. It’s mutable and evolved with time and new incoming elements. With the use of state, it allows us to access previous buyer activity snapshots and identify any out-of-order events by comparing the event timestamp between new events and events from the state. This ensures no outdated information is shared with merchants. 

Garbage Collection 

To ensure we’re not overloading data to states, we use Timer to manually clean up the expired or irrelevant per-key-and-window values in states. The timer is set to use the event_time domain to manage the state lifecycle per-key-and-window. We use this to accommodate the extendable lifespan of a cart. 

Sharing Buyer Context Across Conversations

Conversations and cart cookies have different life spans. The problem we had was that the characteristics of events can be evolved across time. For example, a post-conversation cart event can be shared as a pre-conversation event upon the expiration of a conversation. To address this, we introduced a dynamic tag in states to indicate whether the events have been shared in a conversation. Whenever the timer for the conversation state executes, it will reset this tag in the cart and checkout state. 

Testing Our Pipeline

Through this real-time system, we expect the conversation experience to be better for our merchants by providing them these intelligent insights about the buyer journey. We carried out an experiment and measured the impact on our KPI’s to validate the hypothesis. The experiment had a conventional A/B test setup where we divided the total audience (merchants using Shopify Inbox) into two equal groups: control and treatment. Merchants in the control group continued to have the old behavior, while the treatment group merchant saw the real-time buyer signal events in their Shopify Inbox client app. We tracked the merchant experience using the following metrics: 

  • Response Rate: Percent of buyer conversations that got merchant replies. We observed a significant increase of two percent points.
    • Response Time: Time between first buyer message and first merchant response. While we saw the response rate significantly increase, we observed no significant change in response time, signifying that merchants are showing intent to reply quicker.  
    • Conversion Rate: Percent of buyer conversations that got attributed to a sale. We observed a significant increase of 0.7 percent points.

    Our experiments showed that with these new buyer signals being shown to merchants in real-time, they’re able to better answer customer queries because they know where the buyers are in their shopping journey. Even better, they’re able to prioritize the conversations by responding to the buyer who is already in the checkout process, helping buyers to convert quicker. Overall, we observed a positive impact on all the above metrics. 

    Key Takeaways of Building Real-time Buyer Signals Pipeline 

    Building a real-time buyer signal data pipeline to surface relevant customer context was a challenging process, but one that makes a real impact on our merchants. To quickly summarize the key takeaways: 

    • Apache Beam is a useful system for transactional use cases like cart as it provides useful functionalities such as state management and timers. 
    • Handling out of order events is very important for such use cases and to do that a robust state management is required. 
    • Controlled experiments are an effective approach to measure the true impact of major feature changes and derive valuable insights on users' behaviors.

    Ashay Pathak is a Data Scientist working on Shopify’s Messaging team. He is currently working on building intelligence in conversations & improving chat experience for merchants. Previously he worked for an intelligent product which delivered proactive marketing recommendations to merchants using ML. Connect with Ashay on Linkedln to chat.

    Selina Li is a Data Scientist on the Messaging team. She is currently working to build intelligence in conversation and improve merchant experiences in chat. Previously, she was with the Self Help team where she contributed to deliver better search experiences for users in Shopify Help Center and Concierge. Check out her last blog post on Building Smarter Search Products: 3 Steps for Evaluating Search Algorithms. If you would like to connect with Selina, reach out on Linkedin.

    Interested in tackling challenging problems that make a difference? Visit our Data Science & Engineering career page to browse our open positions.

    Continue reading

    Scaling Shopify's BFCM Live Map: An Apache Flink Redesign

    Scaling Shopify's BFCM Live Map: An Apache Flink Redesign

    By Berkay Antmen, Chris Wu, and Dave Sugden

    In 2017, various teams at Shopify came together to build an external-facing live-streamed visualization of all the sales made by Shopify merchants during the Black Friday and Cyber Monday (BFCM) weekend. We call it the Shopify BFCM live map.

    Shopify’s BFCM live map is a visual signal of the shift in consumer spending towards independent businesses and our way to celebrate the power of entrepreneurship. Over the years, it’s become a tradition for different teams within Shopify to iterate on the live map to see how we can better tell this story. Because of our efforts, people all over the world can watch our merchant sales in real-time, online, broadcast on television, and even in Times Square.

    This year, the Shopify Data Platform Engineering team played a significant role in the latest iteration of the BFCM live map. Firstly, we sought to explore what new insights we could introduce and display on the live map. Secondly, and most importantly, we needed to figure out a way to scale the live map. Last year we had more than 1 million merchants. That number has grown to over 1.7 million. With just weeks left until BFCM, we were tasked with not only figuring out how to address the system’s scalability issues but also challenging ourselves to do so in a way that would help us create patterns we could repeat elsewhere in Shopify.

    We’ll dive into how our team, along with many others, revamped the data infrastructure powering our BFCM live map using Apache Flink. In a matter of weeks, we created a solution that displayed richer insights and processed a higher volume of data at a higher uptime—all with no manual interventions.

    Last Year’s Model Had Met Its Limit

    Last year’s live map drew a variety of transaction data and metadata types from our merchants. The live map looked amazing and did the job, but now with more than 1.7 million merchants on our platform, we weren’t confident that the backend architecture supporting it would be able to handle the volume predicted for 2021.

    With just weeks until BFCM, Shopify execs challenged us to “see if we know our systems” by adding new metrics and scaling the live map.

    In this ask, the Shopify Data Platform Engineering team saw an opportunity. We have an internal consulting team that arose organically to assist Shopify teams in leveraging our data stack. Lately, they'd been helping teams adopt stateful stream processing technologies. Streaming is still a developing practice at Shopify, but we knew we could tap this team to help us use this technology to scale the BFCM live map. With this in mind, we met with the Marketing, Revenue, UX, Product, and Engineering teams, all of whom were equally invested in this project, to discuss what we could accomplish in advance of BFCM.

    Deconstructing Last Year’s Model

    We started by taking stock of the system powering the 2020 live map. The frontend was built with React and a custom 3D visualization library. The backend was a home-grown, bespoke stateful streaming service we call Cricket, built in Go. Cricket processes messages from relevant Kafka topics and broadcasts metrics to the frontend via Redis.

    Image showing the 2020 BFCM live map system diagram.
    2020 BFCM live map system diagram

    Our biggest concern was that this year Cricket could be overloaded with the volume coming from the checkout Kafka topic. To give you an idea of what that volume looked like, at the peak we saw roughly 50,000 messages per second during the 2021 BFCM weekend. On top of volume concerns, our Kafka topic contains more than just the subset of events that we need, and those events contain fields we didn’t intend to use.

    Image showing a snapshot of a Nov 27, 2020 live map including a globe view, sales per minute at $1,541,390, orders per minute at 15,875, and carbon offset at 254,183 Tonnes.
    Shopify’s 2020 Black Friday Cyber Monday Live Map

    Another challenge we faced was that the connection between Cricket and the frontend had a number of weaknesses. The original authors were aware of these, but there were trade-offs they’d made to get things ready in time. We were using Redis to queue up messages and broadcast our metrics to browsers, which was inefficient and relatively complex. The metrics displayed on our live map have more relaxed requirements around ordering than, say, chat applications where message order matters. Instead, our live map metrics:

    • Can tolerate some data loss: If you take a look at the image above of last year’s live map, you’ll see arc visuals that represent where an order is made and shipping to. These data visualizations are already sampled because we’re not displaying every single order on the browser (it would be too many!). So it’s okay if we lose some of the arc visuals because we’re unable to draw all arcs on the screen anyways.
    • Only require the latest value: While Cricket relays near real-time updates, we’re only interested in displaying the latest statistics for our metrics. Last year those metrics included sales per minute, orders per minute, and our carbon offset. Queuing up and publishing the entire broadcasted history for these metrics would be excessive.

    This year, on top of the metrics listed above, we sought to add in:

    • Product trends: Calculated as the top 500 categories of products with the greatest change in sale volume over the last six hours.
    • Unique shoppers: Calculated as unique buyers per shop, aggregated over time.

    Now in our load tests, we observed that Redis would quickly become a bottleneck due to the increase in the number of published messages and subscribers or connections. This would cause the browser long polling to sometimes hang for too long, causing the live map arc visuals to momentarily disappear until getting a response. We needed to address this because we forecasted that this year there would be more data to process. After talking to the teams who built last year’s model, and evaluating what existed, we developed a plan and started building our solution.

    The 2021 Solution

    At a minimum, we knew that we had to deliver a live map that scaled at least as well as last year’s, so we were hesitant to go about changing too much without time to rigorously test it all. In a way this complicated things because while we might have preferred to build from scratch, we had to iterate upon the existing system.

    2021 BFCM live map system diagram
    2021 BFCM live map system diagram

    In our load tests, with 1 million checkout events per second at peak, the Flink pipeline was able to operate well under high volume. We decided to put Flink on the critical path to filter out irrelevant checkout events and resolve the biggest issue—that of Cricket failing to scale. By doing this, Cricket was able to process one percent of the event volume to compute the existing metrics, while relying on Flink for the rest.

    Due to our high availability requirements for the Flink jobs, we used a mixture of cross-region sharding and cross-region active-active deployment. Deduplications were handled in Cricket. For the existing metrics, Cricket continued to be the source of computation and for the new metrics, computed by Flink, Cricket acted as a relay layer.

    For our new product trends metric, we leveraged our product categorization algorithm. We emitted 500 product categories with sales quantity changes, every five minutes. For a given product, the sales quantity percentage change was computed based on the following formula:

    change = SUM(prior 1hr sales quantity) / MEAN(prior 6hr sales quantity) - 1

    For all product trends job at a high level:

    So How Did It Do?

    Pulling computation out of Cricket into Flink proved to be the right move. Those jobs ran with 100 percent uptime throughout BFCM without backpressure and required no manual intervention. To mitigate risk, we also implemented the new metrics as batch jobs on our time-tested Spark infrastructure. While these jobs ran well, we ended up not relying on them because Flink met our expectations.

    Here’s a look at what we shipped:

    Shopify’s 2021 Black Friday Cyber Monday Live Map with new data points including unique shoppers and product trends
    Shopify’s 2021 Black Friday Cyber Monday Live Map with new data points including unique shoppers and product trends

    In the end, user feedback was positive, and we processed significantly more checkout events, as well as produced new metrics.

    However, not everything went as smoothly as planned. The method that we used to fetch messages from Redis and serve them to the end users caused high CPU loads on our machines. This scalability issue was compounded by Cricket producing metrics at a faster rate and our new product trends metric clogging Redis with its large memory footprint.

    A small sample of users noticed a visual error: some of the arc visuals would initiate, then blip out of existence. With the help of our Production Engineering team, we dropped some of the unnecessary Redis state and quickly unclogged it within two hours.

    Despite the hiccup, the negative user impact was minimal. Flink met our high expectations, and we took notes on how to improve the live map infrastructure for the next year.

    Planning For Next Year

    With another successful BFCM through, the internal library we built for Flink enabled our teams to assemble sophisticated pipelines for the live map in a matter of weeks, proving that we can run mission-critical applications on this technology.

    Beyond BFCM, what we’ve built can be used to improve other Shopify analytic visualizations and products. These products are currently powered by batch processing and the data isn’t always as fresh as we’d like. We can’t wait to use streaming technology to power more products that help our merchants stay data-informed.

    As for the next BFCM, we’re planning to simplify the system powering the live map. And, because we had such a great experience with it, we’re looking to use Flink to handle all of the complexity.

    This new system will enable us to:

    • no longer have to maintain our own stateful stream processor
    • remove the bottleneck in our system
    • only have to consider back pressure at a single point (versus having to handle back pressure in our streaming jobs, in Cricket, and between Cricket and Web).

    We are exploring a few different solutions, but the following is a promising one:

    Image showing potential future BFCM live map system diagram. Add data sources via events to Flink all metrics and snapshot metrics to the database. Poll from the browser to web and read metrics from the web to the database
    Potential future BFCM live map system diagram

    The above design is relatively simple and satisfies both our scalability and complexity requirements. All of the metrics would be produced by Flink jobs and periodically snapshotted in a database or key-value store. The Web tier would then periodically synchronize its in-memory cache and serve the polling requests from the browsers.

    Overall, we’re pleased with what we accomplished and excited that we have such a head start on next year’s design. Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Want to help us scale and make commerce better for everyone? Join our team.

    Berkay Antmen leads the Streaming Capabilities team under Data Platform Engineering. He’s interested in computational mathematics and distributed systems. His current Shopify mission is to make large-scale near real-time processing easy. Follow Berkay on Twitter.

    Chris Wu is a Product Lead who works on the Data Platform team. He focuses on making great tools to work with data. In his spare time he can be found buying really nice notebooks but never actually writing in them.

    Dave Sudgen is a Staff Data Developer who works on the Customer Success team, enabling Shopifolk to onboard to streaming technology.

    Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

    Continue reading

    Upgrading MySQL at Shopify

    Upgrading MySQL at Shopify

    In early September 2021, we retired our last Shopify database virtual machine (VM) that was running Percona Server 5.7.21, marking the complete cutover to 5.7.32. In this post, I’ll share how the Database Platform team performed the most recent MySQL upgrade at Shopify. I’ll talk about some of the roadblocks we encountered during rollback testing, the internal tooling that we built out to aid upgrading and scaling our fleet in general, and our guidelines for approaching upgrades going forward, which we hope will be useful for the rest of the community.

    Why Upgrade and Why Now?

    We were particularly interested in upgrading due to the replication improvements that would preserve replication parallelism in a multi-tier replication hierarchy via transaction writesets. However, in a general sense, upgrading our version of MySQL was on our minds for a while and the reasons have become more important over time as we’ve grown:

    • We’ve transferred more load to our replicas over time, and without replication improvements, high load could cause replication lag and a poor merchant and buyer experience.
    • Due to our increasing global footprint, to maintain efficiency, our replication topology can be up to four “hops” deep, which increases the importance of our replication performance.
    • Without replication improvements, in times of high load such as Black Friday/Cyber Monday (BFCM) and flash sales, there’s a greater likelihood of replication lag that in turn heightens the risk to merchants’ data availability in the event of a writer failure.
    • It’s industry best practice to stay current with all software dependencies to receive security and stability patches.
    • We expect to eventually upgrade to MySQL 8.0. Building the upgrade tooling required for this minor upgrade helps us prepare for that.

    To the last point, one thing we definitely wanted to achieve as a part of this upgrade was—to put it in the words of my colleague Akshay—“Make MySQL upgrades at Shopify a checklist of tasks going forward, as opposed to a full-fledged project.” Ideally, by the end of the project, we have documentation with steps for how to perform an upgrade that can be followed by anyone on the Database Platform team that takes on the order of weeks, rather than months, to complete.

    Database Infrastructure at Shopify


    Shopify's Core database infrastructure is horizontally sharded by shop, spread across hundreds of shards, each consisting of a writer and five or more replicas. These shards are run on Google Compute Engine Virtual Machines (VM) and run the Percona Server fork of MySQL. Our backup system makes use of Google Cloud’s persistent disk snapshots. While we’re running the upstream versions of Percona Server, we maintain an internal fork and build pipeline that allows us to patch it as necessary.


    Without automation, there’s a non-trivial amount of toil involved in just the day-to-day operation of our VM fleet due to its sheer size. VMs can go down for many reasons, including failed GCP live migrations, zone outages, or just run-of-the-mill VM failures. Mason was developed to respond to VMs going down by spinning up a VM to replace it—a task far more suited to a robot rather than a human, especially in the middle of the night.

    Mason was developed as a self-healing service for our VM-based databases that was borne out of a Shopify Hack Days project in late 2019.

    Healing Isn’t All That’s Needed

    Shopify’s query workload can differ vastly from shard to shard, which necessitates maintenance of vastly different configurations. Our minimal configuration is six instances: three instances in Google Cloud’s us-east1 region and three instances in us-central1. However, each shard’s configuration can differ in other ways:

    • There may be additional replicas to accommodate higher read workloads or to provide replicas in other locations globally.
    • The VMs for the replicas may have a different number of cores or memory to accommodate differing workloads.

    With all of this in mind, you can probably imagine how it would be desirable to have automation built around maintaining these differences—without it, a good chunk of the manual toil involved in on-call tasks would be simply provisioning VMs, which isn’t an enviable set of responsibilities.

    Using Mason to Upgrade MySQL

    Upgrades at our scale are extremely high effort as the current count of our VM fleet numbers in the thousands. We decided that building additional functionality onto Mason would be the way forward to automate our MySQL upgrade, and called it the Declarative Database Topologies project. Where Mason was previously used as a solely reactive tool that only maintained a hardcoded default configuration, we envisioned its next iteration as a proactive toolone that allows us to define a per-shard topology and do the provisioning work that reconciles its current state to a desired state. Doing this would allow us to automate provisioning of upgraded VMs, thus removing much of the toil involved in upgrading a large fleet, and automate scale-up provisioning for events such as BFCM or other high-traffic occurrences.

    The Project Plan

    We had approximately eight months before BFCM preparations would begin to achieve the following:

    • pick a new version of MySQL.
    • benchmark and test the new version for any regressions or bugs
    • perform rollback testing and create a rollback plan to so we can safely downgrade if necessary
    • finally, perform the actual upgrade.

    At the same time, we also needed to evolve Mason to:

    • increase its stability
    • move from a global hardcoded configuration to a dynamic per-shard configuration
    • have it respond to scale-ups when the configuration changed
    • have it care about Chef configuration, too
    • … do all of that safely.

    One of the first things we had to do was pick a version of Percona Server. We wanted to maximize the gains that we would get from an upgrade while minimizing our risk. This led us to choose the highest minor version of Percona Server 5.7, which was 5.7.32 at the start of the project. By doing so, we benefited from the bug and security fixes made since we last upgraded; in the words of one of our directors, “incidents that never happened” because we upgraded. At the same time, we avoided some of the larger risks associated with major version upgrades.

    Once we had settled on a version, we made changes in Chef to have it handle an in-place upgrade. Essentially, we created a new Chef role with the existing provisioning code but with the new version specified for the MySQL server version variable and modified the code so that the following happens:

    1. Restore a backup taken from a 5.7.21 VM on an VM with 5.7.32 installed.
    2. Allow the VM and MySQL server process to start up normally. 
    3. Check the contents of the mysql_upgrade_info file in the data directory. If the version differs from that of the MySQL server version installed, run mysql_upgrade (via a wrapper script that’s necessary to account for unexpected behaviour of the mysql_upgrade script that exits with the return code 2, instead of the typical return code of 0, when an upgrade wasn’t required).
    4. Perform the necessary replication configuration and proceed with the rest of the MySQL server startup.

    After this work was completed, all we had to do to provision an upgraded version was to specify that the new VM be built with the new Chef role.

    Preparing for the Upgrade

    Performing the upgrade is the easy part, operationally. You can spin up an instance with a backup from the old version, let mysql_upgrade do its thing, have it join the existing replication topology, optionally take backups from this instance with the newer version, populate the rest of the topology, and then perform a takeover. Making sure the newer version performs the way we expect and can be safely rolled back to the old version, however, is the tricky part.

    During our benchmarking tests, we didn’t find anything anomalous, performance-wise. However, when testing the downgrade from 5.7.32 back to 5.7.21, we found that the MySQL server wouldn’t properly start up. This is what we saw when tailing the error logs:

    When we allowed the calculation of transient stats at startup to run to completion, it took over a day due to a lengthy table analyze process on some of our shards—not great if we needed to roll back more urgently than that.

    A cursory look at the Percona Server source code revealed that the table_name column in the innodb_index_stats and innodb_table_stats changed from VARCHAR(64) in 5.7.21 to VARCHAR(199) in 5.7.32. We patched mysql_system_tables_fix.sql in our internal Percona Server fork so that the column lengths were set back to a value that 5.7.21 expected, and re-tested the rollback. This time, we didn’t see the errors about the column lengths, however we still saw the analyze table process causing full table rebuilds, again leading to an unacceptable startup time, and it became clear to us that we had merely addressed a symptom of the problem by fixing these column lengths.

    At this point, while investigating our options, it occurred to us that one of the reasons why this analyze table process might be happening is because we run ALTER TABLE commands as a part of the MySQL server start: we run a startup script that sets the AUTO_INCREMENT value on tables to set a minimum value (this is due to the auto_increment counter not being persisted across restarts, a long-standing bug which is addressed in MySQL 8.0).

    Investigating the Bug

    Once we had our hypothesis, we started to test it. This culminated in a group debugging session where a few members of our team found that the following steps reproduced the bug that resulted in the full table rebuild:

    1. On 5.7.32: A backup previously taken from 5.7.21 is restored.
    2. On 5.7.32: An ALTER TABLE is run on a table that should just be an instantaneous metadata change, for example, ALTER TABLE t AUTO_INCREMENT=n. The table is changed instantaneously, as expected.
    3. On 5.7.32: A backup is taken.
    4. On 5.7.21: The backup taken from 5.7.32 in the previous step is restored.
    5. On 5.7.21: The MySQL server is started up, and mysql_upgrade performs the in-place downgrade.
    6. On 5.7.21: A similar ALTER TABLE statement to step 1 is performed. A full rebuild of the table is performed, unexpectedly and unnecessarily.

    Stepping through the above steps with the GNU Debugger (GDB), we found the place in the MySQL server source code where it’s incorrectly concluded that indexes have changed in a way that required a table rebuild (from Percona Server 5.7.21 in the has_index_def_changed function in sql/

    We saw, while inspecting in GDB, that the flags for the old version of the table (table_key->flags above) don’t match that of the new version of the table (new_key->flags above), despite the fact that only a metadata change was applied:

    Digging deeper, we found past attempts to fix this bug. In the 5.7.23 release notes, there’s the following:

    “For attempts to increase the length of a VARCHAR column of an InnoDB table using ALTER TABLE with the INPLACE algorithm, the attempt failed if the column was indexed.If an index size exceeded the InnoDB limit of 767 bytes for COMPACT or REDUNDANT row format, CREATE TABLE and ALTER TABLE did not report an error (in strict SQL mode) or a warning (in nonstrict mode). (Bug #26848813)”

    A fix was merged for the bug, however we saw that there was a second attempt to fix this behaviour. In the 5.7.27 release notes, we see:

    “For InnoDB tables that contained an index on a VARCHAR column and were created prior to MySQL 5.7.23, some simple ALTER TABLE statements that should have been done in place were performed with a table rebuild after an upgrade to MySQL 5.7.23 or higher. (Bug #29375764, Bug #94383)”

    A fix was merged for this bug as well, but it didn’t fully address the issue of some ALTER TABLE statements that should be simple metadata changes instead leading to a full table rebuild.

    My colleague Akshay filed a bug against this, however the included patch wasn’t ultimately accepted by the MySQL team. In order to safely upgrade past this bug, we still needed MySQL to behave in a reasonable way on downgrade, and we ended up patching Percona Server in our internal fork. We tested our patched version successfully in our final rollback tests, unblocking our upgrade.

    What are “Packed Keys” Anyway?

    The PACK_KEYS feature of the MyISAM storage engine allows keys to be compressed, thereby making indexes much smaller and improving performance. This feature isn’t supported by the InnoDB storage engine as its index layout and expectations are completely different. In MyISAM, when indexed VARCHAR columns are expanded past eight bytes, thus converting from unpacked keys to packed keys, it (rightfully) triggers an index rebuild.

    However, we can see that in the first attempt to fix the bug in 5.7.23, that the same type of change triggers the same behaviour in InnoDB, even though packed keys aren’t supported. To remedy this, from 5.7.23 onwards, the HA_PACK_KEY and HA_BINARY_PACK_KEY flags weren’t set if the storage engine didn’t support it.

    That, however, meant that if a table was created prior to 5.7.23, these flags are unexpectedly set even on storage engines that didn’t support it. So upon upgrade to 5.7.23 or higher, any metadata-only ALTER TABLE commands executed on an InnoDB table incorrectly conclude that a full index rebuild is necessary. This brings us to the second attempt to fix the issue in which the flags were removed entirely if the storage engine didn’t support it. Unfortunately that second bug fix didn’t account for the case where the flags might have changed, but the difference should be ignored when evaluating whether the indexes need to be rebuilt in earlier versions, and that’s what we addressed in our proposed patch. In our patch, during downgrade, if the old version of the table (from 5.7.32) didn’t specify the flag, but the new version of the table (in 5.7.21) does, then we bypass the index rebuild.

    Meanwhile, in the Mason Project… 

    While all of this rollback testing work was in progress, another part of the team was hard at work shipping new features in Mason to let it handle the upgrades. These were some of the requirements we had that guided the project work:

    • The creation of a “priority” lane—self-healing should always take precedence over a scale-up related provisioning request.
    • We needed to throttle the scale-up provisioning queue to limit how much work was done simultaneously.
    • Feature flags were required to limit the number of shards to release the scale-up feature to, so that we could control which shards were provisioned and release the new features carefully.
    • A dry-run mode for scale-up provisioning was necessary to allow us to test these features without making changes to the production systems immediately.

    Underlying all of this was an abundance of caution in shipping the new features. Because of our large fleet size, we didn’t want to risk provisioning a lot of VMs we didn’t need or VMs in the incorrect configuration that would cost us either way in terms of GCP resource usage or engineering time spent in decommissioning resources.

    In the initial stages of the project, stabilizing the service was important since it played a critical role in maintaining our MySQL topology. Over time, it had turned into a critical component of our infrastructure that significantly improved our on-call quality of life. Some of the early tasks that needed to be done were simply making it a first-class citizen among the services that we owned. We stabilized the staging environment it was deployed into, created and improved existing monitoring, and started using it to emit metrics to Datadog indicating when the topology was underprovisioned (in cases where Mason failed to do its job).

    Another challenge was that Mason itself talks to many disparate components in our infrastructure: the GCP API, Chef, the Kubernetes API, ZooKeeper, Orchestrator, as well as the database VMs themselves. It was often a challenge to anticipate failure scenarios—often, the failure experienced was completely new and wouldn’t have been caught in existing tests. This is still an ongoing challenge, and one that we hope to address through improved integration testing.

    Later on, as we onboarded new people to the project and started introducing more features, it also became obvious that the application was quite brittle in its current state; adding new features became more and more difficult due to the existing complexity, especially when they were being worked on concurrently. It brought to the forefront the importance of breaking down streams of work that have the potential to become hard blockers, and highlighted how much a well-designed codebase can decrease the chances of this happening.

    We faced many challenges, but ultimately shipped the project on time. Now that the project is complete, we’re dedicating time to improving the codebase so it’s more maintainable and developer-friendly.

    The Upgrade Itself

    Throughout the process of rollback testing, we had already been running 5.7.32 for a few months on several shards reserved for canary testing. A few of those shards are load tested on a regular basis, so we were reasonably confident that this, along with our own benchmarking tests, made it ready for our production workload.

    Next, we created a rollback plan in case the new version was unstable in production for unforeseen reasons. One of the early suggestions for risk mitigation was to maintain a 5.7.21 VM per-shard and continue to take backups from them. However, that would have been operationally complex and also would have necessitated the creation of more tooling and monitoring to make sure that we always have 5.7.21 VMs running for each shard (rather toilsome when the number of shards reaches the hundreds in a fleet). Ultimately, we decided against this plan, especially considering the fact that we were confident that we could roll back to our patched build of Percona Server, if we had to.

    Our intention was to do everything we could to de-risk the upgrade by performing extensive rollback testing, but ultimately we preferred to fix forward whenever possible. That is, the option to rollback was expected to be taken only as a last resort.

    We started provisioning new VMs with 5.7.32 in earnest on August 25th using Mason, after our tooling and rollback plan were in place. We decided to stagger the upgrades by creating several batches of shards. This allowed the upgraded shards to “bake” and not endanger the entire fleet in the event of an unforeseen circumstance. We also didn’t want to provision all the new VMs at once due to the amount of resource churn (at the petabyte-scale) and pressure it would put on Google Cloud.

    On September 7th, the final shards were completed, marking the end of the upgrade project.

    What Did We Take Away from This Upgrade? 

    This upgrade project highlighted the importance of rollback testing. Without the extensive testing that we performed, we would have never known that there was a critical bug blocking a potential rollback. Even though needing to rebuild the fleet with the old version to downgrade would have been toilsome and undesirable, patching 5.7.21 gave us the confidence to proceed with the upgrade, knowing that we had the option to safely downgrade if it became necessary.

    Also Mason, the tooling that we relied on, became more important over time. In the past, Mason was considered a lower-tier application, and simply turning it off was a band-aid solution to when it was behaving in unexpected ways. Fixing it wasn’t often a priority when bugs were encountered. However, as time has gone by, we’ve recognized how large of a role it plays in toil-mitigation and maintaining healthy on-call expectations, especially as the size of our fleet has grown. We have invested more time and resources into it by improving test coverage and refactoring key parts of the codebase to reduce complexity and improve readability. We also have future plans to improve the local development environments and streamline its deployment pipeline.

    Finally, investing in the documentation and easy repeatability of upgrades has been a big win for Shopify and for our team. When we first started planning for this upgrade, finding out how upgrades were done in the past was a bit of a scavenger hunt and required a lot of institutional knowledge. By developing guidelines and documentation, we paved the way for future upgrades to be done faster, more safely, and more efficiently. Rather than an intense and manual context-gathering process every time that pays no future dividends, we can now treat a MySQL upgrade as simply a series of guidelines to follow using our existing tooling.

    Next up: MySQL 8!

    Yi Qing Sim is a Senior Production Engineer and brings nearly a decade of software development and site reliability engineering experience to the Database Backend team, where she primarily works on Shopify’s core database infrastructure.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Remote Rendering: Shopify’s Take on Extensible UI

    Remote Rendering: Shopify’s Take on Extensible UI

    Shopify is one of the world's largest e-commerce platforms. With millions of merchants worldwide, we support an increasingly diverse set of use cases, and we wouldn't be successful at it without our developer community. Developers build apps that add immense value to Shopify and its merchants, and solve problems such as marketing automation, sales channel integrations, and product sourcing.

    In this post, we will take a deep dive into the latest generation of our technology that allows developers to extend Shopify’s UI. With this technology, developers can better integrate with the Shopify platform and offer native experiences and rich interactions that fit into users' natural workflow on the platform.

    A GIF showing how a 3rd party extension inserting a page highlighting an upsell purchase before the user completes purchase is completed in the Shopify checkout
    3rd party extension adding a post-purchase page directly into the Shopify checkout

    To put the technical challenges into context, it's important to understand our main objectives and requirements:

    • The user experience of 3rd party extensions must be consistent with Shopify's native content in terms of look & feel, performance, and accessibility features.
    • Developers should be able to extend Shopify using standard technologies they are already familiar with.
    • Shopify needs to run extensions in a secure and reliable manner, and prevent them from negatively impacting the platform (naively or maliciously).
    • Extensions should offer the same delightful experience across all supported platforms (web, iOS, Android).

    With these requirements in mind, it's time to peel the onion.

    Remote Rendering

    At the heart of our solution is a technique we call remote rendering. With remote rendering, we separate the code that defines the UI from the code that renders it, and have the two communicate via message passing. This technique fits our use case very well because extensions (code that defines UI) are typically 3rd party code that needs to run in a restricted sandbox environment, while the host (code that renders UI) is part of the main application.

    A diagram showing that Extensions define the UI and run in a sandbox and the Host renders the UI and is part of the main application. Extensions and Host communicate via messages between them.
    Separating extensions (3rd party code) from host (1st party code)

    Communication between an extension and a host is done via a MessageChannel. Using message passing for all communication means that hosts and extensions are completely agnostic of each other’s implementation and can be implemented using different languages. In fact, at Shopify, we have implemented hosts in JavaScript, Kotlin, and Swift to provide cross-platform support.

    The remote-ui Library

    Remote rendering gives us the flexibility we need, but it also introduces non-trivial technical challenges such as defining an efficient message-passing protocol, implementing function calls using message passing (aka remote procedure call), and applying UI updates in a performant way. These challenges (and more) are tackled by remote-ui, an open-source library developed at Shopify.

    Let's take a closer look at some of the fundamental building blocks that remote-ui offers and how these building blocks fit together.


    At the lower level, the @remote-ui/rpc package provides a powerful remote procedure call (RPC) abstraction. The key feature of this RPC layer is the ability for functions to be passed (and called) across a postMessage interface, supporting the common need for passing event callbacks.

    Two code snippets displayed side by side showing remote procedure calls using endpoint.expose and
    Making remote procedure calls using (script1.js) and endpoint.expose (script2.js)

    @remote-ui/rpc introduces the concept of an endpoint for exposing functions and calling them remotely. Under the hood, the library uses Promise and Proxy objects to abstract away the details of the underlying message-passing protocol.

    It's also worth mentioning that remote-ui’s RPC has very smart automatic memory management. This feature is especially useful when rendering UI, since properties (such as event handlers) can be automatically retained and released as UI component mount and unmount. 

    Remote Root

    After RPC, the next fundamental building block is the RemoteRoot which provides a familiar DOM-like API for defining and manipulating a UI component tree. Under the hood, RemoteRoot uses RPC to serialize UI updates as JSON messages and send them to the host.

    Two code snippets showing appending a child to a `RemoteRoot` object and getting converted to a JSON message
    UI is defined with a DOM-like API and gets converted to a JSON message

    For more details on the implementation of RemoteRoot, see the documentation and source code of the @remote-ui/core package.

    Remote Receiver

    The "opposite side" of a RemoteRoot is a RemoteReceiver. It receives UI updates (JSON messages sent from a remote root) and reconstructs the remote component tree locally. The remote component tree can then be rendered using native components.

    Code snippets showing RemoteRoot and RemoteReceiver working together

    Basic example setting up a RemoteRoot and RemoteReceiver to work together (host.jsx and extension.js)

    With RemoteRoot and RemoteReceiver we are very close to having an implementation of the remote rendering pattern. Extensions can define the UI as a remote tree, and that tree gets reconstructed on the host. The only missing thing is for the host to traverse the tree and render it using native UI components.

    DOM Receiver

    remote-ui provides a number of packages that make it easy to convert a remote component tree to a native component tree. For example, a DomReceiver can be initialized with minimal configuration and render a remote root into the DOM. It abstracts away the underlying details of traversing the tree, converting remote components to DOM elements, and attaching event handlers.

    In the snippet above, we create a receiver that will render the remote tree inside a DOM element with the id container. The receiver will convert Button and LineBreak remote components to button and br DOM elements, respectively. It will also automatically convert any prop starting with on into an event listener.

    For more details, check out this complete standalone example in the remote-ui repo.

    Integration with React

    The DomReceiver provides a convenient way for a host to map between remote components and their native implementations, but it’s not a great fit for our use case at Shopify. Our frontend application is built using React, so we need a receiver that manipulates React components (instead of manipulating DOM elements directly).

    Luckily, the @remote-ui/react package has everything we need: a receiver (that receives UI updates from the remote root), a controller (that maps remote components to their native implementations), and the RemoteRenderer React component to hook them up.

    There's nothing special about the component implementations passed to the controller; they are just regular React components:

    However, there's a part of the code that is worth taking a closer look at:

    // Run 3rd party script in a sandbox environment
    // with the receiver as a communication channel ...


    When we introduced the concept of remote rendering, our high-level diagram included only two boxes, extension and host. In practice, the diagram is slightly more complex.

    An image showing the Sandbox as a box surrounding the Extension and a box representing the Host. The two communicate via messages
    The sandbox is an additional layer of indirection between the host and the extension

    The sandbox, an additional layer of indirection between the host and the extension, provides platform developers with more control. The sandbox code runs in an isolated environment (such as a web worker) and loads extensions in a safe and secure manner. In addition to that, by keeping all boilerplate code as part of the sandbox, extension developers get a simpler interface to implement.

    Let's look at a simple sandbox implementation that allows us to run 3rd party code and acts as “the glue” between 3rd party extensions and our host.

    The sandbox allows a host to load extension code from an external URL. When the extension is loaded, it will register itself as a callback function. After the extension finishes loading, the host can render it (that is, call the registered callback).

    Arguments passed to the render function (from the host) provide it with everything it needs. remoteChannel is used for communicating UI updates with the host, and api is an arbitrary object containing any native functionality that the host wants to make available to the extension.

    Let's see how a host can use this sandbox:

    In the code snippet above, the host makes a setTitle function available for the extension to use. Here is what the corresponding extension script might look like:

    Notice that 3rd party extension code isn't aware of any underlying aspects of RPC. It only needs to know that the api (that the host will pass) contains a setTitle function.

    Implementing a Production Sandbox

    The implementation above can give you a good sense of our architecture. For the sake of simplicity, we omitted details such as error handling and support for registering multiple extension callbacks.

    In addition to that, our production sandbox restricts the JavaScript environment where untrusted code runs. Some globals (such as importScripts) are made unavailable and others are replaced with safer versions (such as fetch, which is restricted to specific domains). Also, the sandbox script itself is loaded from a separate domain so that the browser provides extra security constraints.

    Finally, to have cross-platform support, we implemented our sandbox on three different platforms using web workers (web), web views (Android), and JsCore (iOS).

    What’s Next?

    The technology we presented in this blog post is relatively new and is currently used to power two types of extensions, product subscriptions and post-purchase, in two different platform areas.

    We are truly excited about the potential we’re unlocking, and we also know that there's a lot of work ahead of us. Our plans include improving the experience of 3rd party developers, supporting new UI patterns as they come up, and making more areas of the platform extensibile.

    If you are interested in learning more, you might want to check out the remote-ui comprehensive example and this recent React Summit talk.

    Special thanks to Chris Sauve, Elana Kopelevich, James Woo, and Trish Ta for their contribution to this blog post.

    Joey Freund is a manager on the core extensibility team, focusing on building tools that let Shopify developers extend our platform to make it a perfect fit for every merchant.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Building an App Clip with React Native

    Building an App Clip with React Native

    When the App Clip was introduced in iOS 14, we immediately realized that it was something that could be a big opportunity for the Shop app. Due to the App Clip design, being a lightweight version of an app that you can download on the fly, we wanted to investigate what it could mean for us. Being able to instantly show users the power of the Shop app, without having to download it from the App Store and go through the onboarding was something we thought could have a huge growth potential.

    One of the key features, and restrictions, for an App Clip is the size limitation. To make things even more interesting, we wanted to build it in React Native. Something that, to our knowledge, has never been done at this scale before.

    Being the first to build an App Clip in React Native that was going to be surfaced to millions of users each day proved to be a challenging task.

    What’s an App Clip?

    App Clips are a miniature version of an app that’s meant to be lightweight and downloadable “on the go.” To provide a smooth download experience, the App Clip can’t exceed 10MB in size. For comparison, the iOS Shop app is 51MB.

    An App Clip can’t be downloaded from the App Store—it can only be “invoked”. An invocation means that a user performs an action that opens the App Clip on their phone: scanning a QR code or an NFC tag, clicking a link in the Messages app, or tapping a Smart App Banner on a webpage. After the invocation is made, iOS displays a prompt asking the user to open the App Clip, meanwhile the binary of the App Clip is downloaded in the background that causes it to launch instantly. The invocation URL is passed on to the App Clip which enables you to provide a contextual experience for the user.

    What Are We Trying to Solve?

    The Shop app helps users to track all of their packages in one place with ease. When the buyer installs the app that order is automatically imported and the buyer is kept up to date about its status without having to ask the seller.

    However, we noticed a big drop-off of users in the funnel between the “Thank you” page and opening the app. Despite the Shop app having a 4.8 star rating, the few added steps of going through an App Store meant some buyers chose not to complete the process. The App Clip would solve all of this.

    When the user landed on the “Thank you” page on their computer and invoked the App Clip by scanning a QR code, or for mobile checkouts by simply tapping the Open button, they would instantly see their order tracked. No App Store, no onboarding, just straight into the order details with the option to receive push notifications for the whole package journey.

    Why React Native?

    React Native apps aren’t famous for being small in size, so we knew building an App Clip that was below 10MB in size would pose some interesting challenges. However, being one of the most popular apps on the app stores, and champions of React Native, we really wanted to see if it was possible.

    Since the Shop app is built in React Native, all our developers could contribute to the App Clip—not just Swift developers—and we would potentially be able to maintain code sharing and feature parity with the AppClip as we do across Android and iOS.

    In short, it was an interesting challenge that aligned with our technology choices and our values about building reusable systems designed for the long-term.

    Building a Proof of Concept–Failing Fast

    Since the App Clip was a very new piece of technology, there was a huge list of unknowns. We weren’t sure if it was going to be possible to build it with React Native and go below the 10MB limit. So we decided to set up a technical plan where if we failed, we would fail fast.

    The plan looked something like this:

    1. Build a “Hello World” App Clip in React Native and determine its size
    2. Build a very scrappy, not even functional, version of the actual App Clip, containing all the code and dependencies we estimated we would need and determine its size
    3. Clean up the code, make everything work

    We also wanted to fail fast product wise. App Clips is a brand new technology that few people have been exposed to. We weren’t sure if our App Clip would benefit our users, so our goal was to get an App Clip out for testing, and get it out fast. If it would prove to be successful we would go back and re-iterate.

    Hello World

    When we started building the App Clip, there were a lot of unknowns. So to determine if this was even possible, we started off by creating a “Hello World” App Clip using just React Native’s <View /> and <Text /> components.

    The “Hello World” App Clip weighed in at a staggering 28MB. How could a barebone App Clip be this big in size? We investigated and realized that the App Clip was including all the native dependencies that the Shop app used, even though it only needed a subset of the React Native ones. We realized that we had to explicitly define exactly which native dependencies the App Clip needed in the Podfile:

    Defining dependencies was done by looking through React Natives node_modules/react-native/scripts/react_native_pods to determine the bare minimum native dependencies that React Native needed. After determining the list, we calculated the App Clip size. The result was 4.3MB. This was good news, but we still didn’t know if adding all the features we wanted would make us go beyond the 10MB limit.

    Building a Scrappy Version

    Building an App Clip with React Native is almost identical to building a React Native app with one big difference. We need to explicitly define the App Clip dependencies in the Podfile. Auto linking wouldn’t work in this case since it would scan all the installed packages for the ones compatible with auto linking and add them, we needed to cherry-pick pods only used by the App Clip.

    The process was pretty straightforward; add a dependency in a React component, and if it had a native dependency, we’d added it to the “Shop App Clip” target in the Podfile. But the consequence of this would be quite substantial later on.

    So the baseline size was 4.3MB, now it was time to start adding the functionality we needed. Since we were still exploring the design in this phase, we didn’t know exactly what the end result would be (other than displaying information about the user's order), but we could make some assumptions. For one, we wanted to share as much code with the app as possible. The Shop app has a very robust UI library that we wanted to leverage as well as a lot of business logic that handles user and order creation. Secondly we knew that we needed basic functionality like:

    • Network calls to our GraphQL service
    • Error reporting
    • Push notifications

    Since we only wanted to determine the build size, and in the spirit of failing fast, we implemented these features without them even working. The code was added, as well as the dependencies, but the App Clip wasn’t functional at all.

    We calculated the App Clip size once again, and the result was 6.5MB. Even though it was a scrappy implementation to say the least, and there were still quite a few unknowns regarding the functionality, we knew that building it in React Native was theoretically possible and something we wanted to pursue.

    Building the App Clip

    We knew that building our App Clip with React Native was possible, our proof of concept was 6.5MB, giving us some leeway for unknowns. And with a React Native App Clip there sure were a lot of unknowns. Will sharing code between the app and the App Clip affect its size or cause any other issues? What do we do if the App Clip requires a dependency that pushes us over the 10MB limit?

    Technology Drives Design

    Given the very rigid constraints, we decided that unlike most projects where the design leads the technology, we would approach this from the opposite direction. While developing the App Clip, the technology would drive the design. If something caused us to go over, or close to, the 10MB limit we would go back to the drawing board and find alternative solutions.

    Code Sharing Between Shop App and App Clip

    With the App Clip, we wanted to give the user a quick overview of their order and the ability to receive shipping updates through push notifications. We were heavily inspired by the order view in Shop app, and the final App Clip design was a reorganized version of that.

    A screenshot showing the App Clip order page on the left and the Shop App order page on the right. Order details are more front and center in the App Clip verison.
    App Clip versus Shop App

    The Shop app is structured to share as much code as possible, and we wanted to incorporate that in the App Clip. Sharing code between the two makes sense, especially when the App Clip had similar functionality as the order view in the app.

    Our first exploration was to see if it was viable to share all the code in the order view between the app and the App Clip, and modify the layout with props passed from the App Clip.

    A flow diagram showing that App Clip and Shop App share all the code for the <OrderView /> component and therefor share <ProductRow /> and <OrderHeader /> as a result.
    App Clip and Shop App share all the code in the <OrderView /> component

    We quickly realized this wasn’t viable. For one, it would add too much complexity to the order view, but mainly, any change to the order view would affect the App Clip. If a developer adds a feature to the order view, with a big dependency, the 10MB App Clip limit could be at risk.

    For a small development team, it might have been a valid approach, but at our scale we couldn’t. The added responsibility that every developer would have for the App Clip’s size limit while doing changes to the app’s main order view would be against our values around autonomy.

    We then considered building our own version of the order view in the App Clip, but sharing its sub components. This could be a viable compromise where all the logic heavy code would live in the <OrderView /> but the simple presentational components could still be shared.

    A flow diagram showing that App Clip and Shop App share subcomponents from the &lt;OrderView /&gt;:  &lt;ProductRow /&gt; and &lt;OrderHeader /&gt;.
    App Clip and Shop App share subcomponents of the <OrderView /> component

    The first component we wanted to import to the App Clip was <ProductRow />, its job is to display the product title, price and image:

    An image showing <ProductRow />, its job is to display the product title, variant, price and image
    <ProductRow /> displaying product title, price and image

    The code for this component looks like this (simplified):

    But when we imported this component into the App Clip it crashed. After some digging, we realized that our <Image /> component uses a library called react-native-fast-image. It’s a library, built with native Swift code, we use to be able to display large lists of images in a very performant way. And as mentioned previously, to keep the App Clip size down we need to explicitly define all its native dependencies in the Podfile. We hadn’t defined the native dependency for react-native-fast-image and therefore it crashed. The fix was easy though, adding the dependency enabled us to use the <ProductRow /> component:

    However, our proof of concept App Clip weighed in at 6.5MB meaning we only had 3.5MB to spare. So we knew we only wanted to add the absolute necessary dependencies, and since the App Clip would only display a handful of images we didn’t deem this library an absolute necessity.

    With this in mind, we briefly went through all the components we wanted to share with the order view, maybe this was just a one time thing we could create a workaround for? We discovered that the majority of the sub components of the <OrderView /> somewhere down the line had a native dependency. Upon analyzing how they would affect the App Clip size, we discovered that they would push the App Clip far north of 10MB with one single dependency weighing in at a staggering 2.5MB.

    Standing at a Crossroad

    We now realized sharing components between the order view in the app and the App Clip was not possible, was that true for all code? At this stage we were standing at a crossroad. Do we want to duplicate everything? Some things? Nothing?

    To answer this question we decided to base the decision on the following principles:

    • The App Clip is an experiment: we didn’t know if it would be successful or not, so we want to validate this idea as fast as possible.
    • Minimal impact on other developers: We were a small team working on the App Clip, we don’t want to add any responsibility to the rest of the developers working on the Shop app.
    • Easy to delete: Due to the many unknowns for the success of the experiment, we wanted to double down on writing code that was easy to delete.

    With this in mind, we decided that the similarities between the order view in the app and the App Clip are purely coincidental. This change of mindset helped us move forwards very quickly.

    Build Phase

    Building the App Clip was very similar to building any other React Native app, the only real difference was that we constantly needed to keep track of its size. Since checking the size of the App Clip was very time consuming, around 25 minutes each time on our local machines, we decided to only do this when any new dependencies were added as well as doing some ad-hoc checks from time to time.

    All the components for the App Clip were created from scratch with the exception of the usage of our shared components and functions within the Shop app. Inside our shared/ directory there are a lot of powerful foundational tools we wanted to use in the App Clip; <Box /> and <Text /> and a few others that we rely on heavily to structure our UI in the Shop app with the help of our Restyle library. We also wanted to reuse the shared hooks for setting up push notifications, creating a user, etc. As mentioned earlier, sharing code between the app and the App Clip could potentially cause issues. If a developer decides to add a new native dependency to the <Box /> or <Text /> they would, often unknowingly, affect the App Clip as well. 

    However, we deemed these shared components mature enough to not have any large changes made to them. To catch any new dependencies being added to these shared components, we wrote a CI script to detect and notify the pull request author of this.

    The script did three things:

    1. Go through the Podfile and create a list of all the native dependencies.
    2. Traverse through all imports the App Clip made and create a list of the ones that have native dependencies.
    3. Finally, compare the two lists. If they don’t match, the CI job fails with instructions on how to proceed.

    A few times we stumbled upon some issues with dependencies, either our shared one or external ones, adding some weight to the App Clip. This could be a third-party library for animations, async storage, or getting information about the device. With our “technology drives design” principle in mind, we often removed the dependencies for non-critical features, as with the animation library.

    We now felt more confident on how to think while building an App Clip and we moved fast, continuously creating and merging pull requests.

    Support Invocation URLs in the App

    The app always has precedence over the App Clip. Meaning if you invoke the App Clip by scanning a QR code, but you already have the app installed, the app opens and not the App Clip. We had to build support for invocations in the app as well, so even if the user has the app installed scanning the QR code would automatically import the order.

    React Native enables us to do this through the Linking module:

    The module allowed us to fetch the invocation URL inside the app and create the order for already existing app users. With this, we now supported importing an order by scanning a QR code both in the App Clip and the app.

    Smooth Transition to the App

    The last feature we wanted to implement was a smooth transition to the app. If the user decides to upgrade from the App Clip to the full app experience, we wanted to provide a simpler onboarding experience and also magically have their order ready for them in the app. Apple provides a very nice solution to this with shared data containers which both the App Clip and the app have access to.

    Now we can store user data in the App Clip that the app has access to, providing an optimal onboarding experience if the user decides to upgrade.

    Testing the App Clip

    Throughout the development and launch of the App Clip, testing was difficult. Apple provides a great way to mock an invocation of the App Clip by hard coding the invocation URL in Xcode, but there was no way to test the full end-to-end flow of scanning the QR code, invocating the App Clip, and downloading the app. This wasn’t possible either on our local machines or TestFlight. To verify that the flow would work as expected we decided to release a first version of the App Clip extremely early. With the help of beta flags we made sure the App Clip could only be invoked by the team. This early release had no functionality, it only verified that the App Clip received the invocation URL and passed along the proper data to the app for a great onboarding experience. Once this flow was working, and we could trust that our local mockups worked the same as in production, testing the App Clip got a lot easier.

    After extensive testing, we felt ready to release the App Clip. The release process was very similar since the App Clip is bundled into the app, the only thing needed was to provide copy and image assets in App Store Connect for the invocation modal.

    Screenshot of App Store Connect screen for uploading copy and image assets.
    App Store Connect

    We approached this project with a lot of unknowns—the technology was new and new to us. We were trying to build an App Clip with React Native, which isn’t typical! Our approach (to fail fast and iterate) worked well. Having a developer with native iOS development was very helpful because App Clips—even ones written in React Native—involve a lot of Apple’s tooling.

    One challenge we didn’t anticipate was how difficult it would be to share code. It turned out that sharing code introduced too much complexity into the main application, and we didn’t want to impact the development process for the entire Shop team. So we copied code where it made sense.

    Our final App Clip size was 9.1MB, just shy of the 10MB limit. Having such a hard constraint was a fun challenge. We managed to build most of what we initially had in mind, and there are further optimizations we can still make.

    Sebastian Ekström is a Senior Developer based in Stockholm who has been with Shopify since 2018. He’s currently working in the Shop Retention team.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Five Tips for Growing Your Engineering Career

    Five Tips for Growing Your Engineering Career

    The beginning stages of a career in engineering can be daunting. You’re trying to make the most of the opportunity at your new job and learning as much as you can, and as a result, it can be hard to find time and energy to focus on growth. Here are five practical tips that can help you grow as you navigate your engineering career.

    Continue reading

    Using Propensity Score Matching to Uncover Shopify Capital’s Effect on Business Growth

    Using Propensity Score Matching to Uncover Shopify Capital’s Effect on Business Growth

    By Breno Freitas and Nevena Francetic

    Five years ago, we introduced Shopify Capital, our data-powered product that enables merchants to access funding from right within the Shopify platform. We built it using a version of a recurrent neural network (RNN)—analyzing more than 70 million data points across the Shopify platform to understand trends in merchants’ growth potential and offer cash advances that make sense for their businesses. To date, we’ve provided our merchants with over $2.7 billion in funding.

    But how much of an impact was Shopify Capital having, really? Our executives wanted to know—and as a Data team, we were invested in this question too. We were interested in validating our hypothesis that our product was having a measurable, positive impact on our merchants.

    We’ve already delved into the impact of the program in another blog post, Digging Through the Data: Shopify Capital's Effect on Business Growth, but today, we want to share how we got our results. In this post, we’re going behind the scenes to show you how we investigated whether Shopify Capital does what we intended it to do: help our merchants grow.

    The Research Question

    What’s the impact on future cumulative gross merchandise value (for example, sales) of a shop after they take Shopify Capital for the first time?

    To test whether Shopify merchants who accepted Capital were more successful than those who didn’t, we needed to compare their results against an alternative future (the counterfactual) in which merchants who desired Capital didn’t receive it. In other words, an A/B test.

    Unfortunately, in order to conduct a proper A/B test, we would need to randomly and automatically reject half of the merchants who expressed interest in Capital for some period of time in order to collect data for proper analysis. While this makes for good data collection, it would be a terrible experience for our users and undermine our mission to help merchants grow, which we were unwilling to do.

    With Shopify Capital only being active in the US in 2019, an alternative solution would be to use Canadian merchants who didn’t yet have access to Shopify Capital (Capital launched in Canada and the UK in Spring 2020) as our “alternate reality.” We needed to seek out Canadian shops who would have used Shopify Capital if given the opportunity, but weren’t able to because it wasn’t yet available in their market.

    We can do this comparison through a method called “propensity score matching” (PSM).

    Matchmaker, Matchmaker, Make Me a Match

    In the 1980s, researchers Rosenbaum and Rubin proposed PSM as a method to reduce bias in the estimation of treatment effects with observational data sets. This is a method that has become increasingly popular in medical trials and in social studies, particularly in cases where it isn’t possible to complete a proper random trial. A propensity score is defined as the likelihood of a unit being assigned to the treatment group. In this case: What are the chances of a merchant accepting Shopify Capital if it were offered to them?

    It works like this: After propensity scores are estimated, the participants are matched with similar counterparts on the other set, as depicted below.

    Depiction of matching performed on two sets of samples based on their propensity scores.
    Depiction of matching performed on two sets of samples based on their propensity scores.

    We’re looking for a score of similarity for taking treatment and only analyzing samples in both sets that are close enough (get a match) and respecting any other constraints imposed by the selected matching methodology. This means we could even be dropping samples from treatment when matching, if the scores fall outside of the parameters we’ve set.

    Once matched, we’ll be able to determine the difference in gross merchandise value (GMV), that is, sales, between the control and treatment groups in the six months after they take Shopify Capital for the first time.

    Digging into the Data Sets

    As previously discussed, in order to do the matching, we needed two sets of participants in the experiment, the treatment group, and the control group. We decided to set our experiment for a six-month period, starting in January 2019 to remove any confounding effect of COVID-19.

    We segment our two groups as follows:

  • Treatment Group: American shops that were first-time Capital adopters in January 2019, on the platform for at least three months prior (to ensure they were established on the platform), and still Shopify customers in April 2020.
  • Control Group: Canadian shops that had been a customer for at least three months prior to January 2019 and pre-qualified for Capital in Canada when we launched it in April 2020.
  • Ideally, we would have recreated underwriting criteria from January 2019 to see which Canadian shops would have pre-qualified for Capital at that time. To proxy for this, we looked at shops that remained stable until at least April 2020 in the US and Canada, and then went backwards to analyze their 2019 data.

    Key assumptions:

  • Shops in Canada didn’t take an offer for the sole reason that Capital didn’t exist in Canada at that time.
  • Shops in the US and Canada have equal access to external financing sources we can’t control (for example, small business loans)
  • The environments that Canadian and US merchants operate in are more or less the same
  • Matchmaking Methodology

    We began our matching process with approximately 8,000 control shops and about 600 treated shops. At the end of the day, our goal was to make the distributions of the propensity scores for each group of shops match as closely as possible.

    Foundational Setup

    For the next stage in our matching, we set up some features, using characteristics from within the Shopify platform to describe a shop. The literature says there’s no right or wrong way to pick characteristics—just use your discernment to choose whichever ones make the most sense for your business problem.

    We opted to use merchants’ (which we’ll refer to as shops) sales and performance in Shopify. While we have to keep the exact characteristics a secret for privacy reasons, we can say that some of the characteristics we used are the same ones the model would use to generate a Shopify Capital offer.

    At this stage, we also logarithmically transformed many of the covariates. We did this because of the wild extremes we can get in terms of variance on some of the features we were using. Transforming them to logarithmic space shrinks the variances and thus makes the linear regressions behave better (for example, to shrink large disparities in revenue). This helps minimize skew.

    It’s a Match!

    There are many ways we could match the participants on both sets—the choice of algorithm depends on the research objectives, desired analysis, and cost considerations. For the purpose of this study, we chose a caliper matching algorithm.

    A caliper matching algorithm is basically a nearest neighbors (NN) greedy matching algorithm where, starting from the largest score, the algorithm tries to find the closest match on the other set. It differs from a regular NN greedy algorithm as it only allows for matches within a certain threshold. The caliper defines the maximum distance the algorithm is allowed to have between matches—this is key because if the caliper is infinite, you’ll always find a neighbor, but that neighbor might be pretty far away. This means not all shops will necessarily find matches, but the matches we end up with will be fairly close. We followed Austin’s recommendation to choose our caliper width.

    After computing the caliper and running the greedy NN matching algorithm, we found a match to all but one US first-time Capital adopter within Canadian counterparts.

    Matching Quality

    Before jumping to evaluate the impact of Capital, we need to determine the quality of our matching. We used the following three techniques to assess balance:

    1. Standardized mean differences: This methodology compares the averages of the distributions for the covariates for the two groups. When close to zero, it indicates good balance. Several recommended thresholds have been published in the literature with many authors recommending 0.1. We can visualize this using a “love plot,” like so:

      Love plot comparing feature absolute standardized differences before and after matching.
      Love plot comparing feature absolute standardized differences before and after matching.
    2. Visual Diagnostics: Visual diagnostics such as empirical cumulative distribution plots (eCDF), quantile-quantile plots, and kernel density plots can be used to see exactly how the covariate distributions differ from each other (that is, where in the distribution are the greatest imbalances). We plot their distributions to check visually how they look pre and post matching. Ideally, the distributions are superimposed on one another after matching.

      Propensity score plots before matching - less overlapping before matching indicating less matches were found between groups
      Propensity score plots before matching - less overlapping before matching indicating less matches were found between groups.
      Propensity score plots after matching - Increased overlapping indicating good matches between groups
      Propensity score plots after matching - Increased overlapping indicating good matches between groups.
    3. Variance Ratios: The variance ratio is the ratio of the variance of a covariate in one group to that in the other. Variance ratios close to 1 indicate good balance because they imply the variances of the samples are similar, whereas numbers close to 2 are sometimes considered extreme. Only one of our covariates was hitting the 0.1 threshold in the standardized mean differences method. Visual comparison (see above) showed great improvement and good alignment in covariate distributions for the matched sets. And all of our variance ratios were below 1.3.

    The checks presented cover most of the steps presented in the literature in regards to making sure the matching is okay to be used in further analysis. While we could go further and keep tweaking covariates and testing different methods until a perfect matching is achieved, that would risk introducing bias and wouldn’t guarantee the assumptions would be any stronger. So, we decided to proceed with assessing the treatment effect. 

    How We Evaluated Impact

    At this point, the shops were matched, we had the counterfactual and treatment group, and we knew the matching was balanced. We’d come to the real question: Is Shopify Capital impacting their sales? What’s the difference in GMV between shops who did and didn’t receive Shopify Capital? 

    In order to assess the effect of the treatment, we set up a simple binary regression: y’ = β₀ + β₁ * T.

    Where T is a binary indicator of whether or not the data point is for a US or Canadian shop, β₀ is the intercept for the regression and β₁ is the coefficient that will show how being on treatment on average influences our target. Target, 𝑦', is a logarithm of the cumulative six-month GMV, from February to July 2019,  plus one (that is, log1p transform of six-month sales).

    Using this methodology, we found that US merchants on average had a 36% higher geometric average of cumulative six-month GMV after taking Capital for the first time than their peers in Canada.

    How Confident Are We in Our Estimated Treatment Effect? 

    In order to make sure we were confident in the treatment effect we calculated, we ran several robustness checks. We won’t get into the details, but we used the margins package, simulated an A/A test to validate our point estimate, and followed Greifer’s proposed method for bootstrapping.

    Cumulative geometric average of sales between groups before and after taking their first round of Capital
    Cumulative geometric average of sales between groups before and after taking their first round of Capital.

    Our results show that the 95% confidence interval for the average increase in the target, after taking Capital for the first time, is between 13% and 65%. The most important takeaway is that the lower bound is positive—so we can say with high confidence that Shopify Capital has a positive effect on merchants’ sales.

    Final Thoughts

    With high statistical significance, backed by robustness checks, we concluded that the average difference in the geometric mean of GMV in the following six months after adopting Shopify Capital for the first time is +36%, bounded by +13% and +65%. We can now say with confidence that Shopify Capital does indeed help our merchants—and not only that, but it validates the work we’re doing as a data team. Through this study, we were able to prove that one of our first machine learning products has a significant real-world impact, making funding more accessible and helping merchants grow their businesses. We look forward to continuing to create innovative solutions that help our merchants achieve their goals.

    Breno Freitas is a Staff Data Scientist working on Shopify Capital Data and a machine learning researcher at Federal University of Sao Carlos, Brazil. Breno has worked with Shopify Capital for over four years and currently leads a squad within the team. Currently based in Ottawa, Canada, Breno enjoys kayaking and working on DIY projects in his spare time.

    Nevena Francetic is a Senior Data Science Manager for Money at Shopify. She’s leading teams that use data to power and transform financial products. She lives in Ottawa, Ontario and in her spare time she spoils her little nephews. To connect, reach her on LinkedIn.

    Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

    Continue reading

    Building Blocks of High Performance Hydrogen-powered Storefronts

    Building Blocks of High Performance Hydrogen-powered Storefronts

    The future of commerce is dynamic, contextual, and personalized. Hydrogen is a React-based framework for building custom and creative storefronts giving developers everything they need to start fast, build fast, and deliver the best personalized and dynamic buyer experiences powered by Shopify’s platform and APIs. We’ve built and designed Hydrogen to meet the three needs of commerce:

    1. fast user experience: fast loading and responsive
    2. best-in-class merchant capabilities: personalized, contextual, and dynamic commerce
    3. great developer experience: easy, maintainable, and fun.
    A visualization of a .tsx file showing the ease of adding an Add to Cart button to a customized storefront
    Hydrogen provides optimized React components enabling you to start fast.

    These objectives have inherent tension that’s important to acknowledge. You can achieve fast loading through static generation and edge delivery, but you must forgo or make personalization a client-side concern that results in a deferred display of critical content. Vice versa, rendering dynamic responses from the server implies a slower initial render but, when done correctly, can deliver better commerce and shopping experience. However, delivering efficient streaming server-side rendering for React-powered storefronts, and smart server and client caching, is a non-trivial and unsolved developer experience hurdle for most teams.

    Hydrogen is built and optimized to power personalized, contextual, and dynamic commerce. Fast and efficient server-side rendering with high-performance storefront data access is the prerequisite for such experiences. To optimize the user experience, we leverage a collection of strategies that work together:

    There’s a lot to unpack here, let’s take a closer look at each one.

    Streaming Server-side Rendering

    Consider a product page that contains a significant amount of buyer personalized content: a localized description and price for a given product, a dynamic list of recommended products powered by purchase and navigation history, a custom call to action (CTA) or promotion banner, and the assignment to one or several multivariate A/B tests.

    A client-side strategy would, likely, result in a fast render of an empty product page skeleton, with a series of post-render, browser-initiated fetches to retrieve and render the required content. These client-initiated roundtrips quickly add up to a subpar user experience.

    A visualization showing the differences between Client-side Rendering and Server-side Rendering
    Client-side rendering vs. server-side rendering

    The client-side rendering (CSR) strategy typically results in a delayed display of critical page content—that is, slow LCP. An alternative strategy is to server-side render (SSR)—fetch the data on the server and return it in the response—that helps eliminate RTTs and allows first and largest contentful paints to fire close together, but at a cost of a slow time-to-first-byte (TTFB) because the server is blocked on the data. This is where and why streaming SSR is a critical optimization.

    A visualization showing how Streaming Server-side Rendering unlocks critical performance benefits.
    Streaming server-side rendering unlocks fast, non-blocking first render

    Hydrogen adopts the new React 18 alpha streaming SSR API powered by Suspense that unlocks critical performance benefits:

    • Fast TTFB: the browser streams the HTML page shell without blocking the server-side data fetch. This is in contrast to “standard” SSR where TTFB is blocked until all data queries are resolved.
    • Progressive hydration: as server-side data fetches are resolved, the data is streamed within the HTML response, and the React runtime progressively hydrates the state of each component, all without extra client round trips or blocking on rendering the full component tree. This also means that individual components can show custom loading states as the page is streamed and constructed by the browser.

    The ability to stream and progressively hydrate and render the application unlocks fast TTFB and eliminates the client-side waterfall of CSR—it’s a perfect fit for the world of dynamic and high-performance commerce.

    React Server Components

    “Server Components allow developers to build apps that span the server and client, combining the rich interactivity of client-side apps with the improved performance of traditional server rendering.”
        —RFC: React Server Components

    Server components are another building block that we believe (and have been collaborating on with the React core team) is critical to delivering high-performance storefronts. RSC enables separation of concerns between client and server logic and components that enables a host downstream benefits:

    • server-only code that has zero impact on bundle size and reduces bundle sizes
    • server-side access to custom and private server-side data sources
    • seamless integration and well-defined protocol for server+client components
    • streaming rendering and progressive hydration
    • subtree and component-level updates that preserve client-state
    • server and client code sharing where appropriate.
    An home.server.jsx file that has been highlighted to show where code sharing happens, the server-side data fetch, and the streaming server-side response.

    Server components are a new building block for most React developers and have a learning curve, but, after working with them for the last ten months, we’re confident in the architecture and performance benefits that they unlock. If you haven’t already, we encourage you to read the RFC, watch the overview video, and dive into Hydrogen docs on RSC.

    Efficient Data Fetching, Colocation, and Caching

    Delivering fast server-side responses requires fast and efficient first party (Shopify) and third party data access. When deployed on Oxygen—a distributed, Shopify hosted V8 Isolate-powered worker runtime—the Hydrogen server components query the Storefront API with localhost speed: store data is colocated and milliseconds away. For third party fetches, the runtime exposes standard Fetch API enhanced with smart cache defaults and configurable caching strategies:

    • smart default caching policy: key generation and cache TTLs
    • ability to override and customize cache keys, TTLs, and caching policies
    • built-in support for asynchronous data refresh via stale-while-revalidate.

    To learn more, see our documentation on useShopQuery for accessing Shopify data, and fetch policies and options for efficient data fetching.

    Combining the Best of Dynamic and Edge Serving

    Adopting Hydrogen doesn’t mean all data must be fetched from the server. On the contrary, it’s good practice to defer or lazyload non-critical content from the client. Below the fold or non-critical content can be loaded on the client using regular React patterns and browser APIs, for example, through use of IntersectionObserver to determine when content is on or soon to be on screen and loaded on demand.

    Similarly, there’s no requirement that all requests are server-rendered. Pages and subrequests with static or infrequently updated content can be served from the edge. Hydrogen is built to give developers the flexibility to deliver the critical personalized and contextual content, rendered by the server, with the best possible performance while still giving you full access to the power of client-side fetching and interactivity of any React application.

    The important consideration isn’t which architecture to adopt, but when you should be using server-side rendering, client-side fetching, and edge delivery to provide the best commerce experience—a decision that can be made at a page and component level.

    For example, an about or a marketing page that’s typically static can and should be safely cached, served directly from the CDN edge, and asynchronously revalidated with the help of a stale-while-revalidate strategy. The opt-in to edge serving is a few keystrokes away for any response on a Hydrogen storefront. This capability, combined with granular and optimized subrequest—powered by the fetch API we covered above—caching gives full control over data freshness and the revalidation strategy.

    Putting It All Together

    Delivering a high-performance, dynamic, contextual, and personalized commerce experience requires layers of optimizations at each layer of the stack. Historically, this has been the domain of a few, well-resourced engineering teams. The goal of Hydrogen and Oxygen is to level the playing field:

    • the framework abstracts all the streaming
    • the components are tuned to speak to Shopify APIs
    • the Oxygen runtime colocates and distributes rendering around the globe.

    Adopting Hydrogen and Oxygen should, we hope, enable developers to focus on building amazing commerce experiences, instead of the undifferentiated technology plumbing and production operations to power a modern and resilient storefront.

    Take Hydrogen out for a spin, read the docs, leave feedback. Let’s build.

    Ilya Grigorik is a Principal Engineer at Shopify and author of High Performance Browser Networking (O'Reilly), on a mission to supercharge commerce and empower entrepreneurs around the world.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    The Vitality of Core Web Vitals

    The Vitality of Core Web Vitals

    In 2020, Google introduced unified guidance for great user experience (UX) on the web called Core Web Vitals. It’s proposed to evaluate specific metrics and have numerical estimates for such a multifaceted discipline. Current metrics focus on loading, interactivity, and visual stability. You might think, “Nice stuff, thank you, Google. I’ll save this to my bookmarks and look into it once the time comes for nice-to-have investigations!” But before deciding, have a closer look into the following. This year, Google implemented the metrics of Core Web Vitals in the Search Engine ranking algorithm as one of the factors. To be precise, the rollout of page experience in ranking systems began in mid-June of 2021 and completed at the end of August.

    Does that mean that we should notice a completely different ranking of Google Search results in September already? Or the horror case, our websites to be shown in the s-e-c-o-n-d Search Engine Results Page (SERP)? Drastic changes won’t appear overnight, but the update will undoubtedly influence the future of ranking. First of all, the usability of web pages is only one factor that influences ranking. Meaning of inserted query, the relevance of a page, quality of sources, context, and settings are other “big influencers” deciding the final results. Secondly, most websites are in the same boat getting “not great, not terrible” grades. According to Google Core Web Vitals Study April 2021, only four percent of all studied websites are prepared for the update and have a good ranking in all three metrics. It’s good timing for companies to invest efforts for necessary improvements and easily stand out among other websites. Lastly, user expectations continue to rise higher standards, and Google has a responsibility to help users reach relevant searches. At the same time, Google pushes the digital community to prioritize UX because that helps to keep users on their websites. Google's study shows that visitors are 24% less likely to abandon the websites that meet proposed metrics thresholds.

    Your brains are most likely filled with dopamine by thinking about possible UX improvements to your website. Let’s use that momentum and dig deeper into each metric of Core Web Vitals.

    Core Web Vitals Metrics

    Core Web Vitals is the subset of unified guidance for great UX indication called Web Vitals. Core metrics highlight the metrics that matter most. Metrics are not written in stone! They represent the best available indicators developers have today. Thus, be welcoming for future improvements or additions.

    The current set of metrics is largest contentful paint (LCP), first input delay (FID), and cumulative layout shift (CLS).

    An image showing the three Core Web Vitals and the four Other Web Vitals
    Listed metrics of Web Vitals: mobile-friendly, safe browsing, HTTPS, no intrusive interstitials, loading, visual stability and interactivity. The last 3 ones are ascribed to Core Web Vitals.

    Largest Contentful Paint

    LCP measures the time to render the largest element in a currently viewed page part. The purpose is to measure how quickly the main content is ready to be used for the user. Discussions and research helped to understand that the main content is considered the largest element as an image or text block in a viewport. Exposed elements are

    • <img>
    • <image> inside an <svg> (Note: <svg> itself currently is not considered as a candidate)
    • <video>
    • an element with a background image loaded via the url()
    • block-level elements containing text nodes or other inline-level text elements children.

    During the page load, the largest element in a viewport is detected as a candidate of LCP. It might change until the page is fully loaded. In example A below, the candidate changed three times since larger elements were found. Commonly, the LCP is the last loaded element, but that’s not always the case. In example B below, the paragraph of text is the largest element displayed before a page loads an image. Comparing the two, example B has a better LCP score than example A.

    An image depicting the differences between LCP being the last loaded element and LCP occuring before the page is fully loaded.
    LCP detection in two examples: LCP is the last loaded element on a page (A), LCP occurs before the page is fully loaded (B).

    Websites should meet 2.5 seconds or less LCP for getting a score that indicates good UX. But… Why 2.5? The inspiration was taken from studies by Stuart K. Card and Robert B. Miller,  that found that a user will wait roughly 0.3 to 3 seconds before losing focus. In addition, gathered data about top-performing sites across the Web showed that such a limit is consistently achievable for well-optimized sites.

    Good Poor
    LCP <= 2.5s > 4s
    FID <= 100ms > 300ms
    CLS <= 0.1 > 0.25

    The thresholds of “good” and “poor” Web Core Vitals scores. The scores in between are considered “needs improvement”.

    First Input Delay

    FID quantifies the user’s first impression of the responsiveness and interactivity of a page. To be precise, how long does a browser take to become available to respond to the first user’s interaction on a page. For instance, the time between when the user clicks on the “open modal” button and when the browser is ready to trigger the modal opening. You may wonder, shouldn’t the code be executed immediately after the user’s action? Not necessarily, during page load, the browser’s main thread is super busy parsing and executing loaded JavaScript (JS) files—incoming events might wait until processing.

    A visualization of a browser loading a webpage showing that FID is the time between the user's first interaction and when they can respond
    FID represents the time between when a browser receives the first user’s interaction and can respond to that.

    FID measures only the delay in event processing. Time to process the event and update UI afterwards were deliberately excluded to avoid workarounds like moving event processing logic to asynchronous callbacks. The workaround would improve such metric scores because it separates processing from the task associated with the event. Sadly, that wouldn’t bring any benefits for the user—likely the opposite.

    User’s interactions require the main thread to be idle even when the event listener is not registered. For example, the main thread might delay the user’s interaction with the following HTML elements until it completes ongoing tasks:

    • text fields, checkboxes, and radio buttons (<input>, <textarea>)
    • select dropdowns (<select>)
    • links (<a>).

    Jakob Nielsen described in the Usability Engineering book: “0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result”. Despite being first described in 1993, the same limit is considered good in Core Web Vitals nowadays.

    Cumulative Layout Shift

    CLS measures visual stability and counts how much the visible content shifts around. Layout shifts occur when existing elements change their start position defined by Layout Instability API. Note that when a new element is added to the DOM or an existing element changes size—it doesn’t count as a layout shift!

    The metric is named “cumulative” because the score of each shift is summed. In June 2021, the duration of CLS was improved for long-lived pages (for example SPAs and infinite scroll apps) by grouping layout shifts and ensuring the score doesn’t grow unbounded.

    Are all layout shifts bad? No. CLS focuses only on unexpected ones. Expected layout shifts occur within 500 milliseconds after user’s interactions (that is clicking on a link, typing in a search box, etc.). Such shifts are excluded from CLS score calculations. This knowledge may encourage creating extra space immediately after the user’s input with a loading state for tasks that take longer to complete.

    Let’s use a tiny bit of math to calculate the layout shift score of the following example:

    1. Impact fraction describes the amount of space an unstable element takes up of a viewport. When an element covers 60% of a viewport, its impact fraction is 0.6.
    2. Distance fraction defines the amount of space that an unstable element moves from the original to the final position. When an element moves by 25% of a viewport height, its distance fraction is 0.25.
    3. Having layout_shift_score = impact_fraction * distance_fraction formula, the layout shift score of this example is 0.6 * 0.25 = 0.15.
    A visualization of two mobile screens showing the 0.15 layout shift score. The second mobile screen shows the page after the layout shift
    The example of 0.15 layout shift score in a mobile view.

    A good CLS score is considered to be 0.1 or less for a page. Evaluated real-world pages revealed that the shifts of such good scores are still detectable but not excessively disruptive. Leaving shifts of 0.15 score and above consistently the opposite.

    How to Measure the Score of My Web Page?

    There are many different toolings to measure Core Web Vitals for a page. Tools reflect two main measurement techniques: in the lab or in the field.

    In the Lab

    Lab data, also known as synthetic data, is collected from a simulated environment without a user. Measures in such an environment can be tested before features are released in production. Be aware that FID can’t be measured in such a way! Lab data doesn’t contain the required real user input. As an alternative, its suggested to track its proxy—Total Blocking Time (TBT).


    • Lighthouse: I think it is the most comprehensive tool using lab data. It can be executed either on public or authenticated web pages. The generated report indicates the scores and suggests personalised opportunities to improve the performance. The best part is that Chrome users already have this tool ready to be used under DevTools. The drawback I noticed during using the tool—the screen of a page should be visible during the measurement process. Thus, the same browser doesn’t support the analysis of several pages in parallel. Lastly, Lighthouse can be incorporated into continuous integration workflows via Lighthouse CI.
    • WebPageTest: The tool can perform analyses for public pages. I was tricked by the tool when I provided a URL of the authenticated page for the first time. I got results. Better than I expected. Just before patting myself on the back, I decided to dig deeper into a waterfall view. The view showed clearly that the authenticated page wasn’t even reached and it was navigated to a public login page. Despite that, the tool has handy options to test against different locations, browsers, and device emulators. It might help to identify which country or state troubles the most and start thinking about Content Delivery Network (CDN). Finally, be aware that the report includes detailed analyses but doesn’t provide advice for improvements.
    • Web Vitals extension: It's the most concrete tool of all. It contains only metrics and scores for the currently viewed page. In addition, the tool shows how it calculates scores in real-time. For example, FID is shown as “disabled” until your interaction happens on a page.

    In the Field 

    A site’s performance can vary dramatically based on a user’s personalized content, device capabilities, and network conditions. Real User Monitoring (RUM) captures the reality of page performance, including the mentioned differences. Monitoring data shows the performance experienced by a site’s actual users. On the other hand, there’s a way to check the real-world performance of a site without a RUM setup. Chrome User Experience Report gathers and aggregates UX metrics across the public web from opted-in users. Such findings power the following tools:

    • Chrome UX Report Compare Tool (CRUX): As the name dictates, the tool is meant for pages’ comparison. The report includes metrics and scores of selected devices’ groups: desktop, tablet, or mobile. It is a great option to compare your site with similar pages of your competitors.
    • PageSpeed Insights: The tool provides detailed analyzes for URLs that are known by Google’s web crawlers. In addition, it highlights the opportunities for improvements.
    • Search Console: The tool reports performance data per page, including historical data. Before using it—verification of ownership is mandatory.
    • Web Vitals extension: The tool was mentioned for lab toolings, but there’s one more feature to reveal. For pages in which field data is available via Chrome UX Report, lab data (named “local” in the extension) is combined with real-user data from the field. This integration might indicate how similar your individual experiences are to other website users.

    CRUX based tools are great and quick starters for investigations. Despite that, your retrieved RUM data can provide more detailed and immediate feedback. To setup RUM for a website might look scary at the beginning, but usually, it takes these steps:

    1. In order to send data from a website, a developer implements a RUM Javascript snippet to the source code.
    2. Once the user interacts or leaves the website, the data about an experience is sent to a collector. This data is processed and stored in a database that anyone can view via convenient dashboards.

    How to Improve

    Core Web Vitals provides insights into what’s hurting the UX. For example, setting up RUM even for a few hours can reveal where the most significant pain points exist. The worst scoring metrics and pages can indicate where to start searching for improvements. Other toolings mentioned in the previous section might suggest how to fix the specific issues. The great thing is that all scores will likely increase by applying changes to improve one metric.

    Many indications and bits of advice may sound like coins in the Super Mario game, which are hanging for you to grab. That isn’t the case. The hard and sweaty work remains on your table! Not all opportunities are straightforward to implement. Some might include big and long-lasting refactoring that can’t be done in one go or for which preparations should be completed. Thus, it adds several strategies to start explorations:

    1. Update third-party libraries. After reviewing your application libraries, you might reveal that some are no longer used or lighter alternatives (covering the same use case) exist. Next, sometimes only a part of the included library is actually used. That leads to the situation where a portion of JS code is loaded without purpose at all. Tree-shaking could solve this issue. It enables loading only registered specific features from a library instead of loading everything. Be aware that not all libraries support tree-shaking yet, but it’s getting more and more popular. Updates of application dependencies may sound like a small help, but let’s lead by an example. During Shopify internal Hack Days, my team executed the mentioned updates for our dropshipping app Oberlo. It decreased the compressed bundle size of the application by 23%! How long did it take for research and development? Less than three days.
      This improves FID and LCP.
    2. Preload critical assets. The loading process might be extended due to the late discovery of crucial page resources by the browser. By noting which resources can be fetched as soon as possible, the loading can be improved drastically. For example, Shopify noticed a 50% (1.2 seconds) improvement in time-to-text-paint by preloading Web Fonts.
      This improves FID and LCP.
    3. Review your server response time. If you’re experiencing severe delays, you may try the following:
      a) use a dedicated server instead of a shared one for web hosting
      b) route the user to a nearby CDN
      c) cache static assets
      d) use service workers to reduce the amount of data users need to request from a server.
      This improves FID and LCP
    4. Optimize heavy elements. Firstly, shorten the loading and rendering of critical resources by implementing the lazy-loading strategy. It defers the loading of large elements like images below the page viewport once it’s required for a user. Do not add lazy loading for elements in the initial viewport because the LCP element should be loaded as fast as possible! Secondly, compress images to have fewer bytes to download. Images don’t always require high quality or resolution and can be downgraded intentionally without affecting the user. Lastly, provide dimensions as width and height attributes or aspect-ratio boxes. It ensures that a browser can allocate the correct amount of space in the document while the image is loading to avoid unexpected layout shifts.
      This improves FID, LCP, and CLS.

    To sum everything up, Google introduced Core Web Vitals to help us to improve the UX of websites. In this article, I’ve shared clarifications of each core metric, the motives of score thresholds, tools to measure the UX scores of your web pages, and strategies for improvements. Loading, interactivity, and visual stability are the metrics highlighted today. Future research and analyses might reveal different Core Web Vitals to focus on. Be prepared!

    Meet the author of this article—Laura Silvanavičiūtė. Laura is a Web Developer who is making drop-shipping better for everyone together with the Oberlo app team. Laura loves Web technologies and is thrilled to share the most exciting parts with others via tech talks at conferences or articles on

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Always on Calibration with a Quarterly Summary

    Always on Calibration with a Quarterly Summary

    Being always-on means we don’t wait to reach alignment around expectations and performance at the end of a review period. We’re always calibrating and aligning. This removes any sort of surprise in how an individual on the team is doing and recognizes the specific impact they had on the business. It’s also for us to find and perform specific just-in-time growth opportunities for each individual. 

    Today, continuous deployment is an accepted and common best practice when it comes to building and releasing software. But, where’s the continuous deployment for the individuals? Just like the software we write, we’re iterating on ourselves and deploying new versions all the time. We learn, grow, and try new ideas, concepts, and approaches in our work and interactions. But how are we supposed to iterate and grow when, generally, we calibrate so infrequently? Take a minute and think about your experiences:

    • What do calibrations look like for you? 
    • When was the last time you got feedback? 
    • How often do you have performance conversations? Do you have them at all? 
    • Do you always know how your work connects to your job expectations? Do you know how your work contributes to the organizational goals?
    • Are you ever surprised by your performance evaluation? 
    • Were you able to remember, in detail, all of your contributions for the performance review window? 

    If you receive feedback, have performance reviews, and answer "no" or "I don’t know" to any of the above questions, then unfortunately, you’re not alone. I’ve answered "no" and "I don’t know" to each of these questions at one point in time or another in my career. I went years without any sort of feedback or review. I have also worked on teams where performance reviews were yearly, and only at the end of the year did we try and collect a list of the feedback and contributions made over that year.

    I have found great value in performance reviews with my team, if and only if, they are continuous, and always-on. Anything short of always-on leads to surprises, missed opportunities for growth and for corrective feedback to be effective, as well as an incomplete view on individuals' contributions. This culminates in missed opportunities for the individual when it comes to their performance, promotion, and compensation reviews and conversations.

    Ye Old Performance Review

    Ye Old performance reviews are plagued by the same problems which surrounded software development before continuous deployments. Once every X number of months, we’d get all the latest features and build release candidates, and try to ship our code to production. These infrequent releases never worked out and were always plagued with issues. Knowing that this didn’t work so well for software deployments, why are so many still applying the same thinking to the individuals in their company and on their team?

    The Goal of Performance Reviews 

    Performance reviews intend to calibrate (meaning to capture and align) with an individual on the impact they had on the business over a given period, reward them for that impact, provide feedback, and stretch them into new opportunities (when they’re ready).

    Some Problems with Ye Old Performance Reviews 

    I don’t know about you, but I can’t recall what I had for dinner a week ago, let alone what I worked on months ago. Sure, I can give you the high-level details just like I can for that dinner I had last week. I probably had protein and some vegetables. Which one? Who knows. The same applies with impact conversations. How are we supposed to have a meaningful impact conversation if we can’t recall what we did and the impact it had? How are we supposed to provide feedback, so you can do better next time on that meal, if we don’t even recall what we ate? The specifics around those contributions are important, and those specifics are available now. 

    An always-on calibration reduces: surprises and disconnects. It helps ensure the presence of specificity when reviewing contributions, and individuals grow into better versions of themselves. What if after that meal I made last week, I got feedback right away from my partner. That meal was good, but it could have used a little more spice. Great. Now the next time I make that meal, I can add some extra spice. If she didn’t tell me for a year, then every time I made it this year, it would be lacking that spice. I would have missed the opportunity to refine my palate and grow. Let’s look more closely at some of the problems which arise with these infrequent calibrations.


    When reviews are infrequent or non-existent, we’re unable to know the full scope and impact of the work performed by the individual. Their work isn’t fully realized, captured, and rewarded. If we do try to capture these contributions at the end of a review window, we run into a few problems:

    • Our memory fades and we forget our contributions and the impact we had. 
    • Our contributions often end up taking the form of general statements that lack specificity.
    • We suffer from a recency bias that colours how we see the contributions over the entire review period. This results in more weight being given to recent contributions and less to those which happened closer to the start of the review period. 
    • Managers and individuals can have a different or an incomplete view of the contributions made.

    If we are unable to see the full scope of work contributed by the individual, we’re missing opportunities to reward them for this effort and grow them towards their long-term goals.


    When reviews are infrequent, we’re missing out on the opportunity to grow the individuals on our team. With reviews come calibration, feedback, and alignment. This leads to areas for growth in terms of new opportunities and feedback as to where they can improve. If we try to capture these things at the end of review windows, we have a few problems:

    • We’re unable to grow in our careers as quickly because 
      • We can’t quickly try out new things and receive early and frequent feedback.
      • We’re infrequently looking for new opportunities that benefit our career growth. 
    • We won’t be able to provide early feedback or support when something isn’t working. 
    • There’s a lack of specificity on how team members can achieve the next level in their careers.
    • Individuals can overstay on a team when there’s a clear growth opportunity for them elsewhere in the organization.

    Frequent calibrations allow individuals to grow faster as they can find opportunities when they are ready, iterate on their current skills, and pivot towards a more successful development and contribution path. 

    The Quarterly Calibration Document

    Every quarter, each member of my team makes a copy of this quarterly objectives template. At present, this template consists of six sections:

    1. Intended Outcomes: what they intend to accomplish going into this quarter?
    2. Top Accomplishments: what are their most impactful accomplishments this quarter? Other Accomplishments: what other impactful work did they deliver this quarter?
    3. Opportunities for the next three to six months: what opportunities have we identified for them that aren’t yet available to work on but will be available in the upcoming quarters?
    4. Feedback: what feedback did we receive from coworkers?
    5. Quarterly Review: a table that connects the individual’s specific impact for a given quarter to the organization’s expectations for their role and level.

    As you can probably guess by looking at these sections, this is a living document, and it’ll be updated over a given quarter. Each individual creates a new one at the start of the quarter and outlines their intended outcomes for the quarter. After this is done, we discuss and align on those intended outcomes during our first one-on-one of the quarter. After we have aligned, the individual updates this document every week before our one-on-one to ensure it contains their accomplishments and any feedback they’ve received from their coworkers. With this document and the organizations' role expectations, we can calibrate and state where they’re meeting, exceeding, or missing our expectations of them in their role weekly. We can also call out areas for development and look for opportunities in the current and upcoming work for them to develop. There’s never any surprise as to how they are performing and what opportunities are upcoming for their growth.

    A Few Key Points

    There are a few details I’ve learned to pay close attention to with these calibrations. 

    Review Weekly During Your One-on-one

    I’m just going to assume you are doing weekly one-on-ones. These are a great opportunity for coaching and mentorship. For always-on calibrations to be successful, you need to include and dedicate time as part of these weekly meetings. Calibrating weekly lets you:

    • Provide feedback on their work and its impact
    • Recognize their contributions regularly
    • Show them the growth opportunity available to them that connects to their development plan
    • Identify deviations from the agreed upon intended outcomes and take early action to ensure they have a successful quarter

    Who Drives and Owns the Quarterly Calibration Document?

    These calibration documents are driven by the individual as they’re mentored and coached by their manager. Putting the ownership of this document on the individual means they see how their objectives align with the expectations your organization has for someone in this role and level. They know what work they completed and the impact around that work. They also have a development plan and know what they’re working towards. Now that’s not to say their manager doesn’t have a place in this document. We’re there to mentor and coach them. If the intended outcomes aren’t appropriate for their level or are unrealistic for the provided period, we provide them with this feedback and help them craft appropriate intended outcomes and objectives. We also know about upcoming work, and how it might interest or grow them towards their long-term goals. 

    Always Incorporate Team Feedback

    Waiting to collect and share feedback on an infrequent basis results in vague, non-actionable, and non-specific feedback that hinders the growth of the team. Infrequently collected feedback often takes the form of "She did great on this project," "They're a pleasure to work with." This isn’t helpful to anyone's growth. Feedback needs to be candid, specific, timely, and focused on the behaviour. Most importantly, it needs to come from a place of caring. By discussing feedback in the quarterly document and during weekly one-on-ones, you can:

    • Collect highly specific and timely feedback.
    • Identify timely growth opportunities.
    • Provide a reminder to each individual on the importance of feedback to everyone’s growth. 

    During our weekly one-on-ones, we discuss any feedback that we’ve received during the previous week from their coworkers. I also solicit feedback for teammates at this time that they may or may not have shared with their coworkers. We take time to break down this feedback together and discuss the specifics. If the feedback is positive, we’ll make sure to note it as it’s useful in future promotion and compensation conversations. If the feedback is constructive, we discuss the specifics and highlight future opportunities where we can apply what was learned. Where appropriate, we also incorporate new intended outcomes into our quarterly calibration document. 

    What Managers Can Do when Things Aren’t on Track.

    This topic deserves its own post but short of it is: hard conversations don’t get easier with time, they get more difficult. Let’s say we don’t appear to be on track for reaching our intended outcomes and the reason is performance-related. This is where we need to act immediately. Don’t wait. Do it now. The longer you wait, the harder it is to have these difficult conversations. For me, this is akin to howlers in Harry Potter. For those who don’t know Harry Potter, these are letters written to seemingly misbehaving students from parents for which the letter yells at the student for some misbehaving that occurred. If you don’t open these letters right away and get the yelling over with, they get worse and worse. They smoke in the corner, and eventually the yelling from the letter is far worse when you eventually do open it. To me, this is what I think of whenever I have difficult feedback I need to provide. I know it’s going to be difficult, but all parties should provide this early and give the recipient a chance to course correct. The good news is that you’re having weekly calibration sessions and not yearly, so the individual has plenty of time to correct any performance issues before it becomes a serious problem. But only if the manager jumps in.

    What Happens When We Aren’t Hitting Our Intended Outcomes and Objectives 

    First, it’s important to be clear as to which intended outcomes aren’t being hit. Are they specific to personal development or their current role and level?

    Personal Development Intended Outcomes

    In addition to achieving the expectations of the role and level, they’re hopefully working towards their long-term aims. (See The AWARE Development Plan for more on this topic). Working with your Development Plan, you’ve worked back from your long-term aims to set a series of the short-term intended outcomes that move you towards these goals. Missing these intended outcomes results in a delay in reaching your long-term aim, but this doesn’t affect your performance in your current role and level. When personal development aims are missed, we should reaffirm the value in these intended outcomes, and if they’re still appropriate, prioritize them in the future. We need to discuss and acknowledge the delay that missing these outcomes have on their long-term aims, so all parties remain aligned on development goals. 

    Role or Level Intended Outcomes

    At the start of a review period, we’ve agreed to a set of intended outcomes for an individual based on the role and level. Once we learn that the intended outcomes for the role or level aren’t going to be achieved, we need to understand why. If the reason is that priorities have changed, we need to refocus our efforts elsewhere. We acknowledge the work and impact they’ve had to date and set new intended outcomes that are appropriate for the time that remains in the review period. If, on the other hand, they’re unable to meet the intended outcomes because they aren’t up to the task, we may need to set up a performance improvement plan to help them gain the skills needed to execute at the level expected of them.

    The Quarterly Review

    We roll up and calibrate our impact every quarter. This period can be adjusted to work for your organization, but I’d recommend something less than a year but more than a month. Waiting a year is too long and there’s too much data for you to work within creating your evaluation. Breaking down these yearly reviews into smaller windows has a few advantages, it:

    • Allows you to highlight impactful windows of contribution (you may have a great quarter and just an ok year, breaking it down by quarter gives you the chance to celebrate that great period).
    • Allows you to snapshot smaller windows (this is your highlight reel for a given period and when looking back over the year you can look at these snapshots).
    • Allows you to assign a performance rating for this period and course correct early.

    If we want to be sure we have a true representation of each individual's contributions to an organization, ensure those contributions are meeting your organizations' expectations for their role and level, and provide the right opportunities for growth for each individual, then we need to be constantly tracking and discussing the specifics of their work, its impact, and where they can grow. It’s not enough to look back at the end of the year and collect feedback and a list of accomplishments. Your list will, at best, be incomplete, but more importantly you’ll have missed out on magnifying the growth of each individual. Worse yet, that yearly missed growth opportunity compounds, so the impact you are missing out on by not continuously calibrating with your team is huge.

    David Ward is a Development Manager who has been with Shopify since 2018 and is working to develop: Payment Flexibility, Draft Orders, and each member of the team. Connect with David on Twitter, GitHub and LinkedIn.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    The AWARE Development Plan

    The AWARE Development Plan

    A development plan helps everyone aim for and achieve their long-term goals in an intentional manner. Wherever you are in your career, there’s room for growth and a goal for you to work towards. Maybe your goal is a promotion? Maybe your goal is to acquire a given proficiency in a language or framework? Maybe your goal is to start your own company? Whatever it is, the development plan is your roadmap to that goal. Once you have that long-term goal, create a development plan to achieve it. Now I’m assuming there won’t be one single thing you can do to make it happen, if there was you wouldn’t need a plan. When picking a long-term goal, make sure you pick one that’s far enough out to make the goal ambitious but not so far out that you can’t define a sequence of steps to reach that goal. If you can’t define the steps to reach the goal then talk with a mentor or set a goal that you feel is along a path that you can define the steps to reach. After you have that goal, you'll iterate towards it with the AWARE Development Plan. 

    Today, my job is that of an Engineering Manager at Shopify. Outside of this position I've always been a planner: an individual iterating towards one long-term goal or another. I didn’t refine or articulate my process for how I iterate towards my long-term goals until I saw members of my team struggling to define their own paths. In an effort to help and accelerate their growth, I started breaking down and articulating how I’ve iterated towards the goals in my life. The result of this work is this AWARE Development plan.

    Having a goal means we have an aim or purpose.

    "A goal on its own isn’t enough as a goal without a plan is just a wish."
    - Antoine de Saint-Exupéry.

    Once you have a goal, you need to create a plan to understand how to reach those goals. Setting off towards your goal without a plan is akin to setting off on a road trip without a GPS or a clear sense of how to get to your destination. You don’t know what route you’ll take, nor when you’ll arrive. You may get to your destination, but it’s much more likely that you’ll get lost along the way. On route to your destination, you’ll undoubtedly have reference points along the way that you’ll take note of. These reference points are akin to our short and medium-term goals. Our plan has these reference points built-in, so we can constantly check-in to make sure we’re still on course. If we’re off course then we want to know sooner than later, so we can adjust, ask for directions and update the plan. The short- and medium-term goals created in your AWARE plan provide regular checkpoints along the way to your long-term goals. If we are unable to achieve short and medium-term goals we have the opportunity to lean in and course correct. Failing to course correct in these moments will result in at the very least substantial detours on the way to our goals. 


    Iterating Towards My Long-Term Goals

    When I grow up I want to be a… Well, what did you want to be? I wanted to be a: fighter jet pilot, police officer, lawyer, and then when things got serious, a developer. For me, long-term goal setting started with a career goal when I was younger. I set a long-term goal, figured out what goals needed to be achieved along the way, achieved those goals, reflected on them, and then looked for the next goal.

    I was 12 years old when I got serious about wanting a career with computers. I had spent a bunch of time on my dad’s computer and I was hooked. But the question was: how do I make this happen? To figure this out, I started with the long-term goal of becoming a computer programmer and worked my way back. I took what I knew about my goals and figured out what I needed to do along the way to reach them. I made a plan.

    A summary of my 12-year-old self’s plan:

    An outlined planing working towards the long-term goal of a career in Computer Science
    An outline for working towards the long-term goal of a career in Computer Science.

    I looked ahead and set a long-term goal. I wanted a career as a computer programmer and to attend the University of Waterloo. I worked back from that goal to set various shorter-term goals. I needed to get good grades, get my computer, and earn some money to pay for this. I had a specific long-term goal and a set of medium and short-term goals. These medium and short-term goals moved me closer to the long-term goal one step at a time. New goals such as the purchase of a computer weren’t immediately on this plan, but this goal was added once its value became apparent to the long-term goal.

    Helping Others Reach their Goals

    I’ve always been interested in setting and reaching long-term goals. I’ve set and reached long-term goals in: education, software engineering career, and dance career. In my career, working both as an Engineering Manager and a dance instructor I’ve had the opportunity to work with and help individuals set and achieve their goals. What I’ve learned is that the process to achieve those goals is the same irrespective of the area of interest. The AWARE process works just as well for Dance as it does for Software Development. I’m sharing this with you because you’re the one who’s in control of you reaching your long-term goals, and yes, you can reach them. 

    As a manager and instructor in both dance and Software Development, I’ve heard questions like: “What do I need to get to the next level” or “What do I need to do to get better at this” more times than I can count. In either case, my answer is the same. 

    • Start by building your development plan and work backwards from that goal to figure out what must be true for you to reach that goal. 
    • Sit down and critically think about the goal as well as the goals you must achieve along the way. 
    • Sit down and critically think about the result of your short-term goals after you achieve them.

    In my experience, these are the concepts which are often not as clear-cut with other approaches one may take to reach their long-term goals. With the AWARE Development plan we are:

    • Putting the individual in the driver's seat and making them accountable for their long-term goals is key for them taking responsibility and achieving their goals. 
    • Having the individual develop an understanding of what it means to achieve their goal helps them understand where they need to grow and what opportunities they need to seek out. 
    • Creating a clear set of areas of development for which both they and I continuously look for opportunities where they can develop. 
    • Creating a clear answer to the “What’s Next?” question. We know what’s next as we can just consult our Development Plan
    • Provides (to an extent) the individual control over the speed at which they achieve their goals.

    If you are an individual who is unsure of what’s next or unsure how you can reach your long-term goal I’d encourage you to give this a shot. Sit down and write a Development Plan for yourself. Once you're done, share it with your lead/manager and ask them for feedback and their opinion. They are your partner. They can work with you to find opportunities and provide feedback as you work towards your goal. 

    AWARE: Aim, Work Back, Achieve, Reflect, Encore

    AWARE is a way for each of us to iterate towards our long-term goals. It’s strength comes from breaking down long-term goals into actionable steps as well as reflecting on completed steps in order to provide confidence that you are on track. How does it work? Start at the end and imagine you achieved your goal. This is your long-term Aim. Now, work your way back and define what steps you need to take along the way to that outcome. You are unlikely to be able to achieve all of these Aims right away and they are often too big to achieve all at once. When you find these, break them down into smaller Aims. From here start Achieving your Aims. As you complete your Aims take time to Reflect. What did you learn? What new goals did you uncover? Was that Aim valuable? Now take your next Aim and do it all over again. Let’s dig into each step a little more.

    Overview of each step: Aim, Work back, Achieve, Reflect, Encore

    Overview of each step: Aim, Work back, Achieve, Reflect, Encore.



    At this stage, you’ll write down your aim aka your goal. When writing your aim make sure:

    • You time box this aim appropriately.
    • That there’s a clear definition of success.
    • You have a clear set of objectives.
    • You have a clear way to measure this aim.  

    Work Back

    Can the current aim be broken down into smaller aims or can you immediately execute towards it? If you can’t immediately execute towards it then identify smaller aims that will lead to it. A goal is seldom achieved through the accomplishment of a single aim. Write down the aim and then work back from there to identify other aims you can perform as you iterate and move towards this larger aim. Each of these aims should:

    • have clear objectives
    • be measurable
    • be time-boxed
    • be reflected upon once it’s completed.


    Now that you have a time-boxed aim, with a clear objective, and a way to measure success we’ll execute and accomplish this aim.


    After you have completed your aim, it’s important to reflect upon the aim and its outcomes:

    • Did this provide the value we expected it to? 
    • What did we learn? 
    • What new aims are we now aware of that should we add to our Development Plan?
    • What would you do differently next time if you were to do it again?


    Now that you have completed an aim, repeat the process. Aim again and continue to work towards your long-term goal.

    Aim and Work Back Example

    I recently had a team member who expressed that their goal was to run their own software company one day. Awesome! They have a long-term goal. One specific worry they had was onboarding new developers to this company. Specifically, they wanted to know how to have them be impactful in a “timely manner.”  They identified that having impactful developers for their startup was important early on and could be the difference between success and failure for their company. What a great insight! So working back from their long-term goal we’ve identified “being able to onboard new developers in an organization promptly” to be an aim in the plan. In breaking down this aim, we were able to find two aims they could execute over the next eight months to grow this skill set:

    1. One area we identified right away was mentoring an intern and new hires, so we added this to their plan. This aim was easy to set up: you'll onboard an intern in one of the upcoming terms and be responsible for onboarding them to the team and mentoring them while they’re here. 
    2. This team member presented the idea of taking four months away from our team to join an onboarding group that was working to decrease the onboarding time for new developers joining Shopify. This opportunity was perfect. You can practice onboarding developers to our company. You can see firsthand what works and what doesn’t while helping develop the process here.

    With this individual we have :

    • A long-term goal: start your own software company.
    • A medium-term aim working towards that long-term goal: learn how to accelerate the onboarding of new hires.
    • Two short-term aims that work towards the medium-term aim:
      • Buddy an intern or new hire to the team. 
      • Mentor for four months as part of the company’s onboarding program.

    By setting the long-term goal and working back, we were able to work together to find opportunities in the short term that would grow this individual towards their long-term goal. Once they complete any one of these aims they’ll reflect on how it went, what was learned and then decide if there’s anything to add to the Development Plan as we continue on their path towards the long-term goal. In the provided example we think we have all of the shorter-term Aims required to reach the longer-term goals flushed out from the start. This won’t be the case and may not be true. That’s ok. Take it step-by-step and execute on the things you know and append the plan when you reflect.

    Bonus Tip: Sharing

    Once you have an aim. Share it! Share it with your lead/manager, and if you're comfortable with the idea, your colleagues. Sharing your goals with others will help ensure the right growth opportunities are available when you are ready for them. What follows when you share your plan:

    1. New opportunities will present themselves because others are aware of your goals and can therefore share relevant opportunities with you.
    2. You will see opportunities for growth that may not have initially been obvious to you.

    Sharing your aim will enable you to proceed to your goals at a faster rate than if you were working at this alone.

    Tips for Using the AWARE Development Plan

    I provide this development plan template to team members when they join my team. I also ask them to write down their long-term goals and to time box this long-term goal to at most five years. This is usually far enough out that there’s some uncertainty but doesn’t prevent developing a plan. We then work back from that long-term goal and identify shorter-term aims (usually over the next one to two years) that moves them towards this goal. The work of identifying these aims can be done together, but I also encourage them to drive this plan as much as possible. They have to do some thinking and bring ideas to the table as well. 


    To help with their aim I ask questions like: 

    • What does success look like for you in these time frames? 
    • What do we need to do to achieve this aim?
    • How will we know if we’ve achieved this aim? 
    • What opportunities are available today to start progressing towards this aim?
    • How does some of the current work you’re doing tie into this aim?

    Work Back

    Once we have an aim, we need to ask: is this something I can achieve now? If yes, you can move on to Achieve. If it’s a no, not at this time, the situation should be one that you and your lead/manager agree that you’re ready for this aim, but you’re unable to locate an opportunity to achieve it. Finally, if it’s ano, you’re unable to achieve this aim at this precise moment, you’ll need to work back from it to find other aims that are along the right path. For instance, this aim may require me to have a certain degree or set of experiences that I’m missing. If this is the situation, simply work back until you find the aims that are achievable. Document all discovered aims in your development plan so you and your lead can pursue them in due time.


    You have an aim that you can execute now. It has a clear set of objectives, and its success is measurable. Do it!


    You completed your aim. Amazing! Once you’ve completed an aim, it’s important to take some time to reflect on it. This is a step most folks want to skip as they don’t see the value but it’s the most important. Don’t skip it! Ask yourself: 

    • What did we learn? 
    • Did we get what we thought we would out of it?
    • What would we do differently next time?
    • What new aims should we add to our plan based on what we know now?

    Maybe there’s a follow-up aim that you didn’t know about before you started this aim. Maybe you learned something and you want or need to pivot your aim or long-term goal? This is your chance to get feedback and course correct should you need or want to detour. Going back to our driving example from earlier. When we are driving towards a destination and aren’t paying attention to our surroundings, or checking on our route, we’re far more likely to get lost. So take a minute when you complete an aim and check in on your surroundings. Am I still on course? Did I take a wrong turn? Is there another route available to me now?

    Encore: Maintenance of the Development Plan

    We have to maintain this Development Plan after we create it. Aims can change and opportunities arise. When they do, we need to adjust. My team and I check in on our plan during our weekly one on ones. At these meetings, we ask:

    • If an aim has been completed, what’s the next aim that you will work on?
    • What opportunities have come up, and how do they tie into your plan? 
    • Is there something in another task that came up that contributes to an aim? 
    • Are we still on track to reach our aims in the timeframe we set?
    • Did we complete an aim? Let’s make sure we reflect.

    If the answer is yes to any of these that’s great! Call it out! Be explicit about it. Learning opportunities are magnified when you are intentional about them. 

    The Path to Success Doesn’t Have to Be Complicated

    The path to success is often said to look something like this:

    An arrow where the shaft is tied in many loops and knots

    An arrow shaft tied in many loops.

    But it doesn’t have to be quite as complex a ride. Get a mentor, some directions, and form a plan. Make sure this plan iterates towards your goal and has signposts along the way so you can check in to know you’re on track. This way If you miss a turn, it’s no big deal. It’ll happen. The difference for you is that you’ll know sooner than later that you’re off track and thus you’ll be able to course-correct earlier. At the end of the day your path to success may look like this:

    A depiction of an easier path to success. The arrow shaft has less knots that the previous image.

    An easier path to success moving forward.

    If you haven’t already done so: sit down with your lead or manager, write a plan, figure out what is needed for you to reach your long-term goal. Then start iterating towards it.

    David Ward is a Development Manager who has been with Shopify since 2018 and is working to develop: Payment Flexibility, Draft Orders, and each member of the team. Connect with David on Twitter, GitHub and LinkedIn.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Shopify's Path to a Faster Trino Query Execution: Custom Verification, Benchmarking, and Profiling Tooling

    Shopify's Path to a Faster Trino Query Execution: Custom Verification, Benchmarking, and Profiling Tooling

    Data scientists at Shopify expect fast results when querying large datasets across multiple data sources. We use Trino (a distributed SQL query engine) to provide quick access to our data lake and recently, we’ve invested in speeding up our query execution time.

    On top of handling over 500 Gbps of data, we strive to deliver p95 query results in five seconds or less. To achieve this, we’re constantly tuning our infrastructure. But with each change comes a risk to our system. A disruptive change could stall the work of our data scientists and interrupt our engineers on call.

    That’s why Shopify’s Data Reliability team built custom verification, benchmarking, and profiling tooling for testing and analyzing Trino. Our tooling is designed to minimize the risk of various changes at scale. 

    Below we’ll walk you through how we developed our tooling. We’ll share simple concepts to use in your own Trino deployment or any other complex system involving frequent iterations.

    The Problem

    A diagram showing the Trino upgrade tasks over time: Merge update to trino, Deploy candidate cluster, Run through Trino upgrade checklist, and Promote candidate to Prod. The steps include two places to Roll back.
    Trino Upgrade Tasks Over Time

    As Shopify grows, so does our data and our team of data scientists. To handle the increasing volume, we’ve scaled our Trino cluster to hundreds of nodes and tens of thousands of virtual CPUs.

    Managing our cluster gives way to two main concerns:

    1. Optimizations: We typically have several experiments on the go for optimizing some aspect of our configuration and infrastructure.
    2. Software updates: We must keep up-to-date with new Trino features and security patches.

    Both of these concerns involve changes that need constant vetting, but the changes are further complicated by the fact that we run a fork of Trino. Our fork allows us more control over feature development and release schedules. The tradeoff is that we’re sometimes the first large organization to test new code “in the field,” and if we’re contributing back upstream to Trino, the new code must be even more thoroughly tested.

    Changes also have a high cost of failure because our data scientists are interrupted and our engineers must manually roll back Trino to a working version. Due to increasing uncertainty on how changes might negatively affect our cluster, we decided to hit the brakes on our unstructured vetting process and go back to the drawing board. We needed a tool that could give us confidence in potential changes and increase the reliability of our system as a whole.

    Identifying the Missing Tool

    To help identify our missing tool, we looked at our Trino deployment as a Formula 1 race car. We need to complete each lap (or query) as fast as possible and shorten the timeline between research and production of our engine, all the while considering safety.

    The highest-ranked Formula 1 teams have their own custom simulators. They put cars in real-life situations and use data to answer critical questions like, “How will the new steering system handle on the Grand Prix asphalt?” Teams also use simulations to prevent accidents from happening in real life.

    Taking inspiration from this practice, we iterated on a few simulation prototypes:

    1. Replaying past queries. First, we built a tool to pluck a previously run query from our logs and “replay” it in an isolated environment. 
    2. Replicating real life. Next, we wrote a tool that replicated traffic from a previous work day. Think of it like travelling back in time in our infrastructure. 
    3. Standardizing our simulations. We also explored an official benchmarking framework to create controlled simulations. 

    Another thing we were missing in our “garage” was a good set of single-purpose gauges for evaluating a possible change to Trino. We had some manual checks in our heads, but they were undocumented, and some checks caused significant toil to complete. These could surely be formalized and automated.

    A Trino upgrade checklist of reminders to our engineers. These reminders include verifying user-defined functions, connectivity, performance heuristics, resource usage, and security
    Our undocumented Trino upgrade checklist (intentionally vague)

    We were now left with a mixed bag of ideas and prototypes that lacked structure.

    The Solution 

    We decided to address all our testing concerns within a single framework to build the structure that was lacking in our solution. Initially, we had three use cases for this framework: verification, benchmarking, and profiling of Trino.

    The framework materialized as a lightweight Python library. It eliminates toil by extracting undocumented tribal knowledge into code, with familiar, intuitive interfaces. An interface may differ depending on the use case (verification, benchmarking, or profiling), but all rely on the same core code library.

    The core library is a set of classes for Trino query orchestration. It’s essentially an API that has a shared purpose for all of our testing needs. The higher level Library class handles connections, query states, and multithreaded execution or cancellation of queries. The Query class handles more low level concerns, such as query annotations, safety checks, and fetching individual results. Our library makes use of the open source repository trino-python-client which implements the Python Database API Specification for Trino. 

    Verification: Accelerating Deployment

    Verification consists of simple checks to ensure Trino still works as expected after a change. We use verification to accelerate the deployment cycle for a change to Trino.

    A diagram showing the new Trino upgrade flow: Merge update to Trino, Deploy candidate cluster, Run Tests, Promote candidate to Prod. There is only one place for Rollback
    New Trino Upgrade Tasks Over Time (with shadow of original tasks)

    We knew the future users of our framework (Data Platform engineers) have a development background, and a high likelihood of knowing Python. Associating verification with unit testing, we decided to leverage an existing testing framework as our main developer interface. Conceptually, PyTest's features fit our verification needs quite well.

    We wrote a PyTest interface on top of our query orchestration library that abstracted away all the underlying Trino complications into a set of test fixtures. Now, all our verification concerns are structured into a suite of unit tests, in which the fixtures initialize each test in a repeatable manner and handle the cleanup after each test is done. We put a strong focus on testing standards, code readability, and ease of use. 

    Here’s an example block of code for testing our Trino cluster:

    First, the test is marked. We’ve established a series of marks, so a user can run all “correctness” tests at a time, all “performance” tests at a time, or every test except the “production_only” ones. In our case, “correctness” means we’re expecting an exact set of rows to be returned given our query. “Correctness” and “verification” have interchangeable meanings here. 

    Next, a fixture (in this case, candidate_cluster, which is a Trino cluster with our change applied) creates the connection, executes a query, fetches results, and closes the connection. Now, our developers can focus on the logic of the actual test. 

    Lastly, we call run_query and run a simple assertion. With this familiar pattern, we can already check off a handful of our items on our undocumented Trino upgrade checklist.

    A diagram showing a query being run on a single candidate Trino cluster
    Running a query on a candidate cluster

    Now, we increase the complexity:

    First, notice @pytest.mark.performance. Although performance testing is a broad concept, by asserting on a simple threshold comparison we can verify performance of a given factor (for example, execution time) isn’t negatively impacted.

    In this test, we call multi_cluster, which runs the same query on two separate Trino clusters. We look for any differences between the “candidate” cluster we’re evaluating and a “standby” control cluster at the same time.

    A diagram showing the same query being run on two separate Trino clusters
    Running the same query on two different clusters

    We used the multi_cluster pattern during a Trino upgrade when verifying our internal Trino User Defined Functions (which are utilized in domains such as finance and privacy). We also used this pattern when assessing a candidate cluster’s proposed storage caching layer. Although our suite didn’t yet have the ability to assert on performance automatically, our engineer evaluated the caching layer with some simple heuristics and intuition after kicking off some tests.

    We plan to use containerization and scheduling to automate our use cases further. In this scenario, we’d run verification tests at regular intervals and make decisions based on the results.

    So far, this tool covers the “gauges” in our race car analogy. We can take a pit stop, check all the readings, and analyze all the changes.

    Benchmarking: Simplifying Infrastructure

    Benchmarking is used to evaluate the performance of a change to Trino under a standardized set of conditions. Our testing solution has a lightweight benchmarking suite, so we can avoid setting up a separate system.

    Formula 1 cars need to be aerodynamic, and they must direct air to the back engine for cooling. Historically, Formula 1 cars are benchmarked in a wind tunnel, and every design is tested in the same wind tunnel with all components closely monitored. 

    We took some inspiration from the practice of Formula 1 benchmarking. Our core library runs TPC-DS queries on sample datasets that are highly relevant to the nature of our business. Fortunately, Trino generates these datasets deterministically and makes them easily accessible.

    The benchmarking queries are parametrized to run on multiple scale factors (sf). We repeat the same test on increasing magnitudes of dataset size with corresponding amounts of load on our system. For example, sf10 represents a 10 GB database, while sf1000 represents 1 TB.

    PyTest is just one interface that can be swapped out for another. But in the benchmarking case, it continues to work well:

    A diagram showing multiple queries executing on a candidate Trino cluster
    Running the same set of queries on multiple scale factors

    This style of benchmarking is an improvement over our team members’ improvised methodologies. Some used Spark or Jupyter Notebooks, while others manually executed queries with SQL consoles. This led to inconsistencies, which was against the point of benchmarking.

    We’re not looking to build a Formula 1 wind tunnel. Although more advanced benchmarking frameworks do exist, their architectures are more time-consuming to set up. At the moment, we’re using benchmarking for a limited set of simple performance checks.

    Profiling and Simulations: Stability at Scale

    Profiling refers to the instrumentation of specific scenarios for Trino, in order to optimize how the situations are handled. Our library enables profiling at scale, which can be utilized to make a large system more stable.

    In order to optimize our Trino configuration even further, we need to profile highly specific behaviours and functions. Luckily, our core library enables us to write some powerful instrumentation code.

    Notably, it executes queries and ignores individual results (we refer to these as background queries). When kicking off hundreds of parallel queries, which could return millions of rows of data, we’d quickly run out of memory on our laptops (or external Trino clients). With background queries, we put our cluster into overdrive with ease and profile on a much larger scale than before.

    We formalized all our prototyped simulation code and brought it into our library, illustrated with the following samples:

    generate_traffic is called with a custom profile to target a specific behaviour. replay_queries plays back queries in real time to see how a modified cluster handles them. These methodologies cover edge cases that a standard benchmark test will miss.

    A diagram showing simulated "high to low traffic" being generated and sent to the candidate Trino cluster
    Generating traffic on a cluster

    This sort of profiling was used when evaluating an auto-scaling configuration for cloud resources during peak and off-hours. Although our data scientists live around the world, most of our queries happen between 9-5 PM EST, so we’re overprovisioned outside of these hours. One of our engineers experimented with Kubernetes’ horizontal pod autoscaling, kicking off simulated traffic to see how our count of Trino workers adjusted to different load patterns (such as “high to low” and “low to high”). 

    The Results and What’s Next

    Building a faster and safer Trino is a platform effort supported by multiple teams. With this tooling, the Data Foundations team wrote an extensive series of correctness tests to prepare for the next Trino upgrade. The tests helped iron out issues and led to a successful upgrade! To bring back our analogy, we made an improvement to our race car, and when it left the garage, it didn’t crash. Plus, it maintained its speedour P95 query execution time has remained stable over the past few months (the upgrade occurred during this window). 

    A diagram showing p95 query execution time holding steady over three months
    95th percentile of query execution time over the past 3 months (one minute rolling window)

    Key Lessons

    By using this tool, our team learned about the effectiveness of our experimental performance changes, such as storage caching or traffic-based autoscaling. We were able to make more informed decisions about what (or what not) to ship.

    Another thing we learned along the way is that performance testing is complicated. Here are a few things to consider when creating this type of tooling:

    1. A solid statistics foundation is crucial. This helps ensure everyone is on the same page when sharing numbers, interpreting reports, or calculating service level indicators. 
    2. Many nuances of an environment can unintentionally influence results. To avoid this, understand the system and usage patterns on a deep level, and identify all differences in environments (for example, “prod” vs. “dev”).
    3. Ensure you gather all the relevant data. Some of the data points we care about, such as resource usage, are late-arriving, which complicates things. Automation, containerization, and scheduling are useful for this sort of problem.

    In the end, we scoped out most of our performance testing and profiling goals from this project, and focused specifically on verification. By ensuring that our framework is extensible and that our library is modular, we left an opportunity for more advanced performance, benchmarking, and profiling interfaces to be built into our suite in the future.

    What’s Next

    We’re excited to use this tool in several gameday scenarios to prepare our Data Reliability team for an upcoming high-traffic weekend—Black Friday and Cyber Monday—where business-critical metrics are needed at a moment’s notice. This is as good a reason as ever for us to formalize some repeatable load tests and stress tests, similar to how we test the Shopify system as a whole.

    We’re currently evaluating how we can open-source this suite of tools and give back to the community. Stay tuned!

    Interested in tackling challenging problems that make a difference? Visit our Data Science & Engineering career page to browse our open positions.

    Noam is a Hacker in both name and spirit. For the past three years, he’s worked as a data developer focused on infrastructure and platform reliability projects. From Toronto, Canada, he enjoys biking and analog photography (sometimes at the same time).

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Data Science & Engineering career page to find out about our open positions. Learn about how we’re hiring to design the future together—a future that is Digital by Design.

    Continue reading

    Debugging Systems in the Cloud: MySQL, Kubernetes, and Cgroups

    Debugging Systems in the Cloud: MySQL, Kubernetes, and Cgroups

    By Rodrigo Saito, Akshay Suryawanshi, and Jeremy Cole

    KateSQL is Shopify’s custom-built Database-as-a-Service platform, running on top of Google Cloud’s Kubernetes Engine (GKE), currently manages several hundred production MySQL instances across different Google Cloud regions and many GKE Clusters.

    Earlier this year, we found a performance related issue with KateSQL: some Kubernetes Pods running MySQL would start up and shut down slower than other similar Pods with the same data set. This partially impaired our ability to replace MySQL instances quickly when executing maintenance tasks like config changes or upgrades. While investigating, we found several factors that could be contributing to this slowness.

    The root cause was a bug in the Linux kernel memory cgroup controller.  This post provides an overview of how we investigated the root cause and leveraged Shopify’s partnership with Google Cloud Platform to help us mitigate it.

    The Problem

    KateSQL has an operational procedure called instance replacement. It involves creating a new MySQL replica (a Kubernetes Pod running MySQL) and then stopping the old one, repeating this process until all the running MySQL instances in the KateSQL Platform are replaced. KateSQL’s instance replacement operations revealed inconsistent MySQL Pod creation times, ranging from 10 to 30 minutes. The Pod creation time includes the time needed to:

    • spin up a new GKE node (if needed)
    • create a new Persistent Disk with data from the most recent snapshot
    • create the MySQL Kubernetes Pod
    • initialize the mysql-init container (which completes InnoDB crash recovery)
    • start the mysqld in the mysql container.

    We started by measuring the time of the mysql-init container and the MySQL startup time, and then compared the times between multiple MySQL Pods. We noticed a huge difference between these two MySQL Pods that had the exact same resources (CPU, memory, and storage) and dataset:

    KateSQL instance




    2120 seconds

    1104 seconds


    74 seconds

    17 seconds

    Later, we discovered that the MySQL Pods with slow creation time also showed gradual decrease in performance. Evidence of that was an increased number of slow queries for queries that utilized temporary memory tables:

    A line graph showing queries per second over time. A purple line shows a slower MySQL Pod with the queries taking long. A blue line shows a faster pod where queries are much shorter.
    Purple line shows an affected MySQL Pod while the Blue line shows a fast MySQL Pod

    Immediate Mitigation

    A quick spot-check analysis revealed that newly provisioned Kubernetes cluster nodes performed better than those that were up and running for a few months. Having this information in hand, we started our first mitigation strategy for this problem that was to replace the older Kubernetes cluster nodes with new ones using the following steps:

    1. Cordon (disallow any new Pods) the older Kubernetes cluster nodes.
    2. Replace instances using KateSQL to move MySQL Pods to new Kubernetes nodes, allowing GKE to autoscale the cluster by adding new cluster nodes as necessary.
    3. Once instances are moved to new Kubernetes nodes, drain the cordoned cluster nodes to scale down the cluster (automatically, through GKE autoscaler).

    This strategy was applied to production KateSQL instances, and we observed performance improvements on the new MySQL Pods.

    Further Investigation

    We began a deeper analysis to understand why newer cluster nodes performed better than older cluster nodes. We ruled out differences in software versions like kubelet, the Linux kernel, Google’s Container-optimized OS (COS), etc. Everything was the same, except their uptimes.

    Next, we started a resource analysis of each resource subsystem to narrow down the problem. We ruled out the storage subsystem, as the MySQL error log provided a vital clue as to where the slowness was So we examined timestamps from InnoDB’s Buffer Pool initialization:

    We analyzed the MySQL InnoDB source code to understand the operations involved during InnoDB’s Buffer Pool initialization. Most importantly, memory allocation during its initialization is single-threaded, as confirmed using top to show it consuming  approximately 100% CPU usage. We subsequently captured strace output of the mysqld process while it was starting up

    We see that each mmap() system call took around 100 ms to allocate approximately 128MB sized chunks, which in our opinion is terribly slow for the memory allocation process.

    We also did an On-CPU perf capture during this initialization process, below is the snapshot of the flamegraph:

    A flamegraph showing the On-CPU perf capture during initialization
    Flamegraph of the perf output collected of a MySQL process container from Kubernetes Cluster node

     Quick analysis of the flamegraph shows how MySQL (InnoDB) buffer pool initialization task is delegated to the memory allocator (jemalloc in our case) that then spends most of its time in a kernel function called mem_cgroup_commit_charge.

    We further analyzed what the mem_cgroup_commit_charge function does: it seems to be part of memcg (memory control group) and is responsible for charging (claiming ownership of) pages from one cgroup (unused/dead or root cgroup) to the cgroup of the allocating process. Unfortunately, memcg isn’t well documented, so it’s hard to understand what’s causing the slow down.

    Another unusual thing we spotted (using the slabtop command) was the abnormally high dentry cache, sometimes around 20GB for about 64 pods running on a given Kubernetes Cluster node:

    While investigating if a large dentry cache could be slowing the entire system down, we found this post by sysdig that provided useful information. After further analysis following the steps from the post, we confirmed that it wasn’t the same case as we were experiencing. However, we noticed immediate performance improvements (similar to a restarted Kubernetes cluster node) after dropping the dentry cache using the following command:

    echo 2 > /proc/sys/vm/drop_caches

    Continuing the unusual slab allocation investigation, we ruled out any of its side effects, like memory defragmentation, since enough higher-order free pages were available (which was verified using the output of /proc/buddyinfo). Also, this memory is reclaimable during memory pressure events.

    A Breakthrough

    Going through various bug reports related to cgroups, we found a command to list the number of cgroups in a system:

    We compared the memory cgroup’s value of a good node and an affected node. We concluded that approximately 50K memory cgroups is more than expected even with some of our short-lived tasks! This indicated to us that there could be a cgroup leak. It also helped make sense of the perf output that we had flamegraphed previously. There could be a performance impact if the cgroup commit charge has to traverse through many cgroups for each page charge. We also saw that it locks page cache least recently used (LRU) list from source code analysis.

    We evaluated a few more bug reports and articles, especially the following:

    1. A bug report unrelated to Kubernetes but pointing to the underlying cause related to cgroups. This bug report also helped point to the fix that was available for such an issue. 
    2. An article on related to almost our exact issue. A must read!
    3. A related workaround to the problem in Kubernetes.
    4. A thread in the Linux kernel mailing list that helped a lot.

    These posts were a great signal that we were on the right track to understanding the root cause of the problem. To confirm our findings and understand if a symptom of this cgroup leak, that wasn’t yet observed in the Linux community, we met with Linux kernel engineers at Google.

    We evaluated an affected node, and the nature of the problem. The Google engineers were able to confirm that we were in fact hitting another side-effect of reparent slab memory on cgroup removal.

    To Prove the Hypothesis

    After evaluating the problem, we tested a possible solution to this problem. We invoked a switch file for the kubepods cgroup (the parent cgroup for Kubernetes pods) to force it to empty zombie/dead cgroup:

    $ echo 1 | sudo tee /sys/fs/cgroup/memory/kubepods/memory.force_empty

    This caused the number of memory cgroups to decrease rapidly to only approximately 1800 memory cgroups that is in line with a good node as previously compared:

    We quickly tested a MySQL Pod restart to see if there were any improvements in startup time performance. An 80G InnoDB buffer pool was initialized in five seconds:

    A Possible Workaround and Fixes

    There were multiple workarounds and fixes for this problem that we evaluated with engineers from Google:

    1. Rebooting or cordoning the cluster node VM, identifying them by monitoring /proc/cgroups output.
    2. Set up a cronjob to drop SLAB and page caches. It’s an old school DBA/sysadmin technique that might work but could have a performance penalty on read IO.
    3. Short-lived Pods moved to a dedicated nodepool to isolate them from the more critical parts like MySQL Pods.
    4. echo 1 > /sys/fs/cgroup/memory/memory.force_empty in a preStop hook of a short-lived Pod.
    5. Upgrade to COS 85 that has upstream fixes to cgroup SLAB re-parenting bugs. Upgrading from GKE 1.16 to 1.18 should get us the Linux kernel 5.4 with the relevant bug fixes.

    Since we were due a GKE version upgrade, we created new GKE clusters with GKE 1.18 and started creating new MySQL Pods on those new clusters. After several weeks of running on GKE 1.18, we saw consistent MySQL InnoDB Buffer Pool initialization time and query performance:

    A table showing values for kube_namespace, kube_container, innodb_buffer_pool_size, and duration.
    Duration in seconds of new and consistent InnoDB Buffer Pool initialization for various KateSQL instances

    This was one of the lengthiest investigations that the Database Platform group has carried out at Shopify. The time taken to identify the root cause was due to the nature of the problem, difficult reproducibility, and absolutely no out-of-the-book methodology to follow. However, there are multiple ways to solve the problem and that’s a very positive outcome.

    Special thanks to Roman Gushchin from Facebook’s Kernel Engineering team, whom we connected with via LinkedIn to discuss this problem, and Google’s Kernel Engineering team who helped us confirm and solve the root cause of the problem.

    Rodrigo Saito is a Senior Production Engineer in the Database Backend team, where he primarily works on building and improving KateSQL, Shopify's Database-as-a-Service, using his more than a decade of Software Engineering experience. Akshay Suryawanshi is a Staff Production Engineer in the Database Backend team, who helped build KateSQL at Shopify along-with Jeremy Cole, who is a Senior Staff Production Engineer in the larger Database Platform group. Both of them brings decades of Database Administration and Engineering experience to manage Petabyte scale infrastructure.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    GitHub Does My Operations Homework: A Ruby Speed Story

    GitHub Does My Operations Homework: A Ruby Speed Story

    Hey, folks! Some of you may remember me from Rails Ruby Bench, Rebuilding Rails, or a lot of writing about Ruby 3 and Ruby/Rails performance. I’ve joined Shopify, on the YJIT team. YJIT is an open-source Just-In-Time Compiler we’re building as a fork of CRuby. By the time you read this, it should be merged into prerelease CRuby in plenty of time for this year’s Christmas Ruby release.

    I’ve built a benchmarking harness for YJIT and a big site of graphs and benchmarks, updated twice a day .

    I’d love to tell you more about that. But before I do, I know you’re asking, How fast is YJIT? That’s why the top of our new public page of YJIT results looks like this:

     this is a big textbox saying: “Overall, YJIT is 20.6% faster than interpreted CRuby, or 17.4% faster than MJIT! On Railsbench specifically, YJIT is 18.5% faster than CRuby, 21.0% faster than MJIT!”
    Overall, YJIT is 20.6% faster than interpreted CRuby, or 17.4% faster than MJIT! On Railsbench specifically, YJIT is 18.5% faster than CRuby, 21.0% faster than MJIT!

    After that, there are lots of graphs and, if you click through, giant tables of data. I love giant tables of data.

    And I hate doing constant ops work. So let’s talk about a low effort way to make GitHub do all your operational work for a constantly-updated website, shall we?

    I’ll talk benchmarks along the way, because I am still me.

    By the way, I work on the YJIT team at Shopify. We’re building a Ruby optimizer to make Ruby faster for everybody. That means I’ll be writing about it. If you want to keep up, this blog has a subscribe thing (down below.) Fancy, right?

    The Bare Necessities

    I’ve built a few YJIT benchmarks. So have a lot of other folks. We grabbed some existing public benchmarks and custom-built others. The benchmarks are all open-source so please, have a look. If there’s a type of benchmark you wish we had, we take pull requests!

    When I joined the YJIT team, that repo had a perfectly serviceable runner script that would run benchmarks and print the results to console (which still exists, but isn’t used as much anymore.) But I wanted to compare speed between different Ruby configurations and do more reporting. Also, where do all those reports get stored? That’s where my laziness kicked in.

    GitHub Pages is a great way to have GitHub host your website for free. A custom Jekyll config is a great way to get full control of the HTML. Once we had results to post, I could just commit them to Git, push them, and let GitHub take care of the rest.

    But Jekyll won’t keep it all up to date. That needs GitHub Actions. Between them, the final result is benchmarks run automatically, the site updates automatically, and it won’t email me unless something fails.


    Want to see the gritty details?

    Setting up Jekyll

    GitHub Pages run on Jekyll. You can use something else, but then you have to run it on every commit. If you use Jekyll, GitHub runs it for you and tells you when things break. But you’d like to customise how Jekyll runs and test locally with bundle exec jekyll serve. So you need to set up _config.yml in a way that makes all that happen. GitHub has a pretty good setup guide for that. And here's _config.yml for

    Of course, customising the CSS is hard when it’s in a theme. You need to copy all the parts of the theme into your local repo, like I did, if you want to change how they work (like not supporting <b> for bold and requiring <strong>, I’m looking at you , Slate).

    But once you have that set up, GitHub will happily build for you. And it’s easy! No problem! Nothing can go wrong!

    Oh, uh, I should mention, maybe, hypothetically, there might be something you want to put in more than one place. Like, say, a graph that can go on the front page and on a details page, something like that. You might be interested to know that Jekyll requires anything you include to live under _includes or the current subdirectory, so you have to generate your graph in there. Jekyll makes it really hard to get around the has to be under _includes rule. And once you’ve put the file under _includes, if you want to put it onto a page with its own URL, you should probably research Jekyll collections. And an item in a collection gets one page, not one page per graph… Basically, your continuous reporting code, like mine, is going to need to know more about Jekyll than you might wish.

    A snippet of Jekyll _config.yml that adds a collection of benchmark objects which should be output as individual pages

    But once you’ve set Jekyll up, you can have it run the benchmarks, and then you have nice up-to-date data files. You’ll have to generate graphs and reports there too. You can pre-run jekyll build to see if everything looks good. And as a side benefit, since you’re going to need to give it GitHub credentials to check in its data files, you can have it tell you if the performance of any benchmark drops too much.

    AWS and GitHub Actions, a Match Made In… Somewhere

    GitHub actions are pretty straightforward, and you can set one to run regularly, like a cron job. So I did that. And it works with barely a hiccup! It was easy! Nothing could go wrong.

    Of course, if you’re benchmarking, you don’t want to run your benchmarks in GitHub Actions. You want to do it where you can control the performance of the machine it runs on. Like an AWS instance! Nothing could go wrong.

    I just needed to set up some repo secrets for logging into the AWS instance. Like a private key, passed in an environment variable and put into an SSH identity file, that really has to end with a newline or everything breaks. But it’s fine. Nothing could go wrong!

    Hey, did you know that SSH remembers the host SSH key from any previous time you SSH’d there? And that GitHub Actions uses a shared .known_hosts file for those keys? And AWS re-uses old public IP addresses? So there’s actually a pretty good chance GitHub Actions will refuse to SSH to your AWS instance unless you tell it -oStrictHostKeyChecking=no. Also, SSH doesn’t usually pass environment variables through, so you’re going to need to assign them on its command line.

    So, I mean, okay, maybe something could go wrong.

    If you want to SSH into an AWS instance from GitHub Actions, you may want to steal our code, is what I’m saying.

    For the Love of Graphs

    Of course, none of this gets you those lovely graphs. We all want graphs, right? How does that work? Any way you want, of course. But we did a couple of things you might want to look at.

    A line graph of how four benchmarks’ results have changed over time, with ‘whiskers’ at each point to show the uncertainty of the measurement.

    A line graph of how four benchmarks’ results have changed over time, with ‘whiskers’ at each point to show the uncertainty of the measurement.

    For the big performance over time graph on the front page, I generated a D3.js graph from Erb. If you’ve used Rails, generating HTML and JS from Ruby should sound pretty reasonable. I’ve had good luck with it for several different projects. D3 is great for auto-generating your X and Y axis, even on small graphs, and there’s lots of great example code out there.

    If you want to embed your results, you can generate static SVGs from Ruby. That takes more code, and you’ll probably have more trouble with finicky bits like the X and Y axis or the legend. Embeddable graphs are hard in general since you can’t really use CSS and everything has to be styled inline, plus you don’t know the styling for the containing page. Avoid it if you can, frankly, or use an iframe to embed. But it’s nice that it’s an option.

    A large bar graph of benchmark results with simpler axis markings and labels.

    A large bar graph of benchmark results with simpler axis markings and labels.

    Both SVG approaches, D3 and raw SVG, allow you to do fun things with JavaScript like mouseover (we do that on or hiding and showing benchmarks dynamically (like we do on the timeline deep-dive). I wouldn’t try that for embeddable graphs, since they need more JavaScript that may not run inside a random page. It’s more enjoyable to implement interesting features with D3 instead of raw SVG.

    a blocky, larger-font bar graph generated using matplotlib

    A blocky, larger-font bar graph generated using matplotlib.

    If fixed-sized images work for you, matplotlib also works great. We don’t currently use that for, but we have for other YJIT projects.

    Reporting Isn’t Just Graphs

    Although it saddens my withered heart, reporting isn’t just generating pretty graphs and giant, imposing tables. You also need a certain amount of English text designed to be read by “human beings.”

    That big block up-top that says how fast YJIT is? It’s generated from an Erb template, of course. It’s a report, just like the graphs underneath it. In fact, even the way we watch if the results drop is calculated from two JSON files that are both generated as reports—each tripwire report is just a list of how fast every benchmark was at a specific time, and an issue gets filed automatically if any of them drop too fast.

    So What’s the Upshot?

    There’s a lot of text up there. Here’s what I hope you take away:

    GitHub Actions and GitHub Pages do a lot for you if you’re running a batch-updated dynamic site. There are a few weird subtleties, and it helps to copy somebody else’s code where you can.

    YJIT is pretty fast. Watch this space for more YJIT content in the future. You can subscribe below.

    Graphs are awesome. Obviously.

    Noah Gibbs wrote the ebook Rebuilding Rails and then a lot about how fast Ruby is at various tasks. Despite being a grumpy old programmer in Inverness, Scotland, Noah believes that some day, somehow, there will be a second game as good as Stuart Smith’s Adventure Construction Set for the Apple IIe. Follow Noah on Twitter and GitHub.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Try Out YJIT for Faster Rubying

    Try Out YJIT for Faster Rubying

    Here at Shopify, we’re building a new just-in-time (JIT) implementation on top of CRuby. Maxime talked about it at RubyKaigi and wrote a piece for our Engineering Blog. If you keep careful watch, you may have even seen our page of benchmark results.

    YJIT is a Just-In-Time Compiler, so it works by converting your most frequently used code into optimized native machine code. Early results are good—for instance, we’re speeding up a simple hello, world Rails benchmark by 20% or more. Even better, YJIT is pretty reliable: it runs without errors on Shopify’s full suite of unit tests for its main Rails application, and similar tests at GitHub. Matz, Ruby’s chief designer, mentioned us in his EuRuKo keynote. By the time you read this, YJIT should be merged into CRuby in time for Ruby 3.1 release at Christmas.

    Maybe you’d like to take YJIT out for a spin. Or maybe you’re just curious about how to compile CRuby (a.k.a. plain Ruby or Matz’s Ruby Interpreter.) Or maybe you just like reading technical blog posts with code listings in them. Reading blog posts feels weirdly productive while you’re waiting for a big test run to finish, doesn’t it? Even if it’s not 100% on-topic for your job.

    YJIT is available in the latest CRuby as 3.1.0-dev if you use ruby-build . But let’s say that you don’t or you want to configure YJIT with debugging options, statistics, or other customizations.

    Since YJIT is now part of CRuby, it builds and installs the same way that CRuby does. So I’ll tell you how you build CRuby and then you’ll know all about building YJIT too. We’ll also talk about:

    • How mature YJIT is or isn’t 
    • How stable is it 
    • Is it ready for you to use at work 
    • Will this speed turn into real-world benefits, or is it all benchmarks.

    We’re building something new and writing about it. There’s an email subscription widget on this post found at the top-left and bottom. Subscribe and you’ll hear the latest, soonest.

    Start Your Engines

    If you’re going to build Ruby from source, you’ll need a few things. Autoconf, make, OpenSSL, and GCC or Clang. If you’re on a Mac, XCode and Homebrew will get you these things. I won’t go into full details here, but Google can help out.

    brew install autoconf@2.69 openssl@1.1 # Unless you already have them

    On Linux, you’ll need a package similar to Debian build-essential plus autoconf and libssl-dev. Again, Google can help you here if you include the name of your Linux distribution. These are all common packages. Note that installing only Ruby from a package isn’t enough to build a new Ruby. You’re going to need Autoconf and a development package for OpenSSL. These are things a pre-built Ruby package doesn’t have.

    Now that you have the prerequisites installed, you clone the repo and build:

    And that will build you a local YJIT-enabled Ruby. If you’re using chruby, you can now log out, log back in, and switch to it with chruby ruby-yjit. rbenv is similar. With rvm you’ll need to mount it explicitly with a command like rvm mount ~/.rubies/ruby-yjit/bin/ruby -n ruby-yjit and then switch to it with rvm use ext-ruby-yjit.

    Note: on Mac, we’ve had a report of Autoconf 2.71 not working with Ruby. So you may need to install version 2.69, as shown above. And for Ruby in general you’ll want OpenSSL 1.1 - Ruby doesn’t work with version 3, which Homebrew installs by default.

    How Do I Know if YJIT Is Installed?

    Okay… So YJIT runs the same Ruby code, but faster. How do I know I even installed it?

    First, and simplest, you can ask Ruby. Just run ruby --enable-yjit -v. You should see a message underneath that YJIT is enabled. If you get a warning that enable-yjit isn’t a thing, you’re probably using a different Ruby than you think. Check that you’ve switched to it with your Ruby version manager.

    This message means this Ruby has no YJIT.

    You can also pop into irb and see if the YJIT module exists:

    You may want to export RUBYOPT=’--enable-yjit’ for this, or export RUBY_YJIT_ENABLE=1 which also enables YJIT. YJIT isn’t on by default, so you’ll need to enable it.

    Running YJIT

    After you’ve confirmed YJIT is installed, run it on some code. We found it runs fine on our full unit test suites, and a friendly GitHubber verified that it runs without error on theirs. So it’ll probably handle yours without a problem. If you pop into your project of choice and run rake test with YJIT and without, you can compare the times.

    If you can’t think of any code to run it on, YJIT has a benchmark suite we like. You could totally use it for that. If you do, you can run things like ruby -Iharness benchmarks/activerecord/benchmark.rb and compare the times. Those are the same benchmarks we use for . You may want to read YJIT’s documentation while you’re there. There are some command-line parameters and build-time configurations that do useful and fun things.

    Is YJIT Ready for Production?

    Benchmarks are fine, but YJIT doesn’t always deliver the same real-world speedups. We’ve had some luck on benchmarks and minor speedups with production code, but we’re still very much in-progress. So where is YJIT actually at?

    First, we’ve had good luck running it on our unit tests, our production-app benchmarking code, and one real web app here at Shopify. We get a little bit of speedup, in the neighbourhood of 6%. That can add up when you multiply by the number of servers Shopify runs… But we aren’t doing it everywhere, just on a small percentage of traffic for a real web service, basically a canary deployment.

    Unit tests, benchmarks and a little real traffic is a good start. We’re hoping that early adopters and being included in Ruby 3.1 will give us a lot more real world usage data. If you try YJIT, we’d love to hear from you. File a GitHub issue, good or bad, and let us know!

    Hey, YJIT Crashed! Who Do I Talk to?

    Awesome! I’ve been fuzz-testing YJIT with AFL for days and trying to crash it. If you could file an issue and tell us as much as possible about how it broke, we’d really appreciate that. Similarly, anything you can tell us about how fast it is or isn’t is much appreciated. This is still early days.

    And if YJIT is really slow or gives weird error messages, that’s another great reason to file an issue. If you run your code with YJIT, we’d love to hear what breaks. We’d also love to hear if it speeds you up! You can file an issue and, even if it’s good not bad, I promise we’ll figure out what to do with it.

    What if I Want More?

    Running faster is okay. But maybe you find runtime_stats up above intriguing. If you compile YJIT with CFLAGS=’-DRUBY_DEBUG=1’ or CFLAGS=’-DYJIT_STATS=1’ you can get a lot more detail about what it’s doing. Make sure to run it with YJIT_STATS set as an environment variable or yjit-stats on the command line. And then YJIT.runtime_stats will have hundreds of entries, not just two.

    When you run with statistics enabled, you’ll also get a report printed when Ruby exits showing interesting things like:

    • what percentage of instructions were run by YJIT instead of the regular interpreter 
    • how big the generated code was 
    • how many ISEQs (methods, very roughly) were compiled by YJIT.

    The beginning of a YJIT statistics exit report for a trivial one-line print command.

    What Can I Do Next?

    Right now, it’s really helpful just to have somebody using YJIT at all. If you try it out and let us know what you find, that’s huge! Other than that, just keep watching. Check the benchmarks results now and then. Maybe talk about YJIT a little online to your friends. As a famous copyrighted movie character said, "The new needs friends." We’ll all keep trying to be friendly.

    Noah Gibbs wrote the ebook Rebuilding Rails and then a lot about how fast Ruby is at various tasks. Despite being a grumpy old programmer in Inverness, Scotland, Noah believes that some day, somehow, there will be a second game as good as Stuart Smith’s Adventure Construction Set for the Apple IIe. Follow Noah on Twitter and GitHub.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    YJIT: Building a New JIT Compiler for CRuby

    YJIT: Building a New JIT Compiler for CRuby

    The 1980s and 1990s saw the genesis of Perl, Ruby, Python, PHP, and JavaScript: interpreted, dynamically-typed programming languages which favored ease of use and flexibility over performance. In many ways, these programming languages are a product of the surrounding context. The 90s were the peak of the dot-com hype, and CPU clock speeds were still doubling roughly every 18 months. It looked like the growth was never going to end. You didn’t really have to try to make your software run fast because computers were just going to get faster, and the problem would take care of itself. Today, things are a bit different. We’re reaching the limit of current silicon fabrication technologies, and we can’t rely on single-core performance increases to solve our performance problems. Because of mobile devices and environmental concerns, we’re beginning to realize that energy efficiency matters.

    Last year, during the pandemic, I took a job at Shopify, a company that runs a massive server infrastructure powered by Ruby on Rails. I joined a team with multiple software engineers working on improving the performance of Ruby code in a variety of ways, ranging from optimizing the CRuby interpreter and its garbage collector to the implementation of TruffleRuby, an alternative Ruby implementation. Since then, I’ve been working with a team of skilled engineers from Shopify and GitHub on YJIT, a new Just-in-time (JIT) compiler built inside CRuby.

    This project is important to Shopify and Ruby developers worldwide because speed is an underrated feature. There’s already a JIT compiler inside CRuby, known as MJIT, which has been in the works for three years. And while it has delivered speedups on smaller benchmarks, so far, it’s been less successful at delivering real-world speedups on widely used Ruby applications such as Ruby on Rails. With YJIT, we take a data-driven approach and focus specifically on performance hotspots of larger applications such as Rails and Shopify Core (Shopify’s main Rails monolith).

    What’s YJIT?

    ""Shopify loves Ruby! A small team lead by  @Love2Code  has been working on a new JIT that focuses on web &  @rails  workloads while also accelerating all ruby code out there. Today  @yukihiro_matz  gave his thumbs up to merging it into trunk:
    Tobi Lütke tweeting about YJIT

    YJIT is a project to gradually build a JIT compiler inside CRuby such that more and more of the code is executed by the JIT, which will eventually replace the interpreter for most of the execution. The compiler, which is soon to become officially part of CRuby, is based on Basic Block Versioning (BBV), a JIT compiler architecture I started developing during my PhD. I’ve given talks about YJIT this year at the MoreVMs 2021 workshop and another one at RubyKaigi 2021 if you’re curious to hear more about the approach we’re taking.

    Current Results

    We’re about one year into the YJIT project at this point, and so far, we’re pleased with the results, which have significantly improved since the MoreVMs talk. According to our set of benchmarks, we’ve achieved speedups over the CRuby interpreter of 20% on railsbench, 39% on liquid template rendering, and 37% on activerecord. YJIT also delivers very fast warm up. It reaches near-peak performance after a single iteration of any benchmark and performs at least as well as the interpreter on every benchmark, even on the first iteration.

    A bar graph showing the performance differences between YJIT, MJIT, and No JIT.
    Benchmark speed (iterations/second) scaled to the interpreter’s performance (higher is better)

    Building YJIT inside CRuby comes with some limitations. It means that our JIT compiler has to be written in C and that we have to work with design decisions in the CRuby codebase that weren’t made with a high-performance JIT compiler in mind. However, it has the key advantage that YJIT is able to maintain almost 100% compatibility with existing Ruby code and packages. We pass the CRuby test suite, comprising about 30,000 tests, and we have also been able to pass all of the tests of the Shopify Core CI, a codebase that contains over three million lines of code and depends (directly and indirectly) on over 500 Ruby gems, as well as all the tests in the CI for GitHub’s backend. We also have a working deployment to a small percentage of production servers at Shopify.

    We believe that the BBV architecture that powers YJIT offers some key advantages when compiling dynamically-typed code. Having end-to-end control over the full code generation pipeline will allow us to go farther than what’s possible with the current architecture of MJIT, which is based on GCC. Notably, YJIT can quickly specialize code based on type information and patch code at run time based on the run-time behavior of programs. The advantage in terms of compilation speed and warmup time is also difficult to match.

    Next Steps

    The Ruby core developers have invited the YJIT team to merge the compiler into Ruby 3.1. It’s a great honor for my colleagues and myself to have our work become officially part of Ruby. This means, in a few months, every Ruby developer will have the opportunity to try YJIT by simply passing a command-line option to the Ruby binary. However, our journey doesn’t stop there, and we already have plans in the works to make YJIT and CRuby even faster.

    Currently, only about 79% of instructions in railsbench are executed by YJIT, and the rest run in the interpreter, meaning that there’s still a lot we can do to improve upon our current results. There’s a clear path forward, and we believe YJIT can deliver much better performance than it does now. However, as part of building YJIT, we’ve had to dig through the implementation of CRuby to understand it in detail. In doing so, we’ve identified a few key elements in its architecture that we believe can be improved to unlock higher performance. These improvements won’t just help YJIT, they’ll help MJIT too, and some of them will even make the interpreter faster. As such, we will likely try to upstream some of this work separately from YJIT.

    I may expand on some of these in future blog posts, but here is a tentative list of potential improvements to CRuby that we would like to tackle:

    • Moving CRuby to an object model based on object shapes.
    • Changing the CRuby type tagging scheme to reduce the cost of type checks.
    • Implementing a more fine-grained constant caching mechanism.
    • A faster, more lightweight calling convention.
    • Rewriting C runtime methods in Ruby so that JIT compilers can inline through them.

    Matz (Yukihiro Matsumoto) has stated in his recent talk at Euruko 2021 that Ruby would remain conservative with language additions in the near future. We believe this is a wise decision as rapid language changes can make it difficult for JIT implementations to get off the ground and stay up to date. It makes some sense, in our opinion, for Ruby to focus on internal changes that will make the language more robust and deliver very competitive performance in the future.

    I hope you’re as excited about the future of YJIT and Ruby as we are. If you’re interested in trying YJIT, it’s available on GitHub under the same open source license as CRuby. If you run into bugs, we’d appreciate it if you would open an issue and help us find a simple reproduction. Stay tuned as two additional blog posts about YJIT are coming soon, with details about how you can try YJIT, and the performance tracking system we’ve built for

    Maxime Chevalier-Boisvert obtained a PhD in compiler design at the University of Montreal in 2016, where she developed Basic Block Versioning (BBV), a JIT compiler architecture optimized for dynamically-typed programming languages. She is currently leading a project at Shopify to build YJIT, a new JIT compiler built inside CRuby.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Winning AI4TSP: Solving the Travelling Salesperson Problem with Self-programming Machines

    Winning AI4TSP: Solving the Travelling Salesperson Problem with Self-programming Machines

    Running a business requires making a lot of decisions. To be competitive, they have to be good. There are two complications, though:

    1. Some problems are computationally very hard to solve.
    2. In reality, we are dealing with uncertainty, so we do not even know what exact problem setting we should optimize for.

    The AI4TSP Competition fosters research on the intersection of optimization (addressing the first issue of efficient computation for hard problems) and artificial intelligence (addressing the second issue of handling uncertainty). Shopify optimization expert Meinolf Sellmann collaborated with his former colleagues Tapan Shah at GE Research, Kevin Tierney, Fynn Schmitt-Ulms, and Andre Hottung from the University of Bielefeld to compete and win first prize in both tracks of the competition. The type of problem studied in this competition matters to Shopify as the optimization of our fulfillment system requires making decisions based on estimated data.

    The Travelling Salesperson Problem

    The AI4TSP Competition focuses on the Travelling Salesperson Problem (TSP), one of the most studied routing problems in the optimization community. The task is to determine the order to visit a given set of locations, starting from, and returning to, a given home location, so the total travel time is minimized. In its original form, the travel times between all locations are known upfront. In the competition, these times weren’t known but sampled according to a probability distribution. The objective was to visit as many locations as possible within a given period of time, whereby each location was associated with a specific reward. To complicate matters further, the locations visited on the tour had to be reached within fixed time windows. Arriving too early meant having to wait until the location would open, arriving too late was associated with penalties.

    An image of two solutions to the same TSP instance with the home location in black. The route solutions can be done counterclockwise or clockwise
    Two solutions to the same TSP instance (home location in black)

    When travel times are known, the problem looks innocent enough. However, consider this: the number of possible tours grows more than exponentially and is given by “n! = 1 * 2 * 3 … * n” (n factorial) for a problem instance with n locations. Even if we could:

    1. evaluate, in parallel, one potential tour for every atom in the universe
    2. have each atomic processor evaluate the tours at Planck time (shortest time unit that anything can be measured in)
    3. run that computer from the Big Bang until today.

    It wouldn’t even enumerate all solutions for just one TSP instance with 91 locations. The biggest problems at the competition had 200 locations—with over 10375 potential solutions.

    The Competition Tracks

    The competition consisted of two tracks. In the first, participants had to determine a tour for a given TSP instance that would work well on expectation when averaged over all possible travel time scenarios. A tour had to be chosen and participants had to stick to that tour no matter how the travel times turned out when executing the tour. The results were then averaged ten times over 100,000 scenarios to determine the winner.

    A table of results for the final comparison of front runners in Track 1. It shows Meinholf's team as the winner.
    Final Comparison of Front Runners in Track 1 (Shopify and Friend’s Team “Convexers”)

    In the second track, it was allowed to build the tour on the fly. At every location, participants could choose which location to go to next, taking into account how much time had already elapsed. The policy that determined how to choose the next location was evaluated on 100 travel time realizations for each of 1,000 different TSP instances to determine the winner.

    Optimal Decisions Under Uncertainty

    For hard problems like the TSP, optimization requires searching. This search can be systematic, whereby we search in such a manner that we can efficiently keep record of the solutions that have already been looked at. Alternatively, we can search heuristically, which generally refers to search methods that work non-systematically and may revisit the same candidate solution multiple times during the search. This is a drawback of heuristic search, but it offers much more flexibility as the search controller can guide where to go next opportunistically and isn’t bound by exploring spaces that neatly fit to our existing search record. However, we need to deploy techniques that allow us to escape local regions of attraction, so that we don’t explore the same basin over and and over.

    For the solution to both tracks, Shopify and friends used heuristic search, albeit in two very different flavors. For the first track, the team applied a search paradigm called dialectic search. For the second track, they used what’s known in machine learning as deep reinforcement learning.

    The Age of Self-Programming Machines

    Key to making both approaches work is to allow the machine to learn from prior experience and to adjust the program automatically. If the ongoing machine learning revolution had to be summarized in just one sentence, it would be:

    If, for a given task, we fail to develop algorithms with sufficient performance, then shift the focus to building an algorithm that can build this algorithm for us, automatically.

    A recent prominent example where this revolution has led to success is AlphaFold, DeepMind’s self-taught algorithm for predicting the 3D structure of proteins. Humans tried to build algorithms that could predict this structure for decades, but were unable to reach sufficient accuracy to be practically useful. The same was demonstrated for tasks like machine vision, playing board games, and optimization. At another international programming competition, the MaxSAT Evaluation 2016, Meinolf and his team entered a self-tuned dialectic search approach which won four out of nine categories and ten medals overall. 

    These examples show that machine-generated algorithms can vastly outperform human-generated approaches. Particularly when problems become hard to conceptualize in a concise theory and hunches or guesses must be made during the execution of the algorithm, allowing the machine to learn and improve based on prior experience is the modern way to go.

    Meinolf Sellmann, Director for Network Optimization at Shopify, is best known for algorithmic research, with a special focus on self-improving algorithms, automatic algorithm configuration and algorithm portfolios based on artificial intelligence, combinatorial optimization, and the hybridization thereof. He received a doctorate degree (Dr. rer. nat.) from Paderborn University (Germany). Prior to this he was Technical Operations Leader for machine learning and knowledge discovery at General Electric, senior manager for data curation at IBM, Assistant Professor at Brown University, and Postdoctoral Scholar at Cornell University.
    His honors include the Prize of the Faculty of the University of Paderborn (Germany) for his doctoral thesis, an NSF Career Award in 2007, over 20 Gold Medals at international SAT and MaxSAT Competitions, and first places at both tracks of the 2021 AI for TSP Competition. Meinolf has also been invited as keynote speaker and program chair of many international conferences like AAAI, IAAI, Informs, LION and CP.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Journey Through a Dev Degree Intern’s First Placement

    Journey Through a Dev Degree Intern’s First Placement

    This past April, I completed my first placement as a Dev Degree student. I was a back-end developer working on the Docs & API Libraries team. The team’s mission is to create libraries, tools, and documentation that help developers build on top of Shopify. Throughout my time on the team, I had many opportunities to solve problems, and in taking them, I learned not only technical skills, but life lessons I hope to carry with me. I’ll share with you how I learned to appreciate the stages of learning, dispel the myth of the “real developer,” combat imposter syndrome by asking for help, and the power of valuing representation.

    Joining the Dev Degree Program

    In February of 2019, I received an email from York University. Apply for Dev Degree! it read. As a high school student with disabilities, I’d been advised to look for a program that would support me until I graduated. With small cohorts and a team of supportive folks from both Shopify and York University, I felt I could be well-accommodated in Dev Degree. Getting work experience, I reasoned, would also help me learn to navigate the workplace, putting me at an advantage once I graduated. With nothing to lose and absolutely no expectations, I hit Apply.

    An illustration of an email that reads "To: you. subject: apply for Dev Degree!" with an "apply!" button at the bottom
    Apply to Dev Degree

    Much to my surprise, I made it through the application and interview process. What followed were eight months of learning: 25 hours a week in a classroom at Shopify and 20 hours a week at York University. Alongside nine other students, I discovered Ruby, databases, front-end development, and much more with a group of knowledgeable instructors at Shopify. In May 2020, we moved on to the next exciting stage of the Dev Degree program: starting our first placement on a real Shopify team! It would be 12 months long, followed by three additional eight-month long placements, all on teams of varying disciplines. During those 12 months, I learned lessons I’ll carry throughout my career.

    What We Consider “Foundational Knowledge” Gets Distorted the More We Learn

    When I first joined the Docs & API Libraries team, there was a steep learning curve. This was expected. Months into the placement, however, I still felt as though I was stagnant in my learning and would never know enough to become impactful. Upon mentioning this to my lead, he suggested we create an Achievement Log—a list of the victories I’d had, large and small, during my time on the team.

    Looking back on my Achievement Log, I see many victories I would now consider rather small, but had been a big deal at the time. To feel this way is a reminder that I have progressed. My bar for what to consider “basic knowledge” moved upward as I learned more, leading to a feeling of never having progressed. One good way of putting these achievements into perspective is to share them with others. So, let’s journey through some highlights from my Achievement Log together!

    An illustration of stairs with label, "today" on closest and largest step, and "before" off in the distance
    Stairs to achievement

    I remember being very frustrated while attempting my first pull request (PR). I was still learning about the Shopify CLI (a command-line tool to help developers create Shopify apps) and kept falling down the wrong rabbit holes. The task: allow the Shopify CLI to take all Rails flags when scaffolding a Rails project; for example, using --db=DB to change the database type. It took a lot of guidance, learning, and mistakes to solve the issue, so when I finally shipped my first PR, it felt like a massive victory!

    Mentoring was another good way to see how far I’d come. When pairing with first-year Dev Degree students on similar tasks, I’d realize I was answering their questions—questions I myself had asked when working on that first PR—and guiding them toward their own solutions. It reminds me of the fact that we all start in the same place: with no context, little knowledge, and much curiosity. I have made progress.

    The Myth of the “Real” Developer Is, Well, a Myth

    After familiarizing myself with the tool through tasks similar to my first PR, I began working on a project that added themes as a project type to the Shopify CLI. The first feature I added was the command to scaffold a theme, followed by a task I always dreaded: testing. I understood what my code did, and what situations needed to have unit tests, but could not figure out how to write the tests.

    My lead gave a very simple suggestion: why not mimic the tests from the Rails project type? I was more familiar with the Rails project type, and the code being tested was very similar between the two test files. This had never occurred to me before, as it didn’t feel like something a “real” developer would do.

    An illustration of a checklist titled, "Real Programmer" with none of the boxes ticked
    Real Programmer checklist

    Maybe “real” developers program in their free time, are really into video games, or have a bunch of world-changing side projects going at all times. Or maybe they only use the terminal— never the GUI—and always dark mode. I spoke to a few developer interns about what notions they used to have of a “real” developer, and those were some of their thoughts. My own contribution was that “real” developers can solve problems in their own creative ways and didn’t need help; it felt like I needed to be able to implement unique solutions without referring to any examples. So, following existing code felt like cheating at first. However, the more I applied the strategy, the more I realized it was helpful and entirely valid.

    For one, observational learning is a good way of building confidence toward being able to attempt something. My first pairing sessions on this team, before I felt comfortable driving, involved watching an experienced developer work through the problem on their screen. Looking at old code also gives a starting point for what questions to ask or keywords to search. By better understanding the existing tests, and changing them to fit what I needed to test, I was learning to write unrelated tests one bite-sized piece at a time.

    An illustration of the Real Programmer checklist, but criteria has been scratched out and a ticked box added next to checklist title, "Real Programmer"
    The Real Programmer is a dated stereotype

    At the end of the day, developers are problem solvers coming from all different walks of life, all with ways of learning that fit them. There’s no need to play into dated stereotypes!

    Imposter Syndrome Is Real, But It’s Only Part of the Picture

    As I was closing out the Shopify CLI project, I was added to my first team project: the Shopify Admin API Library for Node. This is a library written in Typescript with support for authentication, GraphQL and REST requests, and webhook handling. It’s used in accelerating Node app development.

    With this being my first time working on a team at Shopify, I found it far too easy to nod along and pretend I knew more than I did. Figuring it would be a relatively simple feature to implement, a teammate suggested I add GraphQL proxy functionality to the library; but, for a very junior developer like me, it wasn’t all that simple. It was difficult to communicate the level of help I needed. Instead, I often seemed to require encouragement, which my team gave readily. I kept being told I was doing a good job, even though I felt I had hardly done anything, and whenever I mentioned to a friend that I felt I had no idea what I was doing, the response was often that they were sure I was doing great. That I needed to have more confidence in my abilities.

    I puzzled over the task for several weeks without admitting I had no idea what I was doing. In the end, my lead had to help me a lot. It wasn’t a matter of simply feeling like I knew nothing; I had been genuinely lost and successful at hiding it. Being surrounded by smart people, imposter syndrome is nearly inevitable. I’m still looking for ways to overcome this. Yet, there are moments when it is not simply a matter of underestimating myself; rather, it’s more about acknowledging how much I have to learn. To attribute the lack of confidence entirely to imposter syndrome would only work to isolate me more.

    Imposter syndrome or not, I still had to work through it. I had to ask for help; a vital challenge in the journey of learning. Throughout my placement, I was told repeatedly, by many team members, that they would like to hear me ask for help more often. And I’ve gotten better at that. Not only does it help me learn for the next time, it gives the team context into the problem I am solving and how. It may slow down progress for a little while, but in the larger picture, it moves the entire project forward. Plus, asking a senior developer for their time isn’t unreasonable, but normal. What would be unreasonable is expecting an intern to know enough not to ask!

    Learning Will Never Feel Fast

    My next project after the Node library was the Shopify Admin API Library for PHP (similar functionality as the Shopify Admin API Library for Node but a different language). The first feature I added was the Context class, comparable to a backpack of important information about the user’s app and shop being carried and used across the rest of the library’s functionality. I had never used PHP before and was excited to learn. That is, until I started to fall behind. Then I began to feel frustrated. Since I’d had experience building the previous library, shouldn’t I have been able to hit the ground running with this one?

    An illustration of a backpack with a tag that says "Context" with Context class parameters sticking out
    Shopify backpack full of context

    As a Dev Degree student, I work approximately half the hours of everyone else on my team, and my inexperience means I take longer to grasp certain concepts. Even knowing this, I tend to compare myself to others on my team, so I often feel very behind when it comes to how much I’ve accomplished. And any time I’m told that I’m a quick learner, it’s as if I’ve successfully fooled everyone.

    Determined to practice the lesson I had learned from trying to implement the GraphQL proxy in the Node library, I asked for help. I asked my teammates to review the draft PR occasionally while it was still a work-in-progress. The idea was that, if I strayed from the proper implementation, my mistake would be caught earlier on. Not only did this keep me on track, getting my teammates’ approval throughout the process also helped me realize I was less behind than I’d thought. It made learning a collaborative process. My inexperience and part-time hours didn’t matter, because I had a team to learn alongside.

    Even when I was on track, learning felt like a slow process—and that’s because that is the nature of learning. Not knowing or understanding things can be frustrating in the moment, especially when others seem to understand. But the perspective others have of me is different from mine. It’s not that I’ve fooled them; rather, they’re acknowledging my progress relative to only itself. They’re celebrating my learning for what it is.

    Valuing Representation Isn’t a Weakness

    As a kid, I couldn’t understand why representation mattered; if no one like you had achieved what you wanted to achieve before, then you’d be the first! Yet, this was the same kid who felt she could never become a teacher because her last name “didn’t sound like a teacher’s.” This internalized requirement to appear unbothered by my differences followed me as I got older, applying itself to other facets of me—gender, disability, ethnicity, age. I later faced covert sexism and ableism in high school, making it harder for me to remain unfazed. Consequently, when I first started working on an open source repository, I’d get nervous about how prejudiced a stranger could be just from looking at my GitHub profile picture.

    A doodle of a headshot with a nametag that reads "hello, I'm incompetent". It is annotated with the labels, "Asian", "girl", and "younger = inexperienced"
    Are these possible biases from a stranger?

    This nervousness has since died down. I haven’t been in any predicament of that sort on the team, and on top of that, I’ve been encouraged by everyone I interact with regularly to tell them if they aren’t being properly accommodating. I always have the right to be supported and included, but it’s my responsibility to share how others can help. We even have an autism support Slack channel at Shopify that gives me a space to see people like myself doing all sorts of amazing things and to seek out advice.

    Looking back on all the highlights of my achievement log, there’s a story that snakes through the entries. Not just one of learning and achievement, which I’ve shared, but one that’s best encapsulated by one last story.

    My lead sent out feedback forms to each of the team members to fill out about those they worked with. The last question was “Anything else to add?” Hesitantly, I messaged one of my teammates, asking if I could mention how appreciative I was of her openness about her neurodiversity. It was nice to have someone like me on the team, I told her. Imagine my surprise when she told me she’d meant to write the same thing for me!

    Valuing representation isn’t a weakness. It isn’t petty. There’s no need to appear unbothered. Valuing representation helps us value and empower one another—for who we are, what we represent, and the positive impact we have.

    There you have it, a recap of my first placement as a Dev Degree intern, some problems I solved, and some lessons I learned. Of course, I still have much to learn. There’s no quick fix to any challenges that come with learning. However, if I were to take only one lesson with me to my next placement, it would be this: there is so much learning involved, so get comfortable being uncomfortable! Face this challenge head-on, but always remember: learning is a collaborative activity.

    Having heard my story, you may be interested in learning more about Dev Degree, and possibly experiencing this journey for yourself. Applications for the Fall 2022 program are open until February 13, 2022.

    Carmela is a Dev Degree intern attending York University. She is currently on her second placement, working as a front-end developer in the Shopify Admin.

    Continue reading

    Reusing Code with React Native Packages at Shopify

    Reusing Code with React Native Packages at Shopify

    At Shopify, we develop a bunch of different React Native mobile apps: Shop, Inbox, Point of Sale, Shopify Mobile, and Local Delivery. These apps represent different business domains, but they often have shared pieces of functionality like login or foundational blocks they build upon. Wouldn’t it be great to leverage development speed and focus on important product features by reusing code other teams have already written? Sure, but it might be a big and time consuming effort that discourages teams. Usually, contributing to a new repo is more tedious and error prone in comparison to contributing to an existing repository. The developer needs to create a new repository, set up continuous integration (CI) and distribution pipelines, and add configs for Jest, ESint, and Babel. It might be unclear where to start and what to do.

    My team, React Native Foundations, decided to invest in simplifying the process for developers at Shopify. In this post, I'll walk you through the process of extracting those shared elements, the setup we adopted, the challenges we encountered, and future lines of improvement.

    Our Considerations: Monorepo vs Multi-Repo

    When we set out to extract elements from the product repositories, we explored two approaches: multi-repo and monorepo. For us, it was important that the solution had low maintenance costs, allowing us to be consistent without much effort. Of the two, monorepo was the one that helped us achieve that.

    Having one monorepo to support reduces maintenance costs. The team has one process that can be improved and optimized instead of maintaining and providing support for any number of packages and repositories. For example, imagine updating React Native and React versions across 10 repositories. I don’t even want to!

    A monorepo decreases entrance barriers by offering everything you need to start building a package, including a package template to kick off building your package faster. Documentation and tooling provide the foundation to focusing on what’s important—the content of the package—instead of wasting time on configuring CI pipelines or thinking about the structure and configuration of the package. 

    We want contributing to shared foundational code to be convenient and spark joy. Optimizing once, and for everyone, gives the team time and opportunity to improve the developer experience by offering features like generating automatic documentation and providing a fixture app to test changes during development. 

    Our Setup Details

    A repository consists of a set of npm packages that might contain native iOS and Android code, a fixture app that allows testing those packages in a real application, and an internal documentation website for users and contributors to learn how to use and contribute to the packages. This repository has an uncommon setup making it possible to hot-reload while editing the packages and references between packages and use them from the fixture app.

    First, packages are developed in TypeScript but distributed as JavaScript and definition files. We use TypeScript project references so the TypeScript compiler resolves cross-package references. Since the IDE detects it's a TypeScript project, it resolves the imports on the UI. Dependencies between projects are defined in the tsconfig.json of each package.

    When distributing the packages, we use Yarn. It’s language-agnostic and therefore doesn't translate dependencies between TypeScript projects to dependencies between packages. For that, we use Yarn Workspaces. That means besides defining dependencies for TypeScript, we have to define them in the package.json for Yarn and npm. Lerna, the publishing tool we use to push new versions of the packages to the registry, knows how to resolve the dependencies and build them in the right order.

    We extract TypeScript, Babel, Jest, and ESLint configs to the root level to ensure consistent configuration across the packages. Consistency makes contributions easier because packages have a similar setup, and it also leads to a more reliable setup. 

    The fixture app setup is the standard setup of any React Native app using Metro, Babel, CocoaPods, and Gradle. However, it has custom configuration to import and link the packages that live within the same repository:

    • babel.config.js uses module-resolver plugin to resolve project references. We wouldn't need this if Babel integrated with TypeScript's project references feature.
    • metro.config.js exposes the package directories to Metro so that hot reloading works when modifying the code of the packages.
    • Podfile has logic to locate and include the Pod of the local packages. It’s worth mentioning that we don’t use React Native autolinking for local packages, but install them manually.

    Developers test features by running the fixture app locally. They also have the option to create Shipit Mobile internal builds (which we call Snapshot builds) that they can share internally. Shared builds can be installed via QR code by any person in the company, allowing them to play with available packages.

    CI configuration is one of the things developers get for free when contributing to the monorepo. CI pipelines are auto-generated and therefore standardized across all the packages. Based on the content of the package we define the steps: 

    • build 
    • test 
    • type check 
    • lint TypeScript, Kotlin, and Swift code.
    A CI pipeline run showing all the steps (build, test, run, type check, and lint) run for a package with updates.
    A CI pipeline run showing all the steps (build, test, run, type check, and lint) run for a package with updates.

    Another interesting thing about our setup is that we generate a dependency graph of the package to determine dependencies between packages. Also, the pipelines are triggered based on the file changes, so we only build the package with new changes and those that depend on it. 

    Code Generation

    Even with all the infrastructure in place, it might be confusing to start contributing. Documentation describing the process helped up to a point, but we could do better by involving automation and code generation to leverage bootstrapping new packages further.

    The React Native packages monorepo offers a script built with PlopJS for adding a new package based on the package template similar to the React Native community one. We took this template but customized it for Shopify. 

    A newly created package is a ready-to-use skeleton that extends the monorepo’s default configuration and has auto-generated CI pipelines in place. The script prompts for answers to some questions and generates the package and pipelines as a result.

    Terminal window showing the script that prompts the user for answers to questions needed to create the packages and CI pipelines.
    Terminal window showing the script that prompts the user for answers to questions needed to create the packages and CI pipelines

    Code generation ensures consistency across packages since everything is predefined for contributors. For the React Native Foundations team, it means supporting and improving one workflow, which reduces maintenance costs.


    Documentation is as important as the code we add to the repository, and having great documentation is crucial to provide a great developer experience. Therefore, it shouldn't be an afterthought. To make it easier for contributors not to overlook writing documentation, the monorepo offers auto-generated documentation available in a statically generated website built with Gatsby.

    Screenshot of the package documentation website created by Gatsby. The left hand side shows the list of packages and the right hand side contains the details of the selected package.
    Screenshot of the package documentation website created by Gatsby

    Each package shows up in the sidebar of the documentation website, and its page contains the following information that’s pre-populated by reading metadata from the package.json file:

    • package name 
    • package dependencies
    • installation command (including peer dependencies)
    • dependency graph of the packages in there.

    Since part of the documentation is auto-generated, it’s also consistent across the packages. Users see the same sections with as much generated content as possible.The website supports extending the documentation with manually written content by creating any of the following files under the documentation/ directory of the package:

    • installation.mdx: include extra installation steps
    • getting-started.mdx: document steps to get started with the package
    • troubleshooting.mdx: document issues developers might run into and how to tackle them.

    Release Process

    I’ve mentioned before that we use Lerna for releasing the packages. During a release, we version independently and only if a package has unreleased changes. Due to how Lerna approaches the release process, all unreleased changes need to be released at the same time. 

    Our standard release workflow includes updating changelogs with the newest version and calling a release script that prompts you to update all the modules touched since the last change. 

    When versioning locally, we run two additional npm lifecycle scripts:

    • preversion ensures that all the changelogs are updated correctly. It gets run before we upgrade the version.
    • version gets run after we've updated the versions but before we make the "Publish" commit. It generates an updated readme and runs pod install considering the bumped versions.

    After that, we get a new release commit with release tags that we need to push to the main branch. Now, the only thing left is to press “Publish”, and the packages will be released to the internal package registry. 

    The release process has a few manual steps and can be improved further. We keep the main branch always shippable but plan on introducing automating releases on every merge to reduce friction. To do that we might need to:

    • start using conventional commits in the repo
    • automate changelog generation
    • configure a GitHub action to prepare a release commit after every merge automatically. This step will generate the changelog automatically, trigger a Lerna release commit, and push that to main
    • schedule an automated release of the package right after.

    The Future of Monorepos at Shopify

    In hindsight, we achieve our goal. Extracting and reusing code is easy: you get tooling, infrastructure, and maintenance from the React Native Foundations team, plus other nice things for free. Developers can easily share those internal packages, and product teams have a developer-friendly workflow to contribute to Shopify's foundation. As a result, 17 React Native packages have been developed since June 2020, with 10 of them contributed by product teams.

    Still, we got some lessons along the way.

    We learned that the React Native tooling isn’t optimized for Shopify’s setup, but thanks to the flexibility of their APIs, we achieved a configuration we’re happy with. Still, the team keeps an eye on any occurring inconveniences and works on improving them.

    Also, we came up with the idea of having multiple monorepos for thematically-related packages instead of one huge one. Based on the Web Foundation team’s experience and our impression, it makes sense to introduce a few monorepos for coupled packages. Recent talk from Microsoft at React Native EU 2021 conference also confirmed that having multiple monorepos is a natural evolutionary step for massive React Native codebases. Now, we have two monorepos: one main monorepo contains loosely coupled packages with utilities and Shopify-specific features and another contains a few performance related packages. Still, when we end up having a few monorepos, we’ll have to figure out how to reuse pieces across those monorepos to retain the benefits of monorepo.

    Elvira Burchik is a Production Engineer on the React Native Foundations team. Her mission is to create an environment in which developers are highly productive at creating high-quality React Native applications. She lives in Berlin, Germany, and spends her time outside of work chasing the best kebabs and brewing coffee.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Shard Balancing: Moving Shops Confidently with Zero-Downtime at Terabyte-scale

    Shard Balancing: Moving Shops Confidently with Zero-Downtime at Terabyte-scale

    Moving a shop from one shard to another requires engineering solutions around large, interconnected systems. The flexibility to move shops from shard to shard allows Shopify to provide a stable, well-balanced infrastructure for our merchants. With merchants creating their livelihood on the platform, it’s more important than ever that Shopify remains a sturdy backbone. High-confidence shard rebalancing is simply one of the ways we can do this.

    Continue reading

    Making Shopify’s Flagship App 20% Faster in 6 Weeks Using a Novel Caching Solution

    Making Shopify’s Flagship App 20% Faster in 6 Weeks Using a Novel Caching Solution

    Shop is Shopify’s flagship shopping app. It lets anyone track their packages, find new products, and even plant trees to offset the carbon emissions from their purchases. Since launching in 2019, we’ve quickly grown to serve tens of millions of users. One of the most heavily used features of the app is the home page. It’s the first thing users see when they open Shop. The home page keeps track of people’s orders from the time they click the checkout button to when it’s delivered to their door. This feature brings so much value to our users, but we’ve had some technical challenges scaling this globally. As a result of Shop’s growth, the home feed was taking up a significant amount of our total database load, and was starting to have a user-facing impact.

    We prototyped a few solutions to fix this load issue and ended up building a custom write-through cache for the home feed. This was a huge success—after about six weeks of engineering work, we built a custom caching solution that reduced database load by 15% and overall app latency by about 20%.

    Identifying The Problem

    The main screen of the Shop app is the most used feature. Serving the feed is complex as it requires aggregating orders from millions of Shopify and non-Shopify merchants in addition to handling tracking data from dozens of carriers. Due to a variety of factors, loading and merging this data is both computationally expensive and quite slow. Before we started this project, 30% of Shop’s database load was from the home feed. This load didn’t only affect the home feed, it affected performance on all aspects of the application.

    We looked around for simple, straightforward solutions to this problem like introducing IdentityCache, updating our database schema, and adding more database indexes. After some investigation, we learned that we had little database-level optimization left to do and no time to embark on a huge code rewrite. Caching, on the other hand, seemed ideal for this situation. Because users use the home feed every day and the temporal based sort of the home feed, home feed data was usually only read after it was recently written, making it ideal for a cache of some sort.

    Finding a Solution

    Because of the structure of the home feed, we couldn't use a plug-and-play caching solution. We think of a given user’s home feed as a sorted list of a user’s purchases, where the list can be large (some people do a lot of shopping!). The list can be updated by a series of concurrent operations that include:

    • adding a new order to display on the home feed (for example, when someone makes a purchase from a Shopify store)
    • updating the details associated with an order (for example, when the order is delivered)
    • removing an order from the list (for example, when a user manually archives the order).

    In order to cache the home feed, we’d need a system that maintains a cached version of a user’s feed, while handling arbitrary updates to the orders in the feed and also maintaining the guarantee that the feed order is correct.

    Due to the quantity of updates that we process, it’s infeasible to use a read-through cache that’s invalidated after every write, as the cache would end up being invalidated so often it would be practically useless. After some research, we didn’t find an existing solution that:

    • wasn’t invalidated after writes
    • could handle failure cases without showing  stale data to users.

    So, we built one ourselves.

    Building Shop’s Caching Solution

    A flow diagram showing the state of the Shop app before adding a caching solution
    Before introducing the cache, when a user would make a request to load the home feed, the Rails application would serialy execute multiple database queries, which had high latency.
    A flow diagram showing the state of the Shop app after the caching solution is introduced

    After introducing the cache, when a user makes a request to load their home feed, Rails loads their home feed from the cache and makes far fewer (and much faster) database requests.

    Rather than querying the database every time a user requests the home feed, we now cache a copy of their home feed in a fast, distributed, horizontally-scaled caching system (we chose Memcached). Then we serve from the cache rather than the database at request time provided certain conditions are met. To keep the cache valid and correct before each database update, we mark the cache as “invalid” to ensure the cached data isn’t used while the cache and database are out of sync. After the write is complete, we update the cache with the new data and mark it as “valid” again.

    A flow diagram showing how Shop app updates the cache
    When Shop receives a shipping update from a carrier, we first mark the cache as invalid, then update the database, and then update the cache and mark it as valid.

    Deciding on Memcached

    At Shopify, we use two different technologies for caching: Memcached and Redis. Redis is more powerful than Memcached, supporting more complex operations and storing more complex objects. Memcached is simpler, has less overhead, and is more widely used for caching inside Shop. While we use Redis for managing queues and some caches, we didn’t need Redis’ complexity, so we chose a distributed Memcached. 

    The primary issue we had to solve was ensuring the cache never contained stale records. We minimize the chance of cache invalidation by building the cache using a write-through invalidation policy that invalidates the cache before a database write and revalidates it after the successful write. That led to the next hard question: how do we actually store the data in Memcached and handle concurrent updates?

    The naive approach would be to store a single key for each user in Memcached that maps a user to their home feed. Then, on write, invalidate the cache by evicting the key from the cache, make the database update, and finally revalidate the cache by writing the key again. The issue, unfortunately, is that there’s no support for concurrent writes. At Shop’s scale, multiple worker machines often concurrently process order updates for the same user. Using a delete-then-write strategy introduces race conditions that could lead to an incorrect cache, which is unacceptable.

    To support concurrent writes, we store an additional key/value pair (pending writes key) that tracks the validity of the cache for each user. The key stores the number of active writes to a given user’s home feed. Each time a worker machine is about to write to the database, we increment this value. When the update is complete, we decrement the value. This means the cache is valid when the pending writes key is zero.

    However, there’s one final case. What happens if a machine makes a database update and fails to decrement the pending writes key due to an interrupt or exception? How can we know if the pending writes key is greater than zero because there's currently a database write in progress or a process was interrupted?

    The solution is introducing a key with a short expiry that’s written before any database update. If this key exists, then we know there’s the possibility of a database update, but if it doesn’t and the pending writes key is greater than zero, we know there’s no active database write occurring, so it’s safe to rewarm the cache again.

    Another interesting detail is that we needed this code to work with all of our existing code in Shop and interplay seamlessly with that code. We wrote a series of Active Record Concerns that we mixed into the relevant database records. Using Active Record concerns meant that the ORM’s API stayed exactly the same, causing this change to be totally transparent to developers, and ensuring that all of this code was forward compatible. When Shop Pay became available to anyone selling on Google or Facebook, we were able to integrate the caching with minimal overhead.

    The Rollout Strategy

    Another important piece of this project was the rollout. Once we’d built the caching logic and integrated it with the ORM, we had to ship the cache to users. Theoretically sound, unit-tested code is a good first step, but without real world data, we weren’t confident enough in our system to deploy this cache without strict testing. We wanted to validate our hypothesis that it would never serve stale data to users.

    So, over the course of the few weeks, we ran an experiment. First, we turned on all the cache writing and updating logic (but not the logic to serve data from the cache) and tested at scale. Once we knew that system was durable and scalable, we tested its correctness. At home feed serve time, our backend loaded from both the cache and the database and compared their data, and would log to a dashboard if there was a discrepancy. After letting this experiment run for a few weeks and fixing the issues that arose, we were able to be confident in our system’s correctness and scalability. We knew that the cache was always going to be valid and would not serve users stale or incorrect data.

    After rolling this cache out globally, we saw immediate, impactful results. Our database servers have a lighter load, In addition to the lower database load and faster home feed performance, we also observed a double digit decrease in overall CPU usage and a 20% decrease in our overall GraphQL latency. Our database servers have a lighter load, our users have a faster experience, and our developers don’t need to worry about high database load. It’s a win-win-win.

    Ryan Ehrlich is a software engineer living in Palo Alto, California. He focuses on solving problems in large scale, distributed systems, and CV/NLP AI research. Outside of work, he’s an avid rock climber, cyclist, and hiker.

    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Using Rich Image and Text Data to Categorize Products at Scale

    Using Rich Image and Text Data to Categorize Products at Scale

    The last time we discussed product categorization on this blog, Shopify was powering over 1M merchants. We have since grown and currently serves millions of merchants who sell billions of products across a diverse set of industries. With this influx of new merchants, we decided to reevaluate our existing product categorization model to ensure we’re understanding what our merchants are selling, so we can build the best products that help power their sales.

    To do this, we considered two metrics of highest importance:

    1. How often were our predictions correct? To answer this question, we looked at the precision, recall, and accuracy of the model. This should be very familiar to anyone who has prior experience with classification machine learning models. For the sake of simplicity let us call this set of metrics , “accuracy”. These metrics are calculated using a hold out set to ensure an unbiased measurement.
    2. How often do we provide a prediction? Our existing model  filters out predictions below a certain confidence thresholds to ensure we were only providing predictions that we were confident about. So, we defined a metric called “coverage”: the ratio of the number of products with a prediction and the total number of products.

    In addition to these two metrics, we also care about how these predictions are consumed and if we’re providing the right access patterns and SLA’s to satisfy all use cases. As an example, we might want to provide low latency real time predictions to our consumers.

    After evaluating our model against these metrics and taking into account the various data products we were looking to build, we decided to build a new model to improve our performance. As we approached the problem, we reminded ourselves of the blind spots of the existing model. These included things such as only using textual features for prediction and the ability to only understand products in the english language.

    In this post, we’ll discuss how we evolved and modernized our product categorization model that increased our leaf precision by 8% while doubling our coverage. We’ll dive into the challenges of solving this problem at scale and the technical trade-offs we made along the way. Finally we’ll describe a product that’s currently being used by multiple internal teams and our partner ecosystems to build derivative data products. 

    Why Is Product Categorization Important?

    Before we discuss the model, let’s recap why product categorization is an important problem to solve.

    Merchants sell a variety of products on our platform, and these products are sold across different sales channels. We believe that the key to building the best products for our merchants is to understand what they’re selling. For example, by classifying all the products our merchants sell into a standard set of categories, we can build features like better search and discovery across all channels and personalized insights to support merchants’ marketing efforts.

    Our current categorization model uses the Google Product Taxonomy (GPT). The GPT is a list of over 5,500 categories that help us organize products. Unlike a traditional flat list of categories or labels that’s common to most classification problems, the GPT has a hierarchical tree structure. Both the sheer number of categories in the taxonomy and the complex structure and relationship between the different classes make this a hard problem to model and solve.

    Sample branch from the GPT
    Sample branch from the GPT with the example of Animals & Pet Supplies classification

    The Model

    Before we could dive into creating our improved model, we had to take into account what we had to work with by exploring the product features available to us. Below is an example of the product admin page you would see in the backend of a Shopify merchant’s store:

    The product admin page in the backend of a Shopify store
    The product admin page in the backend of a Shopify store

    The image above shows the product admin page in the Shopify admin. We have highlighted the features that can help us identify what the product is. These include the title, description vendor, product type collection, tags and the product images.

    Clearly we have a few features that can help us identify what the product is, but nothing in a structured format. For example, multiple merchants selling the same product can use different values for Product Type. While this provides a lot of flexibility for the merchant to organize their inventory internally, it creates a harder problem in categorizing and indexing these products across stores.

    Broadly speaking we have two types of features available to us:


    Text Features

    • Product Title 
    • Product Description
    • Product Type
    • Product Vendor
    • Product Collections 
    • Product Tags

    Visual Features

    • Product Images


    These are the features we worked with to categorize the products.

    Feature Vectorization

    To start off, we had to choose what kind of vectorization approaches our features need since both text and image features can’t be used by most machine learning models in their raw state. After a lot of experimentation, we moved forward with transfer learning using neural networks. We used pre-trained image and text models to convert our raw features into embeddings to be further used for our hierarchical classification. This approach provided us with flexibility to incorporate several principles that we’ll discuss in detail in the next section.

    We horse raced several pre-trained models to decide which models to use for image and text embeddings. The parameters to consider were both model performance and computational cost. As we balanced out these two parameters, we settled on the choice of:

    • Multi-Lingual BERT for text 
    • MobileNet-V2 for images

    Model Architecture

    As explained in our previous post, categorizing hierarchical classification problems presents us with additional challenges beyond a flat multi-class problem. We had two lessons from our previous attempts at solving this problem:

    1. Preserving the multi-class nature of this problem is extremely beneficial in making predictions. For example: Level 1 in the taxonomy has 21 different class labels compared to more than 500 labels at Level 3.
    2. Learning parent nodes helps in predicting the child node. For example, if we look back at the image in our example of the Shopify product admin, it’s easier to predict the product as “Dog Beds”, if we’ve already predicted it as belonging to “Dog Supplies”.

    So, we went about framing the problem as a multi-task, multi-class classification problem in order to incorporate these learnings into our model.

    • Multi-Task: Each level of the taxonomy was treated as a separate classification problem and the output of each layer would be fed back into the next model to make the next level prediction. 
    • Multi-Class: Each level in the taxonomy contains a varying number of classes to choose from, so each task became a single multi-class classification problem. 
    Outline of model structure for the first 2 levels of the taxonomy
    Outline of model structure for the first 2 levels of the taxonomy

    The above image illustrates the approach we took to incorporate these lessons. As mentioned previously, we use pre-trained models to embed the raw text and image features and then feed the embeddings into multiple hidden layers before having a multi-class output layer for the Level 1 prediction. We then take the output from this layer along with the original embeddings and feed it into subsequent hidden layers to predict Level 2 output. We continue this feedback loop all the way until Level 7.

    Some important points to note:

    1. We have a total of seven output layers corresponding to the seven levels of the taxonomy. Each of these output layers has its own loss function associated with it. 
    2. During the forward pass of the model, parent nodes influence the outputs of child nodes.
    3. During backpropagation, the losses of all seven output layers are combined in a weighted fashion to arrive at a single loss value that’s used to calculate the gradients. This means that lower level performances can influence the weights of higher level layers and nudge the model in the right direction.
    4. Although we feed parent node prediction to child node prediction tasks in order to influence those predictions, we don’t impose any hard constraints that the child node prediction should strictly be a child of the previous level prediction. As an example the model is allowed to predict Level 2 as “Pet Supplies” even if it predicted Level 1 as “Arts & Entertainment”. We allow this during training so that accurate predictions at child nodes can nudge wrong predictions at the parent node in the right direction. We’ll revisit this point during the inference stage in a subsequent section.
    5. We can handle imbalance in classes using class weights during the training stage. The dataset we have is highly imbalanced. This makes it difficult for us to train a classifier that generalizes. Adding class weights enables us to mitigate the effects of the class imbalance. By providing class weights we’re able to penalize errors in predicting classes that have fewer samples compared thereby overcoming the lack of observations in those classes.

    Model Training

    One of the benefits of Shopify's scale is the availability of large datasets to build great data products that benefit our merchants and their buyers. For product categorization, we have collected hundreds of millions of observations to learn from. But this also comes with its own set of challenges! The model we described above turns out to be massive in complexity. It ends up having over 250 million parameters. Add to this the size of our dataset, training this model in a reasonable amount of time is a challenging task. Training this model using a single machine can run into multiple weeks even with GPU utilization. We needed to bring down training time while also not sacrificing the model performance.

    We decided to go with a data parallelization approach to solve this training problem. It would enable us to speed up the training process by chunking up the training dataset and using one machine per chunk to train the model. The model was built and trained using distributed Tensorflow using multiple workers and GPUs on Google Cloud Platform. We performed multiple optimizations to ensure that we utilized these resources as efficiently as possible.

    Model Inference and Predictions

    As described in the model architecture section, we don’t constrain the model to strictly follow the hierarchy during training. While this works during training, we can’t allow such behavior during inference time or we jeopardize providing a reliable and smooth experience for our consumers. To solve this problem, we incorporate additional logic during the inference step. The steps during predictions are

    1. Make raw predictions from the trained model. This will return seven arrays of confidence scores. Each array represents one level of the taxonomy.
    2. Choose the category that has the highest confidence score at Level 1 and designate that as the Level 1 Prediction.
    3. Collect all the immediate descendants of the Level 1 prediction. From among these, choose the child that has the highest confidence score and designate this as the Level 2 prediction.
    4. Continue this process until we reach the Level 7 prediction.

    We perform the above logic as Tensorflow operations and build a Keras subclass model to combine these operations with the trained model. This allows us to have a single Tensorflow model object that contains all the logic used in both batch and online inference.

    Schematic of subclassed model including additional inference logic
    Schematic of subclassed model including additional inference logic

    The image above illustrates how we build a Keras subclass model to take the raw trained Keras functional model and attach it to a downstream Tensorflow graph to do the recursive prediction.

    Metrics and Performance

    We collected a suite of different metrics to measure the performance of a hierarchical classification model. These include:

    • Hierarchical accuracy
    • Hierarchical precision
    • Hierarchical recall
    • Hierarchical F1
    • Coverage

    In addition to gains in all the metrics listed above, the new model classifies products in multiple languages and isn’t limited to only products with English text, which is critical for us as we further Shopify's mission of making commerce better for everyone around the world.

    In order to ensure only the highest quality predictions are surfaced, we impose varying thresholds on the confidence scores at different levels to filter out low confidence predictions. This means not all products have predictions at every level.

    An example of this is shown in the image below:

    Smart thresholding
    Smart thresholding

    The image above illustrates how the photo of the dog bed results in four levels of predictions. The first three levels all have a high confidence score and will be exposed. The fourth level prediction has a low confidence score and this prediction won’t be exposed.

    In this example, we don’t expose anything beyond the third level of predictions since the fourth level doesn’t satisfy our minimum confidence requirement.

    One thing we’ve learned during this process was how to tune the model so that these different metrics were balanced in an optimal way. We could, for example, achieve a higher hierarchical precision at the cost of lower coverage. These are hard decisions to make and would need us to understand our business use case and the priorities to make these decisions. We can’t emphasize enough how vital it is for us to focus on the business use cases and the merchant experience in order to guide us. We optimized towards reducing negative merchant experience and friction. While metrics are a great indication of model performance, we also conducted spot checks and manual QA on our predictions to identify areas of concern.

    An example is how we paid close attention to model performance on items that belonged to sensitive categories like “Religious and Ceremonial”. While overall metrics might look good, they can also mask model performance in small pockets of the taxonomy that can cause a lot of merchant friction. We manually tuned thresholds for confidences to ensure high performance in these sensitive areas. We encourage the reader to also adopt this practice in rolling out any machine learning powered consumer facing data product.

    Where Do We Go From Here?

    The upgrade from the previous model gave us a boost in both precision and coverage. At a high level, we were able to increase precision by eight percent while also almost doubling the coverage. We have more accurate predictions for a lot more products. While we improved the model and delivered a robust product to benefit our merchants, we believe we can further improve it. Some of the areas of improvements include:

    • Data Quality: While we do have a massive rich dataset of labelled products, it does contain high imbalance. While we can address imbalance in the dataset using a variety of well known techniques like class weights and over/undersampling, we also believe we should be collecting fresh data points in areas where we currently don’t have enough. As Shopify grows, we notice that the products that our merchants sell get more and more diverse by the day. This means we’ll need to keep collecting data in these new categories and sections of the taxonomy.
    • Merchant Level Features: The current model focuses on product level features. While this is the most obvious place to start, there are also a lot of signals that don’t strictly belong at the individual product level but roll up to the merchant level that can help us make better predictions. A simple example of this is a hypothetical merchant called “Acme Shoe warehouse”. It looks clear that the name of this store strongly hints at what the product this store sells could be.

    Kshetrajna Raghavan is a data scientist who works on Shopify's Commerce Algorithms team. He enjoys solving complex problems to help use machine learning at scale. He lives  in the San Francisco Bay Area with his wife and two dogs. Connect with Kshetrajna on LinkedIn to chat.

    If you’re passionate about solving complex problems at scale, and you’re eager to learn more, we're always hiring! Reach out to us or apply on our careers page.

    Continue reading

    A Kotlin Style .copy Function for Swift Structs

    A Kotlin Style .copy Function for Swift Structs

    Working in Android using Kotlin, we tend to create classes with immutable fields. This is quite nice when creating state objects, as it prevents parts of the code that interpret state (for rendering purposes, etc) from modifying the state. This lends to better clarity about where values originate from, less bugs, and easier focused testing.

    We use Kotlin’s data class to create immutable objects. If we need to overwrite existing field values in one of our immutable objects, we use the data class’s .copy function to set a new value for the desired field while preserving the rest of the values. Then we’d store this new copy of the object as the source of truth.

    While trying to bring this immutable object concept to our iOS codebase, I discovered that Swift’s struct isn’t quite as convenient as Kotlin’s data class because Swift’s struct doesn't have a similar copy function. To adopt this immutability pattern in Swift, you’ll have to write quite a lot of boilerplate code. 

    Initializing a New Copy of the Struct

    If you want to change one or more properties for a given struct, but preserve the other property values (as Kotlin’s data class provides), you’ll need an initializer that allows you to specify all the struct’s properties. The default initializer gives you this ability… until you set a default value for a property in the struct or define your own init. Once you do either you lose that default init provided by the compiler.

    So the first step is defining an init that captures every field value.

    Overriding Specific Property Values

    Using the init function above, you take your current struct and set every field to the current value, except the values you want to overwrite. This can get cumbersome, especially when your struct has numerous properties, or contains properties that are also structs.

    So the next step is to define a .copy function that accepts new values for its properties, but defaults to using the current values unless specified. The copy function takes optional parameter values and defaults all params to nil unless specified. If the param is non-nil, it sets that value in the new copy of the struct, otherwise it defaults to the current state’s value for the field.

    Not So Fast, What About Optional Properties?

    That works pretty well… until you have a struct that has optional fields. Then things don’t work as expected. What about the case when you have a non-nil value set for an optional property, and you want to set it nil. Uh-oh, the .copy function will always default to the current value when it receives nil for a param.

    What if rather than make the params in the copy function optional, we just set the default value to the struct’s current value? That’s how Kotlin solves this problem in its data class, in Swift it looks like this:

    Unfortunately in Swift you can’t reference self in default parameter values, so that’s not an option. I needed an alternate solution. 

    An Alternate Solution: Using a Builder

    I found a good solution on Stack Overflow: using a functional builder pattern to capture the override values for the new copy of the struct, while using the original struct’s values as input for the rest of the properties.

    This works a little differently, as instead of a simple copy function that accepts params for our fields, we instead define a closure that receives the builder as the sole argument, and allows you to set overrides for selected properties.

    And voilà, it’s not quite as convenient as Kotlin’s data class and its copy function, but it’s pretty close.

    Sourcery—Automating All the Boilerplate Code

    Using the Sourcery code generator for Swift, I wrote a stencil template that generates an initializer for the copy function, as well as the builder for a given struct:

    Scott Birksted is a Senior Development Manager for the Deliver Mobile team that focuses on Order and Inventory Management features in the Shopify Mobile app for iOS and Android. Scott has worked in mobile development since its infancy (pre-iOS/Android) and is passionate about writing testable extensible mobile code and first class mobile user experiences.

    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    5 Steps for Building Machine Learning Models for Business

    5 Steps for Building Machine Learning Models for Business

    By Ali Wytsma and C. Carquex

    Over the last decade, machine learning underwent a broad democratization. Countless tutorials, books, lectures, and blog articles have been published related to the topic. While the technical aspects of how to build and optimize models are well documented, very few resources are available on how developing machine learning models fits within a business context. When is it a good idea to use machine learning? How to get started? How to update a model over time without breaking the product?

    Below, we’ll share five steps and supporting tips on approaching machine learning from a business perspective. We’ve used these steps and tips at Shopify to help build and scale our suite of machine learning products. They may look simple, but when used together they give a straight-forward workflow to help you productionize models that actually drive impact.

    A flow diagram representing the five steps for building machine learning models for business as discussed in the article.
    Guide for building machine learning models

    1. Ask Yourself If It’s the Right Time for Machine Learning?

    Before starting the development of any machine learning model, the first question to ask is: should I invest resources in a machine learning model at this time? It’s tempting to spend lots of time on a flashy machine learning algorithm. This is especially true if the model is intended to power a product that is supposed to be “smart”. Below are two simple questions to assess whether it’s the right time to develop a machine learning model:

    a. Will This Model Be Powering a Brand New Product?

    Launching a new product requires a tremendous amount of effort, often with limited resources. Shipping a first version, understanding product fit, figuring out user engagement, and collecting feedback are critical activities to be performed. Choosing to delay machine learning in these early stages allows resources to be freed up and focused instead on getting the product off the ground.

    Plans for setting up the data flywheel and how machine learning can improve the product down the line are recommended. Data is what makes or breaks any machine learning model, and having a solid strategy for data collection will serve the team and product for years to come. We recommend exploring what will be beneficial down the line so that the right foundations are put in place from the beginning, but holding off on using machine learning until a later stage.

    To the contrary, if the product is already launched and proven to solve the user’s pain points, developing a machine learning algorithm might improve and extend it.

    b. How Are Non-machine Learning Methods Performing?

    Before jumping ahead with developing a machine learning model, we recommend trying to solve the problem with a simple heuristic method. The performance of those methods is often surprising. A benefit to starting with this class of solution is that they’re typically easier and faster to implement, and provide a good baseline to measure against if you decide to build a more complex solution later on. They also allow the practitioner to get familiar with the data and develop a deeper understanding of the problem they are trying to solve.

    In 90 percent of cases, you can create a baseline using heuristics. Here are some of our favorite for various types of business problems:

    Forecasting For forecasting with time series data, moving averages are often robust and efficient
    Predicting Churn Using a behavioural cohort analysis to determine user dropoff points are hard to beat
    Scoring For scoring business entities (for example leads and customers), a composite index based on two or three weighted proxy metrics is easy to explain and fast to spin up
    Recommendation Engines Recommending content that’s popular across the platform with some randomness to increase exploration and content diversity is a good place to start
    Search Stemming and keyword matching gives a solid heuristic

     2. Keep It Simple

    When developing a first model, the excitement of seeking the best possible solution often leads to adding unnecessary complexity early on: engineering extra features or choosing the latest popular model architecture can certainly provide an edge. However, they also increase the time to build, the overall complexity of the system, as well as the time it takes for a new team member to onboard, understand, and maintain the solution.

    On the other hand, simple models enable the team to rapidly build out the entire pipeline and de-risk any surprise that could appear there. They’re the quickest path to getting the system working end-to-end.

    At least for the first iteration of the model, we recommend being mindful of these costs by starting with the simplest approach possible. Complexity can always be added later on if necessary. Below are a few tips that help cut down complexity:

    Start With Simple Models

    Simple models contribute to iteration speed and allow for better understanding of the model. When possible, start with robust, interpretable models that train quickly (shallow decision tree, linear or logistic regression are three good initial choices). These models are especially valuable for getting buy-in from stakeholders and non-technical partners because they’re easy to explain. If this model is adequate, great! Otherwise, you can move to something more complex later on. For instance, when training a model for scoring leads for our Sales Representatives, we noticed that the performance of a random forest model and a more complex ensemble model were on par. We ended up keeping the first one since it was robust, fast to train, and simple to explain.

    Start With a Basic Set of Features

    A basic set of features allows you to get up and running fast. You can defer most feature engineering work until it’s needed. Having a reduced feature space also means that computational tasks run faster with a quicker iteration speed. Domain experts often provide valuable suggestions for where to start. For example at Shopify, when building a system to predict the industry of a given shop, we noticed that the weight of the products sold was correlated with the industry. Indeed, furniture stores tend to sell heavier products (mattresses and couches) than apparel stores (shirts and dresses). Starting with these basic features that we knew were correlated allowed us to get an initial read of performance without going deep into building a feature set.

    Leverage Off-the-shelf Solutions

    For some tasks (in particular tasks related to images, video, audio, or text), it’s essential to use deep learning to get good results. In this case, pre-trained, off the shelf models help build a powerful solution quickly and easily. For instance, for text processing, a pre-trained word embedding model that feeds into a logistic regression classifier might be sufficient for an initial release. Fine tuning the embedding to the target corpus comes in a subsequent iteration, if there’s a need for it.

    3. Measure Before Optimizing

    A common pitfall we’ve encountered is starting to optimize machine learning models too early. While it’s true that thousands of parameters and hyper-parameters have to be tuned (with respect to the model architecture, the choice of a class of objective functions, the input features, etc), jumping too fast to that stage is counterproductive. Answering the two questions below before diving in helps make sure your system is set up for success.

    a. How is the Incremental Impact of the Model Going to Be Measured?

    Benchmarks are critical to the development of machine learning models. They allow for the comparison of performance. There are two steps to creating a benchmark, and the second one is often forgotten.

    Select a Performance Metric

    The metric should align with the primary objectives of the business. One of the best ways to do so is by building an understanding of what the value means. For instance, what does an accuracy of 98 percent mean in the business context? In the case of a fraud detection system, accuracy would be a poor metric choice, and 98 percent would indicate poor performance as instances of fraud are typically rare. In another situation, 98 percent accuracy could mean great performance on a reasonable metric.

    For comparison purposes, a baseline value for the performance metric can be provided by an initial non-machine learning method, as discussed in the Ask Yourself If It’s the Right Time for Machine Learning? section.

    Tie the Performance Metric Back to the Impact on the Business

    Design a strategy to measure the impact of a performance improvement on the business. For instance, if the metric chosen in step one is accuracy, the strategy chosen in step two should allow the quantification of how each percentage point increment impacts the user of the product. Is an increase from 0.8 to 0.85 a game changer in the industry or barely noticeable to the user? Are those 0.05 extra points worth the potential added time and complexity? Understanding this tradeoff is key to deciding how to optimize the model and drives decisions such as continuing or stopping to invest time and resources in a given model.

    b. Can You Explain the Tradeoffs That the Model Is Making?

    When a model appears to perform well, it’s easy to celebrate too soon and become comfortable with the idea that machine learning is an opaque box with a magical performance. Based on experience, in about 95 percent of cases the magical performance is actually the symptom of an issue in the system. A poor choice of performance metric, a data leakage, or an uncaught balancing issue are just a few examples of what could be going wrong.

    Being able to understand the tradeoffs behind the performance of the model will allow you to catch any issues early, and avoid wasting time and compute cycles on optimizing a faulty system. One way to do this is by investigating the output of the model, and not just its performance metrics:

    Classification System In a classification system, what does the confusion matrix look like? Does the balancing of classes make sense?
    Regression Model When fitting a regression model, what does the distribution of residuals look like? Is there any apparent bias?
    Scoring System  For a scoring system, what does the distribution of scores look like? Are they all grouped toward one end of the scale? 



    Order Dataset
    Prediction Accuracy 98%


    Order is fraudulent Order is not fraudulent
    Prediction Order is fraudulent  0 0
    Order is not fraudulent 20 1,000

    Example of a model output with an accuracy of 98%. While 98% may look like a win, there are 2 issues at play:
    1. The model is consistently predicting “Order isn’t fraudulent”.
    2. Accuracy isn’t the appropriate metric to measure the performance of the model.

    Optimizing the model in this state doesn’t make sense, the metric needs to be fixed first.

    Optimizing the various parameters becomes simpler once the performance metric is set and tied to a business impact: the optimization stops when it doesn’t drive any incremental business impact. Similarly, by being able to explain the tradeoffs behind a model, errors that are otherwise masked by an apparent great performance are likely to get caught early.

    4. Have a Plan to Iterate Over Models

    Machine learning models evolve over time. They can be retrained at a set frequency. Their architecture can be updated to increase their prediction power or features can be added and removed as the business evolves. When updating a machine learning model, the roll out of the new model is usually a critical part. We must understand our performance relative to our baseline, and there should be no regression in performance. Here are a few tips that have helped us do this effectively:

    Set Up the Pipeline Infrastructure to Compare Models

    Models are built and rolled out iteratively. We recommend investing in building a pipeline to train and experimentally evaluate two or more versions of the model concurrently. Depending on the situation, there are Depending on the situation, there are several ways that new models are evaluated. Two great methods are:

    • If it’s possible to run an experiment without surfacing the output in production (for example, for a classifier where you have access to the labels), having a staging flow is sufficient. For instance, we did this in the case of the shop industry classifier, mentioned in the Keep It Simple section. A major update to the model ran in a staging flow for a few weeks before we felt confident enough to promote it to production. When possible, running an offline experiment is preferable because if there are performance degradations, they won’t impact users.
    • An online A/B test works well in most cases. By exposing a random group of users to our new version of the model, we get a clear view of it’s impact relative to our baseline. As an example, for a recommendation system where our key metric is user engagement, we assess how engaged the users exposed to our new model version are compared to users seeing the baseline recommendations to know if there’s a significant improvement.

    Make Sure Comparisons Are Fair

    Will the changes affect how the metrics are reported? As an example, in a classification problem, if the class balance is different between the set the model variant is being evaluated on and production, the comparison may not be fair. Similarly, if we’re changing the dataset being used, we may not be able to use the same population for evaluating our production model as our variant model. If there is bias, we try to change how the evaluations are conducted to remove it. In some cases, it may be necessary to adjust or reweight metrics to make the comparison fair.

    Consider Possible Variance in Performance Metrics

    One run of the variant model may not be enough to understand it’s impact. Model performance can vary due to many factors like random parameter initializations or how data is split between training and testing. Verify its performance over time, between runs and based on minor differences in hyperparameters. If the performance is inconsistent, this could be a sign of bigger issues (we’ll discuss those in the next section!). Also, verify whether performance is consistent across key segments in the population. If that’s a concern, it may be worth reweighting the metric to prioritize key segments.

    Does the Comparison Framework Introduce Bias?

    It’s important to be aware of the risks of overfitting when optimizing, and to account for this when developing a comparison strategy. For example, using a fixed test data set can cause you to optimize your model to those specific examples. Incorporating practices like cross validation, changing the test data set, using a holdout, regularization, and running multiple tests whenever random initializations are involved into your comparison strategy helps to mitigate these problems.

    5. Consider the Stability of the Model Over Time

    One aspect that’s rarely discussed is the stability of prediction as a model evolves over time. Say the model is retrained every quarter, and the performance is steadily increasing. If our metric is good, this means that performance is improving overall. However, individual subjects may have their predictions changed, even as the performance increases overall. That may cause a subset of users to have a negative experience with the product, without the team anticipating it.

    As an example, consider a case where a model is used to decide whether a user is eligible for funding, and that eligibility is exposed to the user. If the user sees their status fluctuate, that could create frustration and destroy trust in the product. In this case, we may prefer stability over marginal performance improvements. We may even choose to incorporate stability into our model performance metric.

    Two graphs side by side representing model Q1 on the left and model Q2 on the right. The graphs highlight the difference between accuracy and how overfitting can change that.
    Example of the decision boundary of a model, at two different points in time. The symbols represent the actual data points and the class they belong to (red division sign or blue multiplication sign). The shaded areas represent the class predicted by the model. Overall the accuracy increased, but two samples out of the eight switched to a different class. It illustrates the case where the eligibility status of a user fluctuates over time. 

    Being aware of this effect and measuring it is the first line of defense. The causes vary depending on the context. This issue can be tied to a form of overfitting, though not always. Here’s our checklist to help prevent this:

    • Understand the costs of changing your model. Consider the tradeoff between the improved performance versus the impact of changed predictions, and the work that needs to be done to manage that. Avoid major changes in the model, unless the performance improvements justify the costs.
    • Prefer shallow models to deep models. For instance in a classification problem, a change in the training dataset is more likely to make a deep model update its decision boundary in local spots than a shallow model. Use deep models only when the performance gains are justified.
    • Calibrate the output of the model. Especially for classification and ranking systems. Calibration highlights changes in distribution and reduces them.
    • Check for objective function condition and regularization. A poorly conditioned model has a decision boundary that changes wildly even if the training conditions change only slightly.

    The Five Factors That Can Make or Break Your Machine Learning Project

    To recap, there are a lot of factors to consider when building products and tools that leverage machine learning in a business setting. Carefully considering these factors can make or break the success of your machine learning projects. To summarize, always remember to:

    1. Ask yourself if it’s the right time for machine learning? When releasing a new product, it’s best to start with a baseline solution and improve it down the line with machine learning.
    2. Keep it simple! Simple models and feature sets are typically faster to iterate on and easier to understand, both of which are crucial for the first version of a machine learning product.
    3. Measure before optimizing. Make sure that you understand the ins and outs of your performance metric and how it impacts the business objectives. Have a good understanding of the tradeoffs your model is making.
    4. Have a plan to iterate over models. Expect to iteratively make improvements to the model, and make a plan for how to make good comparisons between new model versions and the existing one.
    5. Consider the stability of the model over time. Understand the impact stability has on your users, and take that into consideration as you iterate on your solution. 

    Ali Wytsma is a data scientist leading Shopify's Workflows Data team. She loves using data in creative ways to help make Shopify's admin as powerful and easy to use as possible for entrepreneurs. She lives in Kitchener, Ontario, and spends her time outside of work playing with her dog Kiwi and volunteering with programs to teach kids about technology.

    Carquex is a Senior Data Scientist for Shopify’s Global Revenue Data team. Check out his last blog on 4 Tips for Shipping Data Products Fast.

    We hope this guide helps you in building robust machine learning models for whatever business needs you have! If you’re interested in building impactful data products at Shopify, check out our careers page.

    Continue reading

    Diggin’ and Fetchin’ with TruffleRuby

    Diggin’ and Fetchin’ with TruffleRuby

    Sometimes as a software developer you come across a seemingly innocuous piece of code that, when investigated, leads you down a rabbit hole much deeper than you anticipated. This is the story of such a case.

    It begins with some clever Ruby code that we want to refactor, and ends with a prototype solution that changes the language itself. Along the way, we unearth a performance issue in TruffleRuby, an implementation of the Ruby language, and with it, an opportunity to work at the compiler level to even off the performance cliff. I’ll share this story with you.

    A Clever Way to Fetch Values from a Nested Hash

    The story begins with some Ruby code that struck me as a little bit odd. This was production code seemingly designed to extract a value from a nested hash, though it wasn’t immediately clear to me how it worked. I’ve changed names and values, but this is functionally equivalent to the code I found:

    Two things specifically stood out to me. Firstly, when extracting the value from the data hash, we’re calling the same method, fetch, twice and chaining the two calls together. Secondly, each time we call fetch, we provide two arguments, though it isn’t immediately clear what the second argument is for. Could there be an opportunity to refactor this code into something more readable?

    Before I start thinking about refactoring, I have to make sure I understand what’s actually going on here. Let’s do a quick refresher on fetch.

    About Fetch

    The Hash#fetch method is used to retrieve a value from a hash by a given key. It behaves similarly to the more commonly used [ ] syntax, which is itself a method and also fetches values from a hash by a given key. Here’s a simple example of both in action.

    Like we saw in the production code that sparked our investigation initially, you can chain calls to fetch together like you would using [ ], to fetch nested values to extract a value from nested key-value pairs.

    Now, this works nicely assuming that each chained call to fetch returns a hash itself. But what if it doesn’t? Well, fetch will raise a KeyError.

    This is where our optional second argument comes in. Fetch accepts an optional second argument that serves as a default value if a given key can’t be found. If you provide this argument, you get it back instead of the key error being raised.

    Helpfully, you can also pass a block to make the default value more dynamic.

    Let’s loop back around to the original code and look at it again now that we’ve had a quick refresher on fetch.

    The Refactoring Opportunity

    Now, it makes a little more sense as to what’s going on in the original code we were looking at. Here it is again to remind you:

    The first call to fetch is using the optional default argument in an interesting way. If our data hash doesn’t have a response key, instead of raising a KeyError, it returns an empty hash. In this scenario, by the time we’re calling fetch the second time, we’re actually calling it against an empty hash.

    Since an empty hash has no key-value pairs, this means when we evaluate the second call to fetch, we always get the default value returned to us. In this case, it’s an instance of IdentityObject.

    While a clever workaround, I feel this could look a lot cleaner. What if we reduced a chained fetch into a single call to fetch, like below?

    Well, there’s a precedent for this, actually, in the form of the Hash#dig method. Could we refactor the code using dig? Let’s do a quick refresher on this method before we try.

    About Dig

    Dig acts similarly to the [ ] and fetch methods. It’s a method on Ruby hashes that allows for the traversing of a hash to access nested values. Like [ ], it returns nil when it encounters a missing key. Here’s an example of how it works.

    Now, if we try to refactor our initial code with dig, we can already make it look a lot cleaner and more readable.

    Nice. With the refactor complete, I’m thinking, mission accomplished. But...

    Versatile Fetch

    One thing continues to bother me. dig just doesn’t feel as versatile as fetch does. With fetch you can choose between raising an error when a key isn’t found, returning nil, or returning a default in a more readable and user-friendly way.

    Let me show you what I mean with an example.

    Fetch is able to handle multiple control flow scenarios handily. With dig, this is more difficult because you’d have to raise a KeyError explicitly to achieve the same behaviour. In fact, you’d also have to add logic to make a determination about whether the key doesn’t exist or has an explicitly set value of nil, something that fetch handles much better.

    So, what if Ruby hashes had a method that combined the flexibility of fetch with the ability to traverse nested hashes like dig is able to do? If we could do that, we could potentially refactor our code to the following:

    Of course, if we want to add this functionality, we have a few options. The simplest one is to monkey patch the existing implementation of Ruby#Hash and add our new method to it. This lets me test out the logic with minimal setup required.

    There’s also another option. We can try to add this new functionality to the implementation of the Ruby language itself. Since I’ve never made a language level change before, and because it seems more fun to go with option two, I decided to see how hard such a change might actually be.

    Adding a New Method to Ruby Hashes

    Making a language level change seems like a fun challenge, but it’s a bit daunting. Most of the standard implementation of the Ruby language is written using C. Working in C isn’t something I have experience with, and I know enough to know the learning curve would be steep.

    So, is there an option that lets us avoid having to dive into writing or changing C code, but still allows us to make a language level change? Maybe there’s a different implementation of Ruby we could use that doesn’t use C?

    Enter TruffleRuby.

    TruffleRuby is an alternative implementation of the Ruby programming language built for GraalVM. It uses the Truffle language implementation framework and the GraalVM compiler. One of the main aims of the TruffleRuby language implementation is to run idiomatic Ruby code faster. Currently it isn’t widely used in the Ruby community. Most Ruby apps use MRI or other popular alternatives like JRuby or Rubinius.

    However, the big advantage is that parts of the language are themselves written in Ruby, making working with TruffleRuby much more accessible for folks who are proficient in the language already.

    After getting set up with TruffleRuby locally (you can do the same using the contributor guide), I jumped into trying to make the change.

    Implementing Hash#dig_fetch in TruffleRuby

    The easiest way to prototype our new behaviour is to add a brand new method on Ruby hashes in TruffleRuby. Let’s start with the very simple happy case, fetching a single value from a given hash. We’ll call our method dig_fetch, at least for our prototype.

    Here’s how it works.

    Let’s add a little more functionality. We’ll keep in line with fetch and make this method raise a KeyError if the current key isn’t found. For now, we just format the KeyError the same way that the fetch method has done it.

    You may have noticed that there’s still a problem here. With this implementation, we won’t be able to handle the scenario where keys are explicitly set to nil, as they raise a KeyError as well. Thankfully, TruffleRuby has a way to deal with this that’s showcased in its implementation of fetch.

    Below is how the body of the fetch method starts in TruffleRuby. You see that it uses a module called Primitive, which exposes the methods hash_get_or_undefined and undefined?. For the purposes of this post we won’t need to go into detail about how this module works, just know that these methods will allow us to distinguish between explicit nil values and keys that are missing from the hash. We can use this same strategy in dig_fetch to get around our problem of keys existing but containing nil values.

    Now, when we update our dig_fetch method, it looks like this:

    And here is our updated dig_fetch in action.

    Finally, let’s add the ability to ‘dig’ into the hash. We take inspiration from the existing implementation of dig and write this as a recursive call to our dig_fetch method.

    Here’s the behaviour in action:

    From here, it’s fairly easy to add the logic for accepting a default. For now, we just use blocks to provide our default values.

    And tada, it works!

    So far, making this change has gone smoothly. But in the back of my mind, I’ve been thinking that any language level change would have to be justified with performance data. Instead of just making sure our solution works, we should make sure it works well. Does our new method hold up, performance-wise, to the other methods which extract values from a hash?

    Benchmarking—A Performance Cliff Is Found

    I figure it makes sense to test the performance of all three methods that we’ve been focusing on, namely, dig, fetch, and dig_fetch. To run our benchmarks, I’m using a popular Ruby library called benchmark-ips. As for the tests themselves, let’s keep them really simple.

    For each method, let's look at two things

    • How many iterations it can complete in x seconds. Let’s say x = 5.
    • How the depth of the provided hash might impact the performance. Let’s test hashes with three, six, and nine nested keys.

    This example shows how the tests are set up if we were testing all three methods to a depth of three keys.

    Ok, let’s get testing.

    Running the Benchmark Tests

    We start by running the tests against hashes with a depth of three and it looks pretty good. Our new dig_fetch method performs very similarly to the other methods, knocking out about 458.69M iterations every five seconds.

    But uh-oh. When we double the depth to six (as seen below) we already see a big problem emerging. Our method's performance degraded severely. Interestingly, dig degraded in a very similar way. We used this method for inspiration in implementing our recursive solution, and it may have unearthed a problem with both methods.

    Let’s try running these tests on a hash with a depth of nine. At this depth, things have gotten even worse for our new method and for dig. We are now only seeing about 12.7M iterations every five seconds, whereas fetch is still able to clock about 164M.

    When we plot the results on a graph, you see how much more performant fetch is over dig and dig_fetch.

    Line graph of Performance of Hash methods in TruffleRuby

    So, what is going on here?

    Is Recursion the Problem?

    Let’s look at dig, the implementation of which inspired our dig_fetch method, to see if we can find a reason for this performance degradation. Here’s what it looks like, roughly.


    The thing that really jumps out is that both dig and dig_fetch are implemented recursively. In fact, we used the implementation of dig to inspire our implementation of dig_fetch so we could achieve the same hash traversing behaviour.

    Could recursion be the cause of our issues?

    Well, it could be. An optimizing implementation of Ruby such as TruffleRuby attempts to combine recursive calls into a single body of optimized machine code, but there’s a limit to inlining—we can’t inline forever producing infinite code. By contrast, an iterative solution with a loop starts with the code within a single body of optimized machine code in the first place.

    It seems we’ve uncovered an opportunity to fix the production implementation of dig in TruffleRuby. Can we do it by reimplementing dig with an iterative approach?

    Shipping an Iterative Approach to #dig

    Ok, so we know we want to optimize dig to be iterative and then run the benchmark tests again to test out our theory. I’m still fairly new to TruffleRuby at this point, and because this performance issue is impacting production code, it’s time to inform the TruffleRuby team of the issue. Chris Seaton, founder and maintainer of the language implementation is available to ship a fix for dig’s performance degradation problem. But first, we need to fix the problem.

    So, let’s look at dig again.

    To simplify things, let’s implement the iterative logic in a new package in TruffleRuby we will call Diggable. To be totally transparent, there’s a good reason for this, though one that we’ve glossed over in this post—dig is also available on Arrays and Structs in Ruby. By pulling out the iterative implementation into a shared package, we can easily update Array#dig, and Struct#dig to share the same behaviour later on. For now though, we focus on the Hash implementation.

    Inside Diggable, we make a method called dig and add a loop that iterates as many times as the number of keys that were passed to dig initially.

    With this change, dig continues to work as expected and the refactor is complete.

    #dig Performance Revisited

    Now, let’s have a look at performance again. Things look much better for dig with this new approach.

    Our solution had a big impact on the performance of dig. Previously, dig could only complete ~2.5M iterations per second against a hash with nine nested keys, but after our changes it has improved to ~16M. You can see these results plotted below.

    Line graph of Performance of Hash#dig in TruffleRuby

    Awesome! And we actually ship these changes to see a positive performance impact in TruffleRuby. See Chris’ real PRs #2300 and #2301.

    Now that that’s out of the way, it’s time to apply the same process to our dig_fetch method and see if we get the same results.

    Back to Our Implementation

    Now that we’ve seen the performance of dig improve we return to our implementation and make some improvements. Let’s add to the same Diggable package we created when updating dig.

    The iterative implementation ends up being really similar to what we saw with dig.

    After our changes we confirm that dig_fetch works. Now we can return to our benchmark tests and see whether our iterative approach has paid off again.

    Benchmarking, Again

    Performance is looking a lot better! dig_fetch is now performing similarly to dig.

    Below you can see the impact of the changes on performance more easily by comparing the iterative and recursive approaches. Our newly implemented iterative approach is much more performant than the existing recursive one, managing to execute ~15.5M times per second for a hash with nine nested keys when it only hit ~2.5M before.

    Line graph of Performance of Hash#dig in TruffleRuby

    Refactoring the Initial Code

    At this point, we’ve come full circle and can finally swap in our proposed change that set us down this path in the first place.

    One more reminder of what our original code looked like.

    And after swapping in our new method, things look much more readable. Our experimental refactor is complete!

    Final Thoughts

    Of course, even though we managed to refactor the code we found using dig_fetch, we cannot actually change the original production code that inspired this post to use it just yet. That’s because the work captured here doesn’t quite get us to the finish line -- we ignored the interoperability of dig and fetch with two other data structures, Arrays and Structs. On top of that, if we actually wanted to add the method to TruffleRuby, we’d also want to make the same change to the standard implementation, MRI, and we would have to convince the Ruby community to adopt the change.

    That said, I’m happy with the results of this little investigation. Even though we didn’t add our dig_fetch method to the language for everyone to use, our investigation did result in real changes to TruffleRuby in the form of drastically improving the performance of the existing dig method. A little curiosity took us a long way.

    Thanks for reading!

    Julie Antunovic is a Development Manager at Shopify. She leads the App Extensions team and has been with Shopify since 2018.

    If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.

    Continue reading

    Modelling Developer Infrastructure Teams

    Modelling Developer Infrastructure Teams

    I’ve been managing Developer Infrastructure teams (alternatively known as Developer Acceleration, Developer Productivity, and other such names) for almost a decade now. Developer Infrastructure (which we usually shorten to “Dev Infra”) covers a lot of ground, especially at a company like Shopify that’s invested heavily in the area, from developer environments and continuous integration/continuous deployment (CI/CD) to frameworks, libraries, and various productivity tools.

    When I started managing multiple teams I realized I’d benefit from creating a mental model of how they all fit together. There are a number of advantages to doing this exercise, both for myself and for all team members, manager and individual contributor alike.

    First, a model helps clarify the links and dependencies between the various teams and domains. This in turn allows a more holistic approach to designing systems that mitigates siloed thinking that affects both our users and our developers. Seeing these links also lets us identify where there are gaps in our solutions and rough transitions.

    Second, it helps everyone feel more connected to a larger vision, which is important for engagement. Many people feel more motivated if they can see how their work fits into the big picture.

    There’s no single perfect model. Indeed, it’s helpful to have different models to highlight different relationships. Team structures also change and that can require rethinking connections. I’m going to discuss one way that I thought about my area last year, reflecting the org structure at that time. In fact, constructing this model actually helped me think through other ways of organizing teams and led to us implementing a new structure. Before we get into the model, though, here’s a very brief description of the teams that reported into me last year:

    • Local Environments: The team responsible for the tooling that helps get new and existing projects set up on a local machine (that is, a MacBook Pro). This includes cloning repositories, installing dependencies, and running backing services, amongst various other common tasks.
    • Cloud Environments: A relatively new team that was created to explore development on remote, on-demand systems.
    • Test Infrastructure: They’re in charge of our CI systems, continually improving them and trying new ideas to accommodate Shopify’s growth.
    • Deploys: These folks handle the final steps in the development process: merging commits into our main branches (we’ve outgrown GitHub’s standard process!), validating them on our canary systems, and promoting them out to production.
    • Web Foundations: We’ve got some big front-end codebases and thus a team dedicated to accelerating the development of React-based apps through various tools and libraries.
    • React Native Foundations: Similar to Web Foundations, but focused specifically on standardizing and improving how we build React Native apps.
    • Mobile Tooling: Mobile apps have quite a few differences from web apps, so this team specializes in building tools for our mobile devs.

    The Development Workflow

    Phases and teams, with areas of responsibility, of the standard development pipeline.

    One way to look at the Developer Infrastructure teams is as parts of the development workflow (or pipeline), which can be split into three discrete phases:

    The Local Environments, Cloud Environments, Test Infrastructure, and Deploys teams each map to one phase. The scope of these teams remains broad, although the default support is for Ruby on Rails apps. See above for a graphical representation.

    Map of Web Foundations’ responsibilities to development phases


    By contrast, the applications and systems developed and supported by the Mobile Tooling, Web Foundations, and React Native Foundations teams span multiple phases. In the case of Web Foundations, much of this work focuses on the development phase (frameworks, tools, and libraries), but the team also maintains one application that’s executed as part of the validate phase, to monitor bundle sizes.

    Web Foundations builds on the systems supported by the Local and Cloud Environments, Test Infrastructure, and Deploys teams. Their work is complementary by adding specialized tooling for front-end development.

    Map of Mobile Tooling/React Native Foundations’ responsibilities to development phases

    The work of the Mobile Tooling and React Native Foundations teams spans all three phases. Although in this case, as seen in the image above, the Deployment phase is independent from that of the generic workflow, given the very different release process for mobile apps.

    Horizontal and Vertical Integration

    We can further extend the workflow model by borrowing a concept from the business world to look at the relationships in these teams. In a manufacturing industry, horizontal integration means that the different points in the supply chain have specific, often large companies behind them. The producer, supplier, manufacturer, and so on are all separate entities, providing deep specialization in a particular area.

    One could view Local and Cloud Environments, Test Infrastructure, and Deploys as similarly horizontally integrated. The generic development workflow is the supply chain, and each of these teams is responsible for one part of it, that is, one phase of the workflow. Each specializes in the specific problem area involved in that phase by maintaining the relevant systems, implementing workflow optimizations, and scaling up solutions to meet the increasing amount of development activity.

    By contrast, vertical integration involves one company handling multiple parts of the supply chain. IKEA is an example of this model, as they own everything from forests to retail stores. Their entire supply chain specializes in a particular industry (furniture and other housewares), meaning they can take a holistic approach to their business.

    Mobile Tooling, Web Foundations, and React Native Foundations can be seen as similarly vertically integrated. Each is responsible for systems that collectively span two or all three phases of the workflow. As noted, these two teams also rely on systems supported by the generic workflow, with their own specific solutions being either built on or sitting adjacent to them. So, they aren’t fully vertically integrated, but instead of being specialized in a phase of the development pipeline, these teams are subject matter experts in the development workflow of a particular technology. They build solutions along the workflow as required when the generic solutions are insufficient on their own.

    Analyzing Our Model

    Now, we can use the idea of a development workflow and the framework of horizontally and vertically integrated teams as a lens to pull together some interesting observations. First let’s look at the commonalities and contrasts.

    The work of each team in Dev Infra generally fits into one or more of the phases of the development workflow. This gives us a good scope for Dev Infra as a whole and helps distinguish us from other teams in our parent team, Accelerate. This in turn allows us to focus by pushing back on work that doesn’t really fit into this model. We made this Dev Infra’s mission statement: “Improving and scaling the develop–validate–deploy cycle for Shopify engineering.”

    An interesting contrast is that the horizontal teams have broad scale, while the vertical teams have broad scope. Our horizontal teams have to support engineering as a whole: virtually every developer interacts with our development environments, test infrastructure, and deploy systems. As a growing company, this means an increasing amount of usage and traffic. On the other side, our vertical teams specialize in smaller segments of the engineering population: those that develop mainly front-end and mobile apps. However, they’re responsible for specific improvements to the entire development workflow for those technologies, hence a broader scope.

    Further to this point, vertical teams have more opportunities for collaboration given their broad scope. However, there are also more situations where product teams go in their own directions to solve specific problems that Dev Infra can’t prioritize at a given moment. Therefore, it’s imperative for us to stay in close contact with product teams to ensure we aren’t duplicating work and to act as long-term stewards for infra projects that outgrow their teams. On the other side, horizontal teams get fewer outside contributions due to how deep and complex the infrastructure is to support our scale. However, there’s more consistency in its use as there are fewer, if any, ways around these systems.

    From Analysis to Action

    As a result of our study, we’ve started to categorize the work we’re doing and plan to do. For any phase in the development pipeline, there are three avenues for development:

    • Concentration: solidifying and improving systems, improving user experience, and incremental or linear scaling
    • Expansion: pushing outwards, identifying new opportunities within the problem domain, and step-change or exponential scaling
    • Interfacing: improving the points of contact between the development phases, both in terms of data flow and user experience, and identifying gaps into which an existing team could expand or a new team is created

    Horizontal and vertical teams will naturally approach development differently: 

    • Horizontal teams have a more clearly defined scope, and hence prioritization can be easier, but impact is limited to a particular area. Interface development is harder because it spans teams.
    • Vertical teams have a much vaguer scope with more possibilities for impact, but determining where we can have the most impact is thus more difficult. Interface improvement can be more straightforward if it’s between pieces owned by that team.

    We also used this analysis to inform the organizational structure. As I mentioned, we made some changes earlier this year within Accelerate. This included starting a Client Foundations team, which are essentially all vertically integrated and technology-focused teams, specializing in front-end and mobile development. Back in Dev Infra, we have the possibility of pulling in teams that currently exist in other organizations if they help us extend our development workflow model and provide new horizontal integrations. We’re starting to experiment with more active collaboration between teams to expand the context the developers have about our entire workflow.

    Finally, we plan to engage in some user research that spans the development workflow. Most of the time any in-depth research we do is at the team level: what repetitive tasks our mobile devs face, what annoys people about our test infrastructure, or how to make our deploy systems more intuitive. Now we have a way to talk about the journey a developer takes from first writing a patch all the way to getting it out into production. This helps us understand how we can make a more holistic solution and provide the smoothest experience to our developers.

    Mark Côté manages Developer Infrastructure at Shopify. He has worked in the software industry for 20 years, as a developer at a number of start ups and later at the Mozilla Corporation, where he went into management. For half of his career he has been involved in software tooling and developer productivity, leading efforts to bring a product-management mindset into the space.

    We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

    Continue reading

    Start your free 14-day trial of Shopify