Stop trying to do it all alone, add Kit to your team. Learn more.
Upgrading MySQL at Shopify

Upgrading MySQL at Shopify

In early September 2021, we retired our last Shopify database virtual machine (VM) that was running Percona Server 5.7.21, marking the complete cutover to 5.7.32. In this post, I’ll share how the Database Platform team performed the most recent MySQL upgrade at Shopify. I’ll talk about some of the roadblocks we encountered during rollback testing, the internal tooling that we built out to aid upgrading and scaling our fleet in general, and our guidelines for approaching upgrades going forward, which we hope will be useful for the rest of the community.

Why Upgrade and Why Now?

We were particularly interested in upgrading due to the replication improvements that would preserve replication parallelism in a multi-tier replication hierarchy via transaction writesets. However, in a general sense, upgrading our version of MySQL was on our minds for a while and the reasons have become more important over time as we’ve grown:

  • We’ve transferred more load to our replicas over time, and without replication improvements, high load could cause replication lag and a poor merchant and buyer experience.
  • Due to our increasing global footprint, to maintain efficiency, our replication topology can be up to four “hops” deep, which increases the importance of our replication performance.
  • Without replication improvements, in times of high load such as Black Friday/Cyber Monday (BFCM) and flash sales, there’s a greater likelihood of replication lag that in turn heightens the risk to merchants’ data availability in the event of a writer failure.
  • It’s industry best practice to stay current with all software dependencies to receive security and stability patches.
  • We expect to eventually upgrade to MySQL 8.0. Building the upgrade tooling required for this minor upgrade helps us prepare for that.

To the last point, one thing we definitely wanted to achieve as a part of this upgrade was—to put it in the words of my colleague Akshay—“Make MySQL upgrades at Shopify a checklist of tasks going forward, as opposed to a full-fledged project.” Ideally, by the end of the project, we have documentation with steps for how to perform an upgrade that can be followed by anyone on the Database Platform team that takes on the order of weeks, rather than months, to complete.

Database Infrastructure at Shopify

Core

Shopify's Core database infrastructure is horizontally sharded by shop, spread across hundreds of shards, each consisting of a writer and five or more replicas. These shards are run on Google Compute Engine Virtual Machines (VM) and run the Percona Server fork of MySQL. Our backup system makes use of Google Cloud’s persistent disk snapshots. While we’re running the upstream versions of Percona Server, we maintain an internal fork and build pipeline that allows us to patch it as necessary.

Mason

Without automation, there’s a non-trivial amount of toil involved in just the day-to-day operation of our VM fleet due to its sheer size. VMs can go down for many reasons, including failed GCP live migrations, zone outages, or just run-of-the-mill VM failures. Mason was developed to respond to VMs going down by spinning up a VM to replace it—a task far more suited to a robot rather than a human, especially in the middle of the night.

Mason was developed as a self-healing service for our VM-based databases that was borne out of a Shopify Hack Days project in late 2019.

Healing Isn’t All That’s Needed

Shopify’s query workload can differ vastly from shard to shard, which necessitates maintenance of vastly different configurations. Our minimal configuration is six instances: three instances in Google Cloud’s us-east1 region and three instances in us-central1. However, each shard’s configuration can differ in other ways:

  • There may be additional replicas to accommodate higher read workloads or to provide replicas in other locations globally.
  • The VMs for the replicas may have a different number of cores or memory to accommodate differing workloads.

With all of this in mind, you can probably imagine how it would be desirable to have automation built around maintaining these differences—without it, a good chunk of the manual toil involved in on-call tasks would be simply provisioning VMs, which isn’t an enviable set of responsibilities.

Using Mason to Upgrade MySQL

Upgrades at our scale are extremely high effort as the current count of our VM fleet numbers in the thousands. We decided that building additional functionality onto Mason would be the way forward to automate our MySQL upgrade, and called it the Declarative Database Topologies project. Where Mason was previously used as a solely reactive tool that only maintained a hardcoded default configuration, we envisioned its next iteration as a proactive toolone that allows us to define a per-shard topology and do the provisioning work that reconciles its current state to a desired state. Doing this would allow us to automate provisioning of upgraded VMs, thus removing much of the toil involved in upgrading a large fleet, and automate scale-up provisioning for events such as BFCM or other high-traffic occurrences.

The Project Plan

We had approximately eight months before BFCM preparations would begin to achieve the following:

  • pick a new version of MySQL.
  • benchmark and test the new version for any regressions or bugs
  • perform rollback testing and create a rollback plan to so we can safely downgrade if necessary
  • finally, perform the actual upgrade.

At the same time, we also needed to evolve Mason to:

  • increase its stability
  • move from a global hardcoded configuration to a dynamic per-shard configuration
  • have it respond to scale-ups when the configuration changed
  • have it care about Chef configuration, too
  • … do all of that safely.

One of the first things we had to do was pick a version of Percona Server. We wanted to maximize the gains that we would get from an upgrade while minimizing our risk. This led us to choose the highest minor version of Percona Server 5.7, which was 5.7.32 at the start of the project. By doing so, we benefited from the bug and security fixes made since we last upgraded; in the words of one of our directors, “incidents that never happened” because we upgraded. At the same time, we avoided some of the larger risks associated with major version upgrades.

Once we had settled on a version, we made changes in Chef to have it handle an in-place upgrade. Essentially, we created a new Chef role with the existing provisioning code but with the new version specified for the MySQL server version variable and modified the code so that the following happens:

  1. Restore a backup taken from a 5.7.21 VM on an VM with 5.7.32 installed.
  2. Allow the VM and MySQL server process to start up normally. 
  3. Check the contents of the mysql_upgrade_info file in the data directory. If the version differs from that of the MySQL server version installed, run mysql_upgrade (via a wrapper script that’s necessary to account for unexpected behaviour of the mysql_upgrade script that exits with the return code 2, instead of the typical return code of 0, when an upgrade wasn’t required).
  4. Perform the necessary replication configuration and proceed with the rest of the MySQL server startup.

After this work was completed, all we had to do to provision an upgraded version was to specify that the new VM be built with the new Chef role.

Preparing for the Upgrade

Performing the upgrade is the easy part, operationally. You can spin up an instance with a backup from the old version, let mysql_upgrade do its thing, have it join the existing replication topology, optionally take backups from this instance with the newer version, populate the rest of the topology, and then perform a takeover. Making sure the newer version performs the way we expect and can be safely rolled back to the old version, however, is the tricky part.

During our benchmarking tests, we didn’t find anything anomalous, performance-wise. However, when testing the downgrade from 5.7.32 back to 5.7.21, we found that the MySQL server wouldn’t properly start up. This is what we saw when tailing the error logs:

When we allowed the calculation of transient stats at startup to run to completion, it took over a day due to a lengthy table analyze process on some of our shards—not great if we needed to roll back more urgently than that.

A cursory look at the Percona Server source code revealed that the table_name column in the innodb_index_stats and innodb_table_stats changed from VARCHAR(64) in 5.7.21 to VARCHAR(199) in 5.7.32. We patched mysql_system_tables_fix.sql in our internal Percona Server fork so that the column lengths were set back to a value that 5.7.21 expected, and re-tested the rollback. This time, we didn’t see the errors about the column lengths, however we still saw the analyze table process causing full table rebuilds, again leading to an unacceptable startup time, and it became clear to us that we had merely addressed a symptom of the problem by fixing these column lengths.

At this point, while investigating our options, it occurred to us that one of the reasons why this analyze table process might be happening is because we run ALTER TABLE commands as a part of the MySQL server start: we run a startup script that sets the AUTO_INCREMENT value on tables to set a minimum value (this is due to the auto_increment counter not being persisted across restarts, a long-standing bug which is addressed in MySQL 8.0).

Investigating the Bug

Once we had our hypothesis, we started to test it. This culminated in a group debugging session where a few members of our team found that the following steps reproduced the bug that resulted in the full table rebuild:

  1. On 5.7.32: A backup previously taken from 5.7.21 is restored.
  2. On 5.7.32: An ALTER TABLE is run on a table that should just be an instantaneous metadata change, for example, ALTER TABLE t AUTO_INCREMENT=n. The table is changed instantaneously, as expected.
  3. On 5.7.32: A backup is taken.
  4. On 5.7.21: The backup taken from 5.7.32 in the previous step is restored.
  5. On 5.7.21: The MySQL server is started up, and mysql_upgrade performs the in-place downgrade.
  6. On 5.7.21: A similar ALTER TABLE statement to step 1 is performed. A full rebuild of the table is performed, unexpectedly and unnecessarily.

Stepping through the above steps with the GNU Debugger (GDB), we found the place in the MySQL server source code where it’s incorrectly concluded that indexes have changed in a way that required a table rebuild (from Percona Server 5.7.21 in the has_index_def_changed function in sql/sql_table.cc):

We saw, while inspecting in GDB, that the flags for the old version of the table (table_key->flags above) don’t match that of the new version of the table (new_key->flags above), despite the fact that only a metadata change was applied:

Digging deeper, we found past attempts to fix this bug. In the 5.7.23 release notes, there’s the following:

“For attempts to increase the length of a VARCHAR column of an InnoDB table using ALTER TABLE with the INPLACE algorithm, the attempt failed if the column was indexed.If an index size exceeded the InnoDB limit of 767 bytes for COMPACT or REDUNDANT row format, CREATE TABLE and ALTER TABLE did not report an error (in strict SQL mode) or a warning (in nonstrict mode). (Bug #26848813)”

A fix was merged for the bug, however we saw that there was a second attempt to fix this behaviour. In the 5.7.27 release notes, we see:

“For InnoDB tables that contained an index on a VARCHAR column and were created prior to MySQL 5.7.23, some simple ALTER TABLE statements that should have been done in place were performed with a table rebuild after an upgrade to MySQL 5.7.23 or higher. (Bug #29375764, Bug #94383)”

A fix was merged for this bug as well, but it didn’t fully address the issue of some ALTER TABLE statements that should be simple metadata changes instead leading to a full table rebuild.

My colleague Akshay filed a bug against this, however the included patch wasn’t ultimately accepted by the MySQL team. In order to safely upgrade past this bug, we still needed MySQL to behave in a reasonable way on downgrade, and we ended up patching Percona Server in our internal fork. We tested our patched version successfully in our final rollback tests, unblocking our upgrade.

What are “Packed Keys” Anyway?

The PACK_KEYS feature of the MyISAM storage engine allows keys to be compressed, thereby making indexes much smaller and improving performance. This feature isn’t supported by the InnoDB storage engine as its index layout and expectations are completely different. In MyISAM, when indexed VARCHAR columns are expanded past eight bytes, thus converting from unpacked keys to packed keys, it (rightfully) triggers an index rebuild.

However, we can see that in the first attempt to fix the bug in 5.7.23, that the same type of change triggers the same behaviour in InnoDB, even though packed keys aren’t supported. To remedy this, from 5.7.23 onwards, the HA_PACK_KEY and HA_BINARY_PACK_KEY flags weren’t set if the storage engine didn’t support it.

That, however, meant that if a table was created prior to 5.7.23, these flags are unexpectedly set even on storage engines that didn’t support it. So upon upgrade to 5.7.23 or higher, any metadata-only ALTER TABLE commands executed on an InnoDB table incorrectly conclude that a full index rebuild is necessary. This brings us to the second attempt to fix the issue in which the flags were removed entirely if the storage engine didn’t support it. Unfortunately that second bug fix didn’t account for the case where the flags might have changed, but the difference should be ignored when evaluating whether the indexes need to be rebuilt in earlier versions, and that’s what we addressed in our proposed patch. In our patch, during downgrade, if the old version of the table (from 5.7.32) didn’t specify the flag, but the new version of the table (in 5.7.21) does, then we bypass the index rebuild.

Meanwhile, in the Mason Project… 

While all of this rollback testing work was in progress, another part of the team was hard at work shipping new features in Mason to let it handle the upgrades. These were some of the requirements we had that guided the project work:

  • The creation of a “priority” lane—self-healing should always take precedence over a scale-up related provisioning request.
  • We needed to throttle the scale-up provisioning queue to limit how much work was done simultaneously.
  • Feature flags were required to limit the number of shards to release the scale-up feature to, so that we could control which shards were provisioned and release the new features carefully.
  • A dry-run mode for scale-up provisioning was necessary to allow us to test these features without making changes to the production systems immediately.

Underlying all of this was an abundance of caution in shipping the new features. Because of our large fleet size, we didn’t want to risk provisioning a lot of VMs we didn’t need or VMs in the incorrect configuration that would cost us either way in terms of GCP resource usage or engineering time spent in decommissioning resources.

In the initial stages of the project, stabilizing the service was important since it played a critical role in maintaining our MySQL topology. Over time, it had turned into a critical component of our infrastructure that significantly improved our on-call quality of life. Some of the early tasks that needed to be done were simply making it a first-class citizen among the services that we owned. We stabilized the staging environment it was deployed into, created and improved existing monitoring, and started using it to emit metrics to Datadog indicating when the topology was underprovisioned (in cases where Mason failed to do its job).

Another challenge was that Mason itself talks to many disparate components in our infrastructure: the GCP API, Chef, the Kubernetes API, ZooKeeper, Orchestrator, as well as the database VMs themselves. It was often a challenge to anticipate failure scenarios—often, the failure experienced was completely new and wouldn’t have been caught in existing tests. This is still an ongoing challenge, and one that we hope to address through improved integration testing.

Later on, as we onboarded new people to the project and started introducing more features, it also became obvious that the application was quite brittle in its current state; adding new features became more and more difficult due to the existing complexity, especially when they were being worked on concurrently. It brought to the forefront the importance of breaking down streams of work that have the potential to become hard blockers, and highlighted how much a well-designed codebase can decrease the chances of this happening.

We faced many challenges, but ultimately shipped the project on time. Now that the project is complete, we’re dedicating time to improving the codebase so it’s more maintainable and developer-friendly.

The Upgrade Itself

Throughout the process of rollback testing, we had already been running 5.7.32 for a few months on several shards reserved for canary testing. A few of those shards are load tested on a regular basis, so we were reasonably confident that this, along with our own benchmarking tests, made it ready for our production workload.

Next, we created a rollback plan in case the new version was unstable in production for unforeseen reasons. One of the early suggestions for risk mitigation was to maintain a 5.7.21 VM per-shard and continue to take backups from them. However, that would have been operationally complex and also would have necessitated the creation of more tooling and monitoring to make sure that we always have 5.7.21 VMs running for each shard (rather toilsome when the number of shards reaches the hundreds in a fleet). Ultimately, we decided against this plan, especially considering the fact that we were confident that we could roll back to our patched build of Percona Server, if we had to.

Our intention was to do everything we could to de-risk the upgrade by performing extensive rollback testing, but ultimately we preferred to fix forward whenever possible. That is, the option to rollback was expected to be taken only as a last resort.

We started provisioning new VMs with 5.7.32 in earnest on August 25th using Mason, after our tooling and rollback plan were in place. We decided to stagger the upgrades by creating several batches of shards. This allowed the upgraded shards to “bake” and not endanger the entire fleet in the event of an unforeseen circumstance. We also didn’t want to provision all the new VMs at once due to the amount of resource churn (at the petabyte-scale) and pressure it would put on Google Cloud.

On September 7th, the final shards were completed, marking the end of the upgrade project.

What Did We Take Away from This Upgrade? 

This upgrade project highlighted the importance of rollback testing. Without the extensive testing that we performed, we would have never known that there was a critical bug blocking a potential rollback. Even though needing to rebuild the fleet with the old version to downgrade would have been toilsome and undesirable, patching 5.7.21 gave us the confidence to proceed with the upgrade, knowing that we had the option to safely downgrade if it became necessary.

Also Mason, the tooling that we relied on, became more important over time. In the past, Mason was considered a lower-tier application, and simply turning it off was a band-aid solution to when it was behaving in unexpected ways. Fixing it wasn’t often a priority when bugs were encountered. However, as time has gone by, we’ve recognized how large of a role it plays in toil-mitigation and maintaining healthy on-call expectations, especially as the size of our fleet has grown. We have invested more time and resources into it by improving test coverage and refactoring key parts of the codebase to reduce complexity and improve readability. We also have future plans to improve the local development environments and streamline its deployment pipeline.

Finally, investing in the documentation and easy repeatability of upgrades has been a big win for Shopify and for our team. When we first started planning for this upgrade, finding out how upgrades were done in the past was a bit of a scavenger hunt and required a lot of institutional knowledge. By developing guidelines and documentation, we paved the way for future upgrades to be done faster, more safely, and more efficiently. Rather than an intense and manual context-gathering process every time that pays no future dividends, we can now treat a MySQL upgrade as simply a series of guidelines to follow using our existing tooling.

Next up: MySQL 8!

Yi Qing Sim is a Senior Production Engineer and brings nearly a decade of software development and site reliability engineering experience to the Database Backend team, where she primarily works on Shopify’s core database infrastructure.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

Remote Rendering: Shopify’s Take on Extensible UI

Remote Rendering: Shopify’s Take on Extensible UI

Shopify is one of the world's largest e-commerce platforms. With over 1.7 million merchants worldwide, we support an increasingly diverse set of use cases, and we wouldn't be successful at it without our developer community. Developers build apps that add immense value to Shopify and its merchants, and solve problems such as marketing automation, sales channel integrations, and product sourcing.

In this post, we will take a deep dive into the latest generation of our technology that allows developers to extend Shopify’s UI. With this technology, developers can better integrate with the Shopify platform and offer native experiences and rich interactions that fit into users' natural workflow on the platform.

A GIF showing how a 3rd party extension inserting a page highlighting an upsell purchase before the user completes purchase is completed in the Shopify checkout
3rd party extension adding a post-purchase page directly into the Shopify checkout

To put the technical challenges into context, it's important to understand our main objectives and requirements:

  • The user experience of 3rd party extensions must be consistent with Shopify's native content in terms of look & feel, performance, and accessibility features.
  • Developers should be able to extend Shopify using standard technologies they are already familiar with.
  • Shopify needs to run extensions in a secure and reliable manner, and prevent them from negatively impacting the platform (naively or maliciously).
  • Extensions should offer the same delightful experience across all supported platforms (web, iOS, Android).

With these requirements in mind, it's time to peel the onion.

Remote Rendering

At the heart of our solution is a technique we call remote rendering. With remote rendering, we separate the code that defines the UI from the code that renders it, and have the two communicate via message passing. This technique fits our use case very well because extensions (code that defines UI) are typically 3rd party code that needs to run in a restricted sandbox environment, while the host (code that renders UI) is part of the main application.

A diagram showing that Extensions define the UI and run in a sandbox and the Host renders the UI and is part of the main application. Extensions and Host communicate via messages between them.
Separating extensions (3rd party code) from host (1st party code)

Communication between an extension and a host is done via a MessageChannel. Using message passing for all communication means that hosts and extensions are completely agnostic of each other’s implementation and can be implemented using different languages. In fact, at Shopify, we have implemented hosts in JavaScript, Kotlin, and Swift to provide cross-platform support.

The remote-ui Library

Remote rendering gives us the flexibility we need, but it also introduces non-trivial technical challenges such as defining an efficient message-passing protocol, implementing function calls using message passing (aka remote procedure call), and applying UI updates in a performant way. These challenges (and more) are tackled by remote-ui, an open-source library developed at Shopify.

Let's take a closer look at some of the fundamental building blocks that remote-ui offers and how these building blocks fit together.

RPC

At the lower level, the @remote-ui/rpc package provides a powerful remote procedure call (RPC) abstraction. The key feature of this RPC layer is the ability for functions to be passed (and called) across a postMessage interface, supporting the common need for passing event callbacks.

Two code snippets displayed side by side showing remote procedure calls using endpoint.expose and endpoint.call
Making remote procedure calls using endpoint.call (script1.js) and endpoint.expose (script2.js)

@remote-ui/rpc introduces the concept of an endpoint for exposing functions and calling them remotely. Under the hood, the library uses Promise and Proxy objects to abstract away the details of the underlying message-passing protocol.

It's also worth mentioning that remote-ui’s RPC has very smart automatic memory management. This feature is especially useful when rendering UI, since properties (such as event handlers) can be automatically retained and released as UI component mount and unmount. 

Remote Root

After RPC, the next fundamental building block is the RemoteRoot which provides a familiar DOM-like API for defining and manipulating a UI component tree. Under the hood, RemoteRoot uses RPC to serialize UI updates as JSON messages and send them to the host.

Two code snippets showing appending a child to a `RemoteRoot` object and getting converted to a JSON message
UI is defined with a DOM-like API and gets converted to a JSON message

For more details on the implementation of RemoteRoot, see the documentation and source code of the @remote-ui/core package.

Remote Receiver

The "opposite side" of a RemoteRoot is a RemoteReceiver. It receives UI updates (JSON messages sent from a remote root) and reconstructs the remote component tree locally. The remote component tree can then be rendered using native components.

Code snippets showing RemoteRoot and RemoteReceiver working together

Basic example setting up a RemoteRoot and RemoteReceiver to work together (host.jsx and extension.js)

With RemoteRoot and RemoteReceiver we are very close to having an implementation of the remote rendering pattern. Extensions can define the UI as a remote tree, and that tree gets reconstructed on the host. The only missing thing is for the host to traverse the tree and render it using native UI components.

DOM Receiver

remote-ui provides a number of packages that make it easy to convert a remote component tree to a native component tree. For example, a DomReceiver can be initialized with minimal configuration and render a remote root into the DOM. It abstracts away the underlying details of traversing the tree, converting remote components to DOM elements, and attaching event handlers.

In the snippet above, we create a receiver that will render the remote tree inside a DOM element with the id container. The receiver will convert Button and LineBreak remote components to button and br DOM elements, respectively. It will also automatically convert any prop starting with on into an event listener.

For more details, check out this complete standalone example in the remote-ui repo.

Integration with React

The DomReceiver provides a convenient way for a host to map between remote components and their native implementations, but it’s not a great fit for our use case at Shopify. Our frontend application is built using React, so we need a receiver that manipulates React components (instead of manipulating DOM elements directly).

Luckily, the @remote-io/react package has everything we need: a receiver (that receives UI updates from the remote root), a controller (that maps remote components to their native implementations), and the RemoteRenderer React component to hook them up.

There's nothing special about the component implementations passed to the controller; they are just regular React components:

However, there's a part of the code that is worth taking a closer look at:

// Run 3rd party script in a sandbox environment
// with the receiver as a communication channel ...

Sandboxing

When we introduced the concept of remote rendering, our high-level diagram included only two boxes, extension and host. In practice, the diagram is slightly more complex.

An image showing the Sandbox as a box surrounding the Extension and a box representing the Host. The two communicate via messages
The sandbox is an additional layer of indirection between the host and the extension

The sandbox, an additional layer of indirection between the host and the extension, provides platform developers with more control. The sandbox code runs in an isolated environment (such as a web worker) and loads extensions in a safe and secure manner. In addition to that, by keeping all boilerplate code as part of the sandbox, extension developers get a simpler interface to implement.

Let's look at a simple sandbox implementation that allows us to run 3rd party code and acts as “the glue” between 3rd party extensions and our host.

The sandbox allows a host to load extension code from an external URL. When the extension is loaded, it will register itself as a callback function. After the extension finishes loading, the host can render it (that is, call the registered callback).

Arguments passed to the render function (from the host) provide it with everything it needs. remoteChannel is used for communicating UI updates with the host, and api is an arbitrary object containing any native functionality that the host wants to make available to the extension.

Let's see how a host can use this sandbox:

In the code snippet above, the host makes a setTitle function available for the extension to use. Here is what the corresponding extension script might look like:

Notice that 3rd party extension code isn't aware of any underlying aspects of RPC. It only needs to know that the api (that the host will pass) contains a setTitle function.

Implementing a Production Sandbox

The implementation above can give you a good sense of our architecture. For the sake of simplicity, we omitted details such as error handling and support for registering multiple extension callbacks.

In addition to that, our production sandbox restricts the JavaScript environment where untrusted code runs. Some globals (such as importScripts) are made unavailable and others are replaced with safer versions (such as fetch, which is restricted to specific domains). Also, the sandbox script itself is loaded from a separate domain so that the browser provides extra security constraints.

Finally, to have cross-platform support, we implemented our sandbox on three different platforms using web workers (web), web views (Android), and JsCore (iOS).

What’s Next?

The technology we presented in this blog post is relatively new and is currently used to power two types of extensions, product subscriptions and post-purchase, in two different platform areas.

We are truly excited about the potential we’re unlocking, and we also know that there's a lot of work ahead of us. Our plans include improving the experience of 3rd party developers, supporting new UI patterns as they come up, and making more areas of the platform extensibile.

If you are interested in learning more, you might want to check out the remote-ui comprehensive example and this recent React Summit talk.

Special thanks to Chris Sauve, Elana Kopelevich, James Woo, and Trish Ta for their contribution to this blog post.

Joey Freund is a manager on the core extensibility team, focusing on building tools that let Shopify developers extend our platform to make it a perfect fit for every merchant.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

Building an App Clip with React Native

Building an App Clip with React Native

When the App Clip was introduced in iOS 14, we immediately realized that it was something that could be a big opportunity for the Shop app. Due to the App Clip design, being a lightweight version of an app that you can download on the fly, we wanted to investigate what it could mean for us. Being able to instantly show users the power of the Shop app, without having to download it from the App Store and go through the onboarding was something we thought could have a huge growth potential.

One of the key features, and restrictions, for an App Clip is the size limitation. To make things even more interesting, we wanted to build it in React Native. Something that, to our knowledge, has never been done at this scale before.

Being the first to build an App Clip in React Native that was going to be surfaced to millions of users each day proved to be a challenging task.

What’s an App Clip?

App Clips are a miniature version of an app that’s meant to be lightweight and downloadable “on the go.” To provide a smooth download experience, the App Clip can’t exceed 10MB in size. For comparison, the iOS Shop app is 51MB.

An App Clip can’t be downloaded from the App Store—it can only be “invoked”. An invocation means that a user performs an action that opens the App Clip on their phone: scanning a QR code or an NFC tag, clicking a link in the Messages app, or tapping a Smart App Banner on a webpage. After the invocation is made, iOS displays a prompt asking the user to open the App Clip, meanwhile the binary of the App Clip is downloaded in the background that causes it to launch instantly. The invocation URL is passed on to the App Clip which enables you to provide a contextual experience for the user.

What Are We Trying to Solve?

The Shop app helps users to track all of their packages in one place with ease. When the buyer installs the app that order is automatically imported and the buyer is kept up to date about its status without having to ask the seller.

However, we noticed a big drop-off of users in the funnel between the “Thank you” page and opening the app. Despite the Shop app having a 4.8 star rating, the few added steps of going through an App Store meant some buyers chose not to complete the process. The App Clip would solve all of this.

When the user landed on the “Thank you” page on their computer and invoked the App Clip by scanning a QR code, or for mobile checkouts by simply tapping the Open button, they would instantly see their order tracked. No App Store, no onboarding, just straight into the order details with the option to receive push notifications for the whole package journey.

Why React Native?

React Native apps aren’t famous for being small in size, so we knew building an App Clip that was below 10MB in size would pose some interesting challenges. However, being one of the most popular apps on the app stores, and champions of React Native, we really wanted to see if it was possible.

Since the Shop app is built in React Native, all our developers could contribute to the App Clip—not just Swift developers—and we would potentially be able to maintain code sharing and feature parity with the AppClip as we do across Android and iOS.

In short, it was an interesting challenge that aligned with our technology choices and our values about building reusable systems designed for the long-term.

Building a Proof of Concept–Failing Fast

Since the App Clip was a very new piece of technology, there was a huge list of unknowns. We weren’t sure if it was going to be possible to build it with React Native and go below the 10MB limit. So we decided to set up a technical plan where if we failed, we would fail fast.

The plan looked something like this:

  1. Build a “Hello World” App Clip in React Native and determine its size
  2. Build a very scrappy, not even functional, version of the actual App Clip, containing all the code and dependencies we estimated we would need and determine its size
  3. Clean up the code, make everything work

We also wanted to fail fast product wise. App Clips is a brand new technology that few people have been exposed to. We weren’t sure if our App Clip would benefit our users, so our goal was to get an App Clip out for testing, and get it out fast. If it would prove to be successful we would go back and re-iterate.

Hello World

When we started building the App Clip, there were a lot of unknowns. So to determine if this was even possible, we started off by creating a “Hello World” App Clip using just React Native’s <View /> and <Text /> components.

The “Hello World” App Clip weighed in at a staggering 28MB. How could a barebone App Clip be this big in size? We investigated and realized that the App Clip was including all the native dependencies that the Shop app used, even though it only needed a subset of the React Native ones. We realized that we had to explicitly define exactly which native dependencies the App Clip needed in the Podfile:

Defining dependencies was done by looking through React Natives node_modules/react-native/scripts/react_native_pods to determine the bare minimum native dependencies that React Native needed. After determining the list, we calculated the App Clip size. The result was 4.3MB. This was good news, but we still didn’t know if adding all the features we wanted would make us go beyond the 10MB limit.

Building a Scrappy Version

Building an App Clip with React Native is almost identical to building a React Native app with one big difference. We need to explicitly define the App Clip dependencies in the Podfile. Auto linking wouldn’t work in this case since it would scan all the installed packages for the ones compatible with auto linking and add them, we needed to cherry-pick pods only used by the App Clip.

The process was pretty straightforward; add a dependency in a React component, and if it had a native dependency, we’d added it to the “Shop App Clip” target in the Podfile. But the consequence of this would be quite substantial later on.

So the baseline size was 4.3MB, now it was time to start adding the functionality we needed. Since we were still exploring the design in this phase, we didn’t know exactly what the end result would be (other than displaying information about the user's order), but we could make some assumptions. For one, we wanted to share as much code with the app as possible. The Shop app has a very robust UI library that we wanted to leverage as well as a lot of business logic that handles user and order creation. Secondly we knew that we needed basic functionality like:

  • Network calls to our GraphQL service
  • Error reporting
  • Push notifications

Since we only wanted to determine the build size, and in the spirit of failing fast, we implemented these features without them even working. The code was added, as well as the dependencies, but the App Clip wasn’t functional at all.

We calculated the App Clip size once again, and the result was 6.5MB. Even though it was a scrappy implementation to say the least, and there were still quite a few unknowns regarding the functionality, we knew that building it in React Native was theoretically possible and something we wanted to pursue.

Building the App Clip

We knew that building our App Clip with React Native was possible, our proof of concept was 6.5MB, giving us some leeway for unknowns. And with a React Native App Clip there sure were a lot of unknowns. Will sharing code between the app and the App Clip affect its size or cause any other issues? What do we do if the App Clip requires a dependency that pushes us over the 10MB limit?

Technology Drives Design

Given the very rigid constraints, we decided that unlike most projects where the design leads the technology, we would approach this from the opposite direction. While developing the App Clip, the technology would drive the design. If something caused us to go over, or close to, the 10MB limit we would go back to the drawing board and find alternative solutions.

Code Sharing Between Shop App and App Clip

With the App Clip, we wanted to give the user a quick overview of their order and the ability to receive shipping updates through push notifications. We were heavily inspired by the order view in Shop app, and the final App Clip design was a reorganized version of that.

A screenshot showing the App Clip order page on the left and the Shop App order page on the right. Order details are more front and center in the App Clip verison.
App Clip versus Shop App

The Shop app is structured to share as much code as possible, and we wanted to incorporate that in the App Clip. Sharing code between the two makes sense, especially when the App Clip had similar functionality as the order view in the app.

Our first exploration was to see if it was viable to share all the code in the order view between the app and the App Clip, and modify the layout with props passed from the App Clip.

A flow diagram showing that App Clip and Shop App share all the code for the <OrderView /> component and therefor share <ProductRow /> and <OrderHeader /> as a result.
App Clip and Shop App share all the code in the <OrderView /> component

We quickly realized this wasn’t viable. For one, it would add too much complexity to the order view, but mainly, any change to the order view would affect the App Clip. If a developer adds a feature to the order view, with a big dependency, the 10MB App Clip limit could be at risk.

For a small development team, it might have been a valid approach, but at our scale we couldn’t. The added responsibility that every developer would have for the App Clip’s size limit while doing changes to the app’s main order view would be against our values around autonomy.

We then considered building our own version of the order view in the App Clip, but sharing its sub components. This could be a viable compromise where all the logic heavy code would live in the <OrderView /> but the simple presentational components could still be shared.

A flow diagram showing that App Clip and Shop App share subcomponents from the &lt;OrderView /&gt;:  &lt;ProductRow /&gt; and &lt;OrderHeader /&gt;.
App Clip and Shop App share subcomponents of the <OrderView /> component

The first component we wanted to import to the App Clip was <ProductRow />, its job is to display the product title, price and image:

An image showing <ProductRow />, its job is to display the product title, variant, price and image
<ProductRow /> displaying product title, price and image

The code for this component looks like this (simplified):

But when we imported this component into the App Clip it crashed. After some digging, we realized that our <Image /> component uses a library called react-native-fast-image. It’s a library, built with native Swift code, we use to be able to display large lists of images in a very performant way. And as mentioned previously, to keep the App Clip size down we need to explicitly define all its native dependencies in the Podfile. We hadn’t defined the native dependency for react-native-fast-image and therefore it crashed. The fix was easy though, adding the dependency enabled us to use the <ProductRow /> component:

However, our proof of concept App Clip weighed in at 6.5MB meaning we only had 3.5MB to spare. So we knew we only wanted to add the absolute necessary dependencies, and since the App Clip would only display a handful of images we didn’t deem this library an absolute necessity.

With this in mind, we briefly went through all the components we wanted to share with the order view, maybe this was just a one time thing we could create a workaround for? We discovered that the majority of the sub components of the <OrderView /> somewhere down the line had a native dependency. Upon analyzing how they would affect the App Clip size, we discovered that they would push the App Clip far north of 10MB with one single dependency weighing in at a staggering 2.5MB.

Standing at a Crossroad

We now realized sharing components between the order view in the app and the App Clip was not possible, was that true for all code? At this stage we were standing at a crossroad. Do we want to duplicate everything? Some things? Nothing?

To answer this question we decided to base the decision on the following principles:

  • The App Clip is an experiment: we didn’t know if it would be successful or not, so we want to validate this idea as fast as possible.
  • Minimal impact on other developers: We were a small team working on the App Clip, we don’t want to add any responsibility to the rest of the developers working on the Shop app.
  • Easy to delete: Due to the many unknowns for the success of the experiment, we wanted to double down on writing code that was easy to delete.

With this in mind, we decided that the similarities between the order view in the app and the App Clip are purely coincidental. This change of mindset helped us move forwards very quickly.

Build Phase

Building the App Clip was very similar to building any other React Native app, the only real difference was that we constantly needed to keep track of its size. Since checking the size of the App Clip was very time consuming, around 25 minutes each time on our local machines, we decided to only do this when any new dependencies were added as well as doing some ad-hoc checks from time to time.

All the components for the App Clip were created from scratch with the exception of the usage of our shared components and functions within the Shop app. Inside our shared/ directory there are a lot of powerful foundational tools we wanted to use in the App Clip; <Box /> and <Text /> and a few others that we rely on heavily to structure our UI in the Shop app with the help of our Restyle library. We also wanted to reuse the shared hooks for setting up push notifications, creating a user, etc. As mentioned earlier, sharing code between the app and the App Clip could potentially cause issues. If a developer decides to add a new native dependency to the <Box /> or <Text /> they would, often unknowingly, affect the App Clip as well. 

However, we deemed these shared components mature enough to not have any large changes made to them. To catch any new dependencies being added to these shared components, we wrote a CI script to detect and notify the pull request author of this.

The script did three things:

  1. Go through the Podfile and create a list of all the native dependencies.
  2. Traverse through all imports the App Clip made and create a list of the ones that have native dependencies.
  3. Finally, compare the two lists. If they don’t match, the CI job fails with instructions on how to proceed.

A few times we stumbled upon some issues with dependencies, either our shared one or external ones, adding some weight to the App Clip. This could be a third-party library for animations, async storage, or getting information about the device. With our “technology drives design” principle in mind, we often removed the dependencies for non-critical features, as with the animation library.

We now felt more confident on how to think while building an App Clip and we moved fast, continuously creating and merging pull requests.

Support Invocation URLs in the App

The app always has precedence over the App Clip. Meaning if you invoke the App Clip by scanning a QR code, but you already have the app installed, the app opens and not the App Clip. We had to build support for invocations in the app as well, so even if the user has the app installed scanning the QR code would automatically import the order.

React Native enables us to do this through the Linking module:

The module allowed us to fetch the invocation URL inside the app and create the order for already existing app users. With this, we now supported importing an order by scanning a QR code both in the App Clip and the app.

Smooth Transition to the App

The last feature we wanted to implement was a smooth transition to the app. If the user decides to upgrade from the App Clip to the full app experience, we wanted to provide a simpler onboarding experience and also magically have their order ready for them in the app. Apple provides a very nice solution to this with shared data containers which both the App Clip and the app have access to.

Now we can store user data in the App Clip that the app has access to, providing an optimal onboarding experience if the user decides to upgrade.

Testing the App Clip

Throughout the development and launch of the App Clip, testing was difficult. Apple provides a great way to mock an invocation of the App Clip by hard coding the invocation URL in Xcode, but there was no way to test the full end-to-end flow of scanning the QR code, invocating the App Clip, and downloading the app. This wasn’t possible either on our local machines or TestFlight. To verify that the flow would work as expected we decided to release a first version of the App Clip extremely early. With the help of beta flags we made sure the App Clip could only be invoked by the team. This early release had no functionality, it only verified that the App Clip received the invocation URL and passed along the proper data to the app for a great onboarding experience. Once this flow was working, and we could trust that our local mockups worked the same as in production, testing the App Clip got a lot easier.

After extensive testing, we felt ready to release the App Clip. The release process was very similar since the App Clip is bundled into the app, the only thing needed was to provide copy and image assets in App Store Connect for the invocation modal.

Screenshot of App Store Connect screen for uploading copy and image assets.
App Store Connect

We approached this project with a lot of unknowns—the technology was new and new to us. We were trying to build an App Clip with React Native, which isn’t typical! Our approach (to fail fast and iterate) worked well. Having a developer with native iOS development was very helpful because App Clips—even ones written in React Native—involve a lot of Apple’s tooling.

One challenge we didn’t anticipate was how difficult it would be to share code. It turned out that sharing code introduced too much complexity into the main application, and we didn’t want to impact the development process for the entire Shop team. So we copied code where it made sense.

Our final App Clip size was 9.1MB, just shy of the 10MB limit. Having such a hard constraint was a fun challenge. We managed to build most of what we initially had in mind, and there are further optimizations we can still make.

Sebastian Ekström is a Senior Developer based in Stockholm who has been with Shopify since 2018. He’s currently working in the Shop Retention team.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

Five Tips for Growing Your Engineering Career

Five Tips for Growing Your Engineering Career

The beginning stages of a career in engineering can be daunting. You’re trying to make the most of the opportunity at your new job and learning as much as you can, and as a result, it can be hard to find time and energy to focus on growth. Here are five practical tips that can help you grow as you navigate your engineering career.

Continue reading

Using Propensity Score Matching to Uncover Shopify Capital’s Effect on Business Growth

Using Propensity Score Matching to Uncover Shopify Capital’s Effect on Business Growth

By Breno Freitas and Nevena Francetic

Five years ago, we introduced Shopify Capital, our data-powered product that enables merchants to access funding from right within the Shopify platform. We built it using a version of a recurrent neural network (RNN)—analyzing more than 70 million data points across the Shopify platform to understand trends in merchants’ growth potential and offer cash advances that make sense for their businesses. To date, we’ve provided our merchants with over $2.7 billion in funding.

But how much of an impact was Shopify Capital having, really? Our executives wanted to know—and as a Data team, we were invested in this question too. We were interested in validating our hypothesis that our product was having a measurable, positive impact on our merchants.

We’ve already delved into the impact of the program in another blog post, Digging Through the Data: Shopify Capital's Effect on Business Growth, but today, we want to share how we got our results. In this post, we’re going behind the scenes to show you how we investigated whether Shopify Capital does what we intended it to do: help our merchants grow.

The Research Question

What’s the impact on future cumulative gross merchandise value (for example, sales) of a shop after they take Shopify Capital for the first time?

To test whether Shopify merchants who accepted Capital were more successful than those who didn’t, we needed to compare their results against an alternative future (the counterfactual) in which merchants who desired Capital didn’t receive it. In other words, an A/B test.

Unfortunately, in order to conduct a proper A/B test, we would need to randomly and automatically reject half of the merchants who expressed interest in Capital for some period of time in order to collect data for proper analysis. While this makes for good data collection, it would be a terrible experience for our users and undermine our mission to help merchants grow, which we were unwilling to do.

With Shopify Capital only being active in the US in 2019, an alternative solution would be to use Canadian merchants who didn’t yet have access to Shopify Capital (Capital launched in Canada and the UK in Spring 2020) as our “alternate reality.” We needed to seek out Canadian shops who would have used Shopify Capital if given the opportunity, but weren’t able to because it wasn’t yet available in their market.

We can do this comparison through a method called “propensity score matching” (PSM).

Matchmaker, Matchmaker, Make Me a Match

In the 1980s, researchers Rosenbaum and Rubin proposed PSM as a method to reduce bias in the estimation of treatment effects with observational data sets. This is a method that has become increasingly popular in medical trials and in social studies, particularly in cases where it isn’t possible to complete a proper random trial. A propensity score is defined as the likelihood of a unit being assigned to the treatment group. In this case: What are the chances of a merchant accepting Shopify Capital if it were offered to them?

It works like this: After propensity scores are estimated, the participants are matched with similar counterparts on the other set, as depicted below.

Depiction of matching performed on two sets of samples based on their propensity scores.
Depiction of matching performed on two sets of samples based on their propensity scores.

We’re looking for a score of similarity for taking treatment and only analyzing samples in both sets that are close enough (get a match) and respecting any other constraints imposed by the selected matching methodology. This means we could even be dropping samples from treatment when matching, if the scores fall outside of the parameters we’ve set.

Once matched, we’ll be able to determine the difference in gross merchandise value (GMV), that is, sales, between the control and treatment groups in the six months after they take Shopify Capital for the first time.

Digging into the Data Sets

As previously discussed, in order to do the matching, we needed two sets of participants in the experiment, the treatment group, and the control group. We decided to set our experiment for a six-month period, starting in January 2019 to remove any confounding effect of COVID-19.

We segment our two groups as follows:

  • Treatment Group: American shops that were first-time Capital adopters in January 2019, on the platform for at least three months prior (to ensure they were established on the platform), and still Shopify customers in April 2020.
  • Control Group: Canadian shops that had been a customer for at least three months prior to January 2019 and pre-qualified for Capital in Canada when we launched it in April 2020.
  • Ideally, we would have recreated underwriting criteria from January 2019 to see which Canadian shops would have pre-qualified for Capital at that time. To proxy for this, we looked at shops that remained stable until at least April 2020 in the US and Canada, and then went backwards to analyze their 2019 data.

    Key assumptions:

  • Shops in Canada didn’t take an offer for the sole reason that Capital didn’t exist in Canada at that time.
  • Shops in the US and Canada have equal access to external financing sources we can’t control (for example, small business loans)
  • The environments that Canadian and US merchants operate in are more or less the same
  • Matchmaking Methodology

    We began our matching process with approximately 8,000 control shops and about 600 treated shops. At the end of the day, our goal was to make the distributions of the propensity scores for each group of shops match as closely as possible.

    Foundational Setup

    For the next stage in our matching, we set up some features, using characteristics from within the Shopify platform to describe a shop. The literature says there’s no right or wrong way to pick characteristics—just use your discernment to choose whichever ones make the most sense for your business problem.

    We opted to use merchants’ (which we’ll refer to as shops) sales and performance in Shopify. While we have to keep the exact characteristics a secret for privacy reasons, we can say that some of the characteristics we used are the same ones the model would use to generate a Shopify Capital offer.

    At this stage, we also logarithmically transformed many of the covariates. We did this because of the wild extremes we can get in terms of variance on some of the features we were using. Transforming them to logarithmic space shrinks the variances and thus makes the linear regressions behave better (for example, to shrink large disparities in revenue). This helps minimize skew.

    It’s a Match!

    There are many ways we could match the participants on both sets—the choice of algorithm depends on the research objectives, desired analysis, and cost considerations. For the purpose of this study, we chose a caliper matching algorithm.

    A caliper matching algorithm is basically a nearest neighbors (NN) greedy matching algorithm where, starting from the largest score, the algorithm tries to find the closest match on the other set. It differs from a regular NN greedy algorithm as it only allows for matches within a certain threshold. The caliper defines the maximum distance the algorithm is allowed to have between matches—this is key because if the caliper is infinite, you’ll always find a neighbor, but that neighbor might be pretty far away. This means not all shops will necessarily find matches, but the matches we end up with will be fairly close. We followed Austin’s recommendation to choose our caliper width.

    After computing the caliper and running the greedy NN matching algorithm, we found a match to all but one US first-time Capital adopter within Canadian counterparts.

    Matching Quality

    Before jumping to evaluate the impact of Capital, we need to determine the quality of our matching. We used the following three techniques to assess balance:

    1. Standardized mean differences: This methodology compares the averages of the distributions for the covariates for the two groups. When close to zero, it indicates good balance. Several recommended thresholds have been published in the literature with many authors recommending 0.1. We can visualize this using a “love plot,” like so:

      Love plot comparing feature absolute standardized differences before and after matching.
      Love plot comparing feature absolute standardized differences before and after matching.
    2. Visual Diagnostics: Visual diagnostics such as empirical cumulative distribution plots (eCDF), quantile-quantile plots, and kernel density plots can be used to see exactly how the covariate distributions differ from each other (that is, where in the distribution are the greatest imbalances). We plot their distributions to check visually how they look pre and post matching. Ideally, the distributions are superimposed on one another after matching.

      Propensity score plots before matching - less overlapping before matching indicating less matches were found between groups
      Propensity score plots before matching - less overlapping before matching indicating less matches were found between groups.
      Propensity score plots after matching - Increased overlapping indicating good matches between groups
      Propensity score plots after matching - Increased overlapping indicating good matches between groups.
    3. Variance Ratios: The variance ratio is the ratio of the variance of a covariate in one group to that in the other. Variance ratios close to 1 indicate good balance because they imply the variances of the samples are similar, whereas numbers close to 2 are sometimes considered extreme. Only one of our covariates was hitting the 0.1 threshold in the standardized mean differences method. Visual comparison (see above) showed great improvement and good alignment in covariate distributions for the matched sets. And all of our variance ratios were below 1.3.

    The checks presented cover most of the steps presented in the literature in regards to making sure the matching is okay to be used in further analysis. While we could go further and keep tweaking covariates and testing different methods until a perfect matching is achieved, that would risk introducing bias and wouldn’t guarantee the assumptions would be any stronger. So, we decided to proceed with assessing the treatment effect. 

    How We Evaluated Impact

    At this point, the shops were matched, we had the counterfactual and treatment group, and we knew the matching was balanced. We’d come to the real question: Is Shopify Capital impacting their sales? What’s the difference in GMV between shops who did and didn’t receive Shopify Capital? 

    In order to assess the effect of the treatment, we set up a simple binary regression: y’ = β₀ + β₁ * T.

    Where T is a binary indicator of whether or not the data point is for a US or Canadian shop, β₀ is the intercept for the regression and β₁ is the coefficient that will show how being on treatment on average influences our target. Target, 𝑦', is a logarithm of the cumulative six-month GMV, from February to July 2019,  plus one (that is, log1p transform of six-month sales).

    Using this methodology, we found that US merchants on average had a 36% higher geometric average of cumulative six-month GMV after taking Capital for the first time than their peers in Canada.

    How Confident Are We in Our Estimated Treatment Effect? 

    In order to make sure we were confident in the treatment effect we calculated, we ran several robustness checks. We won’t get into the details, but we used the margins package, simulated an A/A test to validate our point estimate, and followed Greifer’s proposed method for bootstrapping.

    Cumulative geometric average of sales between groups before and after taking their first round of Capital
    Cumulative geometric average of sales between groups before and after taking their first round of Capital.

    Our results show that the 95% confidence interval for the average increase in the target, after taking Capital for the first time, is between 13% and 65%. The most important takeaway is that the lower bound is positive—so we can say with high confidence that Shopify Capital has a positive effect on merchants’ sales.

    Final Thoughts

    With high statistical significance, backed by robustness checks, we concluded that the average difference in the geometric mean of GMV in the following six months after adopting Shopify Capital for the first time is +36%, bounded by +13% and +65%. We can now say with confidence that Shopify Capital does indeed help our merchants—and not only that, but it validates the work we’re doing as a data team. Through this study, we were able to prove that one of our first machine learning products has a significant real-world impact, making funding more accessible and helping merchants grow their businesses. We look forward to continuing to create innovative solutions that help our merchants achieve their goals.

    Breno Freitas is a Staff Data Scientist working on Shopify Capital Data and a machine learning researcher at Federal University of Sao Carlos, Brazil. Breno has worked with Shopify Capital for over four years and currently leads a squad within the team. Currently based in Ottawa, Canada, Breno enjoys kayaking and working on DIY projects in his spare time.

    Nevena Francetic is a Senior Data Science Manager for Money at Shopify. She’s leading teams that use data to power and transform financial products. She lives in Ottawa, Ontario and in her spare time she spoils her little nephews. To connect, reach her on LinkedIn.


    Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

    Continue reading

    Building Blocks of High Performance Hydrogen-powered Storefronts

    Building Blocks of High Performance Hydrogen-powered Storefronts

    The future of commerce is dynamic, contextual, and personalized. Hydrogen is a React-based framework for building custom and creative storefronts giving developers everything they need to start fast, build fast, and deliver the best personalized and dynamic buyer experiences powered by Shopify’s platform and APIs. We’ve built and designed Hydrogen to meet the three needs of commerce:

    1. fast user experience: fast loading and responsive
    2. best-in-class merchant capabilities: personalized, contextual, and dynamic commerce
    3. great developer experience: easy, maintainable, and fun.
    A visualization of a .tsx file showing the ease of adding an Add to Cart button to a customized storefront
    Hydrogen provides optimized React components enabling you to start fast.

    These objectives have inherent tension that’s important to acknowledge. You can achieve fast loading through static generation and edge delivery, but you must forgo or make personalization a client-side concern that results in a deferred display of critical content. Vice versa, rendering dynamic responses from the server implies a slower initial render but, when done correctly, can deliver better commerce and shopping experience. However, delivering efficient streaming server-side rendering for React-powered storefronts, and smart server and client caching, is a non-trivial and unsolved developer experience hurdle for most teams.

    Hydrogen is built and optimized to power personalized, contextual, and dynamic commerce. Fast and efficient server-side rendering with high-performance storefront data access is the prerequisite for such experiences. To optimize the user experience, we leverage a collection of strategies that work together:

    There’s a lot to unpack here, let’s take a closer look at each one.

    Streaming Server-side Rendering

    Consider a product page that contains a significant amount of buyer personalized content: a localized description and price for a given product, a dynamic list of recommended products powered by purchase and navigation history, a custom call to action (CTA) or promotion banner, and the assignment to one or several multivariate A/B tests.

    A client-side strategy would, likely, result in a fast render of an empty product page skeleton, with a series of post-render, browser-initiated fetches to retrieve and render the required content. These client-initiated roundtrips quickly add up to a subpar user experience.

    A visualization showing the differences between Client-side Rendering and Server-side Rendering
    Client-side rendering vs. server-side rendering

    The client-side rendering (CSR) strategy typically results in a delayed display of critical page content—that is, slow LCP. An alternative strategy is to server-side render (SSR)—fetch the data on the server and return it in the response—that helps eliminate RTTs and allows first and largest contentful paints to fire close together, but at a cost of a slow time-to-first-byte (TTFB) because the server is blocked on the data. This is where and why streaming SSR is a critical optimization.

    A visualization showing how Streaming Server-side Rendering unlocks critical performance benefits.
    Streaming server-side rendering unlocks fast, non-blocking first render

    Hydrogen adopts the new React 18 alpha streaming SSR API powered by Suspense that unlocks critical performance benefits:

    • Fast TTFB: the browser streams the HTML page shell without blocking the server-side data fetch. This is in contrast to “standard” SSR where TTFB is blocked until all data queries are resolved.
    • Progressive hydration: as server-side data fetches are resolved, the data is streamed within the HTML response, and the React runtime progressively hydrates the state of each component, all without extra client round trips or blocking on rendering the full component tree. This also means that individual components can show custom loading states as the page is streamed and constructed by the browser.

    The ability to stream and progressively hydrate and render the application unlocks fast TTFB and eliminates the client-side waterfall of CSR—it’s a perfect fit for the world of dynamic and high-performance commerce.

    React Server Components

    “Server Components allow developers to build apps that span the server and client, combining the rich interactivity of client-side apps with the improved performance of traditional server rendering.”
        —RFC: React Server Components

    Server components are another building block that we believe (and have been collaborating on with the React core team) is critical to delivering high-performance storefronts. RSC enables separation of concerns between client and server logic and components that enables a host downstream benefits:

    • server-only code that has zero impact on bundle size and reduces bundle sizes
    • server-side access to custom and private server-side data sources
    • seamless integration and well-defined protocol for server+client components
    • streaming rendering and progressive hydration
    • subtree and component-level updates that preserve client-state
    • server and client code sharing where appropriate.
    An home.server.jsx file that has been highlighted to show where code sharing happens, the server-side data fetch, and the streaming server-side response.

    Server components are a new building block for most React developers and have a learning curve, but, after working with them for the last ten months, we’re confident in the architecture and performance benefits that they unlock. If you haven’t already, we encourage you to read the RFC, watch the overview video, and dive into Hydrogen docs on RSC.

    Efficient Data Fetching, Colocation, and Caching

    Delivering fast server-side responses requires fast and efficient first party (Shopify) and third party data access. When deployed on Oxygen—a distributed, Shopify hosted V8 Isolate-powered worker runtime—the Hydrogen server components query the Storefront API with localhost speed: store data is colocated and milliseconds away. For third party fetches, the runtime exposes standard Fetch API enhanced with smart cache defaults and configurable caching strategies:

    • smart default caching policy: key generation and cache TTLs
    • ability to override and customize cache keys, TTLs, and caching policies
    • built-in support for asynchronous data refresh via stale-while-revalidate.

    To learn more, see our documentation on useShopQuery for accessing Shopify data, and fetch policies and options for efficient data fetching.

    Combining the Best of Dynamic and Edge Serving

    Adopting Hydrogen doesn’t mean all data must be fetched from the server. On the contrary, it’s good practice to defer or lazyload non-critical content from the client. Below the fold or non-critical content can be loaded on the client using regular React patterns and browser APIs, for example, through use of IntersectionObserver to determine when content is on or soon to be on screen and loaded on demand.

    Similarly, there’s no requirement that all requests are server-rendered. Pages and subrequests with static or infrequently updated content can be served from the edge. Hydrogen is built to give developers the flexibility to deliver the critical personalized and contextual content, rendered by the server, with the best possible performance while still giving you full access to the power of client-side fetching and interactivity of any React application.

    The important consideration isn’t which architecture to adopt, but when you should be using server-side rendering, client-side fetching, and edge delivery to provide the best commerce experience—a decision that can be made at a page and component level.

    For example, an about or a marketing page that’s typically static can and should be safely cached, served directly from the CDN edge, and asynchronously revalidated with the help of a stale-while-revalidate strategy. The opt-in to edge serving is a few keystrokes away for any response on a Hydrogen storefront. This capability, combined with granular and optimized subrequest—powered by the fetch API we covered above—caching gives full control over data freshness and the revalidation strategy.

    Putting It All Together

    Delivering a high-performance, dynamic, contextual, and personalized commerce experience requires layers of optimizations at each layer of the stack. Historically, this has been the domain of a few, well-resourced engineering teams. The goal of Hydrogen and Oxygen is to level the playing field:

    • the framework abstracts all the streaming
    • the components are tuned to speak to Shopify APIs
    • the Oxygen runtime colocates and distributes rendering around the globe.

    Adopting Hydrogen and Oxygen should, we hope, enable developers to focus on building amazing commerce experiences, instead of the undifferentiated technology plumbing and production operations to power a modern and resilient storefront.

    Take Hydrogen out for a spin, read the docs, leave feedback. Let’s build.

    Ilya Grigorik is a Principal Engineer at Shopify and author of High Performance Browser Networking (O'Reilly), on a mission to supercharge commerce and empower entrepreneurs around the world.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    The Vitality of Core Web Vitals

    The Vitality of Core Web Vitals

    In 2020, Google introduced unified guidance for great user experience (UX) on the web called Core Web Vitals. It’s proposed to evaluate specific metrics and have numerical estimates for such a multifaceted discipline. Current metrics focus on loading, interactivity, and visual stability. You might think, “Nice stuff, thank you, Google. I’ll save this to my bookmarks and look into it once the time comes for nice-to-have investigations!” But before deciding, have a closer look into the following. This year, Google implemented the metrics of Core Web Vitals in the Search Engine ranking algorithm as one of the factors. To be precise, the rollout of page experience in ranking systems began in mid-June of 2021 and completed at the end of August.

    Does that mean that we should notice a completely different ranking of Google Search results in September already? Or the horror case, our websites to be shown in the s-e-c-o-n-d Search Engine Results Page (SERP)? Drastic changes won’t appear overnight, but the update will undoubtedly influence the future of ranking. First of all, the usability of web pages is only one factor that influences ranking. Meaning of inserted query, the relevance of a page, quality of sources, context, and settings are other “big influencers” deciding the final results. Secondly, most websites are in the same boat getting “not great, not terrible” grades. According to Google Core Web Vitals Study April 2021, only four percent of all studied websites are prepared for the update and have a good ranking in all three metrics. It’s good timing for companies to invest efforts for necessary improvements and easily stand out among other websites. Lastly, user expectations continue to rise higher standards, and Google has a responsibility to help users reach relevant searches. At the same time, Google pushes the digital community to prioritize UX because that helps to keep users on their websites. Google's study shows that visitors are 24% less likely to abandon the websites that meet proposed metrics thresholds.

    Your brains are most likely filled with dopamine by thinking about possible UX improvements to your website. Let’s use that momentum and dig deeper into each metric of Core Web Vitals.

    Core Web Vitals Metrics

    Core Web Vitals is the subset of unified guidance for great UX indication called Web Vitals. Core metrics highlight the metrics that matter most. Metrics are not written in stone! They represent the best available indicators developers have today. Thus, be welcoming for future improvements or additions.

    The current set of metrics is largest contentful paint (LCP), first input delay (FID), and cumulative layout shift (CLS).

    An image showing the three Core Web Vitals and the four Other Web Vitals
    Listed metrics of Web Vitals: mobile-friendly, safe browsing, HTTPS, no intrusive interstitials, loading, visual stability and interactivity. The last 3 ones are ascribed to Core Web Vitals.

    Largest Contentful Paint

    LCP measures the time to render the largest element in a currently viewed page part. The purpose is to measure how quickly the main content is ready to be used for the user. Discussions and research helped to understand that the main content is considered the largest element as an image or text block in a viewport. Exposed elements are

    • <img>
    • <image> inside an <svg> (Note: <svg> itself currently is not considered as a candidate)
    • <video>
    • an element with a background image loaded via the url()
    • block-level elements containing text nodes or other inline-level text elements children.

    During the page load, the largest element in a viewport is detected as a candidate of LCP. It might change until the page is fully loaded. In example A below, the candidate changed three times since larger elements were found. Commonly, the LCP is the last loaded element, but that’s not always the case. In example B below, the paragraph of text is the largest element displayed before a page loads an image. Comparing the two, example B has a better LCP score than example A.

    An image depicting the differences between LCP being the last loaded element and LCP occuring before the page is fully loaded.
    LCP detection in two examples: LCP is the last loaded element on a page (A), LCP occurs before the page is fully loaded (B).

    Websites should meet 2.5 seconds or less LCP for getting a score that indicates good UX. But… Why 2.5? The inspiration was taken from studies by Stuart K. Card and Robert B. Miller,  that found that a user will wait roughly 0.3 to 3 seconds before losing focus. In addition, gathered data about top-performing sites across the Web showed that such a limit is consistently achievable for well-optimized sites.

    Good Poor
    LCP <= 2.5s > 4s
    FID <= 100ms > 300ms
    CLS <= 0.1 > 0.25

    The thresholds of “good” and “poor” Web Core Vitals scores. The scores in between are considered “needs improvement”.

    First Input Delay

    FID quantifies the user’s first impression of the responsiveness and interactivity of a page. To be precise, how long does a browser take to become available to respond to the first user’s interaction on a page. For instance, the time between when the user clicks on the “open modal” button and when the browser is ready to trigger the modal opening. You may wonder, shouldn’t the code be executed immediately after the user’s action? Not necessarily, during page load, the browser’s main thread is super busy parsing and executing loaded JavaScript (JS) files—incoming events might wait until processing.

    A visualization of a browser loading a webpage showing that FID is the time between the user's first interaction and when they can respond
    FID represents the time between when a browser receives the first user’s interaction and can respond to that.

    FID measures only the delay in event processing. Time to process the event and update UI afterwards were deliberately excluded to avoid workarounds like moving event processing logic to asynchronous callbacks. The workaround would improve such metric scores because it separates processing from the task associated with the event. Sadly, that wouldn’t bring any benefits for the user—likely the opposite.

    User’s interactions require the main thread to be idle even when the event listener is not registered. For example, the main thread might delay the user’s interaction with the following HTML elements until it completes ongoing tasks:

    • text fields, checkboxes, and radio buttons (<input>, <textarea>)
    • select dropdowns (<select>)
    • links (<a>).

    Jakob Nielsen described in the Usability Engineering book: “0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result”. Despite being first described in 1993, the same limit is considered good in Core Web Vitals nowadays.

    Cumulative Layout Shift

    CLS measures visual stability and counts how much the visible content shifts around. Layout shifts occur when existing elements change their start position defined by Layout Instability API. Note that when a new element is added to the DOM or an existing element changes size—it doesn’t count as a layout shift!

    The metric is named “cumulative” because the score of each shift is summed. In June 2021, the duration of CLS was improved for long-lived pages (for example SPAs and infinite scroll apps) by grouping layout shifts and ensuring the score doesn’t grow unbounded.

    Are all layout shifts bad? No. CLS focuses only on unexpected ones. Expected layout shifts occur within 500 milliseconds after user’s interactions (that is clicking on a link, typing in a search box, etc.). Such shifts are excluded from CLS score calculations. This knowledge may encourage creating extra space immediately after the user’s input with a loading state for tasks that take longer to complete.

    Let’s use a tiny bit of math to calculate the layout shift score of the following example:

    1. Impact fraction describes the amount of space an unstable element takes up of a viewport. When an element covers 60% of a viewport, its impact fraction is 0.6.
    2. Distance fraction defines the amount of space that an unstable element moves from the original to the final position. When an element moves by 25% of a viewport height, its distance fraction is 0.25.
    3. Having layout_shift_score = impact_fraction * distance_fraction formula, the layout shift score of this example is 0.6 * 0.25 = 0.15.
    A visualization of two mobile screens showing the 0.15 layout shift score. The second mobile screen shows the page after the layout shift
    The example of 0.15 layout shift score in a mobile view.

    A good CLS score is considered to be 0.1 or less for a page. Evaluated real-world pages revealed that the shifts of such good scores are still detectable but not excessively disruptive. Leaving shifts of 0.15 score and above consistently the opposite.

    How to Measure the Score of My Web Page?

    There are many different toolings to measure Core Web Vitals for a page. Tools reflect two main measurement techniques: in the lab or in the field.

    In the Lab

    Lab data, also known as synthetic data, is collected from a simulated environment without a user. Measures in such an environment can be tested before features are released in production. Be aware that FID can’t be measured in such a way! Lab data doesn’t contain the required real user input. As an alternative, its suggested to track its proxy—Total Blocking Time (TBT).

    Tooling:

    • Lighthouse: I think it is the most comprehensive tool using lab data. It can be executed either on public or authenticated web pages. The generated report indicates the scores and suggests personalised opportunities to improve the performance. The best part is that Chrome users already have this tool ready to be used under DevTools. The drawback I noticed during using the tool—the screen of a page should be visible during the measurement process. Thus, the same browser doesn’t support the analysis of several pages in parallel. Lastly, Lighthouse can be incorporated into continuous integration workflows via Lighthouse CI.
    • WebPageTest: The tool can perform analyses for public pages. I was tricked by the tool when I provided a URL of the authenticated page for the first time. I got results. Better than I expected. Just before patting myself on the back, I decided to dig deeper into a waterfall view. The view showed clearly that the authenticated page wasn’t even reached and it was navigated to a public login page. Despite that, the tool has handy options to test against different locations, browsers, and device emulators. It might help to identify which country or state troubles the most and start thinking about Content Delivery Network (CDN). Finally, be aware that the report includes detailed analyses but doesn’t provide advice for improvements.
    • Web Vitals extension: It's the most concrete tool of all. It contains only metrics and scores for the currently viewed page. In addition, the tool shows how it calculates scores in real-time. For example, FID is shown as “disabled” until your interaction happens on a page.

    In the Field 

    A site’s performance can vary dramatically based on a user’s personalized content, device capabilities, and network conditions. Real User Monitoring (RUM) captures the reality of page performance, including the mentioned differences. Monitoring data shows the performance experienced by a site’s actual users. On the other hand, there’s a way to check the real-world performance of a site without a RUM setup. Chrome User Experience Report gathers and aggregates UX metrics across the public web from opted-in users. Such findings power the following tools:

    • Chrome UX Report Compare Tool (CRUX): As the name dictates, the tool is meant for pages’ comparison. The report includes metrics and scores of selected devices’ groups: desktop, tablet, or mobile. It is a great option to compare your site with similar pages of your competitors.
    • PageSpeed Insights: The tool provides detailed analyzes for URLs that are known by Google’s web crawlers. In addition, it highlights the opportunities for improvements.
    • Search Console: The tool reports performance data per page, including historical data. Before using it—verification of ownership is mandatory.
    • Web Vitals extension: The tool was mentioned for lab toolings, but there’s one more feature to reveal. For pages in which field data is available via Chrome UX Report, lab data (named “local” in the extension) is combined with real-user data from the field. This integration might indicate how similar your individual experiences are to other website users.

    CRUX based tools are great and quick starters for investigations. Despite that, your retrieved RUM data can provide more detailed and immediate feedback. To setup RUM for a website might look scary at the beginning, but usually, it takes these steps:

    1. In order to send data from a website, a developer implements a RUM Javascript snippet to the source code.
    2. Once the user interacts or leaves the website, the data about an experience is sent to a collector. This data is processed and stored in a database that anyone can view via convenient dashboards.

    How to Improve

    Core Web Vitals provides insights into what’s hurting the UX. For example, setting up RUM even for a few hours can reveal where the most significant pain points exist. The worst scoring metrics and pages can indicate where to start searching for improvements. Other toolings mentioned in the previous section might suggest how to fix the specific issues. The great thing is that all scores will likely increase by applying changes to improve one metric.

    Many indications and bits of advice may sound like coins in the Super Mario game, which are hanging for you to grab. That isn’t the case. The hard and sweaty work remains on your table! Not all opportunities are straightforward to implement. Some might include big and long-lasting refactoring that can’t be done in one go or for which preparations should be completed. Thus, it adds several strategies to start explorations:

    1. Update third-party libraries. After reviewing your application libraries, you might reveal that some are no longer used or lighter alternatives (covering the same use case) exist. Next, sometimes only a part of the included library is actually used. That leads to the situation where a portion of JS code is loaded without purpose at all. Tree-shaking could solve this issue. It enables loading only registered specific features from a library instead of loading everything. Be aware that not all libraries support tree-shaking yet, but it’s getting more and more popular. Updates of application dependencies may sound like a small help, but let’s lead by an example. During Shopify internal Hack Days, my team executed the mentioned updates for our dropshipping app Oberlo. It decreased the compressed bundle size of the application by 23%! How long did it take for research and development? Less than three days.
      This improves FID and LCP.
    2. Preload critical assets. The loading process might be extended due to the late discovery of crucial page resources by the browser. By noting which resources can be fetched as soon as possible, the loading can be improved drastically. For example, Shopify noticed a 50% (1.2 seconds) improvement in time-to-text-paint by preloading Web Fonts.
      This improves FID and LCP.
    3. Review your server response time. If you’re experiencing severe delays, you may try the following:
      a) use a dedicated server instead of a shared one for web hosting
      b) route the user to a nearby CDN
      c) cache static assets
      d) use service workers to reduce the amount of data users need to request from a server.
      This improves FID and LCP
    4. Optimize heavy elements. Firstly, shorten the loading and rendering of critical resources by implementing the lazy-loading strategy. It defers the loading of large elements like images below the page viewport once it’s required for a user. Do not add lazy loading for elements in the initial viewport because the LCP element should be loaded as fast as possible! Secondly, compress images to have fewer bytes to download. Images don’t always require high quality or resolution and can be downgraded intentionally without affecting the user. Lastly, provide dimensions as width and height attributes or aspect-ratio boxes. It ensures that a browser can allocate the correct amount of space in the document while the image is loading to avoid unexpected layout shifts.
      This improves FID, LCP, and CLS.

    To sum everything up, Google introduced Core Web Vitals to help us to improve the UX of websites. In this article, I’ve shared clarifications of each core metric, the motives of score thresholds, tools to measure the UX scores of your web pages, and strategies for improvements. Loading, interactivity, and visual stability are the metrics highlighted today. Future research and analyses might reveal different Core Web Vitals to focus on. Be prepared!

    Meet the author of this article—Laura Silvanavičiūtė. Laura is a Web Developer who is making drop-shipping better for everyone together with the Oberlo app team. Laura loves Web technologies and is thrilled to share the most exciting parts with others via tech talks at conferences or articles on medium.com/@laurasilvanavi.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Always on Calibration with a Quarterly Summary

    Always on Calibration with a Quarterly Summary

    Being always-on means we don’t wait to reach alignment around expectations and performance at the end of a review period. We’re always calibrating and aligning. This removes any sort of surprise in how an individual on the team is doing and recognizes the specific impact they had on the business. It’s also for us to find and perform specific just-in-time growth opportunities for each individual. 

    Today, continuous deployment is an accepted and common best practice when it comes to building and releasing software. But, where’s the continuous deployment for the individuals? Just like the software we write, we’re iterating on ourselves and deploying new versions all the time. We learn, grow, and try new ideas, concepts, and approaches in our work and interactions. But how are we supposed to iterate and grow when, generally, we calibrate so infrequently? Take a minute and think about your experiences:

    • What do calibrations look like for you? 
    • When was the last time you got feedback? 
    • How often do you have performance conversations? Do you have them at all? 
    • Do you always know how your work connects to your job expectations? Do you know how your work contributes to the organizational goals?
    • Are you ever surprised by your performance evaluation? 
    • Were you able to remember, in detail, all of your contributions for the performance review window? 

    If you receive feedback, have performance reviews, and answer "no" or "I don’t know" to any of the above questions, then unfortunately, you’re not alone. I’ve answered "no" and "I don’t know" to each of these questions at one point in time or another in my career. I went years without any sort of feedback or review. I have also worked on teams where performance reviews were yearly, and only at the end of the year did we try and collect a list of the feedback and contributions made over that year.

    I have found great value in performance reviews with my team, if and only if, they are continuous, and always-on. Anything short of always-on leads to surprises, missed opportunities for growth and for corrective feedback to be effective, as well as an incomplete view on individuals' contributions. This culminates in missed opportunities for the individual when it comes to their performance, promotion, and compensation reviews and conversations.

    Ye Old Performance Review

    Ye Old performance reviews are plagued by the same problems which surrounded software development before continuous deployments. Once every X number of months, we’d get all the latest features and build release candidates, and try to ship our code to production. These infrequent releases never worked out and were always plagued with issues. Knowing that this didn’t work so well for software deployments, why are so many still applying the same thinking to the individuals in their company and on their team?

    The Goal of Performance Reviews 

    Performance reviews intend to calibrate (meaning to capture and align) with an individual on the impact they had on the business over a given period, reward them for that impact, provide feedback, and stretch them into new opportunities (when they’re ready).

    Some Problems with Ye Old Performance Reviews 

    I don’t know about you, but I can’t recall what I had for dinner a week ago, let alone what I worked on months ago. Sure, I can give you the high-level details just like I can for that dinner I had last week. I probably had protein and some vegetables. Which one? Who knows. The same applies with impact conversations. How are we supposed to have a meaningful impact conversation if we can’t recall what we did and the impact it had? How are we supposed to provide feedback, so you can do better next time on that meal, if we don’t even recall what we ate? The specifics around those contributions are important, and those specifics are available now. 

    An always-on calibration reduces: surprises and disconnects. It helps ensure the presence of specificity when reviewing contributions, and individuals grow into better versions of themselves. What if after that meal I made last week, I got feedback right away from my partner. That meal was good, but it could have used a little more spice. Great. Now the next time I make that meal, I can add some extra spice. If she didn’t tell me for a year, then every time I made it this year, it would be lacking that spice. I would have missed the opportunity to refine my palate and grow. Let’s look more closely at some of the problems which arise with these infrequent calibrations.

    Contributions

    When reviews are infrequent or non-existent, we’re unable to know the full scope and impact of the work performed by the individual. Their work isn’t fully realized, captured, and rewarded. If we do try to capture these contributions at the end of a review window, we run into a few problems:

    • Our memory fades and we forget our contributions and the impact we had. 
    • Our contributions often end up taking the form of general statements that lack specificity.
    • We suffer from a recency bias that colours how we see the contributions over the entire review period. This results in more weight being given to recent contributions and less to those which happened closer to the start of the review period. 
    • Managers and individuals can have a different or an incomplete view of the contributions made.

    If we are unable to see the full scope of work contributed by the individual, we’re missing opportunities to reward them for this effort and grow them towards their long-term goals.

    Growth

    When reviews are infrequent, we’re missing out on the opportunity to grow the individuals on our team. With reviews come calibration, feedback, and alignment. This leads to areas for growth in terms of new opportunities and feedback as to where they can improve. If we try to capture these things at the end of review windows, we have a few problems:

    • We’re unable to grow in our careers as quickly because 
      • We can’t quickly try out new things and receive early and frequent feedback.
      • We’re infrequently looking for new opportunities that benefit our career growth. 
    • We won’t be able to provide early feedback or support when something isn’t working. 
    • There’s a lack of specificity on how team members can achieve the next level in their careers.
    • Individuals can overstay on a team when there’s a clear growth opportunity for them elsewhere in the organization.

    Frequent calibrations allow individuals to grow faster as they can find opportunities when they are ready, iterate on their current skills, and pivot towards a more successful development and contribution path. 

    The Quarterly Calibration Document

    Every quarter, each member of my team makes a copy of this quarterly objectives template. At present, this template consists of six sections:

    1. Intended Outcomes: what they intend to accomplish going into this quarter?
    2. Top Accomplishments: what are their most impactful accomplishments this quarter? Other Accomplishments: what other impactful work did they deliver this quarter?
    3. Opportunities for the next three to six months: what opportunities have we identified for them that aren’t yet available to work on but will be available in the upcoming quarters?
    4. Feedback: what feedback did we receive from coworkers?
    5. Quarterly Review: a table that connects the individual’s specific impact for a given quarter to the organization’s expectations for their role and level.

    As you can probably guess by looking at these sections, this is a living document, and it’ll be updated over a given quarter. Each individual creates a new one at the start of the quarter and outlines their intended outcomes for the quarter. After this is done, we discuss and align on those intended outcomes during our first one-on-one of the quarter. After we have aligned, the individual updates this document every week before our one-on-one to ensure it contains their accomplishments and any feedback they’ve received from their coworkers. With this document and the organizations' role expectations, we can calibrate and state where they’re meeting, exceeding, or missing our expectations of them in their role weekly. We can also call out areas for development and look for opportunities in the current and upcoming work for them to develop. There’s never any surprise as to how they are performing and what opportunities are upcoming for their growth.

    A Few Key Points

    There are a few details I’ve learned to pay close attention to with these calibrations. 

    Review Weekly During Your One-on-one

    I’m just going to assume you are doing weekly one-on-ones. These are a great opportunity for coaching and mentorship. For always-on calibrations to be successful, you need to include and dedicate time as part of these weekly meetings. Calibrating weekly lets you:

    • Provide feedback on their work and its impact
    • Recognize their contributions regularly
    • Show them the growth opportunity available to them that connects to their development plan
    • Identify deviations from the agreed upon intended outcomes and take early action to ensure they have a successful quarter

    Who Drives and Owns the Quarterly Calibration Document?

    These calibration documents are driven by the individual as they’re mentored and coached by their manager. Putting the ownership of this document on the individual means they see how their objectives align with the expectations your organization has for someone in this role and level. They know what work they completed and the impact around that work. They also have a development plan and know what they’re working towards. Now that’s not to say their manager doesn’t have a place in this document. We’re there to mentor and coach them. If the intended outcomes aren’t appropriate for their level or are unrealistic for the provided period, we provide them with this feedback and help them craft appropriate intended outcomes and objectives. We also know about upcoming work, and how it might interest or grow them towards their long-term goals. 

    Always Incorporate Team Feedback

    Waiting to collect and share feedback on an infrequent basis results in vague, non-actionable, and non-specific feedback that hinders the growth of the team. Infrequently collected feedback often takes the form of "She did great on this project," "They're a pleasure to work with." This isn’t helpful to anyone's growth. Feedback needs to be candid, specific, timely, and focused on the behaviour. Most importantly, it needs to come from a place of caring. By discussing feedback in the quarterly document and during weekly one-on-ones, you can:

    • Collect highly specific and timely feedback.
    • Identify timely growth opportunities.
    • Provide a reminder to each individual on the importance of feedback to everyone’s growth. 

    During our weekly one-on-ones, we discuss any feedback that we’ve received during the previous week from their coworkers. I also solicit feedback for teammates at this time that they may or may not have shared with their coworkers. We take time to break down this feedback together and discuss the specifics. If the feedback is positive, we’ll make sure to note it as it’s useful in future promotion and compensation conversations. If the feedback is constructive, we discuss the specifics and highlight future opportunities where we can apply what was learned. Where appropriate, we also incorporate new intended outcomes into our quarterly calibration document. 

    What Managers Can Do when Things Aren’t on Track.

    This topic deserves its own post but short of it is: hard conversations don’t get easier with time, they get more difficult. Let’s say we don’t appear to be on track for reaching our intended outcomes and the reason is performance-related. This is where we need to act immediately. Don’t wait. Do it now. The longer you wait, the harder it is to have these difficult conversations. For me, this is akin to howlers in Harry Potter. For those who don’t know Harry Potter, these are letters written to seemingly misbehaving students from parents for which the letter yells at the student for some misbehaving that occurred. If you don’t open these letters right away and get the yelling over with, they get worse and worse. They smoke in the corner, and eventually the yelling from the letter is far worse when you eventually do open it. To me, this is what I think of whenever I have difficult feedback I need to provide. I know it’s going to be difficult, but all parties should provide this early and give the recipient a chance to course correct. The good news is that you’re having weekly calibration sessions and not yearly, so the individual has plenty of time to correct any performance issues before it becomes a serious problem. But only if the manager jumps in.

    What Happens When We Aren’t Hitting Our Intended Outcomes and Objectives 

    First, it’s important to be clear as to which intended outcomes aren’t being hit. Are they specific to personal development or their current role and level?

    Personal Development Intended Outcomes

    In addition to achieving the expectations of the role and level, they’re hopefully working towards their long-term aims. (See The AWARE Development Plan for more on this topic). Working with your Development Plan, you’ve worked back from your long-term aims to set a series of the short-term intended outcomes that move you towards these goals. Missing these intended outcomes results in a delay in reaching your long-term aim, but this doesn’t affect your performance in your current role and level. When personal development aims are missed, we should reaffirm the value in these intended outcomes, and if they’re still appropriate, prioritize them in the future. We need to discuss and acknowledge the delay that missing these outcomes have on their long-term aims, so all parties remain aligned on development goals. 

    Role or Level Intended Outcomes

    At the start of a review period, we’ve agreed to a set of intended outcomes for an individual based on the role and level. Once we learn that the intended outcomes for the role or level aren’t going to be achieved, we need to understand why. If the reason is that priorities have changed, we need to refocus our efforts elsewhere. We acknowledge the work and impact they’ve had to date and set new intended outcomes that are appropriate for the time that remains in the review period. If, on the other hand, they’re unable to meet the intended outcomes because they aren’t up to the task, we may need to set up a performance improvement plan to help them gain the skills needed to execute at the level expected of them.

    The Quarterly Review

    We roll up and calibrate our impact every quarter. This period can be adjusted to work for your organization, but I’d recommend something less than a year but more than a month. Waiting a year is too long and there’s too much data for you to work within creating your evaluation. Breaking down these yearly reviews into smaller windows has a few advantages, it:

    • Allows you to highlight impactful windows of contribution (you may have a great quarter and just an ok year, breaking it down by quarter gives you the chance to celebrate that great period).
    • Allows you to snapshot smaller windows (this is your highlight reel for a given period and when looking back over the year you can look at these snapshots).
    • Allows you to assign a performance rating for this period and course correct early.

    If we want to be sure we have a true representation of each individual's contributions to an organization, ensure those contributions are meeting your organizations' expectations for their role and level, and provide the right opportunities for growth for each individual, then we need to be constantly tracking and discussing the specifics of their work, its impact, and where they can grow. It’s not enough to look back at the end of the year and collect feedback and a list of accomplishments. Your list will, at best, be incomplete, but more importantly you’ll have missed out on magnifying the growth of each individual. Worse yet, that yearly missed growth opportunity compounds, so the impact you are missing out on by not continuously calibrating with your team is huge.

    David Ward is a Development Manager who has been with Shopify since 2018 and is working to develop: Payment Flexibility, Draft Orders, and each member of the team. Connect with David on Twitter, GitHub and LinkedIn.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    The AWARE Development Plan

    The AWARE Development Plan

    A development plan helps everyone aim for and achieve their long-term goals in an intentional manner. Wherever you are in your career, there’s room for growth and a goal for you to work towards. Maybe your goal is a promotion? Maybe your goal is to acquire a given proficiency in a language or framework? Maybe your goal is to start your own company? Whatever it is, the development plan is your roadmap to that goal. Once you have that long-term goal, create a development plan to achieve it. Now I’m assuming there won’t be one single thing you can do to make it happen, if there was you wouldn’t need a plan. When picking a long-term goal, make sure you pick one that’s far enough out to make the goal ambitious but not so far out that you can’t define a sequence of steps to reach that goal. If you can’t define the steps to reach the goal then talk with a mentor or set a goal that you feel is along a path that you can define the steps to reach. After you have that goal, you'll iterate towards it with the AWARE Development Plan. 

    Today, my job is that of an Engineering Manager at Shopify. Outside of this position I've always been a planner: an individual iterating towards one long-term goal or another. I didn’t refine or articulate my process for how I iterate towards my long-term goals until I saw members of my team struggling to define their own paths. In an effort to help and accelerate their growth, I started breaking down and articulating how I’ve iterated towards the goals in my life. The result of this work is this AWARE Development plan.

    Having a goal means we have an aim or purpose.

    "A goal on its own isn’t enough as a goal without a plan is just a wish."
    - Antoine de Saint-Exupéry.

    Once you have a goal, you need to create a plan to understand how to reach those goals. Setting off towards your goal without a plan is akin to setting off on a road trip without a GPS or a clear sense of how to get to your destination. You don’t know what route you’ll take, nor when you’ll arrive. You may get to your destination, but it’s much more likely that you’ll get lost along the way. On route to your destination, you’ll undoubtedly have reference points along the way that you’ll take note of. These reference points are akin to our short and medium-term goals. Our plan has these reference points built-in, so we can constantly check-in to make sure we’re still on course. If we’re off course then we want to know sooner than later, so we can adjust, ask for directions and update the plan. The short- and medium-term goals created in your AWARE plan provide regular checkpoints along the way to your long-term goals. If we are unable to achieve short and medium-term goals we have the opportunity to lean in and course correct. Failing to course correct in these moments will result in at the very least substantial detours on the way to our goals. 

     

    Iterating Towards My Long-Term Goals

    When I grow up I want to be a… Well, what did you want to be? I wanted to be a: fighter jet pilot, police officer, lawyer, and then when things got serious, a developer. For me, long-term goal setting started with a career goal when I was younger. I set a long-term goal, figured out what goals needed to be achieved along the way, achieved those goals, reflected on them, and then looked for the next goal.

    I was 12 years old when I got serious about wanting a career with computers. I had spent a bunch of time on my dad’s computer and I was hooked. But the question was: how do I make this happen? To figure this out, I started with the long-term goal of becoming a computer programmer and worked my way back. I took what I knew about my goals and figured out what I needed to do along the way to reach them. I made a plan.

    A summary of my 12-year-old self’s plan:

    An outlined planing working towards the long-term goal of a career in Computer Science
    An outline for working towards the long-term goal of a career in Computer Science.

    I looked ahead and set a long-term goal. I wanted a career as a computer programmer and to attend the University of Waterloo. I worked back from that goal to set various shorter-term goals. I needed to get good grades, get my computer, and earn some money to pay for this. I had a specific long-term goal and a set of medium and short-term goals. These medium and short-term goals moved me closer to the long-term goal one step at a time. New goals such as the purchase of a computer weren’t immediately on this plan, but this goal was added once its value became apparent to the long-term goal.

    Helping Others Reach their Goals

    I’ve always been interested in setting and reaching long-term goals. I’ve set and reached long-term goals in: education, software engineering career, and dance career. In my career, working both as an Engineering Manager and a dance instructor I’ve had the opportunity to work with and help individuals set and achieve their goals. What I’ve learned is that the process to achieve those goals is the same irrespective of the area of interest. The AWARE process works just as well for Dance as it does for Software Development. I’m sharing this with you because you’re the one who’s in control of you reaching your long-term goals, and yes, you can reach them. 

    As a manager and instructor in both dance and Software Development, I’ve heard questions like: “What do I need to get to the next level” or “What do I need to do to get better at this” more times than I can count. In either case, my answer is the same. 

    • Start by building your development plan and work backwards from that goal to figure out what must be true for you to reach that goal. 
    • Sit down and critically think about the goal as well as the goals you must achieve along the way. 
    • Sit down and critically think about the result of your short-term goals after you achieve them.

    In my experience, these are the concepts which are often not as clear-cut with other approaches one may take to reach their long-term goals. With the AWARE Development plan we are:

    • Putting the individual in the driver's seat and making them accountable for their long-term goals is key for them taking responsibility and achieving their goals. 
    • Having the individual develop an understanding of what it means to achieve their goal helps them understand where they need to grow and what opportunities they need to seek out. 
    • Creating a clear set of areas of development for which both they and I continuously look for opportunities where they can develop. 
    • Creating a clear answer to the “What’s Next?” question. We know what’s next as we can just consult our Development Plan
    • Provides (to an extent) the individual control over the speed at which they achieve their goals.

    If you are an individual who is unsure of what’s next or unsure how you can reach your long-term goal I’d encourage you to give this a shot. Sit down and write a Development Plan for yourself. Once you're done, share it with your lead/manager and ask them for feedback and their opinion. They are your partner. They can work with you to find opportunities and provide feedback as you work towards your goal. 

    AWARE: Aim, Work Back, Achieve, Reflect, Encore

    AWARE is a way for each of us to iterate towards our long-term goals. It’s strength comes from breaking down long-term goals into actionable steps as well as reflecting on completed steps in order to provide confidence that you are on track. How does it work? Start at the end and imagine you achieved your goal. This is your long-term Aim. Now, work your way back and define what steps you need to take along the way to that outcome. You are unlikely to be able to achieve all of these Aims right away and they are often too big to achieve all at once. When you find these, break them down into smaller Aims. From here start Achieving your Aims. As you complete your Aims take time to Reflect. What did you learn? What new goals did you uncover? Was that Aim valuable? Now take your next Aim and do it all over again. Let’s dig into each step a little more.

    Overview of each step: Aim, Work back, Achieve, Reflect, Encore

    Overview of each step: Aim, Work back, Achieve, Reflect, Encore.

     

    Aim

    At this stage, you’ll write down your aim aka your goal. When writing your aim make sure:

    • You time box this aim appropriately.
    • That there’s a clear definition of success.
    • You have a clear set of objectives.
    • You have a clear way to measure this aim.  

    Work Back

    Can the current aim be broken down into smaller aims or can you immediately execute towards it? If you can’t immediately execute towards it then identify smaller aims that will lead to it. A goal is seldom achieved through the accomplishment of a single aim. Write down the aim and then work back from there to identify other aims you can perform as you iterate and move towards this larger aim. Each of these aims should:

    • have clear objectives
    • be measurable
    • be time-boxed
    • be reflected upon once it’s completed.

    Achieve

    Now that you have a time-boxed aim, with a clear objective, and a way to measure success we’ll execute and accomplish this aim.

    Reflect

    After you have completed your aim, it’s important to reflect upon the aim and its outcomes:

    • Did this provide the value we expected it to? 
    • What did we learn? 
    • What new aims are we now aware of that should we add to our Development Plan?
    • What would you do differently next time if you were to do it again?

    Encore

    Now that you have completed an aim, repeat the process. Aim again and continue to work towards your long-term goal.

    Aim and Work Back Example

    I recently had a team member who expressed that their goal was to run their own software company one day. Awesome! They have a long-term goal. One specific worry they had was onboarding new developers to this company. Specifically, they wanted to know how to have them be impactful in a “timely manner.”  They identified that having impactful developers for their startup was important early on and could be the difference between success and failure for their company. What a great insight! So working back from their long-term goal we’ve identified “being able to onboard new developers in an organization promptly” to be an aim in the plan. In breaking down this aim, we were able to find two aims they could execute over the next eight months to grow this skill set:

    1. One area we identified right away was mentoring an intern and new hires, so we added this to their plan. This aim was easy to set up: you'll onboard an intern in one of the upcoming terms and be responsible for onboarding them to the team and mentoring them while they’re here. 
    2. This team member presented the idea of taking four months away from our team to join an onboarding group that was working to decrease the onboarding time for new developers joining Shopify. This opportunity was perfect. You can practice onboarding developers to our company. You can see firsthand what works and what doesn’t while helping develop the process here.

    With this individual we have :

    • A long-term goal: start your own software company.
    • A medium-term aim working towards that long-term goal: learn how to accelerate the onboarding of new hires.
    • Two short-term aims that work towards the medium-term aim:
      • Buddy an intern or new hire to the team. 
      • Mentor for four months as part of the company’s onboarding program.

    By setting the long-term goal and working back, we were able to work together to find opportunities in the short term that would grow this individual towards their long-term goal. Once they complete any one of these aims they’ll reflect on how it went, what was learned and then decide if there’s anything to add to the Development Plan as we continue on their path towards the long-term goal. In the provided example we think we have all of the shorter-term Aims required to reach the longer-term goals flushed out from the start. This won’t be the case and may not be true. That’s ok. Take it step-by-step and execute on the things you know and append the plan when you reflect.

    Bonus Tip: Sharing

    Once you have an aim. Share it! Share it with your lead/manager, and if you're comfortable with the idea, your colleagues. Sharing your goals with others will help ensure the right growth opportunities are available when you are ready for them. What follows when you share your plan:

    1. New opportunities will present themselves because others are aware of your goals and can therefore share relevant opportunities with you.
    2. You will see opportunities for growth that may not have initially been obvious to you.

    Sharing your aim will enable you to proceed to your goals at a faster rate than if you were working at this alone.

    Tips for Using the AWARE Development Plan

    I provide this development plan template to team members when they join my team. I also ask them to write down their long-term goals and to time box this long-term goal to at most five years. This is usually far enough out that there’s some uncertainty but doesn’t prevent developing a plan. We then work back from that long-term goal and identify shorter-term aims (usually over the next one to two years) that moves them towards this goal. The work of identifying these aims can be done together, but I also encourage them to drive this plan as much as possible. They have to do some thinking and bring ideas to the table as well. 

    Aim

    To help with their aim I ask questions like: 

    • What does success look like for you in these time frames? 
    • What do we need to do to achieve this aim?
    • How will we know if we’ve achieved this aim? 
    • What opportunities are available today to start progressing towards this aim?
    • How does some of the current work you’re doing tie into this aim?

    Work Back

    Once we have an aim, we need to ask: is this something I can achieve now? If yes, you can move on to Achieve. If it’s a no, not at this time, the situation should be one that you and your lead/manager agree that you’re ready for this aim, but you’re unable to locate an opportunity to achieve it. Finally, if it’s ano, you’re unable to achieve this aim at this precise moment, you’ll need to work back from it to find other aims that are along the right path. For instance, this aim may require me to have a certain degree or set of experiences that I’m missing. If this is the situation, simply work back until you find the aims that are achievable. Document all discovered aims in your development plan so you and your lead can pursue them in due time.

    Achieve 

    You have an aim that you can execute now. It has a clear set of objectives, and its success is measurable. Do it!

    Reflect

    You completed your aim. Amazing! Once you’ve completed an aim, it’s important to take some time to reflect on it. This is a step most folks want to skip as they don’t see the value but it’s the most important. Don’t skip it! Ask yourself: 

    • What did we learn? 
    • Did we get what we thought we would out of it?
    • What would we do differently next time?
    • What new aims should we add to our plan based on what we know now?

    Maybe there’s a follow-up aim that you didn’t know about before you started this aim. Maybe you learned something and you want or need to pivot your aim or long-term goal? This is your chance to get feedback and course correct should you need or want to detour. Going back to our driving example from earlier. When we are driving towards a destination and aren’t paying attention to our surroundings, or checking on our route, we’re far more likely to get lost. So take a minute when you complete an aim and check in on your surroundings. Am I still on course? Did I take a wrong turn? Is there another route available to me now?

    Encore: Maintenance of the Development Plan

    We have to maintain this Development Plan after we create it. Aims can change and opportunities arise. When they do, we need to adjust. My team and I check in on our plan during our weekly one on ones. At these meetings, we ask:

    • If an aim has been completed, what’s the next aim that you will work on?
    • What opportunities have come up, and how do they tie into your plan? 
    • Is there something in another task that came up that contributes to an aim? 
    • Are we still on track to reach our aims in the timeframe we set?
    • Did we complete an aim? Let’s make sure we reflect.

    If the answer is yes to any of these that’s great! Call it out! Be explicit about it. Learning opportunities are magnified when you are intentional about them. 

    The Path to Success Doesn’t Have to Be Complicated

    The path to success is often said to look something like this:

    An arrow where the shaft is tied in many loops and knots

    An arrow shaft tied in many loops.

    But it doesn’t have to be quite as complex a ride. Get a mentor, some directions, and form a plan. Make sure this plan iterates towards your goal and has signposts along the way so you can check in to know you’re on track. This way If you miss a turn, it’s no big deal. It’ll happen. The difference for you is that you’ll know sooner than later that you’re off track and thus you’ll be able to course-correct earlier. At the end of the day your path to success may look like this:

    A depiction of an easier path to success. The arrow shaft has less knots that the previous image.

    An easier path to success moving forward.

    If you haven’t already done so: sit down with your lead or manager, write a plan, figure out what is needed for you to reach your long-term goal. Then start iterating towards it.

    David Ward is a Development Manager who has been with Shopify since 2018 and is working to develop: Payment Flexibility, Draft Orders, and each member of the team. Connect with David on Twitter, GitHub and LinkedIn.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Shopify's Path to a Faster Trino Query Execution: Custom Verification, Benchmarking, and Profiling Tooling

    Shopify's Path to a Faster Trino Query Execution: Custom Verification, Benchmarking, and Profiling Tooling

    Data scientists at Shopify expect fast results when querying large datasets across multiple data sources. We use Trino (a distributed SQL query engine) to provide quick access to our data lake and recently, we’ve invested in speeding up our query execution time.

    On top of handling over 500 Gbps of data, we strive to deliver p95 query results in five seconds or less. To achieve this, we’re constantly tuning our infrastructure. But with each change comes a risk to our system. A disruptive change could stall the work of our data scientists and interrupt our engineers on call.

    That’s why Shopify’s Data Reliability team built custom verification, benchmarking, and profiling tooling for testing and analyzing Trino. Our tooling is designed to minimize the risk of various changes at scale. 

    Below we’ll walk you through how we developed our tooling. We’ll share simple concepts to use in your own Trino deployment or any other complex system involving frequent iterations.

    The Problem

    A diagram showing the Trino upgrade tasks over time: Merge update to trino, Deploy candidate cluster, Run through Trino upgrade checklist, and Promote candidate to Prod. The steps include two places to Roll back.
    Trino Upgrade Tasks Over Time

    As Shopify grows, so does our data and our team of data scientists. To handle the increasing volume, we’ve scaled our Trino cluster to hundreds of nodes and tens of thousands of virtual CPUs.

    Managing our cluster gives way to two main concerns:

    1. Optimizations: We typically have several experiments on the go for optimizing some aspect of our configuration and infrastructure.
    2. Software updates: We must keep up-to-date with new Trino features and security patches.

    Both of these concerns involve changes that need constant vetting, but the changes are further complicated by the fact that we run a fork of Trino. Our fork allows us more control over feature development and release schedules. The tradeoff is that we’re sometimes the first large organization to test new code “in the field,” and if we’re contributing back upstream to Trino, the new code must be even more thoroughly tested.

    Changes also have a high cost of failure because our data scientists are interrupted and our engineers must manually roll back Trino to a working version. Due to increasing uncertainty on how changes might negatively affect our cluster, we decided to hit the brakes on our unstructured vetting process and go back to the drawing board. We needed a tool that could give us confidence in potential changes and increase the reliability of our system as a whole.

    Identifying the Missing Tool

    To help identify our missing tool, we looked at our Trino deployment as a Formula 1 race car. We need to complete each lap (or query) as fast as possible and shorten the timeline between research and production of our engine, all the while considering safety.

    The highest-ranked Formula 1 teams have their own custom simulators. They put cars in real-life situations and use data to answer critical questions like, “How will the new steering system handle on the Grand Prix asphalt?” Teams also use simulations to prevent accidents from happening in real life.

    Taking inspiration from this practice, we iterated on a few simulation prototypes:

    1. Replaying past queries. First, we built a tool to pluck a previously run query from our logs and “replay” it in an isolated environment. 
    2. Replicating real life. Next, we wrote a tool that replicated traffic from a previous work day. Think of it like travelling back in time in our infrastructure. 
    3. Standardizing our simulations. We also explored an official benchmarking framework to create controlled simulations. 

    Another thing we were missing in our “garage” was a good set of single-purpose gauges for evaluating a possible change to Trino. We had some manual checks in our heads, but they were undocumented, and some checks caused significant toil to complete. These could surely be formalized and automated.

    A Trino upgrade checklist of reminders to our engineers. These reminders include verifying user-defined functions, connectivity, performance heuristics, resource usage, and security
    Our undocumented Trino upgrade checklist (intentionally vague)

    We were now left with a mixed bag of ideas and prototypes that lacked structure.

    The Solution 

    We decided to address all our testing concerns within a single framework to build the structure that was lacking in our solution. Initially, we had three use cases for this framework: verification, benchmarking, and profiling of Trino.

    The framework materialized as a lightweight Python library. It eliminates toil by extracting undocumented tribal knowledge into code, with familiar, intuitive interfaces. An interface may differ depending on the use case (verification, benchmarking, or profiling), but all rely on the same core code library.

    The core library is a set of classes for Trino query orchestration. It’s essentially an API that has a shared purpose for all of our testing needs. The higher level Library class handles connections, query states, and multithreaded execution or cancellation of queries. The Query class handles more low level concerns, such as query annotations, safety checks, and fetching individual results. Our library makes use of the open source repository trino-python-client which implements the Python Database API Specification for Trino. 

    Verification: Accelerating Deployment

    Verification consists of simple checks to ensure Trino still works as expected after a change. We use verification to accelerate the deployment cycle for a change to Trino.

    A diagram showing the new Trino upgrade flow: Merge update to Trino, Deploy candidate cluster, Run Tests, Promote candidate to Prod. There is only one place for Rollback
    New Trino Upgrade Tasks Over Time (with shadow of original tasks)

    We knew the future users of our framework (Data Platform engineers) have a development background, and a high likelihood of knowing Python. Associating verification with unit testing, we decided to leverage an existing testing framework as our main developer interface. Conceptually, PyTest's features fit our verification needs quite well.

    We wrote a PyTest interface on top of our query orchestration library that abstracted away all the underlying Trino complications into a set of test fixtures. Now, all our verification concerns are structured into a suite of unit tests, in which the fixtures initialize each test in a repeatable manner and handle the cleanup after each test is done. We put a strong focus on testing standards, code readability, and ease of use. 

    Here’s an example block of code for testing our Trino cluster:

    First, the test is marked. We’ve established a series of marks, so a user can run all “correctness” tests at a time, all “performance” tests at a time, or every test except the “production_only” ones. In our case, “correctness” means we’re expecting an exact set of rows to be returned given our query. “Correctness” and “verification” have interchangeable meanings here. 

    Next, a fixture (in this case, candidate_cluster, which is a Trino cluster with our change applied) creates the connection, executes a query, fetches results, and closes the connection. Now, our developers can focus on the logic of the actual test. 

    Lastly, we call run_query and run a simple assertion. With this familiar pattern, we can already check off a handful of our items on our undocumented Trino upgrade checklist.

    A diagram showing a query being run on a single candidate Trino cluster
    Running a query on a candidate cluster

    Now, we increase the complexity:

    First, notice @pytest.mark.performance. Although performance testing is a broad concept, by asserting on a simple threshold comparison we can verify performance of a given factor (for example, execution time) isn’t negatively impacted.

    In this test, we call multi_cluster, which runs the same query on two separate Trino clusters. We look for any differences between the “candidate” cluster we’re evaluating and a “standby” control cluster at the same time.

    A diagram showing the same query being run on two separate Trino clusters
    Running the same query on two different clusters

    We used the multi_cluster pattern during a Trino upgrade when verifying our internal Trino User Defined Functions (which are utilized in domains such as finance and privacy). We also used this pattern when assessing a candidate cluster’s proposed storage caching layer. Although our suite didn’t yet have the ability to assert on performance automatically, our engineer evaluated the caching layer with some simple heuristics and intuition after kicking off some tests.

    We plan to use containerization and scheduling to automate our use cases further. In this scenario, we’d run verification tests at regular intervals and make decisions based on the results.

    So far, this tool covers the “gauges” in our race car analogy. We can take a pit stop, check all the readings, and analyze all the changes.

    Benchmarking: Simplifying Infrastructure

    Benchmarking is used to evaluate the performance of a change to Trino under a standardized set of conditions. Our testing solution has a lightweight benchmarking suite, so we can avoid setting up a separate system.

    Formula 1 cars need to be aerodynamic, and they must direct air to the back engine for cooling. Historically, Formula 1 cars are benchmarked in a wind tunnel, and every design is tested in the same wind tunnel with all components closely monitored. 

    We took some inspiration from the practice of Formula 1 benchmarking. Our core library runs TPC-DS queries on sample datasets that are highly relevant to the nature of our business. Fortunately, Trino generates these datasets deterministically and makes them easily accessible.

    The benchmarking queries are parametrized to run on multiple scale factors (sf). We repeat the same test on increasing magnitudes of dataset size with corresponding amounts of load on our system. For example, sf10 represents a 10 GB database, while sf1000 represents 1 TB.

    PyTest is just one interface that can be swapped out for another. But in the benchmarking case, it continues to work well:

    A diagram showing multiple queries executing on a candidate Trino cluster
    Running the same set of queries on multiple scale factors

    This style of benchmarking is an improvement over our team members’ improvised methodologies. Some used Spark or Jupyter Notebooks, while others manually executed queries with SQL consoles. This led to inconsistencies, which was against the point of benchmarking.

    We’re not looking to build a Formula 1 wind tunnel. Although more advanced benchmarking frameworks do exist, their architectures are more time-consuming to set up. At the moment, we’re using benchmarking for a limited set of simple performance checks.

    Profiling and Simulations: Stability at Scale

    Profiling refers to the instrumentation of specific scenarios for Trino, in order to optimize how the situations are handled. Our library enables profiling at scale, which can be utilized to make a large system more stable.

    In order to optimize our Trino configuration even further, we need to profile highly specific behaviours and functions. Luckily, our core library enables us to write some powerful instrumentation code.

    Notably, it executes queries and ignores individual results (we refer to these as background queries). When kicking off hundreds of parallel queries, which could return millions of rows of data, we’d quickly run out of memory on our laptops (or external Trino clients). With background queries, we put our cluster into overdrive with ease and profile on a much larger scale than before.

    We formalized all our prototyped simulation code and brought it into our library, illustrated with the following samples:

    generate_traffic is called with a custom profile to target a specific behaviour. replay_queries plays back queries in real time to see how a modified cluster handles them. These methodologies cover edge cases that a standard benchmark test will miss.

    A diagram showing simulated "high to low traffic" being generated and sent to the candidate Trino cluster
    Generating traffic on a cluster

    This sort of profiling was used when evaluating an auto-scaling configuration for cloud resources during peak and off-hours. Although our data scientists live around the world, most of our queries happen between 9-5 PM EST, so we’re overprovisioned outside of these hours. One of our engineers experimented with Kubernetes’ horizontal pod autoscaling, kicking off simulated traffic to see how our count of Trino workers adjusted to different load patterns (such as “high to low” and “low to high”). 

    The Results and What’s Next

    Building a faster and safer Trino is a platform effort supported by multiple teams. With this tooling, the Data Foundations team wrote an extensive series of correctness tests to prepare for the next Trino upgrade. The tests helped iron out issues and led to a successful upgrade! To bring back our analogy, we made an improvement to our race car, and when it left the garage, it didn’t crash. Plus, it maintained its speedour P95 query execution time has remained stable over the past few months (the upgrade occurred during this window). 

    A diagram showing p95 query execution time holding steady over three months
    95th percentile of query execution time over the past 3 months (one minute rolling window)

    Key Lessons

    By using this tool, our team learned about the effectiveness of our experimental performance changes, such as storage caching or traffic-based autoscaling. We were able to make more informed decisions about what (or what not) to ship.

    Another thing we learned along the way is that performance testing is complicated. Here are a few things to consider when creating this type of tooling:

    1. A solid statistics foundation is crucial. This helps ensure everyone is on the same page when sharing numbers, interpreting reports, or calculating service level indicators. 
    2. Many nuances of an environment can unintentionally influence results. To avoid this, understand the system and usage patterns on a deep level, and identify all differences in environments (for example, “prod” vs. “dev”).
    3. Ensure you gather all the relevant data. Some of the data points we care about, such as resource usage, are late-arriving, which complicates things. Automation, containerization, and scheduling are useful for this sort of problem.

    In the end, we scoped out most of our performance testing and profiling goals from this project, and focused specifically on verification. By ensuring that our framework is extensible and that our library is modular, we left an opportunity for more advanced performance, benchmarking, and profiling interfaces to be built into our suite in the future.

    What’s Next

    We’re excited to use this tool in several gameday scenarios to prepare our Data Reliability team for an upcoming high-traffic weekend—Black Friday and Cyber Monday—where business-critical metrics are needed at a moment’s notice. This is as good a reason as ever for us to formalize some repeatable load tests and stress tests, similar to how we test the Shopify system as a whole.

    We’re currently evaluating how we can open-source this suite of tools and give back to the community. Stay tuned!

    Interested in tackling challenging problems that make a difference? Visit our Data Science & Engineering career page to browse our open positions.

    Noam is a Hacker in both name and spirit. For the past three years, he’s worked as a data developer focused on infrastructure and platform reliability projects. From Toronto, Canada, he enjoys biking and analog photography (sometimes at the same time).

    Continue reading

    Debugging Systems in the Cloud: MySQL, Kubernetes, and Cgroups

    Debugging Systems in the Cloud: MySQL, Kubernetes, and Cgroups

    By Rodrigo Saito, Akshay Suryawanshi, and Jeremy Cole

    KateSQL is Shopify’s custom-built Database-as-a-Service platform, running on top of Google Cloud’s Kubernetes Engine (GKE), currently manages several hundred production MySQL instances across different Google Cloud regions and many GKE Clusters.

    Earlier this year, we found a performance related issue with KateSQL: some Kubernetes Pods running MySQL would start up and shut down slower than other similar Pods with the same data set. This partially impaired our ability to replace MySQL instances quickly when executing maintenance tasks like config changes or upgrades. While investigating, we found several factors that could be contributing to this slowness.

    The root cause was a bug in the Linux kernel memory cgroup controller.  This post provides an overview of how we investigated the root cause and leveraged Shopify’s partnership with Google Cloud Platform to help us mitigate it.

    The Problem

    KateSQL has an operational procedure called instance replacement. It involves creating a new MySQL replica (a Kubernetes Pod running MySQL) and then stopping the old one, repeating this process until all the running MySQL instances in the KateSQL Platform are replaced. KateSQL’s instance replacement operations revealed inconsistent MySQL Pod creation times, ranging from 10 to 30 minutes. The Pod creation time includes the time needed to:

    • spin up a new GKE node (if needed)
    • create a new Persistent Disk with data from the most recent snapshot
    • create the MySQL Kubernetes Pod
    • initialize the mysql-init container (which completes InnoDB crash recovery)
    • start the mysqld in the mysql container.

    We started by measuring the time of the mysql-init container and the MySQL startup time, and then compared the times between multiple MySQL Pods. We noticed a huge difference between these two MySQL Pods that had the exact same resources (CPU, memory, and storage) and dataset:

    KateSQL instance

    Initialization

    Startup

    katesql-n4sx0

    2120 seconds

    1104 seconds

    katesql-jxijq

    74 seconds

    17 seconds

    Later, we discovered that the MySQL Pods with slow creation time also showed gradual decrease in performance. Evidence of that was an increased number of slow queries for queries that utilized temporary memory tables:

    A line graph showing queries per second over time. A purple line shows a slower MySQL Pod with the queries taking long. A blue line shows a faster pod where queries are much shorter.
    Purple line shows an affected MySQL Pod while the Blue line shows a fast MySQL Pod

    Immediate Mitigation

    A quick spot-check analysis revealed that newly provisioned Kubernetes cluster nodes performed better than those that were up and running for a few months. Having this information in hand, we started our first mitigation strategy for this problem that was to replace the older Kubernetes cluster nodes with new ones using the following steps:

    1. Cordon (disallow any new Pods) the older Kubernetes cluster nodes.
    2. Replace instances using KateSQL to move MySQL Pods to new Kubernetes nodes, allowing GKE to autoscale the cluster by adding new cluster nodes as necessary.
    3. Once instances are moved to new Kubernetes nodes, drain the cordoned cluster nodes to scale down the cluster (automatically, through GKE autoscaler).

    This strategy was applied to production KateSQL instances, and we observed performance improvements on the new MySQL Pods.

    Further Investigation

    We began a deeper analysis to understand why newer cluster nodes performed better than older cluster nodes. We ruled out differences in software versions like kubelet, the Linux kernel, Google’s Container-optimized OS (COS), etc. Everything was the same, except their uptimes.

    Next, we started a resource analysis of each resource subsystem to narrow down the problem. We ruled out the storage subsystem, as the MySQL error log provided a vital clue as to where the slowness was So we examined timestamps from InnoDB’s Buffer Pool initialization:

    We analyzed the MySQL InnoDB source code to understand the operations involved during InnoDB’s Buffer Pool initialization. Most importantly, memory allocation during its initialization is single-threaded, as confirmed using top to show it consuming  approximately 100% CPU usage. We subsequently captured strace output of the mysqld process while it was starting up

    We see that each mmap() system call took around 100 ms to allocate approximately 128MB sized chunks, which in our opinion is terribly slow for the memory allocation process.

    We also did an On-CPU perf capture during this initialization process, below is the snapshot of the flamegraph:

    A flamegraph showing the On-CPU perf capture during initialization
    Flamegraph of the perf output collected of a MySQL process container from Kubernetes Cluster node

     Quick analysis of the flamegraph shows how MySQL (InnoDB) buffer pool initialization task is delegated to the memory allocator (jemalloc in our case) that then spends most of its time in a kernel function called mem_cgroup_commit_charge.

    We further analyzed what the mem_cgroup_commit_charge function does: it seems to be part of memcg (memory control group) and is responsible for charging (claiming ownership of) pages from one cgroup (unused/dead or root cgroup) to the cgroup of the allocating process. Unfortunately, memcg isn’t well documented, so it’s hard to understand what’s causing the slow down.

    Another unusual thing we spotted (using the slabtop command) was the abnormally high dentry cache, sometimes around 20GB for about 64 pods running on a given Kubernetes Cluster node:

    While investigating if a large dentry cache could be slowing the entire system down, we found this post by sysdig that provided useful information. After further analysis following the steps from the post, we confirmed that it wasn’t the same case as we were experiencing. However, we noticed immediate performance improvements (similar to a restarted Kubernetes cluster node) after dropping the dentry cache using the following command:

    echo 2 > /proc/sys/vm/drop_caches

    Continuing the unusual slab allocation investigation, we ruled out any of its side effects, like memory defragmentation, since enough higher-order free pages were available (which was verified using the output of /proc/buddyinfo). Also, this memory is reclaimable during memory pressure events.

    A Breakthrough

    Going through various bug reports related to cgroups, we found a command to list the number of cgroups in a system:

    We compared the memory cgroup’s value of a good node and an affected node. We concluded that approximately 50K memory cgroups is more than expected even with some of our short-lived tasks! This indicated to us that there could be a cgroup leak. It also helped make sense of the perf output that we had flamegraphed previously. There could be a performance impact if the cgroup commit charge has to traverse through many cgroups for each page charge. We also saw that it locks page cache least recently used (LRU) list from source code analysis.

    We evaluated a few more bug reports and articles, especially the following:

    1. A bug report unrelated to Kubernetes but pointing to the underlying cause related to cgroups. This bug report also helped point to the fix that was available for such an issue. 
    2. An article on lwn.net related to almost our exact issue. A must read!
    3. A related workaround to the problem in Kubernetes.
    4. A thread in the Linux kernel mailing list that helped a lot.

    These posts were a great signal that we were on the right track to understanding the root cause of the problem. To confirm our findings and understand if a symptom of this cgroup leak, that wasn’t yet observed in the Linux community, we met with Linux kernel engineers at Google.

    We evaluated an affected node, and the nature of the problem. The Google engineers were able to confirm that we were in fact hitting another side-effect of reparent slab memory on cgroup removal.

    To Prove the Hypothesis

    After evaluating the problem, we tested a possible solution to this problem. We invoked a switch file for the kubepods cgroup (the parent cgroup for Kubernetes pods) to force it to empty zombie/dead cgroup:

    $ echo 1 | sudo tee /sys/fs/cgroup/memory/kubepods/memory.force_empty

    This caused the number of memory cgroups to decrease rapidly to only approximately 1800 memory cgroups that is in line with a good node as previously compared:

    We quickly tested a MySQL Pod restart to see if there were any improvements in startup time performance. An 80G InnoDB buffer pool was initialized in five seconds:

    A Possible Workaround and Fixes

    There were multiple workarounds and fixes for this problem that we evaluated with engineers from Google:

    1. Rebooting or cordoning the cluster node VM, identifying them by monitoring /proc/cgroups output.
    2. Set up a cronjob to drop SLAB and page caches. It’s an old school DBA/sysadmin technique that might work but could have a performance penalty on read IO.
    3. Short-lived Pods moved to a dedicated nodepool to isolate them from the more critical parts like MySQL Pods.
    4. echo 1 > /sys/fs/cgroup/memory/memory.force_empty in a preStop hook of a short-lived Pod.
    5. Upgrade to COS 85 that has upstream fixes to cgroup SLAB re-parenting bugs. Upgrading from GKE 1.16 to 1.18 should get us the Linux kernel 5.4 with the relevant bug fixes.

    Since we were due a GKE version upgrade, we created new GKE clusters with GKE 1.18 and started creating new MySQL Pods on those new clusters. After several weeks of running on GKE 1.18, we saw consistent MySQL InnoDB Buffer Pool initialization time and query performance:

    A table showing values for kube_namespace, kube_container, innodb_buffer_pool_size, and duration.
    Duration in seconds of new and consistent InnoDB Buffer Pool initialization for various KateSQL instances

    This was one of the lengthiest investigations that the Database Platform group has carried out at Shopify. The time taken to identify the root cause was due to the nature of the problem, difficult reproducibility, and absolutely no out-of-the-book methodology to follow. However, there are multiple ways to solve the problem and that’s a very positive outcome.

    Special thanks to Roman Gushchin from Facebook’s Kernel Engineering team, whom we connected with via LinkedIn to discuss this problem, and Google’s Kernel Engineering team who helped us confirm and solve the root cause of the problem.

    Rodrigo Saito is a Senior Production Engineer in the Database Backend team, where he primarily works on building and improving KateSQL, Shopify's Database-as-a-Service, using his more than a decade of Software Engineering experience. Akshay Suryawanshi is a Staff Production Engineer in the Database Backend team, who helped build KateSQL at Shopify along-with Jeremy Cole, who is a Senior Staff Production Engineer in the larger Database Platform group. Both of them brings decades of Database Administration and Engineering experience to manage Petabyte scale infrastructure.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    GitHub Does My Operations Homework: A Ruby Speed Story

    GitHub Does My Operations Homework: A Ruby Speed Story

    Hey, folks! Some of you may remember me from Rails Ruby Bench, Rebuilding Rails, or a lot of writing about Ruby 3 and Ruby/Rails performance. I’ve joined Shopify, on the YJIT team. YJIT is an open-source Just-In-Time Compiler we’re building as a fork of CRuby. By the time you read this, it should be merged into prerelease CRuby in plenty of time for this year’s Christmas Ruby release.

    I’ve built a benchmarking harness for YJIT and a big site of graphs and benchmarks, updated twice a day .

    I’d love to tell you more about that. But before I do, I know you’re asking, How fast is YJIT? That’s why the top of our new public page of YJIT results looks like this:

     this is a big textbox saying: “Overall, YJIT is 20.6% faster than interpreted CRuby, or 17.4% faster than MJIT! On Railsbench specifically, YJIT is 18.5% faster than CRuby, 21.0% faster than MJIT!”
    Overall, YJIT is 20.6% faster than interpreted CRuby, or 17.4% faster than MJIT! On Railsbench specifically, YJIT is 18.5% faster than CRuby, 21.0% faster than MJIT!

    After that, there are lots of graphs and, if you click through, giant tables of data. I love giant tables of data.

    And I hate doing constant ops work. So let’s talk about a low effort way to make GitHub do all your operational work for a constantly-updated website, shall we?

    I’ll talk benchmarks along the way, because I am still me.

    By the way, I work on the YJIT team at Shopify. We’re building a Ruby optimizer to make Ruby faster for everybody. That means I’ll be writing about it. If you want to keep up, this blog has a subscribe thing (down below.) Fancy, right?

    The Bare Necessities

    I’ve built a few YJIT benchmarks. So have a lot of other folks. We grabbed some existing public benchmarks and custom-built others. The benchmarks are all open-source so please, have a look. If there’s a type of benchmark you wish we had, we take pull requests!

    When I joined the YJIT team, that repo had a perfectly serviceable runner script that would run benchmarks and print the results to console (which still exists, but isn’t used as much anymore.) But I wanted to compare speed between different Ruby configurations and do more reporting. Also, where do all those reports get stored? That’s where my laziness kicked in.

    GitHub Pages is a great way to have GitHub host your website for free. A custom Jekyll config is a great way to get full control of the HTML. Once we had results to post, I could just commit them to Git, push them, and let GitHub take care of the rest.

    But Jekyll won’t keep it all up to date. That needs GitHub Actions. Between them, the final result is benchmarks run automatically, the site updates automatically, and it won’t email me unless something fails.

    Perfect.

    Want to see the gritty details?

    Setting up Jekyll

    GitHub Pages run on Jekyll. You can use something else, but then you have to run it on every commit. If you use Jekyll, GitHub runs it for you and tells you when things break. But you’d like to customise how Jekyll runs and test locally with bundle exec jekyll serve. So you need to set up _config.yml in a way that makes all that happen. GitHub has a pretty good setup guide for that. And here's _config.yml for speed.yjit.org.

    Of course, customising the CSS is hard when it’s in a theme. You need to copy all the parts of the theme into your local repo, like I did, if you want to change how they work (like not supporting <b> for bold and requiring <strong>, I’m looking at you , Slate).

    But once you have that set up, GitHub will happily build for you. And it’s easy! No problem! Nothing can go wrong!

    Oh, uh, I should mention, maybe, hypothetically, there might be something you want to put in more than one place. Like, say, a graph that can go on the front page and on a details page, something like that. You might be interested to know that Jekyll requires anything you include to live under _includes or the current subdirectory, so you have to generate your graph in there. Jekyll makes it really hard to get around the has to be under _includes rule. And once you’ve put the file under _includes, if you want to put it onto a page with its own URL, you should probably research Jekyll collections. And an item in a collection gets one page, not one page per graph… Basically, your continuous reporting code, like mine, is going to need to know more about Jekyll than you might wish.

    A snippet of Jekyll _config.yml that adds a collection of benchmark objects which should be output as individual pages

    But once you’ve set Jekyll up, you can have it run the benchmarks, and then you have nice up-to-date data files. You’ll have to generate graphs and reports there too. You can pre-run jekyll build to see if everything looks good. And as a side benefit, since you’re going to need to give it GitHub credentials to check in its data files, you can have it tell you if the performance of any benchmark drops too much.

    AWS and GitHub Actions, a Match Made In… Somewhere

    GitHub actions are pretty straightforward, and you can set one to run regularly, like a cron job. So I did that. And it works with barely a hiccup! It was easy! Nothing could go wrong.

    Of course, if you’re benchmarking, you don’t want to run your benchmarks in GitHub Actions. You want to do it where you can control the performance of the machine it runs on. Like an AWS instance! Nothing could go wrong.

    I just needed to set up some repo secrets for logging into the AWS instance. Like a private key, passed in an environment variable and put into an SSH identity file, that really has to end with a newline or everything breaks. But it’s fine. Nothing could go wrong!

    Hey, did you know that SSH remembers the host SSH key from any previous time you SSH’d there? And that GitHub Actions uses a shared .known_hosts file for those keys? And AWS re-uses old public IP addresses? So there’s actually a pretty good chance GitHub Actions will refuse to SSH to your AWS instance unless you tell it -oStrictHostKeyChecking=no. Also, SSH doesn’t usually pass environment variables through, so you’re going to need to assign them on its command line.

    So, I mean, okay, maybe something could go wrong.

    If you want to SSH into an AWS instance from GitHub Actions, you may want to steal our code, is what I’m saying.

    For the Love of Graphs

    Of course, none of this gets you those lovely graphs. We all want graphs, right? How does that work? Any way you want, of course. But we did a couple of things you might want to look at.

    A line graph of how four benchmarks’ results have changed over time, with ‘whiskers’ at each point to show the uncertainty of the measurement.

    A line graph of how four benchmarks’ results have changed over time, with ‘whiskers’ at each point to show the uncertainty of the measurement.

    For the big performance over time graph on the front page, I generated a D3.js graph from Erb. If you’ve used Rails, generating HTML and JS from Ruby should sound pretty reasonable. I’ve had good luck with it for several different projects. D3 is great for auto-generating your X and Y axis, even on small graphs, and there’s lots of great example code out there.

    If you want to embed your results, you can generate static SVGs from Ruby. That takes more code, and you’ll probably have more trouble with finicky bits like the X and Y axis or the legend. Embeddable graphs are hard in general since you can’t really use CSS and everything has to be styled inline, plus you don’t know the styling for the containing page. Avoid it if you can, frankly, or use an iframe to embed. But it’s nice that it’s an option.

    A large bar graph of benchmark results with simpler axis markings and labels.

    A large bar graph of benchmark results with simpler axis markings and labels.

    Both SVG approaches, D3 and raw SVG, allow you to do fun things with JavaScript like mouseover (we do that on speed.yjit.org) or hiding and showing benchmarks dynamically (like we do on the timeline deep-dive). I wouldn’t try that for embeddable graphs, since they need more JavaScript that may not run inside a random page. It’s more enjoyable to implement interesting features with D3 instead of raw SVG.

    a blocky, larger-font bar graph generated using matplotlib

    A blocky, larger-font bar graph generated using matplotlib.

    If fixed-sized images work for you, matplotlib also works great. We don’t currently use that for speed.yjit.org, but we have for other YJIT projects.

    Reporting Isn’t Just Graphs

    Although it saddens my withered heart, reporting isn’t just generating pretty graphs and giant, imposing tables. You also need a certain amount of English text designed to be read by “human beings.”

    That big block up-top that says how fast YJIT is? It’s generated from an Erb template, of course. It’s a report, just like the graphs underneath it. In fact, even the way we watch if the results drop is calculated from two JSON files that are both generated as reports—each tripwire report is just a list of how fast every benchmark was at a specific time, and an issue gets filed automatically if any of them drop too fast.

    So What’s the Upshot?

    There’s a lot of text up there. Here’s what I hope you take away:

    GitHub Actions and GitHub Pages do a lot for you if you’re running a batch-updated dynamic site. There are a few weird subtleties, and it helps to copy somebody else’s code where you can.

    YJIT is pretty fast. Watch this space for more YJIT content in the future. You can subscribe below.

    Graphs are awesome. Obviously.

    Noah Gibbs wrote the ebook Rebuilding Rails and then a lot about how fast Ruby is at various tasks. Despite being a grumpy old programmer in Inverness, Scotland, Noah believes that some day, somehow, there will be a second game as good as Stuart Smith’s Adventure Construction Set for the Apple IIe. Follow Noah on Twitter and GitHub.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Try Out YJIT for Faster Rubying

    Try Out YJIT for Faster Rubying

    Here at Shopify, we’re building a new just-in-time (JIT) implementation on top of CRuby. Maxime talked about it at RubyKaigi and wrote a piece for our Engineering Blog. If you keep careful watch, you may have even seen our page of benchmark results.

    YJIT is a Just-In-Time Compiler, so it works by converting your most frequently used code into optimized native machine code. Early results are good—for instance, we’re speeding up a simple hello, world Rails benchmark by 20% or more. Even better, YJIT is pretty reliable: it runs without errors on Shopify’s full suite of unit tests for its main Rails application, and similar tests at GitHub. Matz, Ruby’s chief designer, mentioned us in his EuRuKo keynote. By the time you read this, YJIT should be merged into CRuby in time for Ruby 3.1 release at Christmas.

    Maybe you’d like to take YJIT out for a spin. Or maybe you’re just curious about how to compile CRuby (a.k.a. plain Ruby or Matz’s Ruby Interpreter.) Or maybe you just like reading technical blog posts with code listings in them. Reading blog posts feels weirdly productive while you’re waiting for a big test run to finish, doesn’t it? Even if it’s not 100% on-topic for your job.

    YJIT is available in the latest CRuby as 3.1.0-dev if you use ruby-build . But let’s say that you don’t or you want to configure YJIT with debugging options, statistics, or other customizations.

    Since YJIT is now part of CRuby, it builds and installs the same way that CRuby does. So I’ll tell you how you build CRuby and then you’ll know all about building YJIT too. We’ll also talk about:

    • How mature YJIT is or isn’t 
    • How stable is it 
    • Is it ready for you to use at work 
    • Will this speed turn into real-world benefits, or is it all benchmarks.

    We’re building something new and writing about it. There’s an email subscription widget on this post found at the top-left and bottom. Subscribe and you’ll hear the latest, soonest.

    Start Your Engines

    If you’re going to build Ruby from source, you’ll need a few things. Autoconf, make, OpenSSL, and GCC or Clang. If you’re on a Mac, XCode and Homebrew will get you these things. I won’t go into full details here, but Google can help out.

    brew install autoconf@2.69 openssl@1.1 # Unless you already have them

    On Linux, you’ll need a package similar to Debian build-essential plus autoconf and libssl-dev. Again, Google can help you here if you include the name of your Linux distribution. These are all common packages. Note that installing only Ruby from a package isn’t enough to build a new Ruby. You’re going to need Autoconf and a development package for OpenSSL. These are things a pre-built Ruby package doesn’t have.

    Now that you have the prerequisites installed, you clone the repo and build:

    And that will build you a local YJIT-enabled Ruby. If you’re using chruby, you can now log out, log back in, and switch to it with chruby ruby-yjit. rbenv is similar. With rvm you’ll need to mount it explicitly with a command like rvm mount ~/.rubies/ruby-yjit/bin/ruby -n ruby-yjit and then switch to it with rvm use ext-ruby-yjit.

    Note: on Mac, we’ve had a report of Autoconf 2.71 not working with Ruby. So you may need to install version 2.69, as shown above. And for Ruby in general you’ll want OpenSSL 1.1 - Ruby doesn’t work with version 3, which Homebrew installs by default.

    How Do I Know if YJIT Is Installed?

    Okay… So YJIT runs the same Ruby code, but faster. How do I know I even installed it?

    First, and simplest, you can ask Ruby. Just run ruby --enable-yjit -v. You should see a message underneath that YJIT is enabled. If you get a warning that enable-yjit isn’t a thing, you’re probably using a different Ruby than you think. Check that you’ve switched to it with your Ruby version manager.

    This message means this Ruby has no YJIT.

    You can also pop into irb and see if the YJIT module exists:

    You may want to export RUBYOPT=’--enable-yjit’ for this, or export RUBY_YJIT_ENABLE=1 which also enables YJIT. YJIT isn’t on by default, so you’ll need to enable it.

    Running YJIT

    After you’ve confirmed YJIT is installed, run it on some code. We found it runs fine on our full unit test suites, and a friendly GitHubber verified that it runs without error on theirs. So it’ll probably handle yours without a problem. If you pop into your project of choice and run rake test with YJIT and without, you can compare the times.

    If you can’t think of any code to run it on, YJIT has a benchmark suite we like. You could totally use it for that. If you do, you can run things like ruby -Iharness benchmarks/activerecord/benchmark.rb and compare the times. Those are the same benchmarks we use for speed.yjit.org . You may want to read YJIT’s documentation while you’re there. There are some command-line parameters and build-time configurations that do useful and fun things.

    Is YJIT Ready for Production?

    Benchmarks are fine, but YJIT doesn’t always deliver the same real-world speedups. We’ve had some luck on benchmarks and minor speedups with production code, but we’re still very much in-progress. So where is YJIT actually at?

    First, we’ve had good luck running it on our unit tests, our production-app benchmarking code, and one real web app here at Shopify. We get a little bit of speedup, in the neighbourhood of 6%. That can add up when you multiply by the number of servers Shopify runs… But we aren’t doing it everywhere, just on a small percentage of traffic for a real web service, basically a canary deployment.

    Unit tests, benchmarks and a little real traffic is a good start. We’re hoping that early adopters and being included in Ruby 3.1 will give us a lot more real world usage data. If you try YJIT, we’d love to hear from you. File a GitHub issue, good or bad, and let us know!

    Hey, YJIT Crashed! Who Do I Talk to?

    Awesome! I’ve been fuzz-testing YJIT with AFL for days and trying to crash it. If you could file an issue and tell us as much as possible about how it broke, we’d really appreciate that. Similarly, anything you can tell us about how fast it is or isn’t is much appreciated. This is still early days.

    And if YJIT is really slow or gives weird error messages, that’s another great reason to file an issue. If you run your code with YJIT, we’d love to hear what breaks. We’d also love to hear if it speeds you up! You can file an issue and, even if it’s good not bad, I promise we’ll figure out what to do with it.

    What if I Want More?

    Running faster is okay. But maybe you find runtime_stats up above intriguing. If you compile YJIT with CFLAGS=’-DRUBY_DEBUG=1’ or CFLAGS=’-DYJIT_STATS=1’ you can get a lot more detail about what it’s doing. Make sure to run it with YJIT_STATS set as an environment variable or yjit-stats on the command line. And then YJIT.runtime_stats will have hundreds of entries, not just two.

    When you run with statistics enabled, you’ll also get a report printed when Ruby exits showing interesting things like:

    • what percentage of instructions were run by YJIT instead of the regular interpreter 
    • how big the generated code was 
    • how many ISEQs (methods, very roughly) were compiled by YJIT.

    The beginning of a YJIT statistics exit report for a trivial one-line print command.

    What Can I Do Next?

    Right now, it’s really helpful just to have somebody using YJIT at all. If you try it out and let us know what you find, that’s huge! Other than that, just keep watching. Check the benchmarks results now and then. Maybe talk about YJIT a little online to your friends. As a famous copyrighted movie character said, "The new needs friends." We’ll all keep trying to be friendly.

    Noah Gibbs wrote the ebook Rebuilding Rails and then a lot about how fast Ruby is at various tasks. Despite being a grumpy old programmer in Inverness, Scotland, Noah believes that some day, somehow, there will be a second game as good as Stuart Smith’s Adventure Construction Set for the Apple IIe. Follow Noah on Twitter and GitHub.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    YJIT: Building a New JIT Compiler for CRuby

    YJIT: Building a New JIT Compiler for CRuby

    The 1980s and 1990s saw the genesis of Perl, Ruby, Python, PHP, and JavaScript: interpreted, dynamically-typed programming languages which favored ease of use and flexibility over performance. In many ways, these programming languages are a product of the surrounding context. The 90s were the peak of the dot-com hype, and CPU clock speeds were still doubling roughly every 18 months. It looked like the growth was never going to end. You didn’t really have to try to make your software run fast because computers were just going to get faster, and the problem would take care of itself. Today, things are a bit different. We’re reaching the limit of current silicon fabrication technologies, and we can’t rely on single-core performance increases to solve our performance problems. Because of mobile devices and environmental concerns, we’re beginning to realize that energy efficiency matters.

    Last year, during the pandemic, I took a job at Shopify, a company that runs a massive server infrastructure powered by Ruby on Rails. I joined a team with multiple software engineers working on improving the performance of Ruby code in a variety of ways, ranging from optimizing the CRuby interpreter and its garbage collector to the implementation of TruffleRuby, an alternative Ruby implementation. Since then, I’ve been working with a team of skilled engineers from Shopify and GitHub on YJIT, a new Just-in-time (JIT) compiler built inside CRuby.

    This project is important to Shopify and Ruby developers worldwide because speed is an underrated feature. There’s already a JIT compiler inside CRuby, known as MJIT, which has been in the works for three years. And while it has delivered speedups on smaller benchmarks, so far, it’s been less successful at delivering real-world speedups on widely used Ruby applications such as Ruby on Rails. With YJIT, we take a data-driven approach and focus specifically on performance hotspots of larger applications such as Rails and Shopify Core (Shopify’s main Rails monolith).

    What’s YJIT?

    ""Shopify loves Ruby! A small team lead by  @Love2Code  has been working on a new JIT that focuses on web &  @rails  workloads while also accelerating all ruby code out there. Today  @yukihiro_matz  gave his thumbs up to merging it into trunk:
    Tobi Lütke tweeting about YJIT

    YJIT is a project to gradually build a JIT compiler inside CRuby such that more and more of the code is executed by the JIT, which will eventually replace the interpreter for most of the execution. The compiler, which is soon to become officially part of CRuby, is based on Basic Block Versioning (BBV), a JIT compiler architecture I started developing during my PhD. I’ve given talks about YJIT this year at the MoreVMs 2021 workshop and another one at RubyKaigi 2021 if you’re curious to hear more about the approach we’re taking.

    Current Results

    We’re about one year into the YJIT project at this point, and so far, we’re pleased with the results, which have significantly improved since the MoreVMs talk. According to our set of benchmarks, we’ve achieved speedups over the CRuby interpreter of 20% on railsbench, 39% on liquid template rendering, and 37% on activerecord. YJIT also delivers very fast warm up. It reaches near-peak performance after a single iteration of any benchmark and performs at least as well as the interpreter on every benchmark, even on the first iteration.

    A bar graph showing the performance differences between YJIT, MJIT, and No JIT.
    Benchmark speed (iterations/second) scaled to the interpreter’s performance (higher is better)

    Building YJIT inside CRuby comes with some limitations. It means that our JIT compiler has to be written in C and that we have to work with design decisions in the CRuby codebase that weren’t made with a high-performance JIT compiler in mind. However, it has the key advantage that YJIT is able to maintain almost 100% compatibility with existing Ruby code and packages. We pass the CRuby test suite, comprising about 30,000 tests, and we have also been able to pass all of the tests of the Shopify Core CI, a codebase that contains over three million lines of code and depends (directly and indirectly) on over 500 Ruby gems, as well as all the tests in the CI for GitHub’s backend. We also have a working deployment to a small percentage of production servers at Shopify.

    We believe that the BBV architecture that powers YJIT offers some key advantages when compiling dynamically-typed code. Having end-to-end control over the full code generation pipeline will allow us to go farther than what’s possible with the current architecture of MJIT, which is based on GCC. Notably, YJIT can quickly specialize code based on type information and patch code at run time based on the run-time behavior of programs. The advantage in terms of compilation speed and warmup time is also difficult to match.

    Next Steps

    The Ruby core developers have invited the YJIT team to merge the compiler into Ruby 3.1. It’s a great honor for my colleagues and myself to have our work become officially part of Ruby. This means, in a few months, every Ruby developer will have the opportunity to try YJIT by simply passing a command-line option to the Ruby binary. However, our journey doesn’t stop there, and we already have plans in the works to make YJIT and CRuby even faster.

    Currently, only about 79% of instructions in railsbench are executed by YJIT, and the rest run in the interpreter, meaning that there’s still a lot we can do to improve upon our current results. There’s a clear path forward, and we believe YJIT can deliver much better performance than it does now. However, as part of building YJIT, we’ve had to dig through the implementation of CRuby to understand it in detail. In doing so, we’ve identified a few key elements in its architecture that we believe can be improved to unlock higher performance. These improvements won’t just help YJIT, they’ll help MJIT too, and some of them will even make the interpreter faster. As such, we will likely try to upstream some of this work separately from YJIT.

    I may expand on some of these in future blog posts, but here is a tentative list of potential improvements to CRuby that we would like to tackle:

    • Moving CRuby to an object model based on object shapes.
    • Changing the CRuby type tagging scheme to reduce the cost of type checks.
    • Implementing a more fine-grained constant caching mechanism.
    • A faster, more lightweight calling convention.
    • Rewriting C runtime methods in Ruby so that JIT compilers can inline through them.

    Matz (Yukihiro Matsumoto) has stated in his recent talk at Euruko 2021 that Ruby would remain conservative with language additions in the near future. We believe this is a wise decision as rapid language changes can make it difficult for JIT implementations to get off the ground and stay up to date. It makes some sense, in our opinion, for Ruby to focus on internal changes that will make the language more robust and deliver very competitive performance in the future.

    I hope you’re as excited about the future of YJIT and Ruby as we are. If you’re interested in trying YJIT, it’s available on GitHub under the same open source license as CRuby. If you run into bugs, we’d appreciate it if you would open an issue and help us find a simple reproduction. Stay tuned as two additional blog posts about YJIT are coming soon, with details about how you can try YJIT, and the performance tracking system we’ve built for speed.yjit.org.

    Maxime Chevalier-Boisvert obtained a PhD in compiler design at the University of Montreal in 2016, where she developed Basic Block Versioning (BBV), a JIT compiler architecture optimized for dynamically-typed programming languages. She is currently leading a project at Shopify to build YJIT, a new JIT compiler built inside CRuby.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Winning AI4TSP: Solving the Travelling Salesperson Problem with Self-programming Machines

    Winning AI4TSP: Solving the Travelling Salesperson Problem with Self-programming Machines

    Running a business requires making a lot of decisions. To be competitive, they have to be good. There are two complications, though:

    1. Some problems are computationally very hard to solve.
    2. In reality, we are dealing with uncertainty, so we do not even know what exact problem setting we should optimize for.

    The AI4TSP Competition fosters research on the intersection of optimization (addressing the first issue of efficient computation for hard problems) and artificial intelligence (addressing the second issue of handling uncertainty). Shopify optimization expert Meinolf Sellmann collaborated with his former colleagues Tapan Shah at GE Research, Kevin Tierney, Fynn Schmitt-Ulms, and Andre Hottung from the University of Bielefeld to compete and win first prize in both tracks of the competition. The type of problem studied in this competition matters to Shopify as the optimization of our fulfillment system requires making decisions based on estimated data.

    The Travelling Salesperson Problem

    The AI4TSP Competition focuses on the Travelling Salesperson Problem (TSP), one of the most studied routing problems in the optimization community. The task is to determine the order to visit a given set of locations, starting from, and returning to, a given home location, so the total travel time is minimized. In its original form, the travel times between all locations are known upfront. In the competition, these times weren’t known but sampled according to a probability distribution. The objective was to visit as many locations as possible within a given period of time, whereby each location was associated with a specific reward. To complicate matters further, the locations visited on the tour had to be reached within fixed time windows. Arriving too early meant having to wait until the location would open, arriving too late was associated with penalties.

    An image of two solutions to the same TSP instance with the home location in black. The route solutions can be done counterclockwise or clockwise
    Two solutions to the same TSP instance (home location in black)

    When travel times are known, the problem looks innocent enough. However, consider this: the number of possible tours grows more than exponentially and is given by “n! = 1 * 2 * 3 … * n” (n factorial) for a problem instance with n locations. Even if we could:

    1. evaluate, in parallel, one potential tour for every atom in the universe
    2. have each atomic processor evaluate the tours at Planck time (shortest time unit that anything can be measured in)
    3. run that computer from the Big Bang until today.

    It wouldn’t even enumerate all solutions for just one TSP instance with 91 locations. The biggest problems at the competition had 200 locations—with over 10375 potential solutions.

    The Competition Tracks

    The competition consisted of two tracks. In the first, participants had to determine a tour for a given TSP instance that would work well on expectation when averaged over all possible travel time scenarios. A tour had to be chosen and participants had to stick to that tour no matter how the travel times turned out when executing the tour. The results were then averaged ten times over 100,000 scenarios to determine the winner.

    A table of results for the final comparison of front runners in Track 1. It shows Meinholf's team as the winner.
    Final Comparison of Front Runners in Track 1 (Shopify and Friend’s Team “Convexers”)

    In the second track, it was allowed to build the tour on the fly. At every location, participants could choose which location to go to next, taking into account how much time had already elapsed. The policy that determined how to choose the next location was evaluated on 100 travel time realizations for each of 1,000 different TSP instances to determine the winner.

    Optimal Decisions Under Uncertainty

    For hard problems like the TSP, optimization requires searching. This search can be systematic, whereby we search in such a manner that we can efficiently keep record of the solutions that have already been looked at. Alternatively, we can search heuristically, which generally refers to search methods that work non-systematically and may revisit the same candidate solution multiple times during the search. This is a drawback of heuristic search, but it offers much more flexibility as the search controller can guide where to go next opportunistically and isn’t bound by exploring spaces that neatly fit to our existing search record. However, we need to deploy techniques that allow us to escape local regions of attraction, so that we don’t explore the same basin over and and over.

    For the solution to both tracks, Shopify and friends used heuristic search, albeit in two very different flavors. For the first track, the team applied a search paradigm called dialectic search. For the second track, they used what’s known in machine learning as deep reinforcement learning.

    The Age of Self-Programming Machines

    Key to making both approaches work is to allow the machine to learn from prior experience and to adjust the program automatically. If the ongoing machine learning revolution had to be summarized in just one sentence, it would be:

    If, for a given task, we fail to develop algorithms with sufficient performance, then shift the focus to building an algorithm that can build this algorithm for us, automatically.

    A recent prominent example where this revolution has led to success is AlphaFold, DeepMind’s self-taught algorithm for predicting the 3D structure of proteins. Humans tried to build algorithms that could predict this structure for decades, but were unable to reach sufficient accuracy to be practically useful. The same was demonstrated for tasks like machine vision, playing board games, and optimization. At another international programming competition, the MaxSAT Evaluation 2016, Meinolf and his team entered a self-tuned dialectic search approach which won four out of nine categories and ten medals overall. 

    These examples show that machine-generated algorithms can vastly outperform human-generated approaches. Particularly when problems become hard to conceptualize in a concise theory and hunches or guesses must be made during the execution of the algorithm, allowing the machine to learn and improve based on prior experience is the modern way to go.

    Meinolf Sellmann, Director for Network Optimization at Shopify, is best known for algorithmic research, with a special focus on self-improving algorithms, automatic algorithm configuration and algorithm portfolios based on artificial intelligence, combinatorial optimization, and the hybridization thereof. He received a doctorate degree (Dr. rer. nat.) from Paderborn University (Germany). Prior to this he was Technical Operations Leader for machine learning and knowledge discovery at General Electric, senior manager for data curation at IBM, Assistant Professor at Brown University, and Postdoctoral Scholar at Cornell University.
    His honors include the Prize of the Faculty of the University of Paderborn (Germany) for his doctoral thesis, an NSF Career Award in 2007, over 20 Gold Medals at international SAT and MaxSAT Competitions, and first places at both tracks of the 2021 AI for TSP Competition. Meinolf has also been invited as keynote speaker and program chair of many international conferences like AAAI, IAAI, Informs, LION and CP.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Journey Through a Dev Degree Intern’s First Placement

    Journey Through a Dev Degree Intern’s First Placement

    This past April, I completed my first placement as a Dev Degree student. I was a back-end developer working on the Docs & API Libraries team. The team’s mission is to create libraries, tools, and documentation that help developers build on top of Shopify. Throughout my time on the team, I had many opportunities to solve problems, and in taking them, I learned not only technical skills, but life lessons I hope to carry with me. I’ll share with you how I learned to appreciate the stages of learning, dispel the myth of the “real developer,” combat imposter syndrome by asking for help, and the power of valuing representation.

    Joining the Dev Degree Program

    In February of 2019, I received an email from York University. Apply for Dev Degree! it read. As a high school student with disabilities, I’d been advised to look for a program that would support me until I graduated. With small cohorts and a team of supportive folks from both Shopify and York University, I felt I could be well-accommodated in Dev Degree. Getting work experience, I reasoned, would also help me learn to navigate the workplace, putting me at an advantage once I graduated. With nothing to lose and absolutely no expectations, I hit Apply.

    An illustration of an email that reads "To: you. subject: apply for Dev Degree!" with an "apply!" button at the bottom
    Apply to Dev Degree

    Much to my surprise, I made it through the application and interview process. What followed were eight months of learning: 25 hours a week in a classroom at Shopify and 20 hours a week at York University. Alongside nine other students, I discovered Ruby, databases, front-end development, and much more with a group of knowledgeable instructors at Shopify. In May 2020, we moved on to the next exciting stage of the Dev Degree program: starting our first placement on a real Shopify team! It would be 12 months long, followed by three additional eight-month long placements, all on teams of varying disciplines. During those 12 months, I learned lessons I’ll carry throughout my career.

    What We Consider “Foundational Knowledge” Gets Distorted the More We Learn

    When I first joined the Docs & API Libraries team, there was a steep learning curve. This was expected. Months into the placement, however, I still felt as though I was stagnant in my learning and would never know enough to become impactful. Upon mentioning this to my lead, he suggested we create an Achievement Log—a list of the victories I’d had, large and small, during my time on the team.

    Looking back on my Achievement Log, I see many victories I would now consider rather small, but had been a big deal at the time. To feel this way is a reminder that I have progressed. My bar for what to consider “basic knowledge” moved upward as I learned more, leading to a feeling of never having progressed. One good way of putting these achievements into perspective is to share them with others. So, let’s journey through some highlights from my Achievement Log together!

    An illustration of stairs with label, "today" on closest and largest step, and "before" off in the distance
    Stairs to achievement

    I remember being very frustrated while attempting my first pull request (PR). I was still learning about the Shopify CLI (a command-line tool to help developers create Shopify apps) and kept falling down the wrong rabbit holes. The task: allow the Shopify CLI to take all Rails flags when scaffolding a Rails project; for example, using --db=DB to change the database type. It took a lot of guidance, learning, and mistakes to solve the issue, so when I finally shipped my first PR, it felt like a massive victory!

    Mentoring was another good way to see how far I’d come. When pairing with first-year Dev Degree students on similar tasks, I’d realize I was answering their questions—questions I myself had asked when working on that first PR—and guiding them toward their own solutions. It reminds me of the fact that we all start in the same place: with no context, little knowledge, and much curiosity. I have made progress.

    The Myth of the “Real” Developer Is, Well, a Myth

    After familiarizing myself with the tool through tasks similar to my first PR, I began working on a project that added themes as a project type to the Shopify CLI. The first feature I added was the command to scaffold a theme, followed by a task I always dreaded: testing. I understood what my code did, and what situations needed to have unit tests, but could not figure out how to write the tests.

    My lead gave a very simple suggestion: why not mimic the tests from the Rails project type? I was more familiar with the Rails project type, and the code being tested was very similar between the two test files. This had never occurred to me before, as it didn’t feel like something a “real” developer would do.

    An illustration of a checklist titled, "Real Programmer" with none of the boxes ticked
    Real Programmer checklist

    Maybe “real” developers program in their free time, are really into video games, or have a bunch of world-changing side projects going at all times. Or maybe they only use the terminal— never the GUI—and always dark mode. I spoke to a few developer interns about what notions they used to have of a “real” developer, and those were some of their thoughts. My own contribution was that “real” developers can solve problems in their own creative ways and didn’t need help; it felt like I needed to be able to implement unique solutions without referring to any examples. So, following existing code felt like cheating at first. However, the more I applied the strategy, the more I realized it was helpful and entirely valid.

    For one, observational learning is a good way of building confidence toward being able to attempt something. My first pairing sessions on this team, before I felt comfortable driving, involved watching an experienced developer work through the problem on their screen. Looking at old code also gives a starting point for what questions to ask or keywords to search. By better understanding the existing tests, and changing them to fit what I needed to test, I was learning to write unrelated tests one bite-sized piece at a time.

    An illustration of the Real Programmer checklist, but criteria has been scratched out and a ticked box added next to checklist title, "Real Programmer"
    The Real Programmer is a dated stereotype

    At the end of the day, developers are problem solvers coming from all different walks of life, all with ways of learning that fit them. There’s no need to play into dated stereotypes!

    Imposter Syndrome Is Real, But It’s Only Part of the Picture

    As I was closing out the Shopify CLI project, I was added to my first team project: the Shopify Admin API Library for Node. This is a library written in Typescript with support for authentication, GraphQL and REST requests, and webhook handling. It’s used in accelerating Node app development.

    With this being my first time working on a team at Shopify, I found it far too easy to nod along and pretend I knew more than I did. Figuring it would be a relatively simple feature to implement, a teammate suggested I add GraphQL proxy functionality to the library; but, for a very junior developer like me, it wasn’t all that simple. It was difficult to communicate the level of help I needed. Instead, I often seemed to require encouragement, which my team gave readily. I kept being told I was doing a good job, even though I felt I had hardly done anything, and whenever I mentioned to a friend that I felt I had no idea what I was doing, the response was often that they were sure I was doing great. That I needed to have more confidence in my abilities.

    I puzzled over the task for several weeks without admitting I had no idea what I was doing. In the end, my lead had to help me a lot. It wasn’t a matter of simply feeling like I knew nothing; I had been genuinely lost and successful at hiding it. Being surrounded by smart people, imposter syndrome is nearly inevitable. I’m still looking for ways to overcome this. Yet, there are moments when it is not simply a matter of underestimating myself; rather, it’s more about acknowledging how much I have to learn. To attribute the lack of confidence entirely to imposter syndrome would only work to isolate me more.

    Imposter syndrome or not, I still had to work through it. I had to ask for help; a vital challenge in the journey of learning. Throughout my placement, I was told repeatedly, by many team members, that they would like to hear me ask for help more often. And I’ve gotten better at that. Not only does it help me learn for the next time, it gives the team context into the problem I am solving and how. It may slow down progress for a little while, but in the larger picture, it moves the entire project forward. Plus, asking a senior developer for their time isn’t unreasonable, but normal. What would be unreasonable is expecting an intern to know enough not to ask!

    Learning Will Never Feel Fast

    My next project after the Node library was the Shopify Admin API Library for PHP (similar functionality as the Shopify Admin API Library for Node but a different language). The first feature I added was the Context class, comparable to a backpack of important information about the user’s app and shop being carried and used across the rest of the library’s functionality. I had never used PHP before and was excited to learn. That is, until I started to fall behind. Then I began to feel frustrated. Since I’d had experience building the previous library, shouldn’t I have been able to hit the ground running with this one?

    An illustration of a backpack with a tag that says "Context" with Context class parameters sticking out
    Shopify backpack full of context

    As a Dev Degree student, I work approximately half the hours of everyone else on my team, and my inexperience means I take longer to grasp certain concepts. Even knowing this, I tend to compare myself to others on my team, so I often feel very behind when it comes to how much I’ve accomplished. And any time I’m told that I’m a quick learner, it’s as if I’ve successfully fooled everyone.

    Determined to practice the lesson I had learned from trying to implement the GraphQL proxy in the Node library, I asked for help. I asked my teammates to review the draft PR occasionally while it was still a work-in-progress. The idea was that, if I strayed from the proper implementation, my mistake would be caught earlier on. Not only did this keep me on track, getting my teammates’ approval throughout the process also helped me realize I was less behind than I’d thought. It made learning a collaborative process. My inexperience and part-time hours didn’t matter, because I had a team to learn alongside.

    Even when I was on track, learning felt like a slow process—and that’s because that is the nature of learning. Not knowing or understanding things can be frustrating in the moment, especially when others seem to understand. But the perspective others have of me is different from mine. It’s not that I’ve fooled them; rather, they’re acknowledging my progress relative to only itself. They’re celebrating my learning for what it is.

    Valuing Representation Isn’t a Weakness

    As a kid, I couldn’t understand why representation mattered; if no one like you had achieved what you wanted to achieve before, then you’d be the first! Yet, this was the same kid who felt she could never become a teacher because her last name “didn’t sound like a teacher’s.” This internalized requirement to appear unbothered by my differences followed me as I got older, applying itself to other facets of me—gender, disability, ethnicity, age. I later faced covert sexism and ableism in high school, making it harder for me to remain unfazed. Consequently, when I first started working on an open source repository, I’d get nervous about how prejudiced a stranger could be just from looking at my GitHub profile picture.

    A doodle of a headshot with a nametag that reads "hello, I'm incompetent". It is annotated with the labels, "Asian", "girl", and "younger = inexperienced"
    Are these possible biases from a stranger?

    This nervousness has since died down. I haven’t been in any predicament of that sort on the team, and on top of that, I’ve been encouraged by everyone I interact with regularly to tell them if they aren’t being properly accommodating. I always have the right to be supported and included, but it’s my responsibility to share how others can help. We even have an autism support Slack channel at Shopify that gives me a space to see people like myself doing all sorts of amazing things and to seek out advice.

    Looking back on all the highlights of my achievement log, there’s a story that snakes through the entries. Not just one of learning and achievement, which I’ve shared, but one that’s best encapsulated by one last story.

    My lead sent out feedback forms to each of the team members to fill out about those they worked with. The last question was “Anything else to add?” Hesitantly, I messaged one of my teammates, asking if I could mention how appreciative I was of her openness about her neurodiversity. It was nice to have someone like me on the team, I told her. Imagine my surprise when she told me she’d meant to write the same thing for me!

    Valuing representation isn’t a weakness. It isn’t petty. There’s no need to appear unbothered. Valuing representation helps us value and empower one another—for who we are, what we represent, and the positive impact we have.

    There you have it, a recap of my first placement as a Dev Degree intern, some problems I solved, and some lessons I learned. Of course, I still have much to learn. There’s no quick fix to any challenges that come with learning. However, if I were to take only one lesson with me to my next placement, it would be this: there is so much learning involved, so get comfortable being uncomfortable! Face this challenge head-on, but always remember: learning is a collaborative activity.

    Having heard my story, you may be interested in learning more about Dev Degree, and possibly experiencing this journey for yourself. Applications for the Fall 2022 program are open until February 13, 2022.

    Carmela is a Dev Degree intern attending York University. She is currently on her second placement, working as a front-end developer in the Shopify Admin.

    Continue reading

    Reusing Code with React Native Packages at Shopify

    Reusing Code with React Native Packages at Shopify

    At Shopify, we develop a bunch of different React Native mobile apps: Shop, Inbox, Point of Sale, Shopify Mobile, and Local Delivery. These apps represent different business domains, but they often have shared pieces of functionality like login or foundational blocks they build upon. Wouldn’t it be great to leverage development speed and focus on important product features by reusing code other teams have already written? Sure, but it might be a big and time consuming effort that discourages teams. Usually, contributing to a new repo is more tedious and error prone in comparison to contributing to an existing repository. The developer needs to create a new repository, set up continuous integration (CI) and distribution pipelines, and add configs for Jest, ESint, and Babel. It might be unclear where to start and what to do.

    My team, React Native Foundations, decided to invest in simplifying the process for developers at Shopify. In this post, I'll walk you through the process of extracting those shared elements, the setup we adopted, the challenges we encountered, and future lines of improvement.

    Our Considerations: Monorepo vs Multi-Repo

    When we set out to extract elements from the product repositories, we explored two approaches: multi-repo and monorepo. For us, it was important that the solution had low maintenance costs, allowing us to be consistent without much effort. Of the two, monorepo was the one that helped us achieve that.

    Having one monorepo to support reduces maintenance costs. The team has one process that can be improved and optimized instead of maintaining and providing support for any number of packages and repositories. For example, imagine updating React Native and React versions across 10 repositories. I don’t even want to!

    A monorepo decreases entrance barriers by offering everything you need to start building a package, including a package template to kick off building your package faster. Documentation and tooling provide the foundation to focusing on what’s important—the content of the package—instead of wasting time on configuring CI pipelines or thinking about the structure and configuration of the package. 

    We want contributing to shared foundational code to be convenient and spark joy. Optimizing once, and for everyone, gives the team time and opportunity to improve the developer experience by offering features like generating automatic documentation and providing a fixture app to test changes during development. 

    Our Setup Details

    A repository consists of a set of npm packages that might contain native iOS and Android code, a fixture app that allows testing those packages in a real application, and an internal documentation website for users and contributors to learn how to use and contribute to the packages. This repository has an uncommon setup making it possible to hot-reload while editing the packages and references between packages and use them from the fixture app.

    First, packages are developed in TypeScript but distributed as JavaScript and definition files. We use TypeScript project references so the TypeScript compiler resolves cross-package references. Since the IDE detects it's a TypeScript project, it resolves the imports on the UI. Dependencies between projects are defined in the tsconfig.json of each package.

    When distributing the packages, we use Yarn. It’s language-agnostic and therefore doesn't translate dependencies between TypeScript projects to dependencies between packages. For that, we use Yarn Workspaces. That means besides defining dependencies for TypeScript, we have to define them in the package.json for Yarn and npm. Lerna, the publishing tool we use to push new versions of the packages to the registry, knows how to resolve the dependencies and build them in the right order.

    We extract TypeScript, Babel, Jest, and ESLint configs to the root level to ensure consistent configuration across the packages. Consistency makes contributions easier because packages have a similar setup, and it also leads to a more reliable setup. 

    The fixture app setup is the standard setup of any React Native app using Metro, Babel, CocoaPods, and Gradle. However, it has custom configuration to import and link the packages that live within the same repository:

    • babel.config.js uses module-resolver plugin to resolve project references. We wouldn't need this if Babel integrated with TypeScript's project references feature.
    • metro.config.js exposes the package directories to Metro so that hot reloading works when modifying the code of the packages.
    • Podfile has logic to locate and include the Pod of the local packages. It’s worth mentioning that we don’t use React Native autolinking for local packages, but install them manually.

    Developers test features by running the fixture app locally. They also have the option to create Shipit Mobile internal builds (which we call Snapshot builds) that they can share internally. Shared builds can be installed via QR code by any person in the company, allowing them to play with available packages.

    CI configuration is one of the things developers get for free when contributing to the monorepo. CI pipelines are auto-generated and therefore standardized across all the packages. Based on the content of the package we define the steps: 

    • build 
    • test 
    • type check 
    • lint TypeScript, Kotlin, and Swift code.
    A CI pipeline run showing all the steps (build, test, run, type check, and lint) run for a package with updates.
    A CI pipeline run showing all the steps (build, test, run, type check, and lint) run for a package with updates.

    Another interesting thing about our setup is that we generate a dependency graph of the package to determine dependencies between packages. Also, the pipelines are triggered based on the file changes, so we only build the package with new changes and those that depend on it. 

    Code Generation

    Even with all the infrastructure in place, it might be confusing to start contributing. Documentation describing the process helped up to a point, but we could do better by involving automation and code generation to leverage bootstrapping new packages further.

    The React Native packages monorepo offers a script built with PlopJS for adding a new package based on the package template similar to the React Native community one. We took this template but customized it for Shopify. 

    A newly created package is a ready-to-use skeleton that extends the monorepo’s default configuration and has auto-generated CI pipelines in place. The script prompts for answers to some questions and generates the package and pipelines as a result.

    Terminal window showing the script that prompts the user for answers to questions needed to create the packages and CI pipelines.
    Terminal window showing the script that prompts the user for answers to questions needed to create the packages and CI pipelines

    Code generation ensures consistency across packages since everything is predefined for contributors. For the React Native Foundations team, it means supporting and improving one workflow, which reduces maintenance costs.

    Documentation

    Documentation is as important as the code we add to the repository, and having great documentation is crucial to provide a great developer experience. Therefore, it shouldn't be an afterthought. To make it easier for contributors not to overlook writing documentation, the monorepo offers auto-generated documentation available in a statically generated website built with Gatsby.

    Screenshot of the package documentation website created by Gatsby. The left hand side shows the list of packages and the right hand side contains the details of the selected package.
    Screenshot of the package documentation website created by Gatsby

    Each package shows up in the sidebar of the documentation website, and its page contains the following information that’s pre-populated by reading metadata from the package.json file:

    • package name 
    • package dependencies
    • installation command (including peer dependencies)
    • dependency graph of the packages in there.

    Since part of the documentation is auto-generated, it’s also consistent across the packages. Users see the same sections with as much generated content as possible.The website supports extending the documentation with manually written content by creating any of the following files under the documentation/ directory of the package:

    • installation.mdx: include extra installation steps
    • getting-started.mdx: document steps to get started with the package
    • troubleshooting.mdx: document issues developers might run into and how to tackle them.

    Release Process

    I’ve mentioned before that we use Lerna for releasing the packages. During a release, we version independently and only if a package has unreleased changes. Due to how Lerna approaches the release process, all unreleased changes need to be released at the same time. 

    Our standard release workflow includes updating changelogs with the newest version and calling a release script that prompts you to update all the modules touched since the last change. 

    When versioning locally, we run two additional npm lifecycle scripts:

    • preversion ensures that all the changelogs are updated correctly. It gets run before we upgrade the version.
    • version gets run after we've updated the versions but before we make the "Publish" commit. It generates an updated readme and runs pod install considering the bumped versions.

    After that, we get a new release commit with release tags that we need to push to the main branch. Now, the only thing left is to press “Publish”, and the packages will be released to the internal package registry. 

    The release process has a few manual steps and can be improved further. We keep the main branch always shippable but plan on introducing automating releases on every merge to reduce friction. To do that we might need to:

    • start using conventional commits in the repo
    • automate changelog generation
    • configure a GitHub action to prepare a release commit after every merge automatically. This step will generate the changelog automatically, trigger a Lerna release commit, and push that to main
    • schedule an automated release of the package right after.

    The Future of Monorepos at Shopify

    In hindsight, we achieve our goal. Extracting and reusing code is easy: you get tooling, infrastructure, and maintenance from the React Native Foundations team, plus other nice things for free. Developers can easily share those internal packages, and product teams have a developer-friendly workflow to contribute to Shopify's foundation. As a result, 17 React Native packages have been developed since June 2020, with 10 of them contributed by product teams.

    Still, we got some lessons along the way.

    We learned that the React Native tooling isn’t optimized for Shopify’s setup, but thanks to the flexibility of their APIs, we achieved a configuration we’re happy with. Still, the team keeps an eye on any occurring inconveniences and works on improving them.

    Also, we came up with the idea of having multiple monorepos for thematically-related packages instead of one huge one. Based on the Web Foundation team’s experience and our impression, it makes sense to introduce a few monorepos for coupled packages. Recent talk from Microsoft at React Native EU 2021 conference also confirmed that having multiple monorepos is a natural evolutionary step for massive React Native codebases. Now, we have two monorepos: one main monorepo contains loosely coupled packages with utilities and Shopify-specific features and another contains a few performance related packages. Still, when we end up having a few monorepos, we’ll have to figure out how to reuse pieces across those monorepos to retain the benefits of monorepo.

    Elvira Burchik is a Production Engineer on the React Native Foundations team. Her mission is to create an environment in which developers are highly productive at creating high-quality React Native applications. She lives in Berlin, Germany, and spends her time outside of work chasing the best kebabs and brewing coffee.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Shard Balancing: Moving Shops Confidently with Zero-Downtime at Terabyte-scale

    Shard Balancing: Moving Shops Confidently with Zero-Downtime at Terabyte-scale

    Moving a shop from one shard to another requires engineering solutions around large, interconnected systems. The flexibility to move shops from shard to shard allows Shopify to provide a stable, well-balanced infrastructure for our merchants. With merchants creating their livelihood on the platform, it’s more important than ever that Shopify remains a sturdy backbone. High-confidence shard rebalancing is simply one of the ways we can do this.

    Continue reading

    Making Shopify’s Flagship App 20% Faster in 6 Weeks Using a Novel Caching Solution

    Making Shopify’s Flagship App 20% Faster in 6 Weeks Using a Novel Caching Solution

    Shop is Shopify’s flagship shopping app. It lets anyone track their packages, find new products, and even plant trees to offset the carbon emissions from their purchases. Since launching in 2019, we’ve quickly grown to serve tens of millions of users. One of the most heavily used features of the app is the home page. It’s the first thing users see when they open Shop. The home page keeps track of people’s orders from the time they click the checkout button to when it’s delivered to their door. This feature brings so much value to our users, but we’ve had some technical challenges scaling this globally. As a result of Shop’s growth, the home feed was taking up a significant amount of our total database load, and was starting to have a user-facing impact.

    We prototyped a few solutions to fix this load issue and ended up building a custom write-through cache for the home feed. This was a huge success—after about six weeks of engineering work, we built a custom caching solution that reduced database load by 15% and overall app latency by about 20%.

    Identifying The Problem

    The main screen of the Shop app is the most used feature. Serving the feed is complex as it requires aggregating orders from millions of Shopify and non-Shopify merchants in addition to handling tracking data from dozens of carriers. Due to a variety of factors, loading and merging this data is both computationally expensive and quite slow. Before we started this project, 30% of Shop’s database load was from the home feed. This load didn’t only affect the home feed, it affected performance on all aspects of the application.

    We looked around for simple, straightforward solutions to this problem like introducing IdentityCache, updating our database schema, and adding more database indexes. After some investigation, we learned that we had little database-level optimization left to do and no time to embark on a huge code rewrite. Caching, on the other hand, seemed ideal for this situation. Because users use the home feed every day and the temporal based sort of the home feed, home feed data was usually only read after it was recently written, making it ideal for a cache of some sort.

    Finding a Solution

    Because of the structure of the home feed, we couldn't use a plug-and-play caching solution. We think of a given user’s home feed as a sorted list of a user’s purchases, where the list can be large (some people do a lot of shopping!). The list can be updated by a series of concurrent operations that include:

    • adding a new order to display on the home feed (for example, when someone makes a purchase from a Shopify store)
    • updating the details associated with an order (for example, when the order is delivered)
    • removing an order from the list (for example, when a user manually archives the order).

    In order to cache the home feed, we’d need a system that maintains a cached version of a user’s feed, while handling arbitrary updates to the orders in the feed and also maintaining the guarantee that the feed order is correct.

    Due to the quantity of updates that we process, it’s infeasible to use a read-through cache that’s invalidated after every write, as the cache would end up being invalidated so often it would be practically useless. After some research, we didn’t find an existing solution that:

    • wasn’t invalidated after writes
    • could handle failure cases without showing  stale data to users.

    So, we built one ourselves.

    Building Shop’s Caching Solution

    A flow diagram showing the state of the Shop app before adding a caching solution
    Before introducing the cache, when a user would make a request to load the home feed, the Rails application would serialy execute multiple database queries, which had high latency.
    A flow diagram showing the state of the Shop app after the caching solution is introduced

    After introducing the cache, when a user makes a request to load their home feed, Rails loads their home feed from the cache and makes far fewer (and much faster) database requests.

    Rather than querying the database every time a user requests the home feed, we now cache a copy of their home feed in a fast, distributed, horizontally-scaled caching system (we chose Memcached). Then we serve from the cache rather than the database at request time provided certain conditions are met. To keep the cache valid and correct before each database update, we mark the cache as “invalid” to ensure the cached data isn’t used while the cache and database are out of sync. After the write is complete, we update the cache with the new data and mark it as “valid” again.

    A flow diagram showing how Shop app updates the cache
    When Shop receives a shipping update from a carrier, we first mark the cache as invalid, then update the database, and then update the cache and mark it as valid.

    Deciding on Memcached

    At Shopify, we use two different technologies for caching: Memcached and Redis. Redis is more powerful than Memcached, supporting more complex operations and storing more complex objects. Memcached is simpler, has less overhead, and is more widely used for caching inside Shop. While we use Redis for managing queues and some caches, we didn’t need Redis’ complexity, so we chose a distributed Memcached. 

    The primary issue we had to solve was ensuring the cache never contained stale records. We minimize the chance of cache invalidation by building the cache using a write-through invalidation policy that invalidates the cache before a database write and revalidates it after the successful write. That led to the next hard question: how do we actually store the data in Memcached and handle concurrent updates?

    The naive approach would be to store a single key for each user in Memcached that maps a user to their home feed. Then, on write, invalidate the cache by evicting the key from the cache, make the database update, and finally revalidate the cache by writing the key again. The issue, unfortunately, is that there’s no support for concurrent writes. At Shop’s scale, multiple worker machines often concurrently process order updates for the same user. Using a delete-then-write strategy introduces race conditions that could lead to an incorrect cache, which is unacceptable.

    To support concurrent writes, we store an additional key/value pair (pending writes key) that tracks the validity of the cache for each user. The key stores the number of active writes to a given user’s home feed. Each time a worker machine is about to write to the database, we increment this value. When the update is complete, we decrement the value. This means the cache is valid when the pending writes key is zero.

    However, there’s one final case. What happens if a machine makes a database update and fails to decrement the pending writes key due to an interrupt or exception? How can we know if the pending writes key is greater than zero because there's currently a database write in progress or a process was interrupted?

    The solution is introducing a key with a short expiry that’s written before any database update. If this key exists, then we know there’s the possibility of a database update, but if it doesn’t and the pending writes key is greater than zero, we know there’s no active database write occurring, so it’s safe to rewarm the cache again.

    Another interesting detail is that we needed this code to work with all of our existing code in Shop and interplay seamlessly with that code. We wrote a series of Active Record Concerns that we mixed into the relevant database records. Using Active Record concerns meant that the ORM’s API stayed exactly the same, causing this change to be totally transparent to developers, and ensuring that all of this code was forward compatible. When Shop Pay became available to anyone selling on Google or Facebook, we were able to integrate the caching with minimal overhead.

    The Rollout Strategy

    Another important piece of this project was the rollout. Once we’d built the caching logic and integrated it with the ORM, we had to ship the cache to users. Theoretically sound, unit-tested code is a good first step, but without real world data, we weren’t confident enough in our system to deploy this cache without strict testing. We wanted to validate our hypothesis that it would never serve stale data to users.

    So, over the course of the few weeks, we ran an experiment. First, we turned on all the cache writing and updating logic (but not the logic to serve data from the cache) and tested at scale. Once we knew that system was durable and scalable, we tested its correctness. At home feed serve time, our backend loaded from both the cache and the database and compared their data, and would log to a dashboard if there was a discrepancy. After letting this experiment run for a few weeks and fixing the issues that arose, we were able to be confident in our system’s correctness and scalability. We knew that the cache was always going to be valid and would not serve users stale or incorrect data.

    After rolling this cache out globally, we saw immediate, impactful results. Our database servers have a lighter load, In addition to the lower database load and faster home feed performance, we also observed a double digit decrease in overall CPU usage and a 20% decrease in our overall GraphQL latency. Our database servers have a lighter load, our users have a faster experience, and our developers don’t need to worry about high database load. It’s a win-win-win.

    Ryan Ehrlich is a software engineer living in Palo Alto, California. He focuses on solving problems in large scale, distributed systems, and CV/NLP AI research. Outside of work, he’s an avid rock climber, cyclist, and hiker.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Introducing LinNét: Using Rich Image and Text Data to Categorize Products at Scale

    Introducing LinNét: Using Rich Image and Text Data to Categorize Products at Scale

    The last time we discussed product categorization on this blog, Shopify was powering over 1M merchants. We have since grown and currently serve over 1.7 million merchants who sell billions of products across a diverse set of industries. With this influx of new merchants, we decided to reevaluate our existing product categorization model to ensure we’re understanding what our merchants are selling, so we can build the best products that help power their sales.

    To do this, we considered two metrics of highest importance:

    1. How often were our predictions correct? To answer this question, we looked at the precision, recall, and accuracy of the model. This should be very familiar to anyone who has prior experience with classification machine learning models. For the sake of simplicity let us call this set of metrics , “accuracy”. These metrics are calculated using a hold out set to ensure an unbiased measurement.
    2. How often do we provide a prediction? Our existing model  filters out predictions below a certain confidence thresholds to ensure we were only providing predictions that we were confident about. So, we defined a metric called “coverage”: the ratio of the number of products with a prediction and the total number of products.

    In addition to these two metrics, we also care about how these predictions are consumed and if we’re providing the right access patterns and SLA’s to satisfy all use cases. As an example, we might want to provide low latency real time predictions to our consumers.

    After evaluating our model against these metrics and taking into account the various data products we were looking to build, we decided to build a new model to improve our performance. As we approached the problem, we reminded ourselves of the blind spots of the existing model. These included things such as only using textual features for prediction and the ability to only understand products in the english language.

    In this post, we’ll discuss how we evolved and modernized our product categorization model that increased our leaf precision by 8% while doubling our coverage. We’ll dive into the challenges of solving this problem at scale and the technical trade-offs we made along the way. Finally we’ll describe a product that’s currently being used by multiple internal teams and our partner ecosystems to build derivative data products. We named this version of the model: LinNét, after Carl Von Linné who is considered to be the father of modern taxonomies.

    Why Is Product Categorization Important?

    Before we discuss the model, let’s recap why product categorization is an important problem to solve.

    Merchants sell a variety of products on our platform, and these products are sold across different sales channels. We believe that the key to building the best products for our merchants is to understand what they’re selling. For example, by classifying all the products our merchants sell into a standard set of categories, we can build features like better search and discovery across all channels and personalized insights to support merchants’ marketing efforts.

    Our current categorization model uses the Google Product Taxonomy (GPT). The GPT is a list of over 5,500 categories that help us organize products. Unlike a traditional flat list of categories or labels that’s common to most classification problems, the GPT has a hierarchical tree structure. Both the sheer number of categories in the taxonomy and the complex structure and relationship between the different classes make this a hard problem to model and solve.

    Sample branch from the GPT
    Sample branch from the GPT with the example of Animals & Pet Supplies classification

    The Model

    Before we could dive into creating our improved model, we had to take into account what we had to work with by exploring the product features available to us. Below is an example of the product admin page you would see in the backend of a Shopify merchant’s store:

    The product admin page in the backend of a Shopify store
    The product admin page in the backend of a Shopify store

    The image above shows the product admin page in the Shopify admin. We have highlighted the features that can help us identify what the product is. These include the title, description vendor, product type collection, tags and the product images.

    Clearly we have a few features that can help us identify what the product is, but nothing in a structured format. For example, multiple merchants selling the same product can use different values for Product Type. While this provides a lot of flexibility for the merchant to organize their inventory internally, it creates a harder problem in categorizing and indexing these products across stores.

    Broadly speaking we have two types of features available to us:

     

    Text Features

    • Product Title 
    • Product Description
    • Product Type
    • Product Vendor
    • Product Collections 
    • Product Tags

    Visual Features

    • Product Images

     

    These are the features we worked with to categorize the products.

    Feature Vectorization

    To start off, we had to choose what kind of vectorization approaches our features need since both text and image features can’t be used by most machine learning models in their raw state. After a lot of experimentation, we moved forward with transfer learning using neural networks. We used pre-trained image and text models to convert our raw features into embeddings to be further used for our hierarchical classification. This approach provided us with flexibility to incorporate several principles that we’ll discuss in detail in the next section.

    We horse raced several pre-trained models to decide which models to use for image and text embeddings. The parameters to consider were both model performance and computational cost. As we balanced out these two parameters, we settled on the choice of:

    • Multi-Lingual BERT for text 
    • MobileNet-V2 for images

    Model Architecture

    As explained in our previous post, categorizing hierarchical classification problems presents us with additional challenges beyond a flat multi-class problem. We had two lessons from our previous attempts at solving this problem:

    1. Preserving the multi-class nature of this problem is extremely beneficial in making predictions. For example: Level 1 in the taxonomy has 21 different class labels compared to more than 500 labels at Level 3.
    2. Learning parent nodes helps in predicting the child node. For example, if we look back at the image in our example of the Shopify product admin, it’s easier to predict the product as “Dog Beds”, if we’ve already predicted it as belonging to “Dog Supplies”.

    So, we went about framing the problem as a multi-task, multi-class classification problem in order to incorporate these learnings into our model.

    • Multi-Task: Each level of the taxonomy was treated as a separate classification problem and the output of each layer would be fed back into the next model to make the next level prediction. 
    • Multi-Class: Each level in the taxonomy contains a varying number of classes to choose from, so each task became a single multi-class classification problem. 
    Outline of model structure for the first 2 levels of the taxonomy
    Outline of model structure for the first 2 levels of the taxonomy

    The above image illustrates the approach we took to incorporate these lessons. As mentioned previously, we use pre-trained models to embed the raw text and image features and then feed the embeddings into multiple hidden layers before having a multi-class output layer for the Level 1 prediction. We then take the output from this layer along with the original embeddings and feed it into subsequent hidden layers to predict Level 2 output. We continue this feedback loop all the way until Level 7.

    Some important points to note:

    1. We have a total of seven output layers corresponding to the seven levels of the taxonomy. Each of these output layers has its own loss function associated with it. 
    2. During the forward pass of the model, parent nodes influence the outputs of child nodes.
    3. During backpropagation, the losses of all seven output layers are combined in a weighted fashion to arrive at a single loss value that’s used to calculate the gradients. This means that lower level performances can influence the weights of higher level layers and nudge the model in the right direction.
    4. Although we feed parent node prediction to child node prediction tasks in order to influence those predictions, we don’t impose any hard constraints that the child node prediction should strictly be a child of the previous level prediction. As an example the model is allowed to predict Level 2 as “Pet Supplies” even if it predicted Level 1 as “Arts & Entertainment”. We allow this during training so that accurate predictions at child nodes can nudge wrong predictions at the parent node in the right direction. We’ll revisit this point during the inference stage in a subsequent section.
    5. We can handle imbalance in classes using class weights during the training stage. The dataset we have is highly imbalanced. This makes it difficult for us to train a classifier that generalizes. Adding class weights enables us to mitigate the effects of the class imbalance. By providing class weights we’re able to penalize errors in predicting classes that have fewer samples compared thereby overcoming the lack of observations in those classes.

    Model Training

    One of the benefits of Shopify's scale is the availability of large datasets to build great data products that benefit our merchants and their buyers. For product categorization, we have collected hundreds of millions of observations to learn from. But this also comes with its own set of challenges! The model we described above turns out to be massive in complexity. It ends up having over 250 million parameters. Add to this the size of our dataset, training this model in a reasonable amount of time is a challenging task. Training this model using a single machine can run into multiple weeks even with GPU utilization. We needed to bring down training time while also not sacrificing the model performance.

    We decided to go with a data parallelization approach to solve this training problem. It would enable us to speed up the training process by chunking up the training dataset and using one machine per chunk to train the model. The model was built and trained using distributed Tensorflow using multiple workers and GPUs on Google Cloud Platform. We performed multiple optimizations to ensure that we utilized these resources as efficiently as possible.

    Model Inference and Predictions

    As described in the model architecture section, we don’t constrain the model to strictly follow the hierarchy during training. While this works during training, we can’t allow such behavior during inference time or we jeopardize providing a reliable and smooth experience for our consumers. To solve this problem, we incorporate additional logic during the inference step. The steps during predictions are

    1. Make raw predictions from the trained model. This will return seven arrays of confidence scores. Each array represents one level of the taxonomy.
    2. Choose the category that has the highest confidence score at Level 1 and designate that as the Level 1 Prediction.
    3. Collect all the immediate descendants of the Level 1 prediction. From among these, choose the child that has the highest confidence score and designate this as the Level 2 prediction.
    4. Continue this process until we reach the Level 7 prediction.

    We perform the above logic as Tensorflow operations and build a Keras subclass model to combine these operations with the trained model. This allows us to have a single Tensorflow model object that contains all the logic used in both batch and online inference.

    Schematic of subclassed model including additional inference logic
    Schematic of subclassed model including additional inference logic

    The image above illustrates how we build a Keras subclass model to take the raw trained Keras functional model and attach it to a downstream Tensorflow graph to do the recursive prediction.

    Metrics and Performance

    We collected a suite of different metrics to measure the performance of a hierarchical classification model. These include:

    • Hierarchical accuracy
    • Hierarchical precision
    • Hierarchical recall
    • Hierarchical F1
    • Coverage

    In addition to gains in all the metrics listed above, the new model classifies products in multiple languages and isn’t limited to only products with English text, which is critical for us as we further Shopify's mission of making commerce better for everyone around the world.

    In order to ensure only the highest quality predictions are surfaced, we impose varying thresholds on the confidence scores at different levels to filter out low confidence predictions. This means not all products have predictions at every level.

    An example of this is shown in the image below:

    Smart thresholding
    Smart thresholding

    The image above illustrates how the photo of the dog bed results in four levels of predictions. The first three levels all have a high confidence score and will be exposed. The fourth level prediction has a low confidence score and this prediction won’t be exposed.

    In this example, we don’t expose anything beyond the third level of predictions since the fourth level doesn’t satisfy our minimum confidence requirement.

    One thing we’ve learned during this process was how to tune the model so that these different metrics were balanced in an optimal way. We could, for example, achieve a higher hierarchical precision at the cost of lower coverage. These are hard decisions to make and would need us to understand our business use case and the priorities to make these decisions. We can’t emphasize enough how vital it is for us to focus on the business use cases and the merchant experience in order to guide us. We optimized towards reducing negative merchant experience and friction. While metrics are a great indication of model performance, we also conducted spot checks and manual QA on our predictions to identify areas of concern.

    An example is how we paid close attention to model performance on items that belonged to sensitive categories like “Religious and Ceremonial”. While overall metrics might look good, they can also mask model performance in small pockets of the taxonomy that can cause a lot of merchant friction. We manually tuned thresholds for confidences to ensure high performance in these sensitive areas. We encourage the reader to also adopt this practice in rolling out any machine learning powered consumer facing data product.

    Where Do We Go From Here?

    The upgrade from the previous model gave us a boost in both precision and coverage. At a high level, we were able to increase precision by eight percent while also almost doubling the coverage. We have more accurate predictions for a lot more products. While we improved the model and delivered a robust product to benefit our merchants, we believe we can further improve it. Some of the areas of improvements include:

    • Data Quality: While we do have a massive rich dataset of labelled products, it does contain high imbalance. While we can address imbalance in the dataset using a variety of well known techniques like class weights and over/undersampling, we also believe we should be collecting fresh data points in areas where we currently don’t have enough. As Shopify grows, we notice that the products that our merchants sell get more and more diverse by the day. This means we’ll need to keep collecting data in these new categories and sections of the taxonomy.
    • Merchant Level Features: The current model focuses on product level features. While this is the most obvious place to start, there are also a lot of signals that don’t strictly belong at the individual product level but roll up to the merchant level that can help us make better predictions. A simple example of this is a hypothetical merchant called “Acme Shoe warehouse”. It looks clear that the name of this store strongly hints at what the product this store sells could be.

    Kshetrajna Raghavan is a data scientist who works on Shopify's Commerce Algorithms team. He enjoys solving complex problems to help use machine learning at scale. He lives  in the San Francisco Bay Area with his wife and two dogs. Connect with Kshetrajna on LinkedIn to chat.


    If you’re passionate about solving complex problems at scale, and you’re eager to learn more - we are always hiring! Reach out to us or apply on our careers page.<

    Continue reading

    A Kotlin Style .copy Function for Swift Structs

    A Kotlin Style .copy Function for Swift Structs

    Working in Android using Kotlin, we tend to create classes with immutable fields. This is quite nice when creating state objects, as it prevents parts of the code that interpret state (for rendering purposes, etc) from modifying the state. This lends to better clarity about where values originate from, less bugs, and easier focused testing.

    We use Kotlin’s data class to create immutable objects. If we need to overwrite existing field values in one of our immutable objects, we use the data class’s .copy function to set a new value for the desired field while preserving the rest of the values. Then we’d store this new copy of the object as the source of truth.

    While trying to bring this immutable object concept to our iOS codebase, I discovered that Swift’s struct isn’t quite as convenient as Kotlin’s data class because Swift’s struct doesn't have a similar copy function. To adopt this immutability pattern in Swift, you’ll have to write quite a lot of boilerplate code. 

    Initializing a New Copy of the Struct

    If you want to change one or more properties for a given struct, but preserve the other property values (as Kotlin’s data class provides), you’ll need an initializer that allows you to specify all the struct’s properties. The default initializer gives you this ability… until you set a default value for a property in the struct or define your own init. Once you do either you lose that default init provided by the compiler.

    So the first step is defining an init that captures every field value.

    Overriding Specific Property Values

    Using the init function above, you take your current struct and set every field to the current value, except the values you want to overwrite. This can get cumbersome, especially when your struct has numerous properties, or contains properties that are also structs.

    So the next step is to define a .copy function that accepts new values for its properties, but defaults to using the current values unless specified. The copy function takes optional parameter values and defaults all params to nil unless specified. If the param is non-nil, it sets that value in the new copy of the struct, otherwise it defaults to the current state’s value for the field.

    Not So Fast, What About Optional Properties?

    That works pretty well… until you have a struct that has optional fields. Then things don’t work as expected. What about the case when you have a non-nil value set for an optional property, and you want to set it nil. Uh-oh, the .copy function will always default to the current value when it receives nil for a param.

    What if rather than make the params in the copy function optional, we just set the default value to the struct’s current value? That’s how Kotlin solves this problem in its data class, in Swift it looks like this:

    Unfortunately in Swift you can’t reference self in default parameter values, so that’s not an option. I needed an alternate solution. 

    An Alternate Solution: Using a Builder

    I found a good solution on Stack Overflow: using a functional builder pattern to capture the override values for the new copy of the struct, while using the original struct’s values as input for the rest of the properties.

    This works a little differently, as instead of a simple copy function that accepts params for our fields, we instead define a closure that receives the builder as the sole argument, and allows you to set overrides for selected properties.

    And voilà, it’s not quite as convenient as Kotlin’s data class and its copy function, but it’s pretty close.

    Sourcery—Automating All the Boilerplate Code

    Using the Sourcery code generator for Swift, I wrote a stencil template that generates an initializer for the copy function, as well as the builder for a given struct:

    Scott Birksted is a Senior Development Manager for the Deliver Mobile team that focuses on Order and Inventory Management features in the Shopify Mobile app for iOS and Android. Scott has worked in mobile development since its infancy (pre-iOS/Android) and is passionate about writing testable extensible mobile code and first class mobile user experiences.


    We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

    Continue reading

    5 Steps for Building Machine Learning Models for Business

    5 Steps for Building Machine Learning Models for Business

    By Ali Wytsma and C. Carquex

    Over the last decade, machine learning underwent a broad democratization. Countless tutorials, books, lectures, and blog articles have been published related to the topic. While the technical aspects of how to build and optimize models are well documented, very few resources are available on how developing machine learning models fits within a business context. When is it a good idea to use machine learning? How to get started? How to update a model over time without breaking the product?

    Below, we’ll share five steps and supporting tips on approaching machine learning from a business perspective. We’ve used these steps and tips at Shopify to help build and scale our suite of machine learning products. They may look simple, but when used together they give a straight-forward workflow to help you productionize models that actually drive impact.

    A flow diagram representing the five steps for building machine learning models for business as discussed in the article.
    Guide for building machine learning models

    1. Ask Yourself If It’s the Right Time for Machine Learning?

    Before starting the development of any machine learning model, the first question to ask is: should I invest resources in a machine learning model at this time? It’s tempting to spend lots of time on a flashy machine learning algorithm. This is especially true if the model is intended to power a product that is supposed to be “smart”. Below are two simple questions to assess whether it’s the right time to develop a machine learning model:

    a. Will This Model Be Powering a Brand New Product?

    Launching a new product requires a tremendous amount of effort, often with limited resources. Shipping a first version, understanding product fit, figuring out user engagement, and collecting feedback are critical activities to be performed. Choosing to delay machine learning in these early stages allows resources to be freed up and focused instead on getting the product off the ground.

    Plans for setting up the data flywheel and how machine learning can improve the product down the line are recommended. Data is what makes or breaks any machine learning model, and having a solid strategy for data collection will serve the team and product for years to come. We recommend exploring what will be beneficial down the line so that the right foundations are put in place from the beginning, but holding off on using machine learning until a later stage.

    To the contrary, if the product is already launched and proven to solve the user’s pain points, developing a machine learning algorithm might improve and extend it.

    b. How Are Non-machine Learning Methods Performing?

    Before jumping ahead with developing a machine learning model, we recommend trying to solve the problem with a simple heuristic method. The performance of those methods is often surprising. A benefit to starting with this class of solution is that they’re typically easier and faster to implement, and provide a good baseline to measure against if you decide to build a more complex solution later on. They also allow the practitioner to get familiar with the data and develop a deeper understanding of the problem they are trying to solve.

    In 90 percent of cases, you can create a baseline using heuristics. Here are some of our favorite for various types of business problems:

    Forecasting For forecasting with time series data, moving averages are often robust and efficient
    Predicting Churn Using a behavioural cohort analysis to determine user dropoff points are hard to beat
    Scoring For scoring business entities (for example leads and customers), a composite index based on two or three weighted proxy metrics is easy to explain and fast to spin up
    Recommendation Engines Recommending content that’s popular across the platform with some randomness to increase exploration and content diversity is a good place to start
    Search Stemming and keyword matching gives a solid heuristic

     2. Keep It Simple

    When developing a first model, the excitement of seeking the best possible solution often leads to adding unnecessary complexity early on: engineering extra features or choosing the latest popular model architecture can certainly provide an edge. However, they also increase the time to build, the overall complexity of the system, as well as the time it takes for a new team member to onboard, understand, and maintain the solution.

    On the other hand, simple models enable the team to rapidly build out the entire pipeline and de-risk any surprise that could appear there. They’re the quickest path to getting the system working end-to-end.

    At least for the first iteration of the model, we recommend being mindful of these costs by starting with the simplest approach possible. Complexity can always be added later on if necessary. Below are a few tips that help cut down complexity:

    Start With Simple Models

    Simple models contribute to iteration speed and allow for better understanding of the model. When possible, start with robust, interpretable models that train quickly (shallow decision tree, linear or logistic regression are three good initial choices). These models are especially valuable for getting buy-in from stakeholders and non-technical partners because they’re easy to explain. If this model is adequate, great! Otherwise, you can move to something more complex later on. For instance, when training a model for scoring leads for our Sales Representatives, we noticed that the performance of a random forest model and a more complex ensemble model were on par. We ended up keeping the first one since it was robust, fast to train, and simple to explain.

    Start With a Basic Set of Features

    A basic set of features allows you to get up and running fast. You can defer most feature engineering work until it’s needed. Having a reduced feature space also means that computational tasks run faster with a quicker iteration speed. Domain experts often provide valuable suggestions for where to start. For example at Shopify, when building a system to predict the industry of a given shop, we noticed that the weight of the products sold was correlated with the industry. Indeed, furniture stores tend to sell heavier products (mattresses and couches) than apparel stores (shirts and dresses). Starting with these basic features that we knew were correlated allowed us to get an initial read of performance without going deep into building a feature set.

    Leverage Off-the-shelf Solutions

    For some tasks (in particular tasks related to images, video, audio, or text), it’s essential to use deep learning to get good results. In this case, pre-trained, off the shelf models help build a powerful solution quickly and easily. For instance, for text processing, a pre-trained word embedding model that feeds into a logistic regression classifier might be sufficient for an initial release. Fine tuning the embedding to the target corpus comes in a subsequent iteration, if there’s a need for it.

    3. Measure Before Optimizing

    A common pitfall we’ve encountered is starting to optimize machine learning models too early. While it’s true that thousands of parameters and hyper-parameters have to be tuned (with respect to the model architecture, the choice of a class of objective functions, the input features, etc), jumping too fast to that stage is counterproductive. Answering the two questions below before diving in helps make sure your system is set up for success.

    a. How is the Incremental Impact of the Model Going to Be Measured?

    Benchmarks are critical to the development of machine learning models. They allow for the comparison of performance. There are two steps to creating a benchmark, and the second one is often forgotten.

    Select a Performance Metric

    The metric should align with the primary objectives of the business. One of the best ways to do so is by building an understanding of what the value means. For instance, what does an accuracy of 98 percent mean in the business context? In the case of a fraud detection system, accuracy would be a poor metric choice, and 98 percent would indicate poor performance as instances of fraud are typically rare. In another situation, 98 percent accuracy could mean great performance on a reasonable metric.

    For comparison purposes, a baseline value for the performance metric can be provided by an initial non-machine learning method, as discussed in the Ask Yourself If It’s the Right Time for Machine Learning? section.

    Tie the Performance Metric Back to the Impact on the Business

    Design a strategy to measure the impact of a performance improvement on the business. For instance, if the metric chosen in step one is accuracy, the strategy chosen in step two should allow the quantification of how each percentage point increment impacts the user of the product. Is an increase from 0.8 to 0.85 a game changer in the industry or barely noticeable to the user? Are those 0.05 extra points worth the potential added time and complexity? Understanding this tradeoff is key to deciding how to optimize the model and drives decisions such as continuing or stopping to invest time and resources in a given model.

    b. Can You Explain the Tradeoffs That the Model Is Making?

    When a model appears to perform well, it’s easy to celebrate too soon and become comfortable with the idea that machine learning is an opaque box with a magical performance. Based on experience, in about 95 percent of cases the magical performance is actually the symptom of an issue in the system. A poor choice of performance metric, a data leakage, or an uncaught balancing issue are just a few examples of what could be going wrong.

    Being able to understand the tradeoffs behind the performance of the model will allow you to catch any issues early, and avoid wasting time and compute cycles on optimizing a faulty system. One way to do this is by investigating the output of the model, and not just its performance metrics:

    Classification System In a classification system, what does the confusion matrix look like? Does the balancing of classes make sense?
    Regression Model When fitting a regression model, what does the distribution of residuals look like? Is there any apparent bias?
    Scoring System  For a scoring system, what does the distribution of scores look like? Are they all grouped toward one end of the scale? 

     

    Example

    Order Dataset
    Prediction Accuracy 98%

     Actual

    Order is fraudulent Order is not fraudulent
    Prediction Order is fraudulent  0 0
    Order is not fraudulent 20 1,000

    Example of a model output with an accuracy of 98%. While 98% may look like a win, there are 2 issues at play:
    1. The model is consistently predicting “Order isn’t fraudulent”.
    2. Accuracy isn’t the appropriate metric to measure the performance of the model.

    Optimizing the model in this state doesn’t make sense, the metric needs to be fixed first.

    Optimizing the various parameters becomes simpler once the performance metric is set and tied to a business impact: the optimization stops when it doesn’t drive any incremental business impact. Similarly, by being able to explain the tradeoffs behind a model, errors that are otherwise masked by an apparent great performance are likely to get caught early.

    4. Have a Plan to Iterate Over Models

    Machine learning models evolve over time. They can be retrained at a set frequency. Their architecture can be updated to increase their prediction power or features can be added and removed as the business evolves. When updating a machine learning model, the roll out of the new model is usually a critical part. We must understand our performance relative to our baseline, and there should be no regression in performance. Here are a few tips that have helped us do this effectively:

    Set Up the Pipeline Infrastructure to Compare Models

    Models are built and rolled out iteratively. We recommend investing in building a pipeline to train and experimentally evaluate two or more versions of the model concurrently. Depending on the situation, there are Depending on the situation, there are several ways that new models are evaluated. Two great methods are:

    • If it’s possible to run an experiment without surfacing the output in production (for example, for a classifier where you have access to the labels), having a staging flow is sufficient. For instance, we did this in the case of the shop industry classifier, mentioned in the Keep It Simple section. A major update to the model ran in a staging flow for a few weeks before we felt confident enough to promote it to production. When possible, running an offline experiment is preferable because if there are performance degradations, they won’t impact users.
    • An online A/B test works well in most cases. By exposing a random group of users to our new version of the model, we get a clear view of it’s impact relative to our baseline. As an example, for a recommendation system where our key metric is user engagement, we assess how engaged the users exposed to our new model version are compared to users seeing the baseline recommendations to know if there’s a significant improvement.

    Make Sure Comparisons Are Fair

    Will the changes affect how the metrics are reported? As an example, in a classification problem, if the class balance is different between the set the model variant is being evaluated on and production, the comparison may not be fair. Similarly, if we’re changing the dataset being used, we may not be able to use the same population for evaluating our production model as our variant model. If there is bias, we try to change how the evaluations are conducted to remove it. In some cases, it may be necessary to adjust or reweight metrics to make the comparison fair.

    Consider Possible Variance in Performance Metrics

    One run of the variant model may not be enough to understand it’s impact. Model performance can vary due to many factors like random parameter initializations or how data is split between training and testing. Verify its performance over time, between runs and based on minor differences in hyperparameters. If the performance is inconsistent, this could be a sign of bigger issues (we’ll discuss those in the next section!). Also, verify whether performance is consistent across key segments in the population. If that’s a concern, it may be worth reweighting the metric to prioritize key segments.

    Does the Comparison Framework Introduce Bias?

    It’s important to be aware of the risks of overfitting when optimizing, and to account for this when developing a comparison strategy. For example, using a fixed test data set can cause you to optimize your model to those specific examples. Incorporating practices like cross validation, changing the test data set, using a holdout, regularization, and running multiple tests whenever random initializations are involved into your comparison strategy helps to mitigate these problems.

    5. Consider the Stability of the Model Over Time

    One aspect that’s rarely discussed is the stability of prediction as a model evolves over time. Say the model is retrained every quarter, and the performance is steadily increasing. If our metric is good, this means that performance is improving overall. However, individual subjects may have their predictions changed, even as the performance increases overall. That may cause a subset of users to have a negative experience with the product, without the team anticipating it.

    As an example, consider a case where a model is used to decide whether a user is eligible for funding, and that eligibility is exposed to the user. If the user sees their status fluctuate, that could create frustration and destroy trust in the product. In this case, we may prefer stability over marginal performance improvements. We may even choose to incorporate stability into our model performance metric.

    Two graphs side by side representing model Q1 on the left and model Q2 on the right. The graphs highlight the difference between accuracy and how overfitting can change that.
    Example of the decision boundary of a model, at two different points in time. The symbols represent the actual data points and the class they belong to (red division sign or blue multiplication sign). The shaded areas represent the class predicted by the model. Overall the accuracy increased, but two samples out of the eight switched to a different class. It illustrates the case where the eligibility status of a user fluctuates over time. 

    Being aware of this effect and measuring it is the first line of defense. The causes vary depending on the context. This issue can be tied to a form of overfitting, though not always. Here’s our checklist to help prevent this:

    • Understand the costs of changing your model. Consider the tradeoff between the improved performance versus the impact of changed predictions, and the work that needs to be done to manage that. Avoid major changes in the model, unless the performance improvements justify the costs.
    • Prefer shallow models to deep models. For instance in a classification problem, a change in the training dataset is more likely to make a deep model update its decision boundary in local spots than a shallow model. Use deep models only when the performance gains are justified.
    • Calibrate the output of the model. Especially for classification and ranking systems. Calibration highlights changes in distribution and reduces them.
    • Check for objective function condition and regularization. A poorly conditioned model has a decision boundary that changes wildly even if the training conditions change only slightly.

    The Five Factors That Can Make or Break Your Machine Learning Project

    To recap, there are a lot of factors to consider when building products and tools that leverage machine learning in a business setting. Carefully considering these factors can make or break the success of your machine learning projects. To summarize, always remember to:

    1. Ask yourself if it’s the right time for machine learning? When releasing a new product, it’s best to start with a baseline solution and improve it down the line with machine learning.
    2. Keep it simple! Simple models and feature sets are typically faster to iterate on and easier to understand, both of which are crucial for the first version of a machine learning product.
    3. Measure before optimizing. Make sure that you understand the ins and outs of your performance metric and how it impacts the business objectives. Have a good understanding of the tradeoffs your model is making.
    4. Have a plan to iterate over models. Expect to iteratively make improvements to the model, and make a plan for how to make good comparisons between new model versions and the existing one.
    5. Consider the stability of the model over time. Understand the impact stability has on your users, and take that into consideration as you iterate on your solution. 

    Ali Wytsma is a data scientist leading Shopify's Workflows Data team. She loves using data in creative ways to help make Shopify's admin as powerful and easy to use as possible for entrepreneurs. She lives in Kitchener, Ontario, and spends her time outside of work playing with her dog Kiwi and volunteering with programs to teach kids about technology.

    Carquex is a Senior Data Scientist for Shopify’s Global Revenue Data team. Check out his last blog on 4 Tips for Shipping Data Products Fast.


    We hope this guide helps you in building robust machine learning models for whatever business needs you have! If you’re interested in building impactful data products at Shopify, check out our careers page.

    Continue reading

    Diggin’ and Fetchin’ with TruffleRuby

    Diggin’ and Fetchin’ with TruffleRuby

    Sometimes as a software developer you come across a seemingly innocuous piece of code that, when investigated, leads you down a rabbit hole much deeper than you anticipated. This is the story of such a case.

    It begins with some clever Ruby code that we want to refactor, and ends with a prototype solution that changes the language itself. Along the way, we unearth a performance issue in TruffleRuby, an implementation of the Ruby language, and with it, an opportunity to work at the compiler level to even off the performance cliff. I’ll share this story with you.

    A Clever Way to Fetch Values from a Nested Hash

    The story begins with some Ruby code that struck me as a little bit odd. This was production code seemingly designed to extract a value from a nested hash, though it wasn’t immediately clear to me how it worked. I’ve changed names and values, but this is functionally equivalent to the code I found:

    Two things specifically stood out to me. Firstly, when extracting the value from the data hash, we’re calling the same method, fetch, twice and chaining the two calls together. Secondly, each time we call fetch, we provide two arguments, though it isn’t immediately clear what the second argument is for. Could there be an opportunity to refactor this code into something more readable?

    Before I start thinking about refactoring, I have to make sure I understand what’s actually going on here. Let’s do a quick refresher on fetch.

    About Fetch

    The Hash#fetch method is used to retrieve a value from a hash by a given key. It behaves similarly to the more commonly used [ ] syntax, which is itself a method and also fetches values from a hash by a given key. Here’s a simple example of both in action.

    Like we saw in the production code that sparked our investigation initially, you can chain calls to fetch together like you would using [ ], to fetch nested values to extract a value from nested key-value pairs.

    Now, this works nicely assuming that each chained call to fetch returns a hash itself. But what if it doesn’t? Well, fetch will raise a KeyError.

    This is where our optional second argument comes in. Fetch accepts an optional second argument that serves as a default value if a given key can’t be found. If you provide this argument, you get it back instead of the key error being raised.

    Helpfully, you can also pass a block to make the default value more dynamic.

    Let’s loop back around to the original code and look at it again now that we’ve had a quick refresher on fetch.

    The Refactoring Opportunity

    Now, it makes a little more sense as to what’s going on in the original code we were looking at. Here it is again to remind you:

    The first call to fetch is using the optional default argument in an interesting way. If our data hash doesn’t have a response key, instead of raising a KeyError, it returns an empty hash. In this scenario, by the time we’re calling fetch the second time, we’re actually calling it against an empty hash.

    Since an empty hash has no key-value pairs, this means when we evaluate the second call to fetch, we always get the default value returned to us. In this case, it’s an instance of IdentityObject.

    While a clever workaround, I feel this could look a lot cleaner. What if we reduced a chained fetch into a single call to fetch, like below?

    Well, there’s a precedent for this, actually, in the form of the Hash#dig method. Could we refactor the code using dig? Let’s do a quick refresher on this method before we try.

    About Dig

    Dig acts similarly to the [ ] and fetch methods. It’s a method on Ruby hashes that allows for the traversing of a hash to access nested values. Like [ ], it returns nil when it encounters a missing key. Here’s an example of how it works.

    Now, if we try to refactor our initial code with dig, we can already make it look a lot cleaner and more readable.

    Nice. With the refactor complete, I’m thinking, mission accomplished. But...

    Versatile Fetch

    One thing continues to bother me. dig just doesn’t feel as versatile as fetch does. With fetch you can choose between raising an error when a key isn’t found, returning nil, or returning a default in a more readable and user-friendly way.

    Let me show you what I mean with an example.

    Fetch is able to handle multiple control flow scenarios handily. With dig, this is more difficult because you’d have to raise a KeyError explicitly to achieve the same behaviour. In fact, you’d also have to add logic to make a determination about whether the key doesn’t exist or has an explicitly set value of nil, something that fetch handles much better.

    So, what if Ruby hashes had a method that combined the flexibility of fetch with the ability to traverse nested hashes like dig is able to do? If we could do that, we could potentially refactor our code to the following:

    Of course, if we want to add this functionality, we have a few options. The simplest one is to monkey patch the existing implementation of Ruby#Hash and add our new method to it. This lets me test out the logic with minimal setup required.

    There’s also another option. We can try to add this new functionality to the implementation of the Ruby language itself. Since I’ve never made a language level change before, and because it seems more fun to go with option two, I decided to see how hard such a change might actually be.

    Adding a New Method to Ruby Hashes

    Making a language level change seems like a fun challenge, but it’s a bit daunting. Most of the standard implementation of the Ruby language is written using C. Working in C isn’t something I have experience with, and I know enough to know the learning curve would be steep.

    So, is there an option that lets us avoid having to dive into writing or changing C code, but still allows us to make a language level change? Maybe there’s a different implementation of Ruby we could use that doesn’t use C?

    Enter TruffleRuby.

    TruffleRuby is an alternative implementation of the Ruby programming language built for GraalVM. It uses the Truffle language implementation framework and the GraalVM compiler. One of the main aims of the TruffleRuby language implementation is to run idiomatic Ruby code faster. Currently it isn’t widely used in the Ruby community. Most Ruby apps use MRI or other popular alternatives like JRuby or Rubinius.

    However, the big advantage is that parts of the language are themselves written in Ruby, making working with TruffleRuby much more accessible for folks who are proficient in the language already.

    After getting set up with TruffleRuby locally (you can do the same using the contributor guide), I jumped into trying to make the change.

    Implementing Hash#dig_fetch in TruffleRuby

    The easiest way to prototype our new behaviour is to add a brand new method on Ruby hashes in TruffleRuby. Let’s start with the very simple happy case, fetching a single value from a given hash. We’ll call our method dig_fetch, at least for our prototype.

    Here’s how it works.

    Let’s add a little more functionality. We’ll keep in line with fetch and make this method raise a KeyError if the current key isn’t found. For now, we just format the KeyError the same way that the fetch method has done it.

    You may have noticed that there’s still a problem here. With this implementation, we won’t be able to handle the scenario where keys are explicitly set to nil, as they raise a KeyError as well. Thankfully, TruffleRuby has a way to deal with this that’s showcased in its implementation of fetch.

    Below is how the body of the fetch method starts in TruffleRuby. You see that it uses a module called Primitive, which exposes the methods hash_get_or_undefined and undefined?. For the purposes of this post we won’t need to go into detail about how this module works, just know that these methods will allow us to distinguish between explicit nil values and keys that are missing from the hash. We can use this same strategy in dig_fetch to get around our problem of keys existing but containing nil values.

    Now, when we update our dig_fetch method, it looks like this:

    And here is our updated dig_fetch in action.

    Finally, let’s add the ability to ‘dig’ into the hash. We take inspiration from the existing implementation of dig and write this as a recursive call to our dig_fetch method.

    Here’s the behaviour in action:

    From here, it’s fairly easy to add the logic for accepting a default. For now, we just use blocks to provide our default values.

    And tada, it works!

    So far, making this change has gone smoothly. But in the back of my mind, I’ve been thinking that any language level change would have to be justified with performance data. Instead of just making sure our solution works, we should make sure it works well. Does our new method hold up, performance-wise, to the other methods which extract values from a hash?

    Benchmarking—A Performance Cliff Is Found

    I figure it makes sense to test the performance of all three methods that we’ve been focusing on, namely, dig, fetch, and dig_fetch. To run our benchmarks, I’m using a popular Ruby library called benchmark-ips. As for the tests themselves, let’s keep them really simple.

    For each method, let's look at two things

    • How many iterations it can complete in x seconds. Let’s say x = 5.
    • How the depth of the provided hash might impact the performance. Let’s test hashes with three, six, and nine nested keys.

    This example shows how the tests are set up if we were testing all three methods to a depth of three keys.

    Ok, let’s get testing.

    Running the Benchmark Tests

    We start by running the tests against hashes with a depth of three and it looks pretty good. Our new dig_fetch method performs very similarly to the other methods, knocking out about 458.69M iterations every five seconds.

    But uh-oh. When we double the depth to six (as seen below) we already see a big problem emerging. Our method's performance degraded severely. Interestingly, dig degraded in a very similar way. We used this method for inspiration in implementing our recursive solution, and it may have unearthed a problem with both methods.

    Let’s try running these tests on a hash with a depth of nine. At this depth, things have gotten even worse for our new method and for dig. We are now only seeing about 12.7M iterations every five seconds, whereas fetch is still able to clock about 164M.

    When we plot the results on a graph, you see how much more performant fetch is over dig and dig_fetch.

    Line graph of Performance of Hash methods in TruffleRuby

    So, what is going on here?

    Is Recursion the Problem?

    Let’s look at dig, the implementation of which inspired our dig_fetch method, to see if we can find a reason for this performance degradation. Here’s what it looks like, roughly.

     

    The thing that really jumps out is that both dig and dig_fetch are implemented recursively. In fact, we used the implementation of dig to inspire our implementation of dig_fetch so we could achieve the same hash traversing behaviour.

    Could recursion be the cause of our issues?

    Well, it could be. An optimizing implementation of Ruby such as TruffleRuby attempts to combine recursive calls into a single body of optimized machine code, but there’s a limit to inlining—we can’t inline forever producing infinite code. By contrast, an iterative solution with a loop starts with the code within a single body of optimized machine code in the first place.

    It seems we’ve uncovered an opportunity to fix the production implementation of dig in TruffleRuby. Can we do it by reimplementing dig with an iterative approach?

    Shipping an Iterative Approach to #dig

    Ok, so we know we want to optimize dig to be iterative and then run the benchmark tests again to test out our theory. I’m still fairly new to TruffleRuby at this point, and because this performance issue is impacting production code, it’s time to inform the TruffleRuby team of the issue. Chris Seaton, founder and maintainer of the language implementation is available to ship a fix for dig’s performance degradation problem. But first, we need to fix the problem.

    So, let’s look at dig again.

    To simplify things, let’s implement the iterative logic in a new package in TruffleRuby we will call Diggable. To be totally transparent, there’s a good reason for this, though one that we’ve glossed over in this post—dig is also available on Arrays and Structs in Ruby. By pulling out the iterative implementation into a shared package, we can easily update Array#dig, and Struct#dig to share the same behaviour later on. For now though, we focus on the Hash implementation.

    Inside Diggable, we make a method called dig and add a loop that iterates as many times as the number of keys that were passed to dig initially.

    With this change, dig continues to work as expected and the refactor is complete.

    #dig Performance Revisited

    Now, let’s have a look at performance again. Things look much better for dig with this new approach.

    Our solution had a big impact on the performance of dig. Previously, dig could only complete ~2.5M iterations per second against a hash with nine nested keys, but after our changes it has improved to ~16M. You can see these results plotted below.

    Line graph of Performance of Hash#dig in TruffleRuby

    Awesome! And we actually ship these changes to see a positive performance impact in TruffleRuby. See Chris’ real PRs #2300 and #2301.

    Now that that’s out of the way, it’s time to apply the same process to our dig_fetch method and see if we get the same results.

    Back to Our Implementation

    Now that we’ve seen the performance of dig improve we return to our implementation and make some improvements. Let’s add to the same Diggable package we created when updating dig.

    The iterative implementation ends up being really similar to what we saw with dig.

    After our changes we confirm that dig_fetch works. Now we can return to our benchmark tests and see whether our iterative approach has paid off again.

    Benchmarking, Again

    Performance is looking a lot better! dig_fetch is now performing similarly to dig.

    Below you can see the impact of the changes on performance more easily by comparing the iterative and recursive approaches. Our newly implemented iterative approach is much more performant than the existing recursive one, managing to execute ~15.5M times per second for a hash with nine nested keys when it only hit ~2.5M before.

    Line graph of Performance of Hash#dig in TruffleRuby

    Refactoring the Initial Code

    At this point, we’ve come full circle and can finally swap in our proposed change that set us down this path in the first place.

    One more reminder of what our original code looked like.

    And after swapping in our new method, things look much more readable. Our experimental refactor is complete!

    Final Thoughts

    Of course, even though we managed to refactor the code we found using dig_fetch, we cannot actually change the original production code that inspired this post to use it just yet. That’s because the work captured here doesn’t quite get us to the finish line -- we ignored the interoperability of dig and fetch with two other data structures, Arrays and Structs. On top of that, if we actually wanted to add the method to TruffleRuby, we’d also want to make the same change to the standard implementation, MRI, and we would have to convince the Ruby community to adopt the change.

    That said, I’m happy with the results of this little investigation. Even though we didn’t add our dig_fetch method to the language for everyone to use, our investigation did result in real changes to TruffleRuby in the form of drastically improving the performance of the existing dig method. A little curiosity took us a long way.

    Thanks for reading!

    Julie Antunovic is a Development Manager at Shopify. She leads the App Extensions team and has been with Shopify since 2018.


    If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.

    Continue reading

    Modelling Developer Infrastructure Teams

    Modelling Developer Infrastructure Teams

    I’ve been managing Developer Infrastructure teams (alternatively known as Developer Acceleration, Developer Productivity, and other such names) for almost a decade now. Developer Infrastructure (which we usually shorten to “Dev Infra”) covers a lot of ground, especially at a company like Shopify that’s invested heavily in the area, from developer environments and continuous integration/continuous deployment (CI/CD) to frameworks, libraries, and various productivity tools.

    When I started managing multiple teams I realized I’d benefit from creating a mental model of how they all fit together. There are a number of advantages to doing this exercise, both for myself and for all team members, manager and individual contributor alike.

    First, a model helps clarify the links and dependencies between the various teams and domains. This in turn allows a more holistic approach to designing systems that mitigates siloed thinking that affects both our users and our developers. Seeing these links also lets us identify where there are gaps in our solutions and rough transitions.

    Second, it helps everyone feel more connected to a larger vision, which is important for engagement. Many people feel more motivated if they can see how their work fits into the big picture.

    There’s no single perfect model. Indeed, it’s helpful to have different models to highlight different relationships. Team structures also change and that can require rethinking connections. I’m going to discuss one way that I thought about my area last year, reflecting the org structure at that time. In fact, constructing this model actually helped me think through other ways of organizing teams and led to us implementing a new structure. Before we get into the model, though, here’s a very brief description of the teams that reported into me last year:

    • Local Environments: The team responsible for the tooling that helps get new and existing projects set up on a local machine (that is, a MacBook Pro). This includes cloning repositories, installing dependencies, and running backing services, amongst various other common tasks.
    • Cloud Environments: A relatively new team that was created to explore development on remote, on-demand systems.
    • Test Infrastructure: They’re in charge of our CI systems, continually improving them and trying new ideas to accommodate Shopify’s growth.
    • Deploys: These folks handle the final steps in the development process: merging commits into our main branches (we’ve outgrown GitHub’s standard process!), validating them on our canary systems, and promoting them out to production.
    • Web Foundations: We’ve got some big front-end codebases and thus a team dedicated to accelerating the development of React-based apps through various tools and libraries.
    • React Native Foundations: Similar to Web Foundations, but focused specifically on standardizing and improving how we build React Native apps.
    • Mobile Tooling: Mobile apps have quite a few differences from web apps, so this team specializes in building tools for our mobile devs.

    The Development Workflow

    Phases and teams, with areas of responsibility, of the standard development pipeline.

    One way to look at the Developer Infrastructure teams is as parts of the development workflow (or pipeline), which can be split into three discrete phases:

    The Local Environments, Cloud Environments, Test Infrastructure, and Deploys teams each map to one phase. The scope of these teams remains broad, although the default support is for Ruby on Rails apps. See above for a graphical representation.

    Map of Web Foundations’ responsibilities to development phases

     

    By contrast, the applications and systems developed and supported by the Mobile Tooling, Web Foundations, and React Native Foundations teams span multiple phases. In the case of Web Foundations, much of this work focuses on the development phase (frameworks, tools, and libraries), but the team also maintains one application that’s executed as part of the validate phase, to monitor bundle sizes.

    Web Foundations builds on the systems supported by the Local and Cloud Environments, Test Infrastructure, and Deploys teams. Their work is complementary by adding specialized tooling for front-end development.

    Map of Mobile Tooling/React Native Foundations’ responsibilities to development phases

    The work of the Mobile Tooling and React Native Foundations teams spans all three phases. Although in this case, as seen in the image above, the Deployment phase is independent from that of the generic workflow, given the very different release process for mobile apps.

    Horizontal and Vertical Integration

    We can further extend the workflow model by borrowing a concept from the business world to look at the relationships in these teams. In a manufacturing industry, horizontal integration means that the different points in the supply chain have specific, often large companies behind them. The producer, supplier, manufacturer, and so on are all separate entities, providing deep specialization in a particular area.

    One could view Local and Cloud Environments, Test Infrastructure, and Deploys as similarly horizontally integrated. The generic development workflow is the supply chain, and each of these teams is responsible for one part of it, that is, one phase of the workflow. Each specializes in the specific problem area involved in that phase by maintaining the relevant systems, implementing workflow optimizations, and scaling up solutions to meet the increasing amount of development activity.

    By contrast, vertical integration involves one company handling multiple parts of the supply chain. IKEA is an example of this model, as they own everything from forests to retail stores. Their entire supply chain specializes in a particular industry (furniture and other housewares), meaning they can take a holistic approach to their business.

    Mobile Tooling, Web Foundations, and React Native Foundations can be seen as similarly vertically integrated. Each is responsible for systems that collectively span two or all three phases of the workflow. As noted, these two teams also rely on systems supported by the generic workflow, with their own specific solutions being either built on or sitting adjacent to them. So, they aren’t fully vertically integrated, but instead of being specialized in a phase of the development pipeline, these teams are subject matter experts in the development workflow of a particular technology. They build solutions along the workflow as required when the generic solutions are insufficient on their own.

    Analyzing Our Model

    Now, we can use the idea of a development workflow and the framework of horizontally and vertically integrated teams as a lens to pull together some interesting observations. First let’s look at the commonalities and contrasts.

    The work of each team in Dev Infra generally fits into one or more of the phases of the development workflow. This gives us a good scope for Dev Infra as a whole and helps distinguish us from other teams in our parent team, Accelerate. This in turn allows us to focus by pushing back on work that doesn’t really fit into this model. We made this Dev Infra’s mission statement: “Improving and scaling the develop–validate–deploy cycle for Shopify engineering.”

    An interesting contrast is that the horizontal teams have broad scale, while the vertical teams have broad scope. Our horizontal teams have to support engineering as a whole: virtually every developer interacts with our development environments, test infrastructure, and deploy systems. As a growing company, this means an increasing amount of usage and traffic. On the other side, our vertical teams specialize in smaller segments of the engineering population: those that develop mainly front-end and mobile apps. However, they’re responsible for specific improvements to the entire development workflow for those technologies, hence a broader scope.

    Further to this point, vertical teams have more opportunities for collaboration given their broad scope. However, there are also more situations where product teams go in their own directions to solve specific problems that Dev Infra can’t prioritize at a given moment. Therefore, it’s imperative for us to stay in close contact with product teams to ensure we aren’t duplicating work and to act as long-term stewards for infra projects that outgrow their teams. On the other side, horizontal teams get fewer outside contributions due to how deep and complex the infrastructure is to support our scale. However, there’s more consistency in its use as there are fewer, if any, ways around these systems.

    From Analysis to Action

    As a result of our study, we’ve started to categorize the work we’re doing and plan to do. For any phase in the development pipeline, there are three avenues for development:

    • Concentration: solidifying and improving systems, improving user experience, and incremental or linear scaling
    • Expansion: pushing outwards, identifying new opportunities within the problem domain, and step-change or exponential scaling
    • Interfacing: improving the points of contact between the development phases, both in terms of data flow and user experience, and identifying gaps into which an existing team could expand or a new team is created

    Horizontal and vertical teams will naturally approach development differently: 

    • Horizontal teams have a more clearly defined scope, and hence prioritization can be easier, but impact is limited to a particular area. Interface development is harder because it spans teams.
    • Vertical teams have a much vaguer scope with more possibilities for impact, but determining where we can have the most impact is thus more difficult. Interface improvement can be more straightforward if it’s between pieces owned by that team.

    We also used this analysis to inform the organizational structure. As I mentioned, we made some changes earlier this year within Accelerate. This included starting a Client Foundations team, which are essentially all vertically integrated and technology-focused teams, specializing in front-end and mobile development. Back in Dev Infra, we have the possibility of pulling in teams that currently exist in other organizations if they help us extend our development workflow model and provide new horizontal integrations. We’re starting to experiment with more active collaboration between teams to expand the context the developers have about our entire workflow.

    Finally, we plan to engage in some user research that spans the development workflow. Most of the time any in-depth research we do is at the team level: what repetitive tasks our mobile devs face, what annoys people about our test infrastructure, or how to make our deploy systems more intuitive. Now we have a way to talk about the journey a developer takes from first writing a patch all the way to getting it out into production. This helps us understand how we can make a more holistic solution and provide the smoothest experience to our developers.

    Mark Côté manages Developer Infrastructure at Shopify. He has worked in the software industry for 20 years, as a developer at a number of start ups and later at the Mozilla Corporation, where he went into management. For half of his career he has been involved in software tooling and developer productivity, leading efforts to bring a product-management mindset into the space.


    We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

    Continue reading

    Bridging the Gap Between Developers and End Users

    Bridging the Gap Between Developers and End Users

    In every business, regardless of industry, it is important to keep an open communication line between the consumer and the producer. In the tech world, this line of communication comes in the form of customer support, chatbots, or contact forms. There’s often no direct line of communication between the developers who build these products and the end users, especially as a company grows. Developers receive requirements of the products to build, focus on shipping their projects, and, if need be, fixing bugs that surface after the end users have interacted with the product.

    Typical line of communication between developers and end users.

    This creates a situation where developers become out of touch with the products and who they’re building for. As with most things in life, when you’re out of touch with something, it can often result in a lack of empathy for people whose lives involve using the said thing. This lack of empathy can affect the quality of products that are subsequently built.

    The ideal line of communication between developers and end users.

    This is why at Shopify, we have developed a program to make sure our customers (our merchants) and their problems are accessible to everyone at Shopify. Communicating and connecting with customers provides multiple benefits for developers, including:

    • Providing perspective on how customers interact with products
    • Creating new experiences from customer workarounds
    • Expanding domain knowledge
    • Developing valuable interpersonal skills

    Providing Perspective on How Customers Interact with Products

    Listening to customers describe their problems helps me as a developer relate to them and build more context.”

    The development process doesn’t usually have room for interaction between the developers and customers. As a result, developers might ultimately deliver the requirements list they are given for a product, but in the end, the customers might interact with the product in a different way than expected. Developers must fully understand how and why these differences arise, which might not be accurately articulated by Support Advisors (think of the game of telephone where messages get misunderstood and passed along). By somehow interacting with the end users, developers will be able to see the products they have built in a new light and the customers’ fresh eyes. More importantly, developers will get to see that their customers are human too, and in the case of Shopify, more than just shop IDs or accounts.

    Creating New Experiences from Customer Workarounds

    As developers interact with the customers, they can see how they use the products and create workarounds like this to get the desired functionality. This is very important in informing developers about what the customers really need and what to build next. Before you know it, you start to roll out bug fixes, newer product versions, and possibly new products that address these workarounds and make customers’ workflow a little more seamless.

    Expanding Domain Knowledge

    End users have limited knowledge of the different parts that make up the product, so they tend to ask an array of questions when they contact support advisors. Support Advisors are trained to be generalists, having a lot of knowledge on different (if not all) parts of the company’s product. While interacting with customers, a developer may get curious about new domains that are totally out of their regular workflow. This curiosity is very beneficial because it fosters the growth of more T-shaped developers, that is, developers who are specialists in a particular area but have general knowledge about a lot of other areas. After listening in on a support phone call, one developer said that it forced them out of their comfort zone to explore new product parts.

    Curiosity breeds T-shaped developers who are both generalists and specialists.

    Developing Interpersonal Skills

    Technical Support Advisors encounter different people and temperaments all the time. They’re trained and experienced to listen, communicate and help customers resolve their issues. Patience and empathy are required when dealing with frustrated customers. Developers can foster growth and improve their craft by listening more, communicating effectively, and setting realistic expectations. These skills are important to build because sometimes, the solution to a problem might not be immediately apparent or fixable. A frustrated customer wants a solution, and they want it immediately. When this happens, you shouldn’t make false promises to appease the customer; instead, you should navigate the situation to realistically set expectations to put the customer at ease.

    Given these advantages listed above, there is a case to be made for companies to develop a system where the distance between the developers and customers is reduced. This will help build empathy in the developers as they begin to see the actual users of their products, as human as they come, rather than the ideal personas created during the development process.

    How We Build Empathy at Shopify

    At Shopify, we have found an effective system of getting developers (and non-developers) closer to our merchants, appropriately called Bridge the Gap.

    We launched the Bridge the Gap program in 2014 to connect Shopify employees to Shopify merchants via the Support teams. The mission of this program is to ensure that everyone working at Shopify has the opportunity to connect to our merchants in a human way. There are three channels in this program:

    • Workshops
    • Internal Podcast
    • Shadowing Sessions

    Workshops

    Workshops are organized around specific topics and areas of our platform where we take a deep dive and learn more about our merchants’ experience in these areas.. Participants can ask questions throughout to gain context and clarity about how merchants use our products.

    Internal Podcast

    Our internal podcast focuses on merchant experiences, an insightful channel for allowing everyone to learn more asynchronously and at their own pace. It contains clips from a support call and a discussion between the Support Advisor who took the call and the podcast host.

    Shadowing Sessions

    In these sessions, you shadow a Support Advisor for one hour as they take questions in real-time from merchants through chat, phone or email. These sessions may be for a broad audience or focus on a specific part of the platform like marketplaces, partners, Plus merchants, etc. Participation in these shadowing sessions is voluntary but strongly encouraged.

    Our mission at Shopify is to make commerce better for everyone. Regardless of the discipline you’re working in, our mission remains to have the end user in mind, and Bridge the Gap gives us all the opportunity to build awareness and empathy.

    Ebun Segun is a software developer working primarily in Shopify’s Orders and Fulfillment domain. Since starting out as a backend developer in 2019 mainly writing Ruby code, she has transitioned into frontend development, involving the use of React Native and GraphQL. When she’s not working, she takes care of her numerous house plants, dabbles in fashion designing, or volunteers for programs that encourage females in high school to pursue careers in STEM. You can reach out to her on LinkedIn or Twitter.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Understanding GraphQL for Beginners–Part Three

    Understanding GraphQL for Beginners–Part Three

    In Part 2 of Understanding GraphQL for Beginners, we created queries that fetched us data. In Part 3, we learn what are mutations and create mutation queries to modify data. Before we begin, let’s recap what we learned so far in this series:

    • GraphQL is a data manipulation and query language for APIs.
    • There’s no more over and under fetching of data.
    • Root fields define the structure of your response fields based on the object fields selected. They’re the entry points (similar to endpoints) to your GraphQL server.
    • Object fields are attributes of an object.

    Learning Outcomes

    • Create a GraphQL mutation to modify data in the database.
    • Use a GraphQL mutation to create a new ActiveRecord.
    • Modify data using a GraphQL mutation.
    • Use a GraphQL mutation to destroy an existing ActiveRecord.

    Before You Start

    We’re using the same repository as Part 2 for this tutorial. As a reminder, the repository is set up with models and gems needed for GraphQL.

    The following models are

    Food

    Attribute Type
    id Bigint
    name String
    place_of_origin String
    image String
    created_at Timestamp
    updated_at Timestamp

     

    Nutrition

    Attribute Type
    id Bigint
    food_id Bigint
    serving_size String
    calories String
    total_fat String
    trans_fat String
    saturated_fat String
    cholesterol String
    sodium String
    potassium String
    total_carbohydrate String
    dietary_fiber String
    sugars String
    protein String
    vitamin_a String
    vitamin_c String
    calcium String
    iron String
    created_at Timestamp
    updated_at Timestamp

    What Are Mutations?

    Mutations are queries that create, modify, or delete objects in your database, similar to a PUT, POST, or DELETE request in REST. Mutation requests are sent to the same endpoint as query requests. Mutation queries have the following structure:

    • The query starts with mutation.
    • Any required arguments are under input.
    • The mutation field name contains the action it’s trying to perform, that is foodCreate.

    We’re naming our mutation query with an object first, followed by the action. This is useful to order the mutations alphabetically, as seen below.

    Ordered by object first Ordered by action first
    food_create.rb create_food.rb
    food_delete.rb create_nutrition.rb
    fiid_update.rb delete_food.rb
    nutrition_create.rb delete_nutrition.rb
    nutrition_delete.rb update_food.rb
    nutrition_update.rb update_nutrition.rb
    Left: Naming a mutation with an object first, followed by the action. Right: Naming a mutation with the action first, followed by an object.

    There’s no preference with which naming convention you go with, but, we’re using the naming convention used on the left side of the image. You can find all mutations under the mutations directory and mutation_type.rb under the types directory.

    Creating Your First Mutation Query

    We create a foodCreate mutation to create new food items. To create a mutation query, enter the following in your terminal: rails g graphql:mutation foodCreate

    The rails generator does the following things:

    1. Check if a new mutation file like base_mutation.rb and mutation_type.rb exists. If it doesn’t, create them.
    2. Add the root field, food_create to mutation_type.rb.
    3. Create a class called food_create.rb.

    Let’s go to the mutation_type.rb class and remove the field and method called test_field.

    Notice how we don’t need to write a method here like in query_type.rb. The mutation: Mutations::foodCreate executes a method called resolve in that mutation class. We’ll learn what the resolve method is soon. We then go to food_create.rb and your class looks like this:

    The first thing we’re going to do is add input arguments. Remove all the comments, then add the following:

    GraphQL uses camel case ( placeOfOrigin) by default. To be consistent with Shopify’s style guide, we’ll use a snake case (place_of_origin) instead. The field’s snake case is converted to camel case by GraphQL automatically!

    Next, we need to add in a resolve method. The resolve method fetches the data for its field (food_create from mutation_type.rb) and returns a response back. GraphQL server only has one resolve method per field in its schema.

    You might be wondering what ** is. It’s an operator called a double splat and it passes in a hash to the resolve method. This allows us to pass in as many arguments as we like. For best practices, if there are more than three parameters, use a double splat to pass in the parameters. For the sake of simplicity, we use the double splat for the three arguments.

    We then add in type Types::FoodCreate to indicate our response fields, and inside the resolve method, create a new ActiveRecord.

    Now, let’s go test it out on GraphiQL! Go to http://localhost:3000/graphiql to test out our new mutation query!

    Write the following query:

    When you execute the query, you get the following response.

    Try It Yourself #1

    Create a mutation called nutritionCreate to create a new Nutrition ActiveRecord. As there are a lot of attributes for the Nutrition class, copy the input arguments from this gist: https://gist.github.com/ShopifyEng/bc31c9fc0cc13b9d7be04368113b49d4

    If you want to see the solution, check out nutrition_create.rb as well as its query and response.

    Creating the Mutation to Update an Existing Food Item

    Create a new mutation called foodUpdate using rails g graphql:mutation foodUpdate. Inside the food_update class, we need to add in the arguments to update. ID will be part of the arguments.

    The only required argument here is ID. We need to use an ID to look for an existing product. This allows the resolve method to find the food item and update it.

    Next, we write the resolve method and response back.

    Let’s test this new mutation query out. We rename our new food item from Apple Pie to Pumpkin Pie.

    Try It Yourself #2

    Create a mutation called nutritionUpdate to update an existing Nutrition ActiveRecord.

    As there are a lot of attributes for the Nutrition class, copy the input arguments from this gist: https://gist.github.com/ShopifyEng/406065ab6c6ce68da6f3a2918ffbeaab.

    If you would like to see the solution, check out nutrition_update.rb and the query as well as its query and response.

    Creating the Mutation to Delete an Existing Food Item

    Create a new mutation called foodDelete using rails g graphql:mutation foodDelete. The only argument needed in food_delete.rb is ID.

    Next, we need to add the return type and the resolve method. For simplicity, we just use Types::FoodType as the response.

    Let’s test this mutation in GraphiQL.

    Try It Yourself #3

    Create a mutation called nutritionDelete to delete an existing Nutrition ActiveRecord. Similar to foodDelete, we use Types::NutritionType as the response.

    If you would like to see the solution, check out nutrition_delete.rb as well as its query and response.

    We’ve reached the end of this tutorial, I hope you enjoyed creating mutations to modify, create or delete data.

    Let’s recap what we learned!

    1. Mutations are queries that create, modify, or delete objects in your database.
    2. To generate a mutation, we use rails g graphql:mutation nameAction.
    3. A resolve method fetches the data for its field (food_create from mutation_type.rb) and returns a response back.
    4. GraphQL server only has one resolve method per field in its schema.

    GraphQL is a powerful data manipulation and query language for APIs that offers lots of flexibility. At times, it can seem very daunting to implement GraphQL to your application. I hope the Understanding GraphQL for Beginners series helps you to implement GraphQL into your personal projects or convince your work to adopt the GraphQL ecosystem.

    If you would like to see the finished code for part three, check out the branch called part-3-solution.

    Often mistaken as an intern Raymond Chung is building a new generation of software developers through Shopify’s Dev Degree program. As a Technical Educator, his passion for computer science and education allows him to create bite-sized content that engages interns throughout their day-to-day. When he is not teaching, you’ll find Raymond exploring for the best bubble tea shop.


    If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.

    Continue reading

    Connecting with Mob Programming

    Connecting with Mob Programming

    We were a team of six people from three different teams with half of us completely new to the Shipping domain and some of us new to the company. We had six weeks to complete one ambitious project. Most of us had never met in person, let alone worked together. We worked like any other team: picked items off the backlog and worked on them. Everyone said they were open to pairing but very few paired, or if they paired it was with the one person they knew very well. Pull requests (PRs) came in but feedback was scarce. Everyone was new to each other and the domain, so opinions were rare, and when present, weak. PRs had a couple of nits, but otherwise they'd just go through. We may have been shipping, but growing, connecting, and learning, but were unsure.

    Until one day, our teammate, Sheldon Nunes, introduced us to mob programming. Mob programming builds on pair programming but instead of two people pairing, it's an entire mob—more than two people—pairing on the same problem.

    What is Mob Programming?

    It started innocently enough, as none of us had done mob programming before. Six of us joined a video call and the driver shared their screen. It was ineffective. It wasn’t inclusive or engaging and ended up as pair programming with observers. We had 30 minutes of two individuals contributing, and everyone else had zoned out. Until someone asked, "How do we make sure that everyone doesn't fall asleep?!" Surely enough, mobbing has a solution for that: fast rotations of the driver.

    Sheldon, our mob programming expert, suggested we switch to a 10-minute rotation of one driver. At the end of every rotation, the driver pushes the changes, and the next person pulls the changes and takes over. It worked like magic. By taking turns and having a short duration, everyone was forced into engagement or they would be lost on their turn. We made programming a game.

    A mob of five people rotating every 10 minutes is 50 minutes of work per rotation. Though the 10 minutes passed quickly, we also moved swiftly and kept tight alignment. The fast rotation also meant that we made decisions quickly—nobody wanted a turn to end without having shipped anything—and every decision was reversible, so it hardly made sense not to be decisive. We saw the same with how much context one shared with the group. There was no risk of a 30-minute context dump by one individual who had high context because the short rotation forced people to share just enough context to get something done. Code reviews also became moot—everyone wrote the code together, so there was little back and forth, allowing us to ship even faster.

    The most valuable benefits we saw with mob programming was the strength of our relationships after we started doing them. It was so effective, we noticed it immediately following the first session. Feedback was easier to give and receive because it wasn't a judgement but a collaboration. While collaborating so closely, we were able to learn from watching each other's unique workflows, laugh at each other’s scribbles and animal drawings, and engage in spontaneous games of tic tac toe.

    The Five Lessons of Mob Programming

    For Three months, the team performed mob programming almost daily. That intensive experience taught us a few things.

    1. Pointing and Communicating

    Being able to point with a crudely drawn arrow is important. Drawing increases the ways you can interact, changing from verbal only to verbal and visual, but most importantly, it keeps everyone engaged. When mobbing, a 30 second build feels like eternity - and being able to doodle or even see someone else draw doodles on the screen changes the engagement level of the group.

    We tried one session without drawing and while it can work, it is an exercise of frustration as you try to explain to the driver exactly where to scroll, which character on a line the bracket is missing, and where exactly the spelling error is.

    2. Discoverability Matters

    Our first mobbing session came out of an informal coffee chat. We used Slack's /call feature for pairing so members of the team who weren't in the informal coffee chat could join at a later time. We started this in a private group with a direct message, but faced challenges such as not being able to add any "guests" who may have had the context on what we're trying to solve who we wanted to add to our mob. A call in a small private group also puts pressure on the whole team to join, irrespective of their availability. So we moved it to a public channel.

    An active Slack huddle window that shows the profile photos of the attendees and a Show Call button

    A mob that’s discoverable, so people can drop in and drop out, ensures that the mob doesn't "die off" and people can take a break. For us, this means using Slack huddles with screen share and Slack /call in a public channel. Give it a fun name or an obvious name, but keep it public.

    3. The Right Problem in the Right Environment

    A mob that’s rotating the driver constantly, like ours, requires a problem where people can set up the environment quickly. Have one branch and a simple setup. A single rotation should involve:

    git pull
    <mob>
    git commit
    -a -m 'WIP!!!"
    git push

    Yes, the good commit messages get ditched here. It's very possible to end your rotation with debugging statements in code. That's OK. Add a good commit message when a task is complete, not necessarily at every push. This reduces how long a hand off takes and allows rotations to happen without waiting for a "clean exit."

    Writing tests (or even this article!) is a poor experience for mobbing. For tests, the runtime for tests is too long to be effective for a mob. These tasks are better in a pairing environment or solo activities, so often someone would volunteer for ownership of the task to take it to completion. For documentation, it's pretty hard to write a sentence together.

    4. Invite Other Disciplines

    The nature of mob programming means that non-developers can mob with developers. Sometimes it’s Directors who rarely get to code in their day to day or a Product Manager who’s curious. The point is that anyone can mob because the mob is available to help. The driver is expected to not know what to do, and by making that the default experience, mobbing becomes welcoming for developers of all skill levels.

    5. Take a Break

    Time passes fast in a mob. We found two hours is the maximum length. Mobbing sessions can drain the introverts in the team. Timebox it and set a limit to minimize the feeling of "missing out" for members of the team who are not able to participate.

    Remote work changed for all of us permanently that day. Gone were the days of lamenting over the loss of learning through osmosis. In person, we learned from each other by overhearing conversations, but with remote work that quickly went away as conversations moved into private messages or meetings—unless you asked the question, you didn't get to hear the answer. There was no learning new shortcuts and tricks from your coworkers by happening to walk by. However, with mobbing, all of that was back. Arguably pairing should've done this too, but the key with mobbing is that you don't have to ask the questions or give the answers—you can learn from the conversations of others.

    An ended Slack huddle window that shows the profile photos of the attendees and the amount of time the huddle lasted.

    Before we were suffering from isolation and feeling disconnected from the team, now we were over-socialized and had to introduce no-pairing days to give people a chance to recharge. We’re now able to onboard newcomers as mob programming welcomes low-context—you have an entire mob to help you, after all.

    Swati Swoboda is a Development Manager at Shopify. She has been at Shopify for 3.5 years and currently leads the Shipping Platform team.


    If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.

    Continue reading

    A Guide to Running an Engineering Program

    A Guide to Running an Engineering Program

    In 2020, Shopify celebrated 15 years of coding, building, and changing the global commerce landscape forever. In that time, the team built an enormous amount of tooling, products, features, and an insanely scalable architecture.

    With the velocity and pace we’ve been keeping, through growing surface areas that enable commerce for 8% of the world’s adult population and with an evolving architecture to support our rapid rate of development, the platform complexity has increased exponentially.  Despite this, we still look hard problems straight in the face and decide that despite how complex it seems, it's what we want as a business. We can and will do this.

    We’ve taken on huge engineering programs in the past that cross over multiple areas of complexity on the platform. These teams deliver features such as the Storefront Renderer, world class performance for events like BFCM, platform wide capacity planning and testing, and efficient tooling for engineers at Shopify like static typing.

    Shopify has a playbook for running these engineering programs and we’d like to share it with you.

    Defining the Program

    All programs require a clearly defined definition of done or the North Star if you will. This clarity and alignment is absolutely essential to ensure the team is all going in the same direction. For us, a number of documented assets are produced that enable company wide alignment and provide a framework for contextual status updates, risk mitigation, and decisions.

    To be aligned is for all stakeholders to agree on the Program Plan in its entirety:

    • the length of time
    • the scope of the Program
    • the staffing assigned
    • the outcomes of the program. 

    The program stakeholders are typically Directors and VPs for the area the program will affect. Any technical debt or decisions made along the way are critical for these stakeholders as they inherit it as leaders of that area. The Program Steering Committee includes the Program Stakeholders and the selected Program Lead(s) who together define the program and set the stage for the team to move into action. Here’s how we frame it out:

    Problem Statement

    Exactly what is the issue? This is necessary for buy-in across executives and organizational leaders. Be clear, be specific. Questions we ask include

    • What can our users or the company do after this goal is achieved?
    • Why should we solve this problem now?
    • What aren’t we doing in order to prioritize this? 
    • Is that the right tradeoff for the company?

    Objectives of Program

    Objectives of the program become a critical drum for motivation and negotiation when resources become scarce or other programs are gaining momentum. To come up with good objectives, consider answers to these questions:

    • When we’re done, what’s now possible? 
    • As a company, what do we gain for choosing to invest in this effort over the countless other investments?  
    • What can merchants, developers, and Shopify do after this goal is achieved? 
    • What aren’t we doing in order to prioritize this? 
    • Is that the right tradeoff for the company?

    Guiding Principles of the Program

    What guiding principles govern the solution? Spend time on this as it’s a critical tool for making tradeoff decisions or forcing constraints in solutions.

    Definition of Done

    Define what exactly needs to be true for this problem to be solved and what constitutes a passing grade for this definition both at the program and for each workstream level. The definition of done for the program is specifically what the stakeholders group is responsible to deliver on. The definition of done for each workstream is what the program contributors are responsible to deliver. Program Leads are responsible to manage both. It includes

    • a checklist for each team to complete their contribution
    • a performance baseline 
    • a resiliency plan and gameday
    • complete internal documentation
    • external documentation. 

    Defining these expectations early and upfront makes it clear for all contributors and workstream leads what their objective is while also supporting strong parallelization across the team.

    Top Risks and Mitigation

    What are the top realistic things that could prevent you from attaining the definition of done, and what’s the plan of action to de-risk them? This is an active evolution. As soon as you mitigate one, it's likely another is nipping at your heels. These risks can be and often are technical, but sometimes they’re resource risks if the program doesn’t get the staffing needed to start on the planned schedule.

    Path to Done

    You know where you are and you know where you need to be. How are you going to get there and as a result and what’s the timeline? Based on the goals for the program’s definition of done, project scope or staffing become the levers to manipulate the plans by holistically looking at the other company initiatives to ensure total company success and not over optimizing for a single program.

    These assets become the framework for aligning, staffing, and executing the program with the Program Stakeholders who use all these assets to decide exactly how the Program is executed, what it will achieve exactly, how much staff is needed and for how long. With this clarity, then it's a matter of execution that is one of the most nebulous parts of large, high stakes programs, and there’s no perfect formula. Eisenhower said, “Plans are worthless, but planning is indispensable.” He’s so accurate. It's the implementation of the plan that creates value and gets the company to the North Star.

    Executing the Program

    Here’s what works for my team of 200 that crosses nine sub-organizations that each have their own senior leadership team, roadmap, and goals.

    Within Shopify R&D, we run in six week cycles company wide. The Definition of Done defined as part of the Program Plan acts as a primary reinforcing loop of the program until the goal is attained, that is, have we reached the North Star? Until that answer is true, we continue working in cycles to deliver on our program. Every new cycle factors in:

    • unachieved goals or regressions from the previous cycle 
    • any new risks that need to be addressed 
    • the goals expected based on the Path to Done. 

    Each cycle kicks off with clarity on goals, then moves to get it done or in Shopify terms: GSD (get shit done).

    Flow diagram showing the inputs considered to start a cycle and begin GSD to the inputs needed to attain the definition of done and finally implementation
    Six week cycle structure

    This execution structure takes time to absorb and learn. In practice, it's not this flow of working or even aligning on the Program Plan that’s the hardest part, rather it’s all the things that hold the program together along the way. It's how you work within the cycle and what rituals you implement that are critical aspects of keeping huge programs on track. Here’s a look at the rituals my team implements within our 200 person and 92 project program. We have many types of rituals: Weekly Rituals, Cycle Rituals performed every six weeks, and ad hoc rituals.

    Weekly Rituals

    Company Wide Program Update

    Frequency: The Company Wide Program Update is done weekly. We like the start of the week reflecting on the previous.

    Why this is important: The update informs stakeholders and others on progress in every active workstream based on our goals. It supports Program Leads by providing an async, detailed pulse of the project.

    Trigger: A recurring calendar item that’s scheduled in two events:

    1. Draft the update.
    2. Review the update. 

    It holds us accountable to ensure we communicate on a predictable schedule.

    Approach: Programs Leads have a shared template that adds to the predictably our stakeholders can expect from the update. The template matches the format used cycle to cycle, but specifically the exact language and layout as the cycle goals presented in the Program Stakeholder Review. A consistent format allows everyone to interpret the plan easily.

    We follow the same template every week and update the template cycle to cycle as the goals for the next cycle changes. This is prioritized on Monday mornings. The update is a mechanism to dive into things like the team’s risks, blockers, concerns, and celebrations.

    Throughout the day, Program Leads collaborate on the document identifying tripwires that signal our ad hoc rituals or unknowns that require reaching out to teams. If the Program Leads are able to review and seek out any missing information by the review meeting, we cancel and get the time back. If not, we have the time and can wrap the document up together.

    Deliverable: A weekly communication delivered via email to stakeholders on the status against the current cycle.

    Program Lead and Project Lead Status Check-in

    Frequency: Lead and Champion Status Check-in is done weekly at the start of the week to ensure the full team is aligned.

    Why this is important: Team Leads and Champions complete project updates by end of day Friday to inform the Company Wide Program Update for Monday. This status check-in is dedicated space, if and when we need it, for cycle updates, program updates, or housekeeping.

    Trigger: A recurring calendar item.

    Approach: The recurring calendar item has all Project Leads on the attendee list. Often, the sync is cancelled as we finish the Company Wide Program Update. If there are a few missing updates, all but those Leads are excused from the sync. By completing the Company Wide Program Update, Program Leads identify which projects are selected for a Program Touch Base ritual.

    Deliverable: Accurate status of each project and likelihood to reach the as defined cycle goal. It  informs the weekly Company Wide Program Update.

    Escalation Triage

    Frequency: Held at a minimum weekly, though mostly ad hoc.

    Why this is important: This is how we ensure we’re removing blockers, making decisions, and mitigating risks that could affect our velocity.

    Trigger: A recurring calendar item called Weekly Program Lead Sync.

    Approach: A GitHub project board is used to manage and track the Program. Tags are used to sort urgency and when things need to be done. Decisions are often the outcome of escalations. These are added to the decision log once key stakeholders are fully aligned.

    Escalations are added by Program Leads as they come up allowing us to put them into GitHub with the right classifications to allow for active management. As Program Leads, we tune into all technical designs, project updates, and team demos for many reasons, but one advantage is we can proactively identify escalations or blockers.

    Deliverable: Delegate ownership to ensure a solution is prioritized among the program team. The escalations aggregate into relevant items in the Program Stakeholder Review as highlights of blockers or solutions to blockers.

    Risk Triage

    Frequency: Held at a minimum weekly, though also ad hoc.

    Why this is important: This is also how we ensure we’re removing blockers, making decisions and mitigating risks that could affect our velocity. This is how we proactively clear the runway.

    Trigger: A recurring calendar item called Weekly Program Lead Sync.

    Approach: In our planning spreadsheet, we have a ranking formula to prioritize the risks. This means we’ve identified what risks that need mitigation first, where the risk lives within the entire program, and who’s the Lead that’s assigned a mitigation strategy. We also include a last updated date to the status of the mitigation. This allows us to jump in and out at any cadence without accidentally putting undue pressure on the team to have a strategy immediately or poking them repeatedly. The spreadsheet shows who has had time to develop a mitigation strategy and allows us to monitor its implementation. Once there’s a plan, we update the sheet with the mitigation plan and status of implementation. It’s only once the plan is implemented that we change the ranking and actually mitigate our top risks.

    Updating and collaborating is done with comments in the spreadsheet. Between Slack channels and the spreadsheet, you can see progress on these strategies. This is a great opportunity for Program Leads to be proactive and pop in these channels to celebrate and remind the team we just mitigated a big risk. Then, the spreadsheet is updated either by the Team Lead or the Program Lead, depending on who's more excited.

    Deliverable: Delegate ownership to ensure a mitigation plan is prioritized among the program team. The escalations aggregate into relevant items in the Program Stakeholder Review as highlights of blockers or solutions to blockers.

    Program Lead Sync

    Frequency: Held weekly and ad hoc as needed.

    Why this is important: This is where the Program Leads to strategize, collaborate, and divide and conquer. Program Leads are accountable for the Program’s success. These Leads partner to run the Program and are accountable to deliver on the definition of done by planning cycle after cycle. They must work together and stay highly aligned.

    Trigger: A recurring calendar item.

    Approach: We have an automatic agenda for this to ensure we tighten our end of week rituals, but also to stay close to the challenges, risks, and wins of the team. We try to minimalize our redundancy in coverage.  Our agenda starts with three basic bullets:

    • Real Talk: What is on your mind, and what is keeping you up at night. It's how we stay close and ensure we’re partnering and not just coordinated ships in the night. 
    • Demo Plan: What messaging if any should we deliver during the demo since the entire Program team has joined?
    • Divide and Conquer: What meetings can we drop to reduce redundancy. 
    • Risk Review: What are the top risks, and how are the mitigation plans shaping up? 

    Throughout the week, agenda items are added by either Program Lead that ensures we have a well rounded conversation about the aspects of the Program that are top of mind for the week. Often these items tend to be escalations that could affect the Program velocity or a Project’s Scope.

    Deliverable: A communication and messaging plan for weekly demos where the team fully gathers, risk levels, mitigation plans based on time passed, and new information or tooling changes.

    Weekly Demos

    Frequency: Held weekly.

    Why this is important: Weeks one to five is mainly time for the team to share and show off their hard work and contribution to the goals. Week six is to show off the progress on the planned goals to our stakeholders.

    Trigger: Scheduled in the calendar for the end of day on Fridays.

    Approach: There are two things that happen in prep for this ritual:

    1. planning demo facilitation 
    2. planning demos. 

    Planning demos: Any Champion can sign up for a weekly demo. A call for demos is made about two days in advance, and teams inform their intention on the planning spreadsheet: a weekly check mark if they will demo.

    Planning demo facilitation: Program Leads and domain area leadership facilitate the event and deliver announcements, appropriately timed messaging, and demos. Of course, we also have fun and carve out a team vibe. We do jokes and welcome everyone with music. The demos identified are called on one by one to demo, answer team questions and share any milestones achieved.

    Deliverable: A recorded session available to the whole company to review and ask further questions. It’s included in the weekly Company Wide Program Updates.

    Cycle Rituals Performed Every Six Weeks

    Cycle Kick Off

    Frequency: Held every new cycle start: day one of week one.

    Why this is important: This aligns the team and reminds us what we’re all working towards. We share goals, progress, and welcome new team members or workstreams. It also allows the team to understand what other projects are active in parallel to theirs, allowing them to proactively anticipate changes and collaborate on shared patterns and approaches.

    Trigger: A recurring calendar item.

    Approach: We host a team sync up, the entire program team is invited to participate. We try to keep it short, exciting, and inspiring. We raise any reminders on things that have changed, like the definition of done and office hours to help repeat the support in place for the whole team.

    Deliverable: A presentation to the team delivered in real-time that highlights the cycle’s investment plan, overall progress on the Program and some of the biggest areas of risk the next six weeks for the team.

    Mid-Cycle Goal Iteration

    Frequency: Held between weeks one and three in a given cycle but no more than once per project.

    Why this is important: Goals aren’t always realistic when set, but it's only after starting that it’s realized. Goals aren’t a jail cell, they’re flexible and iterative. Leads are empowered in weeks one to three to change their cycle goal so long as they communicate why and provide a new goal that’s attainable within the remaining time.

    Trigger: Week three

    Approach: In weeks one to three, Leads leverage Slack to share how their goal is evolving. This evolution and the effect on the subsequent cycles left in the program plan needs to be understood. Leads do this as needed, however in week three there’s a reminder paired with Goal Setting Office Hours.

    Deliverable: A detailing of the change in cycle goals since kick off, and its impact on the overall project workstream and program path to be done.

    Goal Setting Office Hours

    Frequency: Held between weeks three to five in a given cycle. 

    Why this is important: In week three, time is carved off for reviewing current cycle goals. In week four and week five, the time is focused on next cycle goals. This is how we build a plan for the next cycle’s goals intentionally rather than rushing once the Program Stakeholder meeting is booked. It's how we’re aligned for the week one kick off.

    Trigger: Week three

    Approach: This is done with a recurring calendar on the shared program calendar and paired with a sign up sheet. Individuals then add themselves to the calendar invite.

    This isn’t a frequently used process, but does give comfort to leads that the support is there and the time is carved off. The Program Touch Base ritual tends to catch risks and velocity changes in advance of Goal Setting Office Hours, but we have yet to determine if they should be removed altogether.

    Deliverable: A change in the cycle’s current goal, the overall project workstream, and program path to be done, including staffing changes.

    Cycle Report Card

    Frequency: Held every six weeks.

    Why this is important: This is a moment of gratitude and reflection on what we've achieved, and how we did so together as a team.

    Trigger: Week Six

    Approach: In week five, Slack reminds Leads to start thinking about this. Over the next week, the team drips in nominations to highlight some of the best wins from the team on performance and values we hold such as being collaborative, merchant/partner/developer obsessed, and resourceful.

    This is done in a templated report card where we reflect back on what we wanted to achieve and update the team so they can see the velocity and impact of their work. Then, we celebrate.

    This is delivered and facilitated by Program Leads where Team Leads are the ones delivering the praise in a full team sync up. We believe this not only helps create a great team and working environment, but also helps demonstrate alignment among the Program Leads. It helps us all get to know our large team and strengths better.

    Deliverable: A celebratory section in the Cycle Kick off presentation reflecting back on individual contributions and team collaborations aligned to the company values.

    Program Lead Retro of the Previous Cycle

    Frequency: Held every six weeks, skipping the first cycle of the program.

    Why this is important: This enables recurring optimization of how the Program is run, the rituals and the team’s health. It ensures that we’re tracking towards a great working experience for staff while balancing critical goals for the company. It’s how Program Leads and Project leads get better at executing the Program, leading the team and managing the Program stakeholders.

    Trigger: A new cycle in a program. Typically the retro is held in week one after Project Lead’s have shared their Retro feedback.

    Approach: This retro is facilitated through a stop start and continue workshop. It’s a simple, effective way to reflect on recent experiences and decide on what things should change moving forward. Decisions are based on what we learned in the cycle, and what we'll to stop doing, start doing, and continue doing?

    A few questions are typically added to get the team thinking about aspects of feedback that should be provided

    • How are Program Leads working together as a team?
    • How Program Leads are managing up to Program Stakeholders? 
    • How Project Leads are managing up to Program Leads?
    • What feedback is our Team Leads telling us? 
    • How is the execution of the Program going within each team?

    This workshop produces a number of lessons that drive change on the current rituals. Starting in week two, the Lead Sync is held to review and discuss how we’re iterating rituals in this cycle. Program Leads aim to implement the changes and communicate to the broader team by the end of week two so we have four weeks of change in place to power the next cycle’s retro.

    Deliverable: A documented summary of each aspect of the retro described above available company wide and included in the Program Stakeholder Company Wide Update.

    Project Lead Retro of Previous Cycle

    Frequency: Held every six weeks, skipping the first cycle of the program.

    Why this is important: Project Leads have the option to run the retro as part of their rituals.

    This enables recurring optimization of how a Project is run within the Program, the rituals, and the team’s health. It’s how Project Leads get better at executing Projects, leading the team, and working within a larger Program.

    Trigger: A new cycle in a program.

    Typically the retro is held in week six or week one while the cycle is fresh. Even if the Project Lead has decided not to run a retro, they still may at the request of a Program Lead.

    Approach: Project Leads are not prescribed an approach beyond the general Get Shit Done recommendations that already exist within Shopify. The main point of the exercise is not how it's run, but the outcome of the exercise.

    Program Leads share an anonymous feedback form in advance of week six. This asks for what the team is going to stop, start and continue but also at the Program level. Then, we include an open ended section to distill lessons learned. These lessons are shared back with all Project Leads so we’re learning from each other. This generates a first team vibe for all Project Leads who have teams contributing to the program. First team is a concept from Patrick Lencioni where true leaders prioritize supporting their fellow leaders over their direct reports.

    It’s important for teams who want to go far and fast as this mindset is transformational in creating high performing teams and organizations. This is because a strong foundation of trust and understanding makes it significantly easier for the team to manage change, be vulnerable, and solve problems. At the end of the day, ideas or plans don’t solve problems; teams do.

    Deliverable: Iteration plan on the rituals, communication approaches, and tooling that continues to remove as many barriers and as much complexity from the team’s path.

    Program Stakeholder Review

    Frequency: Held every six weeks, often in early week six.

    Why this is important: This is where Program Stakeholders review the goals for the upcoming cycle, set expectations, escalate any risks necessary, or discuss scope changes based on other goals and decisions. This is viewed in context to the cycle ahead, but also the overall Program Plan.

    Trigger:  Booked by the VP office.

    Approach: Program Leads provide a status update of the previous cycle and the goals for the upcoming cycle in visual format. Program Leads leverage the Weekly Sync to make a plan on how we’d like to use this time with the stakeholders so we’re aligned on the most important discussion points and can effectively use the Program Steering Committee's time.

    Deliverable: A presentation that highlights progress, the remaining program plan, open decisions, and escalations that require additional support to manage.

    Program Stakeholder Company Wide Update

    Frequency: We aim to do this at least once a cycle, often at the beginning of week four right in time to clarify the program changes following Mid-Cycle Goal Iteration.

    Why is this important: Shopify is a very transparent company internally, it's one of the ways we work that allows us to move so fast. Sharing the Program Status and the evolution cycle to cycle creates an intense collaboration environment, ensuring teams have the right information to spot dependencies and risks as an outcome of this program. It supports Program Leads as well by helping clarify exactly where their team fits in the larger picture by providing an async, detailed pulse of the program.

    Trigger: A recurring calendar item that’s scheduled in two events:

    1. Draft the update.
    2. Review the update. 

    It holds us accountable to ensure we communicate on a predictable schedule.

    Approach: Programs Leads have a shared template that adds to the predictably our stakeholders can expect from the update. The template matches the overall program layout, specifically the exact language and layout as the Program was framed at the time of kickoff.  A consistent format allows everyone to interpret the plan easily. The update is a mechanism to dive into things like the program risks, blockers, concerns, and milestones.

    Throughout the day, Program Leads collaborate on the document identifying areas that could use more attention and support, highlighting changes to the overall plan, updating forecasting numbers, and most often, celebrating being on track!

    Deliverable: A communication delivered via email to stakeholders on the status of the overall program.

    Ad Hoc Rituals

    The ad hoc rituals are the ones that somehow hold the whole thing together, even through the most turbulent situations. They are the rituals triggered by intuition, experience, and context that pull out the minor technical and operational details that have the potential to significantly affect the scope, trajectory or velocity of the Program. These rituals navigate the known unknowns and unknown unknowns hidden in every project in the program.

    Assumption Slam Workshop

    Frequency: Held ad hoc, but critical within the first month of kick off.

    Why this is important: The nature of these programs is a complex intersection of Product, UX, and Engineering. This is a workshop to align the team and decrease unclear communications or decisions rooted in assumptions. This workshop is a mechanism to surface those assumptions, and the resulting lift to ensure this is well managed and doesn’t become a blocker.

    Trigger: Ad hoc

    Approach: In weeks one to three the Program Leads facilitate a guided workshop that we call an Assumption Slam. The group should be small as you’ll want to have a meaningful and actionable discussion. The workshop should be facilitated by someone who has enough context on the program to ask questions that lead the team to the root of the assumption and underlying impacts that require management or mitigation. You’ll also want to ensure the right team members are included to ensure you are challenging plans at the right intersections.

    Deliverable: The key items identified in this section shift to action items. Mitigate the risk, finalize a decision, or complete a deeper investigation.

    Program Touch base

    Frequency: Ad Hoc

    Why this is important: This is a conversational sync allowing the Project Lead to flag anything they feel relevant. This is how we stay highly aligned with the Leads and help them stay on course as much as possible.

    Trigger: If something doesn’t seem right like:

    • A workstream's goals are off track for more than one week in a row and we haven’t heard from them.
    • A workstream's goals status moves from green to red without being in yellow.
    • A workstream isn’t making their team updates on a regular cadence.
    • A workstream’s Lead hasn't talked with us in a full cycle.

    Otherwise it’s triggered, if we have new information that we need to talk about like another initiative that affects scope or dependencies or staffing changes.

    Approach: We leverage few here and call the meeting Program Touch Base. Once that’s done, an agenda is automatically added with the following items:

    • Real Talk: What is on your mind and what is keeping you up at night. It's how we stay close and ensure we’re partnering and not just coordinated ships in the night. 
    • Confidence in cycle and full workback: 
      • Based on the goal you have for this cycle, are you confident you can deliver on it? 
      • What about your Full schedule for the program? 
      • What is your confidence in that estimate including time, staffing and scope?
    • What challenges or risks are in your way?
    • Performance: Speed, Scale, Resiliency: 
      • How is the performance of your project shaping up? 
      • Any concerns worth nothing that would risk you attaining the definition of done for your workstream?
    • What aren’t you doing? Program stakeholders typically will inherit any technical debt of decisions. By asking this, Project Leads can identify items for the roadmap backlog.

    Deliverable: This engagement often leads to action items such as dependency clarification, risk identification and decision making.

    Engineering Request for Comments (RFC)

    Frequency: Held ad hoc but critical during technical design phases or after performance testing analysis.

    Why is this important: Technical Design is good for rapid consensus building. In Engineering Programs, we need to align quickly on small technical areas, rather than align on the entire project direction. There’s significant overlap between the changes being shipped and the design exploration.

    Trigger: Ad hoc

    Approach: Using an async-friendly approach in GitHub, Engineers follow a template and rules of engagement. If alignment isn’t reached by the deadline date and no one has explicitly “vetoed” the approach, how to proceed becomes the decision of the RFC author.

    Deliverable: A technical, architectural decision that is documented.

    Performance Testing

    Frequency: Held ad hoc, on the component and integrated into the system.

    Why is this important: This is critical to the Program and to Shopify. It's a core attribute of the product and minimally, can’t be regressed on. It, however, can also be improved on. Those improvements are key call outs used in the Cycle Report Card.

    Trigger: Deploying to Production.

    Approach: Teams design a load test for their component by configuring a shop and writing a Lua script that’s orchestrated through our internal tooling named Genghis. Teams validate the performance against the Service Level Indicators we are optimizing for as part of the program, and if it’s a pass, aim to fold their service into a complete end to end test where the complexity of the system will rear its head.

    This is done through async discussion as well as office hours hosted by the Program’s performance team. The Performance team documents the context being shared and inherits the end to end testing flows and associated shops. Yes, multiple shops and multiple flows. This is because services are tested at the happy path, but also with the maximum complexity to understand how the system behaves, and what to expect or fix.

    Deliverable: First and foremost, it's a feedback loop validating to teams that the service meets the performance expectations. Additionally, the Performance team can now run recurring tests to monitor for regressions and stress on any dimension desired.

    Engineering Program Management is still an early craft and evolves to the specific needs of the program, organization of the company, and management structure among the teams involved. We hope a glimpse into how we’ve run a 200+ person engineering program of 90 projects helps you define how your Engineering Program ought to be designed. As you start that journey, remember that not all rituals are necessary. In our experience, we find they’re important to attaining the objectives as close as possible and doing so with a happy and healthy team. It’s the combo of all of these calendar-based and ad hoc rituals that have allowed Shopify to achieve our goals quarter after quarter.

    You heard directly about some of these outcomes at Unite 2021: custom storefronts, checkout app extensions, and Online Store 2.0.

    We’d love to hear how you are running engineering programs and how our approaches contrast! Reach out to us on Twitter at @ShopifyEng.

    Carla Wright is an Engineering Program Manager with a focus on Scalability. She's been at Shopify for five years working across the organization to guide technical transformation, focusing on product and infrastructure bottlenecks that impact a merchant’s ability to scale and grow their business.


    We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

    Continue reading

    Perspectives on React Native from Three Shopify Developers

    Perspectives on React Native from Three Shopify Developers

    By Ash Furrow, AJ Robidas, and Michelle Fernandez

    From the perspective of web developers, React Native expands the possibilities of what a developer can create. Using the familiar React paradigm, a web developer can build user interfaces for iOS and Android apps (among other platforms). With React Native, web developers can use familiar tools to solve unfamiliar problems—what a delight!

    From the perspective of native developers, both Android and iOS, React Native (RN) helps them build user interfaces much faster. And since native developers typically focus on either Android or iOS (but usually not both), React Native also offers a wider audience for developers to reach with less effort.

    As we can see, React Native offers benefits to both web and native mobile developers. Another benefit is that developers get to work together with programmers from other backgrounds. Web, Android, and iOS developers can work together using React Native in a way that they couldn’t before. We know that teams staffed with a variety of backgrounds perform better than monocultures, so the result of using React Native can be better apps, built faster, and for a wider audience. Great!

    Even as we see the benefits of React Native, we also need to acknowledge the challenges and concerns from developers of web and native backgrounds. It’s our hope that this blog post (written by a web developer, an Android developer, and an iOS developer) can help contextualize the React Native technology. We hope that by sharing our experiences learning React Native, we can help soothe your anxieties and empower you to explore this exciting technology.

    Were You Excited to Start Using React Native?

    AJ: Yes definitely. Coming from a web dev background, I was always interested in mobile development, but had no interest in going back to Java just to make android apps. I had been using React for a while already so when I heard there was a way to write mobile apps using React I was immediately interested, though I struggled to get into it on my own because I learn better with others. When I was offered a job working with React Native, I jumped at the opportunity

    Michelle: I was hesitant and thought that all that I knew about native android development was going to be thrown out the window! The easier choice would have been to branch off and stay close to my native development roots doing iOS development, but I’m always up for a challenge and saying YES to new things.

    Ash: I wasn’t! My previous team started using it in 2015 and I resisted. I kind of stuck my head in the sand about it because I wanted to use Swift instead. But since the company didn’t need a lot of Swift to be written, I started working on web back-ends and front-ends. That’s when I learned React and got really excited. I finally understood the value in React Native: you get to use React.

    What surprised you most about React Native?

    AJ: The simplicity of the building blocks. Now I know that sounds a little crazy, but in the web world there are just so many base semantic elements. <button>, <a>, <h1> to <h6>, <p>, <input>, <footer>, <img>, <ol>, etc. So when I started learning React Native, I was looking for the RN equivalents for all of these, but it just isn’t structured the same way. RN doesn’t have heading levels and paragraphs, all text is handled by the <Text> component. Buttons, links, tabs, checkboxes, and more can all be handled with <Touchable> components. Even though the structure of writing a custom component is almost exactly the same as React, it feels very different because the number of semantic building blocks goes from more than 100 down to a little more than 20.

    Michelle: I was surprised at how quick it was to build an app! The instant feedback when updating the UI is incomparable to the delay you get with native development, and the data that informs that UI is easy to retrieve using tools like GraphQL and Apollo. I was also very surprised at how painless it was to create the native bridge module and integrate existing SDKs into the app and then using those methods from the JavaScript side. The outcome of it all is a solid cross-platform app that still allows you to use the native layer when you need it! (And it’s especially needed for our Point of Sale app)

    Ash: I was surprised by how good a React Native app could be. Previous attempts at cross-platform frameworks, like PhoneGap, always felt like PhoneGap apps. They never felt like they really belonged on the OS. Software written in React Native is hard to distinguish from software written in Swift or Objective-C. I thought that the value proposition of React Native was the ability to write cross-platform apps with a single codebase, but it was only used on iOS during the five years I used it at Artsy. React Native’s declarative model is just a better way to create user interfaces, and I think we’ve seen the industry come to this consensus as SwiftUI and Jetpack Compose play catch-up.

    Let’s start by exploring React Native from a web developer’s perspective. React Native uses a lot of the tooling that you, as a web developer, are already familiar with. But it also uses some brand new tools. In some ways, you might feel like you’re starting from scratch. You might struggle with the new tools to accomplish simple tasks, and that’s normal. The discomfort that comes from feeling like a beginner is normal, and it’s mitigated with help from your peers.

    Android Studio and Xcode can both be intimidating, even for experienced developers who use them day-to-day. Ideally, your team has enough Android and iOS developers to build solid tooling foundations and to help you when you get stuck. At Shopify, we rarely use the Android Studio and Xcode IDEs to write React Native apps. Instead, we use Visual Studio Code for most of our work. Our mobile tooling teams created command line abstractions for tools like adb, xcodebuild, and xcrun. So you could clone a React Native repository and get a simulator running with the code without ever opening Android Studio or Xcode.

    What was the most challenging part about getting used to RN?

    AJ: For me it was the uncertainty. I came in confident in my React skills, but I found myself never knowing what mobile specific concerns existed, and when they might come into play. Since everything needs to be run over the RN Bridge, some aspects of web development, like CSS animations for example, just don’t really translate in a way that’s performant enough. So with no mobile development background any of those mobile specific concerns were an blind spot for me. This is an area where having coworkers from a mobile background has been a huge benefit.

    Michelle: Understanding the framework and metro server and node and packages and hooks and state management and and and, so... everything?! Although if you create analogies to native development, you’ll find something similar. One quote I like is: “You’re not starting from scratch, you’re starting from experience.” This helps me to put in perspective that although it’s a new language and framework and the syntax may be different—the semantics are the same, meaning that if I wanted to create something like I would using native android development, I just had to figure out how I could achieve the same thing using a bit of JavaScript (TypeScript) and if needed, leverage my native development skills and the React bridge to do it.

    Ash: I was really sceptical about the node ecosystem, it felt like a house of cards. Having over a thousand dependencies in a fresh React Native project feels… almost irresponsible? At least from a native background in Swift and Objective-C. It’s a different approach to be sure, to work around the unique constraints of JavaScript. I’ve come to appreciate it, but I don’t blame anyone for feeling overwhelmed by the massive amount of tools that your work sits atop of.

    Your experience as a web developer offers a perspective on how to build user interfaces that’s new to native developers. While you may be familiar with tools like node and yarn, these are very different from the tools that native developers are used to. Your familiarity, from the basics of JSX and React components to your intuition of React best practices and software patterns, will be a huge help to your native developer colleagues.

    Offer your guidance and support, and ask questions. Android and iOS developers will tend to use tools they are already familiar with, even if a better cross-platform solution exists. Challenge your teammates to come up with cross-platform abstractions instead of platform-specific implementations.

    What do you think would be painful about RN but turned out to be really friendly?

    AJ: That's a difficult question for me, I didn’t really have anything in particular that I expected to be painful. I can say that the little bit I tried to learn RN on my own before I started at Shopify, I personally found getting the simulators and emulators set up to be painful. I was grateful when I got started here to find that the onboarding documentation and RN tutorial helped me breeze through the setup way faster than expected. I was up and running with a test app in the simulator within minutes that let me actually start learning RN right away instead of struggling with the tech.

    Michelle: Coming from a native background using a powerhouse of an IDE, I thought the development process would slow me down. Thankfully, I’ve got my IDE (IntelliJ IDEA) now set up so that I can write code in React and at the same time write and inspect native kotlin code. You’d never think that a good search and replace and refactoring system would speed up your dev process by 10x but it really does.

    Ash: I was worried that writing JavaScript would be painful, because no one I knew really liked JavaScript. At the time, CoffeeScript was still around, so no one really liked JavaScript, especially iOS developers. But instead, I found that JavaScript had grown a lot since I’d last used it. Furthermore, TypeScript provided a lot of the compile-time advantages of Swift but with a much more humane approach to compilers. I can’t think of a reason I would ever use React Native without TypeScript, it makes the whole experience so much more friendly.

    Next, let’s explore the Android and iOS perspectives. Although the Android and iOS platforms are often seen to have an antagonistic relationship with one another, the experiences of Android and iOS developers coming to React Native are remarkably similar. As a native developer, you might feel like you’re turning your back on all the experience you’ve gained so far. But your experience building native applications is going to be a huge benefit to your React Native team! For example, you have a deep familiarity with your platform’s user interface idioms; you should use this intuition to help your team build user interfaces that “feel” like native applications.

    What do you wish was better about working in RN?

    AJ: Accessibility. I’m a huge accessibility advocate, I push for implementation of accessibility in anything I work on. But this is a bit of a struggle with React Native. Accessibility is an area of RN that doesn’t yet have a lot of educational material out there. A lot of the principles for web still apply, but the correct way to implement accessibility in some areas isn’t yet well established and with fewer semantic building blocks very little gets built in by default. So developers need to be even more aware and intentional about what they create.

    Michelle: React Native land seems like the wild wild west after coming from languages with well established patterns and libraries as well as the documentation to support it. These do currently exist for RN but because of how new this framework is and the slow (but increasing!) adoption of it, there's still a long way to go to make it accessible for more people by providing more examples and resources to refer to.

    Ash: I wish that the tools were more cohesive. Having worked in the Apple developer ecosystem for so long, I know how empowering a really polished developer tool can be. Not that Apple’s tools are perfect, far from it, but they are cohesive in a way that I miss. There’s usually one way to accomplish a task, but in React Native, I’m often left figuring things out on my own.

    React Native apps are still apps and, consequently, they operate under the same conditions as native apps. Mobile devices have specific constraints and capabilities that web developers aren’t used to working with. Native developers are used to thinking about an app’s user experience as more than only the user interface. For example, native developers are keenly aware of how cellular and GPS radios impact battery life. They also know the value of integrating deeply with the operating system, like using rich push notifications or native share sheets. The same skills that help native developers ensure apps are “good citizens” of their platform are still critical to building great React Native applications.

    When did opinions about React Native change?

    AJ: I’m not sure I’d say I’ve had a change of opinion. I went into React Native curious and uncertain of what to expect. I’d heard good things from other web devs that I knew who had tried RN. So I felt pretty confident that I’d be able to pick it up and that I would enjoy it. If anything I would say the learning process went even smoother than anticipated.

    Michelle: My opinions changed when I found that a React Native app allows us to integrate with the native SDKs we've developed at Shopify and still results in a performant app. I realized that Kotlin over the React bridge works and performs well together and still keeps up my skills in native Android development.

    Ash: They changed when I built my first feature from the ground-up in React, for the web. The component model just kind of “clicked” for me. The next time I worked in Swift, everything felt so cumbersome and awkward. I was spending a lot of time writing code that didn’t make the software unique or valuable, it was just boilerplate.

    Native developers are also familiar with mobile devices’ native APIs for geofencing, augmented reality, push notifications, and more. All these APIs are accessible to React Native applications, either through existing open source node modules or custom native modules that you can write. It’s your job to help your team make full and appropriate use of the device’s capabilities. A purely React Native app can be good, but it takes collaborating with native developers to make an app that’s really great.

    How would you describe your experiences with React Native at Shopify?

    AJ: I’ve had a great experience working with React Native at Shopify. I came in as a React dev with absolutely no mobile experience of any kind. I was pointed towards a coworker’s day long “Introduction to React Native” workshop, and it gave me a better understanding than I’d gotten from the self learning I’d attempted previously. On top of that, I have knowledgeable and supportive coworkers that are always willing to take the time out of their day to lend a hand and help fill in the gaps. Additionally the tooling created by the React Native Foundations team takes away the majority of the pain involved in getting started with React Native to begin with.

    Michelle: Everything goes at super speed at Shopify—this includes learning React Native! There are so many resources within Shopify to support you including internal workshops providing a crash course to RN. Other teams are also using RN so there’s opportunity to learn from them and the best practices they’re following. Shopify also has specific mobile tooling teams to support RN in our CI environment and automation to ship to production. In addition to the mobile tooling team, there’s a specific React Native Foundations team that builds internal tools to help others get familiar and quickly spin up RN apps. We have monthly mobile team meetups to share and gain visibility into the different mobile projects built across Shopify.

    Ash: I’m still very new to the company, but my experience here is really positive so far. There’s a lot of time spent on the foundations of our React Native apps—fast reload, downloadable bundles built for each pull request, lint rules that keep developers from making common mistakes—that all empower developers to move very, very quickly. In React Native, there is no compile step to interrupt a developer’s workflow. We get to develop at the speed of thought. Since Shopify invests so much in developer tooling, getting up to speed with the Shop app took no time at all.

    Learning anything new, including RN, can feel intimidating, but you can learn RN. Your existing skills will help you learn, and learning it is best done in a team environment with many perspectives (which we have at Shopify, apply today!).

    We see now that both web developers and native developers have different perspectives on building mobile apps with React Native, and how those perspectives complement each other. React Native teams at Shopify are generally staffed with developers from web, Android, and iOS backgrounds because the teams produce the best software when they can draw from these perspectives.

    Whether you’re a web developer looking to expand the possibilities of what you can create, or you’re a native developer looking to move faster with a cross-platform technology, React Native can be a great solution. But just like any new skill, learning React Native can be intimidating. The best approach is to learn in a team environment where you can draw from the strengths of your colleagues. And if you’re keen to learn React Native in a supportive, collaborative environment, Shopify is hiring! You can also view a presentation on How We Write React Native Apps at Shopify from our Shipit! Presents series.

    AJ Robidas is a developer from Ontario, Canada, with a specialization in accessibility. She has a varied background from C++ and Purl, some Python backend work, to multiple web frameworks (AngularJS, Angular, StencilJS and React). For the past year she has been a React Native developer on the Shop team implementing new and updated experiences for the Shop App

    Michelle Fernandez is a senior software developer from Toronto, Canada with nearly a decade of experience in the mobile applications world. She has been working on Shopify’s Android Point of Sale app since its redesign with Shopify Polaris and has contributed to its rebuild as a React Native app from inception to launch. The Shopify POS app is now in production and used by Shopify merchants around the world.

    Ash Furrow is a developer from New Brunswick, Canada, with a decade of experience building native iOS applications. He has written books on software development, has a personal blog, and currently contributes to the Shop team at Shopify.

    Continue reading

    Shopify-Made Patterns in Our Rails Apps

    Shopify-Made Patterns in Our Rails Apps

    At Shopify, we’ve developed our own patterns in order to support our global platform. Before coming here, I've developed multiple Ruby (and Rails) applications at multiple growth stages. Because of that, I quickly came to appreciate some workarounds and automation that were created to support the large codebase of Shopify.

    If there’s something I appreciate about Ruby on Rails, it’s the principle of convention over configuration it’s been built with. This enables junior developers to build higher quality code than in other languages, simply by following conventions. Conventions are also great when moving to a new Rails application: the file structure is always familiar.

    But this makes it harder to go outside conventions. When people ask me about the biggest challenges of Ruby, I usually say it’s easy to start, but hard to become an expert. Everything is so abstracted, so one must be really curious and take the time to understand how Ruby and specifically Rails actually work.

    Our monolith, Shopify Core, employs many of the common Rails conventions. This ranges from the default application structure, to the usage of in-built libraries like the Active Record ORM, Active Model, or Ruby gems like FrozenRecord.

    At Shopify, we implement what most merchants need, most of the time. Similarly, the Rails framework also provides the infrastructure that most developers need, most of the time. Therefore, we had to find creative ways to make the largest Rails monolithic application maintainable.

    When ready to join Shopify as a developer, my goal is that this blog post is useful to you whether you are new to Ruby, or if you’ve worked with Ruby on other projects in the past.

    Dev

    I would like to give the first mention to our command line developer tool, dev. At Shopify, we have thousands of developers working on hundreds of active projects. Many of these projects,in the past, had their own workflows and instructions on setup, how to run tests, and so on.

    We created dev to provide us with a unified workflow across a variety of projects. It gives us a way to specify and automate the installation of all the dependencies and includes the workflow items required to boot the project on macOS, from XCode to bin/rails db:migrate. This is probably the first Shopify-made infrastructure you’ll use when starting at Shopify. It’s easy to take it for granted, but dev is doing so much towards increasing our productivity.

    Time is money and automations are one time efforts.

    We believe consistency is important across development environments. Inconsistencies can lead to debugging nightmares and incorrect local behaviour. Even with the existing tools like chruby, bundler, and homebrew to manage dependencies, setup can be a multi-step tedious process, and it can be difficult to outline the processes that achieve the desired consistency. So, we standardise many of the commands we use at Shopify through dev.

    One of the most powerful features of dev is the ability to spin up services, in multiple programming languages. That means each repo has the same base configuration, structure, and libraries. Our infrastructure team is constantly working to make dev better to ultimately increase developer productivity. Dev also abstracts environment variables. Whenever joining smaller companies, one would spend days “fishing” environment variables before getting a few connected systems up and running.

    Dev also enables Shopify developers to enable and disable integrations with interconnected services. This is usually manually changed through environment variables or configuration types.

    Lastly, dev even abstracts command aliases! Ruby is already pretty good on commands, but when looking at tools, the commands can get super long. And this is where aliases help us developers save time, as we can make shortcuts for longer commands. So Shopify took this to the next level: why let developers set up their environment if they can get a baseline configuration, right through dev? This also helps standardise commands across projects, regardless of the programming language. For example, before I'd use the Hub package for opening PR’s. Now, I just use dev open pr.

    Pods

    Shopify core has a podded architecture, which means that the database is split into an arbitrary number of subsets, each containing a distinct subset of shops. Each pod runs Shopify independently, with a database containing a portion of our shops. The concept is based on the shard database infrastructure pattern. The Rails framework already has the pod/shard structure built-in. It was implemented with Shopify’s usage in mind and in collaboration with Github. In comparison with the shard database pattern, we’re expanding it to the full infrastructure. That includes provisioning, deployment, load balancers, caching, and servers. If one pod shuts down temporarily, the other pods aren’t affected. If you’d like to learn more about the infrastructure behind this, check out our blog post about running Kafka on Kubernetes at Shopify.

    Horizontally scaling out our monolith was the fastest solution to handling our load.

    Shopify is not just a software as a service company. It’s a platform able to generate full websites for over 1.7 million merchants. Whenever we deliver our services to merchants, we look at data in the context of the merchant's store. And that’s why we split everything by shop, including:

    • Incoming HTTP requests
    • Background jobs
    • Asynchronous events

    That’s why every table in a podded database is connected to a shop. The shop is necessary for podding—our solution for horizontal scaling. And the link helps us avoid having data leaks between shops.

    For a more detailed overview of pods, check out A Pods Architecture to Allow Shopify to Scale.

    Domain Driven Design

    At Shopify, we love monoliths. The same way microservices have their challenges, so do monoliths, but these are challenges we're excited to try and solve.

    Splitting concerns became a necessity to support delivery in our growing organization.

    Monoliths can serve our business purpose very well—if they aren’t a mess. And this is where domain driven architecture comes into place. This concept wasn’t invented by Shopify, but it was definitely tweaked to work in our domain. If you’d like to learn more about how we deconstructed our monolith through components, check out Deconstructing the Monolith: Designing Software that Maximizes Developer Productivity and Under Deconstruction: The State of Shopify’s Monolith.

    We did split our code in domains, but that’s about all we split. Traditionally, we’d see no link between domains besides public or internal APIs. But our database is still common for all domains, and everything is still linked to the Shop. This means we’re breaking domain boundaries every time we call Shop from another domain. As mentioned earlier, this is a necessity for our podded architecture. This is where it becomes trickier: every time we’re instantiating a model outside our domain, we’re ignoring component boundaries and we receive a warning for it. But, because the shop is already part of every table, the shop is practically part of every domain.

    Something else you may be surprised by is we don’t enforce any relationships between tables on the database layer. This means the foreign keys are enforced only at the code level through models.

    And, even though we use ActiveRecord migrations (not split by pods), running all historical migrations wouldn’t be feasible. Because of that, we only use migrations in the short term. Every month or so, we merge our migrations in a raw sql file which holds our database structure. This avoids the platform running migrations for hours, aging back 10 years. This blog post, Pros and Cons of Using structure.sql in Your Ruby on Rails Application, explains in more detail the benefits of using a structure.sql file.

    Standardizing How We Write code

    We expect to hire over 2000 this year. How can we control the quality of the code written? We do it by detecting repetitive mistakes. There are so many systems Shopify created to address this, ranging from gems to generators.

    We built safeguards to keep quality levels up in a fast scaling organization.

    One of the tools often used that’s implemented by us is the translation platform: a system handling creation, translation, and publication of translations directly through git.

    In smaller companies, you’d just receive translations from the marketing team to embed in the app, or just get it through a CRM. This is certainly not enough when it comes to globalizing such a large application. The goal is to enable anyone to release their work while translations are being handled asynchronously, and it definitely saves us a lot of time. All we need to do is push the English version, and all the strings are automatically sent to a third party system where translators can add their translations. Without any input from the developers, the translations are directly committed back in our repos. The idea was first developed during Shopify hack days back in 2017. To learn more, check out this blog post about our translation platform.

    Our maintenance task system also deserves a memorable mention. It’s built over the rails Active Job library, but has been adapted to work with our podded infrastructure. In a nutshell, it’s a Rails engine for queuing and managing maintenance tasks. In case you’d like to look into it, we’ve made this project open source.

    In our monolith, we’ve also set up tons of automatic tests letting us know when we’re taking the wrong approach, and limits were put in to avoid overloading our system when spawning jobs.

    Another system that standardizes how we do things is Monorail. Initially inspired by Airbnb Jitney, Monorail enforces schemas for widely used events. It creates contracts between Kafka producers and consumers through a defined structure of the data sent through JSON. Some benefits are

    1. With unstructured events, events with different structure would end up as part of the same data warehouse table. Monorail creates a contract between developers and data scientists through schemas. If it changes, it has to be done through versioning.
    2. It also avoids Personal Identifiable Information (PII) leaks. Schemas all go through privacy review, ensuring PII fields are annotated as such, and are automatically scrubbed (redacted, tokenized).

    I’ve covered many different topics herein this introduction to all of the awesome features we’ve set up to increase our productivity levels and focus on what matters: shipping great features. If you decide to join us, this overview should give you enough background to help you take the right approach at Shopify from the beginning.

    Ioana Surdu-Bob is a Developer at Shopify, working on the Shopify Payments team. She’s passionate about personal finance and investing. She’s trying to help everyone build for financial independence through Konvi, a crowdfunding platform for alternative assets.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Shopify's Path to a Faster Trino Query Execution: Infrastructure

    Shopify's Path to a Faster Trino Query Execution: Infrastructure

    By Matt Bruce & Bruno Deszczynski

    Driving down the amount of time data scientists are waiting for query results is a critical focus (and necessity) for every company with a large data lake. However, handling and analyzing high-volume data within split seconds is complicated. One of the biggest hurdles to speed is whether you have the proper infrastructure in place to efficiently store and query your data.

    At Shopify, we use Trino to provide our data scientists with quick access to our data lake, via an industry standard SQL interface that joins and aggregates data across heterogeneous data sources. However, our data has scaled to the point where we’re handling 15 Gbps and over 300 million rows of data per second. With this volume, greater pressure was put on our Trino infrastructure, leading to slower query execution times and operational problems. We’ll discuss how we scaled our interactive query infrastructure to handle the rapid growth of our datasets, while enabling a query execution time of less than five seconds.

    Our Interactive Query Infrastructure 

    At Shopify, we use Trino and multiple client apps as our main interactive query tooling, where the client apps are the interface and Trino is the query engine. Trino is a distributed SQL query engine. It’s designed to query large data sets distributed over heterogeneous data sources. The main reason we chose Trino is that it gives you optionality in the case of database engine use. However, it’s important to note that Trino isn’t a database itself, as it’s lacking the storage component. Rather, it's optimized to perform queries across one or more large data sources.

    Our architecture consists of two main Trino clusters:

    • Scheduled cluster: runs reports from Interactive Analytics apps configured on a fixed schedule.
    • Adhoc cluster:  runs any on-demand queries and reports, including queries from our experiments platform.

    We use a fork of Lyft’s Trino Gateway to route queries to the appropriate cluster by inspecting header information in the query. Each of the Trino clusters runs on top of Kubernetes (Google GKE) which allows us to scale the clusters and perform blue-green deployments easily.

    While our Trino deployment managed to process an admirable amount of data, our users had to deal with inconsistent query times depending on the load of the cluster, and occasionally situations where the cluster became so bogged down that almost no queries could complete. We had to get to work to identify what was causing these slow queries, and speed up Trino for our users.

    The Problem 

    When it comes to querying data, Shopify data scientists (rightfully) expect to get results within seconds. However, we encounter scenarios like interactive analytics, A/B testing (experiments), and reporting all in one place. In order to improve our query execution times, we focused on speeding up Trino, as it enables a larger portion of optimization to the final performance of queries executed via any SQL client software.

    We wanted to achieve a query latency of P95 less than five seconds, which would be a significant decrease (approximately 30 times). That was a very ambitious target as approximately five percent of our queries were running around one to five minutes. To achieve this we started by analyzing these factors:

    • Query volumes
    • Most often queried datasets
    • Queries consuming most CPU wall time
    • Datasets that are consuming the most resources
    • Failure scenarios.

    When analyzing the factors above, we discovered that it’s not necessarily the query volume itself that was driving our performance problems. We noticed a correlation between certain types of queries and datasets consuming the most resources that was creating a lot of error scenarios for us. So we decided to zoom in and look into the errors.

    We started looking at error classes in particular:

    A dashboard showing 0.44% Query Execution Failure rate and a 0.35% Highly relevant error rate. The dashboard includes a breakdown of the types of Presto errors.
    Trino failure types breakdown

    It can be observed that our resource relevant error rate (related to exceeding resource use) was around 0.35 percent, which was acceptable due to the load profile that was executed against Trino. What was most interesting for us was the ability to identify the queries that were timing out or causing a degradation in the performance of our Trino cluster. At first it was hard for us to properly debug our load specific problems, as we couldn’t recreate the state of Trino during the performance degradation scenarios. So, we created a Trino Query Replicator that allowed us to recreate any load from the past.

    Recreating the state of Trino during performance degradation scenarios enabled us to drill down deeper on the classes of errors, and identify that the majority of our problems were related to:

    • Storage type: especially compressed JSON format of messages coming from Kafka.
    • Cluster Classes: using the ad-hoc server for everything, and not just what was scheduled.
    • CPU & Memory allocation: both on the coordinator and workers. We needed to scale up together with the number of queries and data.
    • JVM settings: we needed to tune our virtual machine options.
    • Dataset statistics: allowing for better query execution via cost based optimization available in Trino.

    While we could write a full book diving into each problem, for this post we’ll focus on how we addressed problems related to JVM settings, CPU and Memory allocation, and cluster classes.

    A line graph showing the P95 execution time over the month of December. The trend line shows that execution time was steadily increasing.
    Our P95 Execution time and trend line charts before we fine tuned our infrastructure

    The Solution

    In order to improve Trino query execution times and reduce the number of errors caused by timeouts and insufficient resources, we first tried to “money scale” the current setup. By “money scale” we mean we scaled our infrastructure horizontally and vertically. We doubled the size of our worker pods to 61 cores and 220GB memory, while also increasing the number of workers we were running. Unfortunately, this alone didn’t yield stable results. For that reason, we dug deeper into the query execution logs, stack-traces, Trino codebase, and consulted Trino creators. From this exploration, we discovered that we could try the following:

    • Creating separate clusters for applications with predictable heavy compute requirements.
    • Lowering the number of concurrent queries to reduce coordinator lock contention.
    • Ensuring the recommended JVM recompilation settings are applied.
    • Limiting the maximum number of drivers per query task to prevent compute starvation.

    Workload Specific Clusters

    As outlined above, we initially had two Trino clusters: a Scheduled cluster and an Adhoc cluster. The shared cluster for user's ad hoc queries and the experiment queries was causing frustrations on both sides. The experiment queries were adding a lot of excess load causing user's queries to have inconsistent query times. A query that might take seconds to run could take minutes if there were experiment queries running. Correspondingly, the user's queries were making the runtime for the experiments queries unpredictable. To make Trino better for everyone, we added a new cluster just for the experiments queries, leveraging our existing deployment of Trino Gateway to route experiments queries there based on a HTTP header.

    We also took this opportunity to write some tooling that allows users to create their own ephemeral clusters for temporary heavy-duty processing, or investigations with a single command (these are torn down automatically by an Airflow job after a defined TTL).

    A system diagram showing the Trino infrastructure before changes. Mode and internal SQL clients feed into the Trino Gateway. The Gateway feeds into scheduled reports and adhoc queries.
    Trino infrastructure before
    A system diagram of the Trino infrastructure after changes. Mode and internal clients both feed into the Trino Gateway. The Gateway feeds into Scheduled Reports, Ad hoc queries, and experimental queries. In addition, the Internal SQL clients feed into Short-Term clusters
    Trino infrastructure after

    Lock Contention

    After exhausting the conventional scaling up options, we moved onto the most urgent problem: when the Trino cluster overloaded and work wasn’t progressing, what was happening? By analyzing metrics output to Datadog, we were able to identify a few situations that would arise.One problem we identified was that the Trino cluster’s queued work would continue to increase, but no queries or splits were being dispatched. In this situation, we noticed that the Trino coordinator (the server that handles incoming queries) was running, but it stopped outputting metrics for minutes at a time. We originally assumed that this was due to CPU load on the coordinator (those metrics were also unavailable). However, after logging into the coordinator’s host and looking at the CPU usage, we saw that the coordinator wasn’t busy enough that it shouldn’t be able to report statistics. We proceeded to capture and analyze multiple stack traces and determined that the issue wasn’t an overloaded CPU, but lock contention against the Internal Resource Group object from all the active queries and tasks.

    We set hardConcurrencyLimit to 60 in our root resource group to limit the number of running parallel queries and reduce the lock contention on the coordinator.

    "rootGroups": [
        {
        "hardConcurrencyLimit": "60",

    Resource group configuration

    This setting is a bit of a balancing act between allowing enough queries to run to fully utilize the cluster, and capping the amount running to limit the lock contention on the coordinator.

    A line graph showing Java Lang:System CPU load in percent over a period of 5 hours before the change. The graph highlights the spikes where metric dropouts happened six times.
    Pre-change CPU graph showing Metrics dropouts due to lock contention
    A line graph showing Java Lang:System CPU load in percent over a period of 5 hours after the change. The graph highlights there were no more metric dropouts.
    Post change CPU graph showing no metrics dropouts

    JVM Recompilation Settings

    After the coordinator lock contention was reduced, we noticed that we would have a reasonable number of running queries, but the cluster throughput would still be lower than expected. This caused queries to eventually start queuing up. Datadog metrics showed that a single worker’s CPU was running at 100%, but most of the others were basically idle.

    A line graph showing Java Lang:System CPU load by worker in percent over a period of 5 hours. It highlights that a single worker's CPU was running at 100%
    CPU Load distribution by worker

    We investigated this behaviour by doing some profiling of the Trino process with jvisualvm while the issue was occurring. What we found was that almost all the CPU time was spent either: 

    1. Doing GCM AES decryption of the data coming from GCS.
    2. JSON deserialization of that data.

    What was curious to us is that the datasets the affected workers were processing were no different than any of the other workers. Why were these using more CPU time to do the same work?After some trial and error, we found setting the following JVM options prevented our users from being put in this state:

    -XX:PerMethodRecompilationCutoff=10000
    -XX:PerBytecodeRecompilationCutoff=10000

    JVM settings

    It’s worth noting that these settings were added to the recommended JVM options in a later version of Trino than we were running at the time. There’s a good discussion about those settings in the trino GitHub repo! It seems that we were hitting a condition that was causing the JVM to no longer attempt compilation of some methods, which caused them to run in the JVM interpreter rather than as compiled code which is much, much slower.

    In the graph below, the CPU of the workers is more aligned without the ‘long tail’ of the single worker running at 100 percent.

    A line graph showing
    CPU Load distribution by worker

    Amount of Splits Per Stage Per Worker

    In the process of investigating the performance of queries, we happened to come across an interesting query via the Trino Web UI:

    A screenshot showing the details displayed on the Trino WebUI. It includes query name, execution time, size, etc.
    Trino WebUI query details

    What we found was one query had a massive number of running splits: approximately 29,000. This was interesting because, at that time, our cluster only had 18,000 available worker threads, and our Datadog graphs showed a maximum of 18,000 concurrent running splits. We’ll chalk that up to an artifact of the WebUI. Doing some testing with this query, we discovered that a single query could monopolize the entire Trino cluster, starving out all the other queries.After hunting around the Slack and forum archives, we came across an undocumented configuration option: `task.max-drivers-per-task`. This configuration enabled us to limit the maximum number of splits that can be scheduled per stage, per query, per worker. We set this to 16, which limited this query to around 7,200 active splits.

    The Results and What’s Next

    Without leveraging the storage upgrade and by tapping into cluster node sizing, cluster classes, Trino configs, and JVM tuning, we managed to bring down our execution latency to 30 seconds and provide a stable environment for our users. The below charts present the final outcome:

    A bar graph showing the large decrease in execution time before the change and after the change.
    Using log scale binned results for execution time before and after
    A line graph showing the P95 execution time over a 3 month period.  The trend line shows that execution time reduces.
    P95 Execution time and trendline over 3 month period

    The changes in the distribution of queries being run within certain bins shows that we managed to move more queries into the zero to five second bucket and (most importantly) limited the time that the heaviest queries were executed at. Our execution time trendline speaks for itself, and as we’re writing this blog, we hit less than 30 seconds with P95 query execution time.

    By creating separate clusters, lowering the number of concurrent queries, ensuring the recommended JVM recompilation setting were applied, and limiting the maximum number of drivers per query task, we were able to scale our interactive query infrastructure. 

    While addressing the infrastructure was an important step to speed up our query execution, it’s not our only step. We still think there is room for improvement and are working to make Trino our primary interactive query engine. We’re planning to put further efforts into:

    • Making our storage more performant (JSON -> Parquet).
    • Introducing a Alluxio Cache layer.
    • Creating a load profiling tooling.
    • Enhancing our statistics to improve the ability of the Trino query optimizer to choose the most optimal query execution strategy, not just the overall performance of user queries.
    • Improving our Trino Gateway by rolling out Shopify Trino Conductor (a Shopify specific gateway), improving UI/infrastructure, and introducing weighted query routing.

    Matt Bruce: Matt is a four-year veteran at Shopify serving as Staff Data Developer for the Foundations and Orchestration team. He’s previously helped launch many open source projects in Shopify including Apache Druid and Apache Airflow, as well as migrating Shopify’s Hadoop and Presto infrastructure from physical Data centers into cloud based services.

    Bruno Deszczynski: Bruno is a Data Platform EPM working with the Foundations team. He is obsessed with making Trino execute interactive analytics queries (P95) below five seconds in Shopify.

    Continue reading

    High Availability by Offloading Work Into the Background

    High Availability by Offloading Work Into the Background

    Unpredictable traffic spikes, slow requests to a third-party payment gateway, or time-consuming image processing can easily overwhelm an application, making it respond slowly or not at all. Over Black Friday Cyber Monday (BFCM) 2020, Shopify merchants made sales of over 5 Billion USD, with peak sales of over 100 Million USD per hour. On such a massive scale, high availability and short response times are crucial. But even for smaller applications, availability and response times are important for a great user experience.

    BFCM by the numbers globally 2020: Total Sales: $5.1B USD, Average cart prices $89.20 USD, Peak sales hour $102M+ at 12 pm EST, 50% more consumers buying from Shopify Merchant
    BFCM by the numbers

    High Availability

    High availability is often conflated with a high server uptime. But it’s not sufficient that the server hasn’t crashed or shut down. In the case of Shopify, our merchants need to be able to make sales. So a buyer needs to be able to interact with the application. A banner saying “come back later” isn’t sufficient, and serving only one buyer at a time isn’t good enough either. To consider an application available, the community of users needs to have meaningful interactions with the application. Availability can be considered high if these interactions are possible whenever the users need them to be.

    Offloading Work

    In order to be available, the application needs to be able to accept incoming requests. If the external-facing part of the application (the application server) is also doing the heavy lifting required to process the requests, it can quickly become overwhelmed and unavailable for new incoming requests. To avoid this, we can offload some of the heavy lifting into a different part of the system, moving it outside of the main request response cycle to not impact the application server’s availability to accept and serve incoming requests. This also shortens response times, providing a better user experience.

    Commonly offloaded tasks include:
    • sending emails
    • processing images and videos
    • firing webhooks or making third party requests
    • rebuilding search indexes
    • importing large data sets
    • cleaning up stale data

    The benefits of offloading a task are particularly large if the task is slow, consumes a lot of resources, or is error-prone.

    For example, when a new user signs up for a web application, the application usually creates a new account and sends them a welcome email. Sending the email is not required for the account to be usable, and the user doesn’t receive the email immediately anyways. So there’s no point in sending the email from within the request response cycle. The user shouldn’t have to wait for the email to be sent, they should be able to start using the application right away, and the application server shouldn’t be burdened with the task of sending the email.

    Any task not required to be completed before serving a response to the caller is a candidate for offloading. When uploading an image to a web application, the image file needs to be processed and the application might want to create thumbnails in different sizes. Successful image processing is often not required for the user to proceed, so it’s generally possible to offload this task. However, the server can no longer respond, saying “the image has been successfully processed” or “an image processing error has occurred.” Now, all it can respond with is “the image has been uploaded successfully, it will appear on the website later if things go as planned.” Given the very time-consuming nature of image processing, this trade-off is often well worth it, given the massive improvement of response time and the benefits of availability it provides.

    Background Jobs

    Background jobs are an approach to offloading work. A background job is a task to be done later, outside of the request response cycle. The application server delegates the task, for example, the image processing, to a worker process, which might even be running on an entirely different machine. The application server needs to communicate the task to the worker. The worker might be busy and unable to take on the task right away, but the application server shouldn’t have to wait for a response from the worker. Placing a message queue between the application server and worker solves this dilemma, making their communication asynchronous. Sender and receiver of messages can interact with the queue independently at different points in time. The application server places a message into the queue and moves on, immediately becoming available to accept more incoming requests. The message is the task to be done by the worker, which is why such a message queue is often called a task queue. The worker can process messages from the queue at its own speed. A background job backend is essentially some task queues along with some broker code for managing the workers.

    Features

    Shopify queues tens of thousands of jobs per second in order to leverage a variety of features.

    Response times

    Using background jobs allows us to decouple the external-facing request (served by the application server) from any time-consuming backend tasks (executed by the worker). thus improving response times. What improves response times for individual requests also improves the overall availability of the system.

    Spikeability

    A sudden spike in, say, image uploads, doesn’t hurt if the time consuming image processing is done by a background job. The availability and response time of the application server is constrained by the speed with which it can queue image processing jobs. But the speed of queueing more jobs is not constrained by the speed of processing them. If the worker can’t keep up with the increased amount of image processing tasks, the queue grows. But the queue serves as a buffer between the worker and the application server so that users can continue uploading images as usual. With Shopify facing traffic spikes of up to 170k requests per second, background jobs are essential for maintaining high availability despite unpredictable traffic.

    Retries and Redundancy

    When a worker encounters an error while running the job, the job is requeued and retried later. Since all of that is happening in the back, it's not affecting the availability or response times of the external facing application server. It makes background jobs a great fit for error-prone tasks like requests to an unreliable third party.

    Parallelization

    Several workers might process messages from the same queue allowing more than one task to be worked on simultaneously. This is distributing the workload. We can also split a large task into several smaller tasks and queue them as individual background jobs so that several of these subtasks are worked on simultaneously.

    Prioritization

    Most background job backends allow for prioritizing jobs. They might use priority queues that don’t follow the first in - first out approach so that high-priority jobs end up cutting the line. Or they set up separate queues for jobs of different priorities and configure workers to prioritize jobs from the higher priority queues. No worker needs to be fully dedicated to high-priority jobs, so whenever there’s no high-priority job in the queue, the worker processes lower-priority jobs. This is resourceful, reducing the idle time of workers significantly.

    Event-based and Time-based Scheduling

    Background jobs aren’t always queued by an application server. A worker processing a job can also queue another job. While they queue jobs based on events like user interaction, or some mutated data, a scheduler might queue jobs based on time (for example, for a daily backup).

    Simplicity of Code

    The background job backend encapsulates the asynchronous communication between the client requesting the job and the worker executing the job. Placing this complexity into a separate abstraction layer keeps the concrete job classes simple. A concrete job class only implements the task to be done (for example, sending a welcome email or processing an image). It’s not aware of being run in the future, being run on one of several workers, or being retried after an error.

    Challenges

    Asynchronous communication poses some challenges that don’t disappear by encapsulating some of its complexity. Background jobs aren’t any different.

    Breaking Changes to Job Parameters

    The client queuing the job and the worker processing it doesn’t always run the same software version. One of them might already have been deployed with a newer version. This situation can last for a significant amount of time, especially if practicing canary deployments. Changes to the job parameters can break the application if a job has been queued with a certain set of parameters, but the worker processing the job expects a different set. Breaking changes to the job parameters need to roll out through a sequence of changes that preserve backward compatibility until all legacy jobs from the queue have been processed.

    No Exactly-once Delivery

    When a worker completes a job, it reports back that it’s now safe to remove the job from the queue. But what if the worker processing the job remains silent? We can allow other workers to pick up such a job and run it. This ensures that the job runs even if the first worker has crashed. But if the first worker is simply a little slower than expected, our job runs twice. On the other hand, if we don’t allow other workers to pick up the job, the job will not run at all if the first worker did crash. So we have to decide what’s worse: not running the job at all, or running it twice. In other words, we have to choose between at least and at most-once delivery.

    For example, not charging a customer is not ideal, but charging them twice might be worse for some businesses. At most-once delivery sounds right in this scenario. However, if every charge is carefully tracked and the job checks those states before attempting a charge, running the job a second time doesn’t result in a second charge. The job is idempotent, allowing us to safely choose at-least once delivery.

    Non-Transactional Queuing

    The job queue is often in a separate datastore. Redis is a common choice for the queue, while many web applications store their operational data in MySQL or PostgreSQL. When a transaction for writing operational data is open, queuing the job will not be part of this enclosing transaction - writing the job into Redis isn’t part of a MySQL or PostgreSQL transaction. The job is queued immediately and might end up being processed before the enclosing transaction commits (or even if it rolls back).

    When accepting external input from user interaction, it’s common to write some operational data with very minimal processing, and queue a job performing additional steps processing that data. This job may not find the data it needs unless we queue it after committing the transaction writing the operational data. However, the system might crash after committing the transaction and before queuing the job. The job will never run, the additional steps for processing the data won’t be performed, leaving the system in an inconsistent state.

    The outbox pattern can be used to create transactionally staged job queues. Instead of queuing the job right away, the job parameters are written into a staging table in the operational data store. This can be part of a database transaction writing operational data. A scheduler can periodically check the staging table, queue the jobs, and update the staging table when the job is queued successfully. Since this update to the staging table might fail even though the job was queued, the job is queued at least once and should be idempotent.

    Depending on the volume of jobs, transactionally staged job queues can result in quite some load on the database. And while this approach guarantees the queuing of jobs, it can’t guarantee that they will run successfully.

    Local Transactions

    A business process might involve database writes from the application server serving a request and workers running several jobs. This creates the problem of local database transactions. Eventual consistency is reached when the last local transaction commits. But if one of the jobs fails to commit its data, the system is again in an inconsistent state. The SAGA pattern can be used to guarantee eventual consistency. In addition to queuing jobs transactionally, the jobs also report back to the staging table when they succeed. A scheduler can check this table and spot inconsistencies. This results in an even higher load on the database than a transactionally staged job queue alone.

    Out of Order Delivery

    The jobs leave the queue in a predefined order, but they can end up on different workers and it’s unpredictable which one completes faster. And if a job fails and is requeued, it’s processed even later. So if we’re queueing several jobs right away, they might run out of order. The SAGA pattern can ensure jobs are run in the correct order if the staging table is also used to maintain the job order.

    A more lightweight alternative can be used if consistency guarantees are not of concern. Once a job has completed its tasks, it can queue another job as a follow-up. This ensures the jobs run in the predefined order. The approach is quick and easy to implement since it doesn’t require a staging table or a scheduler, and it doesn’t generate any extra load on the database. But the resulting system can become hard to debug and maintain since it’s pushing all its complexity down a potentially long chain of jobs queueing other jobs, and little observability into where exactly things might have gone wrong.

    Long Running Jobs

    A job doesn’t have to be as fast as a user-facing request, but long-running jobs can cause problems. For example, the popular ruby background job backend Resque prevents workers from being shut down while a job is running. This worker cannot be deployed. It is also not very cloud-friendly, since resources are required to be available for a significant amount of time in a row. Another popular ruby background job backend, Sidekiq, aborts and requeues the job when a shutdown of the worker is initiated. However, the next time the job is running, it starts over from the beginning, so it might be aborted again before completion. If deployments happen faster than the job can finish, the job has no chance to succeed. With the core of Shopify deploying about 40 times a day, this is not an academic discussion but an actual problem we needed to address.

    Luckily, many long-running jobs are similar in nature: they iterate over a huge dataset. Shopify has developed and open sourced an extension to Ruby on Rails’s Active Job framework, making this kind of job interruptible and resumable. It sets a checkpoint after each iteration and requeues the job. Next time the job is processed, work continues at the checkpoint, allowing for safe and easy interruption of the job. With interruptible and resumable jobs, workers can be shut down any time, which makes them more cloud-friendly and allows for frequent deployments. Jobs can be throttled or halted for disaster prevention, for example, if there’s a lot of load on the database. Interrupting jobs also allows for safely moving data between database shards.

    Distributed Background Jobs

    Background job backends like Resque and Sidekiq in Ruby usually queue a job by placing a serialized object into the queue, an instance of the concrete job class. This implies that both the client queuing the job and the worker processing it needs to be able to work with this object and have an implementation of this class. This works great in a monolithic architecture where clients and workers are running the same codebase. But if we would like to, say, extract the image processing into a dedicated image processing microservice, maybe even written in a different language, we need a different approach to communicate.

    It is possible to use Sidekiq with separate services, but the workers still need to be written in Ruby and the client has to choose the right redis queue for a job. So this approach is not easily applied to a large-scale microservices architecture, but avoids the overhead of adding a message broker like RabbitMQ.

    A message-oriented middleware like RabbitMQ places a purely data-based interface between the producer and consumer of messages, such as a JSON payload. The message broker can serve as a distributed background job backend where a client can offload work onto workers running an entirely different codebase.

    Instead of simple point-to-point queues, topics that leverage task queues add powerful routing. In contrast to HTTP, this routing is not limited to 1:1. Beyond delegating specific tasks, messaging can be used for different event messages whenever communication between microservices is needed. With messages removed after processing, there’s no way to replay the stream of messages and no source of truth for a system-wide state.

    Event streaming like Kafka has an entirely different approach: events are written into an append-only event log. All consumers share the same log and can read it at any time. The broker itself is stateless; it doesn’t track event consumption. Events are grouped into topics, which provides some publish subscribe capabilities that can be used for offloading work to different services. These topics aren’t based on queues, and events are not removed. Since the log of events can be replayed, it can serve, for example, as a source of truth for event sourcing. With a stateless broker and append-only writing, throughput is incredibly high—a great fit for real-time applications and data streaming.

    Background jobs allow the user-facing application server to delegate tasks to workers. With less work on its plate, the application server can serve user-facing requests faster and maintain a higher availability, even when facing unpredictable traffic spikes or dealing with time-consuming and error-prone backend tasks. The background job backend encapsulates the complexity of asynchronous communication between client and worker into a separate abstraction layer, so that the concrete code remains simple and maintainable.

    High availability and short response times are necessary for providing a great user experience, making background jobs an indispensable tool regardless of the application’s scale.

    Kerstin is a Staff Developer transforming Shopify’s massive Rails code base into a more modular monolith, building on her prior experience with distributed microservices architectures .


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Understanding GraphQL for Beginners–Part Two

    Understanding GraphQL for Beginners–Part Two

    Welcome back to part two of the Understanding GraphQL for Beginners series. In this tutorial, we’ll build GraphQL fields about food! If you did not read part one of this series, please read it before reading this part.

    As a refresher, GraphQL is a data manipulation and query language for APIs. The two main benefits of implementing GraphQL are

    1. The ability to describe the structure you want back as your response.
    2. Only needing one endpoint to consume one or more resources.

    Learning Outcomes

    • Examine the file directory of GraphQL.

    • Identify the difference between root fields and object fields.

    • Create a GraphQL object based on an existing Ruby on Rails model.

    • Create a GraphQL root field to define the structure of your response.

    • Use a GraphQL root field to query data within a database.

    • Examine how the GraphQL endpoint works.

    Before You Start

    Download the repository to follow along in this tutorial. The repository has been set up with models and gems needed for GraphQL. Once downloaded, seed the database.

    The following models are

    Food

    Attribute Type
    id Bigint
    Name String
    place_of_origin String
    image String
    created_At Timestamp
    updated_at Timestamp

    Nutrition

    Attribute Type
    id Bigint
    food_id Bigint
    serving_size String
    calories String
    total_fat String
    trans_fat String
    saturated_fat String
    cholesterol String
    sodium String
    Potassium String
    total_carbohydrate String
    dietary_fiber String
    sugars String
    protein String
    vitamin_a String
    vitamin_c String
    calciuum String
    iron String
    created_at Timestamp
    update_at Timestamp

    GraphQL File Structure

    Everything GraphQL related is found in the folder called “graphql” under “/app”. Open up your IDE editor, and look at the file structure under “graphql”.

    A screenshot of the directory structure of the folder graphql in an IDE. Under the top folder graphql is mutations and types and they are surrounded by a yellow box.  Underneath them is foo_app_schema.rb.
    Directory structure of the folder graphql

    In the yellow highlighted box, there are two directories here:

    1. “Mutations”
      This folder contains classes that will modify (create, update or delete) data.
    2. “Types”
      This folder contains classes that define what will be returned. As well as the type of queries (mutation_type.rb and query_type.rb) that can be called.

    In the red highlighted box, there’s one important file to note.

    A screenshot of the directory structure of the folder graphql in an IDE. Under the top folder graphql is mutations and types. Underneath them is foo_app_schema.rb which is surrounded by a red box.
    Directory structure of the folder graphql

    The class food_app_schema.rb, defines the queries you can make.

    Creating Your First GraphQL Query all_food

    We’re creating a query that returns us a list of all the food. To do that, we need to create fields. There are two kinds of fields:

    1. Root fields define the structure of your response fields based on the object fields selected. They’re the entry points (similar to endpoints) to your GraphQL server.
    2. Object fields are attributes of an object.

    We create a new GraphQL object called food. On your terminal run rails g graphql:object food. This will create the file food_type.rb filled with all the attributes found in your foods table in db/schema.rb. Your generated food_type.rb will look like:

    This class contains all the object fields, exposing a specific data of an object. Next, we need to create the root field that allows us to ask the GraphQL server for what we want. Go to the file query_type.rb that’s a class that contains all root fields.

    Remove the field test_field and its method. Create a field called all_food like below. As food is both a singular and plural term, we use all_food to be plural.

    The format for field is as followed:

    1. The field name (:all_food).
    2. The return type for the field ([Types::FoodType]).
    3. Whether the field will ever be null (null: false). By setting this to false, it means that the field will never be null.
    4. The description of the field (description: "Get all the food items.").

    Congratulations, you’ve created your first GraphQL query! Let’s go test it out!

    How to Write and Run Your GraphQL Query

    To test your newly created query, we use the playground, GraphiQL, to execute the all_food query. To access GraphiQL, add the following URI to your web address: localhost:3000/graphiql.

    You will see this page:

     A screenshot of the GraphiQL playground.  There are two large text boxes side by side. The left text box is editable and the right isn't. The menu item at the top shows the GraphiQL name, a play button, Prettify button, and History button.
    GraphiQL playground

    The left side of the page is where we will write our query. The right side will return the response to that query.

    Near the top left corner next to the GraphiQL text contains three buttons:

    A  screenshot of the navigation menu of the GraphiQL. The menu item shows the a play button, a Prettify button, and a History button.
    GrapiQL playground menus
    1. This button will execute your query.
    2. This button will reformat your query to look pretty.
    3. This button will show all previous queries you ran.
    4. On the right corner of the menu bar is a button called “< Docs”.
    A screenshot of the <Docs menu item. There is a large red arrow pointing to the menu item and it says click here.
    Docs menu item

    If you click on the “< Docs” button in the top right corner, you can find all the possible queries based on your schema.

    A screenshot of the <Docs menu item. The page is called "Document Explorer" and displays a search field allowing the user to search for a schema.  Underneath the search the screen lists two Root Types in a list: "query:Query" and "mutation: Mutation." There is a red box around "query:Query" and the words "Click here." to its right.
    Document explorer

    The queries are split into two groups, query and mutation. “query” which contains all queries that do not modify data. Queries that do modify data can be found in “mutation”. Click on “query: Query” to find the “all_food” query we just created.

    A screenshot of the Query screen with a result displaying the field allFood: [Food!]!
    Query screen

    After clicking on “query: Query”, you will find all the possible queries you can make. If you click on [Food!]!, you will see all the possible fields we can ask for.

    A screenshot listing the all fields contained in the all_food query.
    Fields in the all_food query

    These are all the possible fields you can use within your all_food query.Remember, GraphQL allows us to describe exactly what we need. Let’s say we only want the ids and names of all food items. We write the query as

    Click the execute button to run the query. You get the following response back:

    Awesome job! Now, create another query to get the image and place_of_origin fields back:

    You will get this response back.

    What’s Happening Behind the Scenes?

    Recall from part one, GraphQL has this single “smart” endpoint that bundles all different types of RESTful actions under one endpoint. This is what happens when you make a request and get a response back.

    A flow diagram showing the steps to execute a query between the client and the food app's server.

    When you execute the query:

    1. You call the graphql endpoint with your request (for example, query and variables).
    2. The graphql endpoint then calls the execute method from the graphql_controller to process your request.
    3. The method renders a JSON containing a response catered to your request.
    4. You get a response back.

    Try it Yourself #1

    Try to implement the root field called nutrition. Like all_food, it returns all nutrition facts.

    If you need any assistance, please refer to this gist that includes a sample query and response: https://gist.github.com/ShopifyEng/7c196bf443bdf26e55f827d65ee490a6

    Adding a Field to an Existing Query.

    You may have noticed that the nutrition table contains a foreign key, where a food item has one nutrition fact. Currently, it’s associated at the model level but not used at the GraphQL level. For someone to query food and get the nutrition fact as well, we need to add a nutrition field to food.

    Add the following field to food_type.rb:

    field :nutrition, Types::NutritionType, null: true

    Let’s execute the following query where we want to know the serving size and calories of each food item:

    You will get this response back:

    Hooray! We now know the serving size and calories of each food item!

    So far, we learned how to create root fields to query all data of a specific resource. Let’s write a query to look at data based on id.

    Writing a Query with an Argument

    In query_type.rb, we need to add another root field called food that requires and takes an argument called id:

    On GraphiQL, let’s execute this query:

    You will get this response back:

    Try it Yourself #2

    This time, create a root field called find_food, which returns a set of data based on place_of_origin.

    If you need any assistance, please refer to this gist that includes a sample query and response: https://gist.github.com/ShopifyEng/1f92cee91f2932a0ef665594418764d3

    As we’ve reached the end of this tutorial, let’s recap what we learned!

    1. GraphQL generates and populates an object if the model with the same name exists.
    2. Root fields define the structure of your response and are entry points to your GraphQL server.
    3. Object fields are an object’s attributes.
    4. All requests are processed by the graphql_controller’s execute method and return a JSON response back.

    I hope you enjoyed creating some GraphQL queries! One thing you might still be wondering is how do we update these ActiveRecord objects? In part 3 of Understanding GraphQL for Beginners, we’ll continue creating queries called mutations that create, update, or delete data.

    If you would like to see the finished code for part two, check out the branch called part-2-solution.

    Often mistaken as an intern, Raymond Chung is building a new generation of software developers through Shopify's Dev Degree program. As a Technical Educator, his passion for computer science and education allows him to create bite-sized content that engages interns throughout their day-to-day. When he is not teaching, you'll find Raymond exploring for the best bubble tea shop.


    We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

    Continue reading

    Understanding GraphQL for Beginners–Part One

    Understanding GraphQL for Beginners–Part One

    As developers, we’re always passionate about learning new things! Whether it’s a new framework or a new language, our curiosity takes us places! One term you may have heard of is REST. REST stands for REpresentational State Transfer - a software architecture style introduced by Roy Fielding in the year 2000, with a set of principles on how a web application should behave. Think of this as a guideline of operations, like how to put together a meal. One of the principles is that one endpoint should only do one CRUD action (either create, read, update, or delete). As well, each RESTful endpoint returns a fixed set of data. I like to think of this as a cookie-cutter response, where you get the same shape back every time. Sometimes you may only need less data, and other times you may need more data. This can lead to the issue of calling additional APIs to get more data. How can we get exactly the right amount of data and under one call?

    As technology evolves, one thing that contrasts REST and is gaining popularity is GraphQL. But what is it exactly? Within this blog, we will learn what GraphQL is all about!

    Learning Outcomes

    • Explain what GraphQL is.
    • Use an analogy to deepen your understanding of GraphQL.
    • Explain the difference between REST and GraphQL.

    Before You Start

    If you are new to API development, here are some terminologies for reference. Otherwise, continue ahead.

    API

    What is API?

    Application Programming Interface (API) allows two machines to talk to each other. You can think of it as the cashier who takes your request to the kitchen to prepare and gives you your meal when ready.

    Why are APIs important?

    APIs allow multiple devices like your laptop and phone to talk to the same backend server. APIs that use REST are RESTful.

    REST

    What is REST?

    REpresentational State Transfer (REST) is a software architecture style on how a web application should behave. Think of this as a guideline of operations, like how to put a meal together.

    Why is REST important?

    REST offers a great deal of flexibility like handling different types of calls and responses. It breaks down a resource into CRUD services, making it easier to organize what each endpoint should be doing. One of REST’s key principles is client-server separation of concerns. This means that any issues that happen on the server are only concerned by the server. All the client cares about is getting a response back based on their request to the server.

    Latency Time

    What is latency time?

    Latency time is the time it takes for your request to travel to the server to be processed. You can think of this like driving from point A to B. Sometimes there are delays due to traffic congestion.

    Why is latency time important?

    The lower the latency, the faster the request can be processed by the server. The higher the latency, the longer it takes for your request to be processed.

    Response Time

    What is response time?

    Response time is the sum of latency time and the time it takes to process your request. Think of this as the time it takes since you ordered your meal.

    Why is response time important?

    Like latency, the faster the response time, the more seamless the overall experience feels for users. The slower the response time, the less seamless it feels for users, and they may quit your application.

    What Is GraphQL?

    GraphQL is an open-source data query and manipulation language for APIs, released publicly by Facebook in 2015.

    Unlike REST, GraphQL offers the flexibility for clients to describe the structure of the data they need in the form of a query. No more and no less. The best part is it's all under one endpoint! The response back will be exactly what you described, and not a cookie-cutter response.

    For example, provided below, we have three API responses about the Toronto Eagles, their championships, and their players. If we want to look at the year the Toronto Eagles were founded, the first and last name of the team captain and their last championship, we need to make three separate RESTful calls.

    Call 1:

    Call 2:

    Call 3:

    When you make an API call, it’s ideal to get a response back within a second. The response time is made up of latency time and processing time. With three API calls, we are making three trips to the server and back. You may expect that the latency times for all three calls would be the same. That will never be the case. You can think of latency like driving in traffic, sometimes it's fast, and sometimes it's slow due to rush hour. If one of the calls is slow, that means the overall total response time is slow!

    Luckily with GraphQL, we can combine the three requests together, and get the exact amount of data back on a single trip!

    GraphQL query:

    GraphQL response:

    GraphQL Analogies

    Here are two analogies to help describe how GraphQL compares to REST.

    Analogy 1: Burgers

    Imagine you are a customer at a popular burger restaurant, and you order their double cheeseburger. Regardless of how many times you order (calling your RESTful API), you get every ingredient in that double cheeseburger every time. It will always be the same shape and size (what’s returned in a RESTful response).

    An image of a two pattie hamburger on a sesame seed bun with cheese, bacon, pickles, red pepper, and secret sauce
    Photo by amirali mirhashemian on Unsplash.

    With GraphQL, you can “have it your way” by describing exactly how you want that double cheeseburger to be. I’ll take a double cheeseburger with fewer pickles, cheese not melted, bacon on top, sautéed onions on the bottom, and finally no sesame seeds on the bottom bun.

    Your GraphQL response is shaped and sized to be exactly how you describe it.

    A two pattie hamburger on a sesame seed bun with cheese, bacon, pickles, red pepper, and secret sauce
    Photo by amirali mirhashemian on Unsplash.

    Analogy 2: Banks

    You are going to the bank to make a withdrawal for $200. Using the RESTful way, you won’t be able to describe how you want your money to be. The teller (response) will always give you two $100 bills.

    RESTful response:

    An image of two rectangles side by side. Each rectangle represents $100 and that text is contained within each rectangle.
    Two $100 bills

    By using GraphQL, you can describe exactly how you want your denominations to be. You can request one $100 bill and five $20 bills.

    GraphQL response:

    An image of six rectangles in a three by three grid. The first rectangle starting from the top left represents $100 and the other five represent $20 from the text contained within each rectangle.
    One $100 bill and five $20 bills

    REST Vs GraphQL

    Compared to RESTful APIs, GraphQL provides more flexibility on how to ask for data from the server. It provides four main benefits over REST:

    1. No more over fetching extra data.
      With REST APIs, a fixed set of data (same size and shape response) is returned. Sometimes, a client doesn’t need all the data. GraphQL solves this by having the clients grab only what they need.
    2. No more under fetching data.
      Sometimes, a client may need more data. Additional calls must be made to get data that an endpoint may not have.
    3. Rapid product iterations on the front end.
      Flexible structure catered to clients. Frontend developers can make UI changes without asking the backend developers to make changes to cater frontend design changes.
    4. Fewer endpoints.
      Calling too many endpoints can get confusing really fast. GraphQL’s single “smart” endpoint bundles all different types of RESTful actions under one.

    By leveraging GraphQL’s principles of describing the structure of the data you want back, you don’t need to make multiple trips for some cookie-cutter responses. Read part two of Understanding GraphQL for Beginners as we’ll implement GraphQL to a Ruby on Rails application, and create and execute queries!

    Often mistaken as an intern, Raymond Chung is building a new generation of software developers through Shopify's Dev Degree program. As a Technical Educator, his passion for computer science and education allows him to create bite-sized content that engages interns throughout their day-to-day. When he is not teaching, you'll find Raymond exploring for the best bubble tea shop.


    We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

    Continue reading

    Let’s Encrypt x Shopify: Securing the Web 4.5 Million Domains at a Time

    Let’s Encrypt x Shopify: Securing the Web 4.5 Million Domains at a Time

    On June 30, 2021 Shipit!, our monthly event series, presented Let’s Encrypt and Shopify: Securing Shopify’s 4.5 Million Domains. Learn about how we secure over 4.5M Shopify domains and team up to foster a safer Internet for everyone. The video is now available.

    It’s already been six years since Shopify became a sponsor of Let’s Encrypt.

    In 2016, the SSL team started transitioning all of our merchants' stores to HTTPS. When we started exploring the concept a few years earlier, it was a daunting task. There were few providers that could let us integrate a certificate authority programmatically. The few that did had names like “Reseller API.” The idea that you would give away certificates for free and no human would be involved was completely alien in this market. Everything was designed with the idea that a user would be purchasing the certificate, downloading it, and installing it somehow. It’s a lot more problematic than you might think. For example, a lot of those API return human readable error messages instead of having a defined error code. Normally, they would expect the implentor to send back the message to the user trying to purchase a certificate, but in a fully automated system there is no user to read anything. For Shopify, all 650,000 domains would get a certificate, and they would be provisioned and renewed without any interactions from our merchants.

    I first heard about Let’s Encrypt in 2014. A lot of the chatter online was around the fact that they would become a certificate authority providing free certificates (they were pretty expensive until now), but a bit less about the other part of the project, the ACME protocol. The idea was to fully automate the certificate authorities using standardized APIs.

    In the summer of 2015 they still hadn’t launched, but I started to write a Ruby implementation of the ACME client protocol on the weekend to get a feel for it. I’d already been through this exercise a few times with other providers. Working from a specification was pretty refreshing. They’re boring documents, but when trying to automate hundreds of thousands of domains that you don’t really control, you want to know that you have all your exceptions accounted for. That’s when we reached out to them to figure out how Shopify could help and agreed on a sponsorship. We didn’t intend to make use of their service, at least not in the immediate future, but we share value around the open web and the importance of removing barriers of entry using technology.

    Interacting with a small organization that does their work fully open was also quite refreshing. My experience dealing with certificate authority would be to work with an account manager who forwards my question to a technical team. The software they run is usually not implemented by them, so there is a limit to how much they can answer questions. Let’s Encrypt being fully open changes the dynamic. I asked questions on IRC and they answered me with github links that point at the actual implementation. I reported bugs or inconsistencies in the specification, and they tagged me in the pull request that fixed it.

    In late November, we started rolling out our shiny new automated provisioning system. We immediately ran into some scalability issues with our initial providers. We did some napkin math with the throttling they were imposing on us, we would need about 100 days to provision every domain. We let it run over the holidays and launched in February 2016.

    The team was already engaged in its next mission but in the back of our mind we knew we needed to revisit this. Now that the bulk of the domains were done, new domains would come at a slower pace and eventually renewal, but that would be good for a while at our current growth projection. Our main concern was emergency rotation. If for some reason we had to rotate our private keys or the certificate chain was compromised somehow, we’d be in trouble. A 100 days is too slow to react to an incident.

    We needed to be more responsive for our merchants, and that’s why we decided to add Let’s Encrypt as a backup option. We were able to roll Let’s Encrypt out in a few hours compared to months with our original providers. The errors we ran into were predictable because of their specification and server implementation being open source, so we could refer directly to it to debug unexpected behaviour. It was so reliable that we decided to make them our main certificate authority.

    Let's Encrypt is a game changer for the industry. For a big software-as-a-service company like Shopify, it saves time because their implementation is built around an open specification. You can even change or add a new certificate authority that supports the ACME protocol without redesigning or having to change your entire infrastructure if you wanted to. It's more reliable than the API from the past because it's designed to be fully automated from the beginning.

    Shipit! Presents Let’s Encrypt and Shopify: Securing Shopify’s 4.5 Million Domains

    Shipit! welcomes Josh Aas, co-founder and Executive Director of Let’s Encrypt and Shopify’s Charles Barbier, Application Security Development Manager, to talk about securing over 4.5 million Shopify domains and teaming up to foster a safer Internet for everyone.

    Additional Information

    Charles Barbier is a Developer Lead for the Application security team. You can connect with him on Twitter.


    We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

    Continue reading

    Rate Limiting GraphQL APIs by Calculating Query Complexity

    Rate Limiting GraphQL APIs by Calculating Query Complexity

    Rate limiting is a system that protects the stability of APIs. GraphQL opens new possibilities for rate limiting. I’ll show you Shopify’s rate limiting system for the GraphQL Admin API and how it addresses some limitations of common methods commonly used in REST APIs. I’ll show you how we calculate query costs that adapt to the data clients need while providing a more predictable load on servers.

    What Is Rate Limiting and Why Do APIs Need It?

    To ensure developers have a reliable and stable API, servers need to enforce reasonable API usage. The most common cases that can affect platform performance are

    • Bad actors abusing the API by sending too many requests.
    • Clients unintentionally sending requests in infinite loops or sending a high number of requests in bursts.

    The traditional way of rate limiting APIs is request-based and widely used in REST APIs. Some of them have a fixed rate (that is clients are allowed to make a number of requests per second). The Shopify Admin REST API provides credits that clients spend every time they make a request, and those credits are refilled every second. This allows clients to keep a request pace that never limits the API usage (that is two requests per second) and makes occasional request bursts when needed (that is making 10 requests per second).

    Despite widely used, the request-based model has two limitations:

    • Clients use the same amount of credits regardless, even if they don’t need all the data in an API response.
    • POST, PUT, PATCH and DELETE requests produce side effects that demand more load on servers than GET requests, which only reads existing data. Despite the difference in resource usage, all these requests consume the same amount of credits in the request-based model.

    The good news is that we leveraged GraphQL to overcome these limitations and designed a rate limiting model that better reflects the load each request causes on a server.

    The Calculated Query Cost Method for GraphQL Admin API Rate Limiting

    In the calculated query cost method, clients receive 50 points per second up to a limit of 1,000 points. The main difference from the request-based model is that every GraphQL request has a different cost.

    Let’s get started with our approach to challenges faced by the request-based model. 

    Defining the Query Cost for Types Based on the Amount of Data it Requests

    The server performs static analysis on the GraphQL query before executing it. By identifying each type used in a query, we can calculate its cost.

    Objects: One Point

    The object is our base unit and worth one point. Objects usually represent a single server-side operation such as a database query or a request to an internal service.

    Scalars and Enums: Zero points

    You might be wondering, why do scalars and enums have no cost? Scalars are types that return a final value. Some examples of scalar types are strings, integers, IDs, and booleans. Enums is a special kind of scalar that returns one of a predefined set of values. These types live within objects that already have their cost calculated. Querying additional scalars and enums within an object generally comes at a minimum cost.

    In this example, shop is an object, costing 1. id, name, timezoneOffsetMinutes, and customerAccountsreturn are scalar types that cost 0. The total query cost is 1.

    Connections: Two  Points Plus The Number of Returned Objects

    Connections express one-to-many relationships in GraphQL. Shopify uses Relay-compliant connections, meaning they follow some conventions, such as compounding them by using edges, node, cursor, and pageInfo.

    The edges object contains the fields describing the one-to-many relationship:

    • node: the list of objects returned by the query.
    • cursor: our current position on that list.

    pageInfo holds the hasPreviousPage and hasNextPage boolean fields that help navigating through the list.

    The cost for connections is two plus the number of objects the query expects to return. In this example, a connection that expects to return five objects has a cost of seven points:

    cursor and pageInfo come free of charge as they’re the result of the heavy lifting already made by the connection object.

    This query costs seven points just like the previous example:

    Interfaces and Unions: One point

    Interfaces and unions behave as objects that return different types, therefore they cost one point just like objects do.

    Mutations: 10 points

    Mutations are requests that produce side effects on databases and indexes, and can even trigger webhooks and email notifications. A higher cost is necessary to account for this increased server load so they’re 10 points. 

    Getting Query Cost Information in GraphQL Responses

    You don’t need to calculate query costs by yourself. The API responses include an extension object that includes the query cost. You can try running a query on Shopify Admin API GraphiQL explorer and see its calculated cost in action.

    The request:

    The response with the calculated cost displayed by the extension object:

    Getting Detailed Query Cost Information in GraphQL Responses

    You can get detailed per-field query costs in the extension object by adding the X-GraphQL-Cost-Include-Fields: true header to your request:

    Understanding Requested Vs Actual Query Cost

    Did you notice two different types of costs on the queries above?

    • The requested query cost is calculated before executing the query using static analysis.
    • The actual query cost is calculated while we execute the query.

    Sometimes the actual cost is smaller than the requested cost. This usually happens when you query for a specific number of records in a connection, but fewer are returned. The good news is that any difference between the requested and actual cost is refunded to the API client.

    In this example, we query the first five products with a low inventory. Only one product matches this query, so even though the requested cost is seven, you are only charged for the four points calculated by the actual cost:

    Measuring the Effectiveness of the Calculated Query Cost Model

    The calculated query complexity and execution time have a linear correlation
    The calculated query complexity and execution time have a linear correlation

    By using the query complexity calculation rules, we have a query cost that’s proportional to the server load measured by query execution time. This gives Shopify the predictability needed to scale our infrastructure, giving partners a stable platform for building apps. We can also detect outliers on this correlation and find opportunities for performance optimization.

    Rate limiting GraphQL APIs by calculating the amount of data clients query or modify adapts more to the use case of each API client better than a request-based model commonly used by REST APIs.  Our calculated query cost method benefits clients with good API usage because it encourages them to request only the data they need, providing servers with a more predictable load.

    Additional Information

    Guilherme Vieira is a software developer on the API Patterns team. He loves building tools to help Partners and Shopifolk turn their ideas into products. He grew up a few blocks from a Formula 1 circuit and has been a fan of this sport ever since.


    We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

    Continue reading

    10 Lessons Learned From Online Experiments

    10 Lessons Learned From Online Experiments

    Controlled experiments are considered the gold standard approach for determining the true effect of a change. Despite that, there are still many traps, biases, and nuances that can spring up as you experiment, and lead you astray. Last week, I wrapped up my tenth online experiment, so here are 10 hard-earned lessons from the past year.

    1. Think Carefully When Choosing Your Randomized Units

    After launching our multi-currency functionality, our team wanted to measure the impact of our feature on the user experience, so we decided to run an experiment. I quickly decided that we randomly assign each session to test or control, and measure the impact on conversion (purchase) rates. We had a nicely modelled dataset at the session grain that was widely used for analyzing conversion rates, and consequently, the decision to randomize by session seemed obvious.

    Online A/B test with session level randomization
    Online A/B test with session level randomization

    When you map out this setup, it looks something like the image below. Each session comes from some user, and we randomly assign their session to an experience. Our sessions are evenly split 50/50.

    User experience with session level randomization
    User experience with session level randomization

    However, if you look a little closer, this setup could result in some users getting exposed to both experiences. It’s not uncommon for a user to have multiple sessions on a single store before they make a purchase.

    If we look at our user Terry below, it’s not unrealistic to think that the “Version B” experience they had in Session 2 influenced their decision to eventually make a purchase in Session 3, which would get attributed to the “Version A” experience.

    Carryover effects between sessions may violate the independent randomization units assumption
    Carryover effects between sessions may violate the independent randomization units assumption

    This led me to my first very valuable lesson that is to think carefully when choosing your randomization unit. Randomization units should be independent, and if they aren’t, you may not be measuring the true effect of your change. Another factor in choosing your randomization unit comes from the desired user experience. You can imagine that it’s confusing for some users if something significant was visibly different each time they came to a website. For a more involved discussion on choosing your randomization unit, check out my post: Choosing your randomization unit in online A/B tests.

    2. Simulations Are Your Friend

    With the above tip in mind, we decided to switch to user level randomization for the experiment, while keeping the session conversion rate as our primary success metric since it was already modelled and there was a high degree of familiarity with the metric internally.

    However, after doing some reading I discovered that having a randomization unit (user) that’s different from your analysis unit (session) could lead to issues. In particular, there were some articles claiming that this could result in a higher rate of false positives. One of them showed a plot like this:

    Distribution of p-values for session level conversion with users as a randomization unit
    Distribution of p-values for session level conversion with users as a randomization unit

    The intuition is that your results could be highly influenced by which users land in which group. If you have some users with a lot of sessions, and a really high or low session conversion rate, that could heavily influence the overall session conversion rate for that group.

    Rather than throwing my hands up and changing the strategy, I decided to run a simulation to see if we would actually be impacted by this. The idea behind the simulation was to take our population and simulate many experiments where we randomized by user and compared session conversion rates like we were planning to do in our real experiment. I then checked if we saw a higher rate of false positives, and it turned out we didn’t so we decided to stick with our existing plan.

    The key lesson here was that simulations are your friend. If you’re ever unsure about some statistical effect, it’s very quick (and fun) to run a simulation to see how you’d be affected, before jumping to any conclusions.

    3. Data Can and Should Inform Big and Small Decisions

    Data is commonly used to influence big decisions with an obvious need for quantitative evidence. Does this feature positively impact our users? Should we roll it out to everyone? But there’s also a large realm of much smaller decisions that can be equally influenced by data.

    Around the time we were planning an experiment to test a new design for our geolocation recommendations, the system responsible for rendering the relevant website content was in the process of being upgraded. The legacy system (“Renderer 1”) was still handling approximately 15 percent of the traffic, while the new system (“Renderer 2”) was handling the other 85 percent. This posed a question to us: do we need to implement our experiment in the two different codebases for each rendering system? Based on the sizable 15 percent still going to “Renderer 1”, our initial thinking was yes. However, we decided to dig a bit deeper.

    Flow of web requests to two different content rendering codebases
    Flow of web requests to two different content rendering codebases

    With our experiment design, we’d only be giving the users the treatment or control experience on the first request in a given session. With that in mind, the question we actually needed to answer changed. Instead of asking what percent of all requests across all users are served by “Renderer 2”, we needed to look at what percent of first requests in a session are served by “Renderer 2” for the users we planned to include in our experiment.

    Flow of web requests to two different content rendering codebases after filtering out irrelevant requests
    Flow of web requests to two different content rendering codebases after filtering out irrelevant requests

    By reframing the problem, we learned that almost all of the relevant requests were being served by the new system, so we were safe to only implement our experiment in one code base.

    A key lesson learned from this was that data can and should inform both big and small decisions. Big decisions like “should we roll out this feature to all users,” and small decisions like “should we spend a few days implementing our experiment logic in another codebase.” In this case, two hours of scoping saved at least two days of engineering work, and we learned something useful in the process.

    This lesson wasn’t necessarily unique to this experiment, but it’s worth reinforcing. You can only identify these opportunities when you’re working very closely with your cross-discipline counterparts (engineering in this case), attending their standups, and hearing the decisions they’re trying to make. They usually won’t come to you with these questions as they may not think that this is something data can easily or quickly solve.

    4, 5, and 6. Understand Your System, Log Generously, and Run More A/A Tests

    For an experiment that involved redirecting the treatment group to a different URL, we decided to first run an A/A test to validate that redirects were working as expected and not having a significant impact on our metrics.

    The A/A setup looked something like this:

    • A request for a URL comes into the back-end
    • The user, identified using a cookie, is assigned to control/treatment
      • The user and their assigned group is asynchronously logged to Kafka
    • If the user is in the control group, they receive the rendered content (html, etc.) they requested
    • If the user is in the treatment group, the server instead responds with a 302 redirect to the same URL
    • This causes the user in the treatment group to make another request for the same URL
      • This time, the server responds with the rendered content originally requested (a cookie is set in the previous step to prevent the user from being redirected again)

    This may look like a lot, but for users this is virtually invisible. You’d only know if you were redirected if you opened your browser developer tools (under the “Network” tab you’ll see a request with a 302 status).

    A/A experiment set up with a redirect to the same page
    A/A experiment set up with a redirect to the same page

    Shortly into the experiment, I encountered my first instance of sample ratio mismatch (SRM). SRM is when the number of subjects in each group doesn’t match your expectations.

    After “inner joining” the assigned users to our sessions system of record, we were seeing a slightly lower fraction of users in the test group compared to the control group instead of the desired 50/50 split.

    We asked ourselves why this could be happening. And in order to answer that question, we needed to understand how our system worked. In particular, how do records appear in the sessions data model, and what could be causing fewer records from our test group to appear in there?

    Sample ratio mismatch in an A/A experiment with a redirect to the same page
    Sample ratio mismatch in an A/A experiment with a redirect to the same page

    After digging through the source code of the sessions data model, I learned that it’s built by aggregating a series of pageview events. These pageview events are emitted client side, which means that the “user” needs to download the html and javascript content our servers return, and then they will emit the pageview events to Kafka.

    With this understanding in place, I now knew that some users in our test group were likely dropping off after the redirect and consequently not emitting the pageview events.

    Data flows for an A/A experiment with a redirect to the same page
    Data flows for an A/A experiment with a redirect to the same page

    To better understand why this was happening, we added some new server-side logging for each request to capture some key metadata. Our main hypothesis was that this was being caused by bots, since they may not be coded to follow redirects. Using this new logging, I tried removing bots by filtering out different user agents and requests coming from certain IP addresses. This helped reduce the degree of SRM, but didn’t entirely remove it. It’s likely that I wasn’t removing all bots (as they’re notoriously hard to identify) or there were potentially some real users (humans) who were dropping off in the test group. Based on these results, I ended up changing the data sources used to compute our success metric and segment our users.

    Despite the major head scratching this caused, I walked away with some really important lessons: 

    • Develop a deep understanding of your system. By truly understanding how redirects and our sessions data model worked, we were able to understand why we were seeing SRM and come up with alternatives to get rid of it.
    • Log generously. Our data platform team made it incredibly simple and low effort to add new Kafka instrumentation, so we took advantage. The new request logging we initially added for investigative purposes ended up being used in the final metrics.
    • Run more A/A tests. By running the A/A test, I was able to identify the sample ratio mismatch issues and update our metrics and data sources prior to running the final experiment. We also learned the impact of redirection alone that helped with the final results interpretation in the eventual A/B test where we had redirection to a different URL.

    7 and 8. Beware of User Skew and Don’t Be Afraid to Peek

    In one experiment where we were testing the impact of translating content into a buyer’s preferred language, I was constantly peeking at the results each day as I was particularly interested in the outcome. The difference in the success metric between the treatment and control groups had been holding steady for well over a week, until it took a nose dive in the last few days of the experiment.

    Towards the end of the experiment, the results suddenly changed due to abnormal activity
    Towards the end of the experiment, the results suddenly changed due to abnormal activity

    After digging into the data, I found that this change was entirely driven by a single store with abnormal activity and very high volumes, causing it to heavily influence the overall result. This served as a pleasant reminder to beware of user skew. With any rate based metric, your results can easily be dominated by a set of high volume users (or in this case, a single high volume store).

    And despite the warnings you’ll hear, don’t be afraid to peek. I encourage you to look at your results throughout the course of the experiment. Avoiding the peeking problem can only be done in conjunction with following a strict experiment plan to collect a predetermined sample size (that is, don’t get excited by the results and end the experiment early). Peeking at the results each day allowed me to spot the sudden change in our metrics and subsequently identify and remove the offending outlier.

    9. Go Down Rabbit Holes

    In another experiment involving redirects, I was once again experiencing SRM. There was a higher than expected number of sessions in one group. In past experiments, similar SRM issues were found to be caused by bots not following redirects or weird behaviour with certain browsers.

    I was ready to chalk up this SRM to the same causes and call it a day, but there was some evidence that hinted something else may be at play. As a result, I ended up going down a big rabbit hole. The rabbit hole eventually led me to review the application code and our experiment qualification logic. I learned that users in one group had all their returning sessions disqualified from the experiment due to a cookie that was set in their first session.

    For an ecommerce experiment, this has significant implications since returning users (buyers) are much more likely to purchase. It’s not a fair comparison if one group contains all sessions, and the other only contains the buyer’s first sessions. The results of the experiment changed from negative to positive overall after switching the analysis unit from session to user so that all user’s sessions were considered.

    Another important lesson learned: go down rabbit holes. In this case, the additional investigation turned out to be incredibly valuable as the entire outcome of the experiment changed after discovering the key segment that was inadvertently excluded. The outcome of a rabbit hole investigation may not always be this rewarding, but at minimum you’ll learn something you can keep in your cognitive backpack.

    10. Remember, We’re Measuring Averages

    Oftentimes it may be tempting to look at your overall experiment results across all segments and call it a day. Your experiment is positive overall and you want to move on and roll out to the feature. This is a dangerous practice, as you can miss some really important insights.

    Example experiment results across different segments
    Example experiment results across different segments

    As we report results across all segments, it’s important to remember that we’re measuring averages. Positive overall doesn’t mean positive for everyone and vice versa. Always slice your results across key segments and look at the results. This can identify key issues like a certain browser or device where your design doesn’t work, or a buyer demographic that’s highly sensitive to the changes. These insights are just important as the overall result, as they can drive product changes or decisions to mitigate these effects.

    Going Forward…

    So as you run more experiments remember:

    1. Think carefully when choosing your randomization unit
    2. Simulations are your friend
    3. Data can, and should inform both big & small decisions
    4. Understand your system
    5. Log generously
    6. Run more A/A tests
    7. Beware of user skew
    8. Don’t be afraid to peek
    9. Go down rabbit holes
    10. Remember, we’re measuring averages

    I certainly will.

    Ian Whitestone: Ian joined Shopify in 2019 and currently leads a data science team working to simplify cross border commerce and taxes. Connect with Ian on LinkedIn to hear about work opportunities or just to chat.


    Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

    Continue reading

    Querying Strategies for GraphQL Clients

    Querying Strategies for GraphQL Clients

    As more clients rely on GraphQL to query data, we witness performance and scalability issues emerging. Queries are getting bigger and slower, and net-new roll-outs are challenging. The web & mobile development teams working on Orders & Fulfillments spent some time exploring and documenting our approaches. On mobile, our goal was to consistently achieve a sub one second page load on a reliable network. After two years of scaling up our Order screen in terms of features, it was time to re-think the foundation on which we were operating to achieve our goal. We ran a few experiments in mobile and web clients to develop strategies around those pain points. These strategies are still a very open conversation internally, but we wanted to share what we’ve learned and encourage more developers to play with GraphQL at scale in their web and mobile clients. In this post, I’ll go through some of those strategies based on an example query and build upon it to scale it up.

    1. Designing Our Base Query

    Let’s take the case of a client loading a list of products. To power our list screen we use the following query:

    Using this query, we can load the first 100 products and their details (name, price, and image). This might work great, as long as we have fewer than 100 products. As our app grows we need to consider scalability:

    • How can we prepare for the transition to a paginated list?
    • How can we roll out experiments and new features?
    • How can we make this query faster as it grows?

    2. Loading Multiple Product Pages

    Good news, our products endpoint is paginated on Shopify’s back-end side and can now implement the change on our clients! The main concern on the client side is to find the right page size because it could also have UX and Product implications. The right page size will likely change from one platform to another because we’re likely to display fewer products at the same time on the mobile client (due to less space). This weighs on the performances as the query grows.

    In this step, a good strategy is to set performance tripwires, that is create some kind of score (based on loading times) to monitor our paginated query. Implementing pagination within our query immediately reduces the load on the back-end and front-end side if we opt for a lower number than the initial 100 products:

    We add two parameters to control the page size and index. We also need to know if the next page is available to show, hence the hasNextPage field. Now that we have support for an unlimited amount of products in our query, we can focus on how we roll out new fields.

    3. Controlling New Field Rollouts

    Our product list is growing in terms of features, and we run multiple projects at the same time. To make sure we have control on how changes are rolled out in our ProductList query we use @include and @skip tags to make optional some of the net-new fields we’re rolling out. It looks like this:

    In the example above the description field is hidden behind the $featureFlag parameter. It becomes optional, and you need to unwrap its value when parsing the response. If the value of $featureFlag is false, the response will return it as null.

    The @include and @skip tags require any new field to keep the same naming and level as renaming or deleting those fields will likely result in breaking the query. A way around this problem is to dynamically build the query at runtime based on the feature flag value.

    Other rollout strategies can involve duplicating queries and running a specific query based on feature flags or working off a side branch until rollout and deployment. Those strategies are likely project and platform specific and come with more trade-offs like complexity, redundant code, and scalability.

    The @include and @skip tags solution is handy for flags on hand, but what about for conditional loading based on remote flags? Let’s have a look at chained queries!

    4. Chaining Queries

    From time to time you’ll need to chain multiple queries. A few scenarios where this might happen are

    • Your query relies on a remote flag that comes from another query. This makes rolling out features easier as you control the feature release remotely. On mobile clients with many versions in production, this is useful.
    • A part of your query relies on a remote parameter. Similar to the scenario above, you need the value of a remote parameter to power your field. This is usually tied to back-end limitations.
    • You’re running into pagination limitations with your UX. You need to load all pages on screen load and chain your queries until you reach the last page. This mostly happens in clients where the current UX doesn’t allow for pagination and is out of sync with the back-end updates. In this specific case solve the problem at a UX level if possible.

    We transform our local feature flag into a remote flag and this is what our query looks like:

    In the example above, the RemoteDescriptionFlag query is executed first, and we wait for its results to start the ProductsList query. The descriptionEnabled (aliased to remoteFlag) powers the @include inside our ProductsList query. This means we’re now waiting for two queries at every page or screen load to complete before we can display our list of products. It significantly slows down our performance. A way to work around this scenario is to move the remote flag query outside of this context, probably at an app-wide level.

    The TL;DR of chained queries: only do it if necessary.

     

    5. Using Parallel Queries

    Our products list query is growing significantly with new features:

    We added search filters, user permission, and banners. Those three parts aren’t tied to the products list pagination because if they were included in the ProductsList query, we have to re-query those three endpoints every time we ask for a new page. It slows down performance and gives redundant information. This doesn’t scale well with new features and endpoints, so this sounds like a good time to leverage parallel querying!

    Parallel querying is exactly what it sounds like: running multiple queries at the same time. By splitting the query into scalable parts and leaving aside the “core” query of the screen, it brings the benefits to our client:

    • Faster screen load: since we’re querying those endpoints separately, the load is transferred to the back-end side instead. Fragments are resolved and queried simultaneously instead of being queued on the server-side. It’s also easier to scale server-side than client-side in this scenario.
    • Easier to contribute as the team grows: by having one endpoint per query, we diminish the risk of code conflict (for example, fixtures) and flag overlapping for new features. It also makes it easier to remove some endpoints.
    • Easier to introduce the possibility of incremental and partial rendering: As queries are completed, you can start to render content to create the illusion of a faster page load for users.
    • Removes the redundant querying by leaving our paginated endpoint in its own query: we only query for product pages after the initial query cycle.

    Here’s an example of what our parallel queries look like:

    Whenever one of those queries becomes too big, we apply the same principles and split again to accommodate for logic and performances. What’s too big? As a client developer, it’s up to you to answer this question by setting up goals and tripwires. Creating some kind of trackable score for loading time can help you make the decision on when to cut the query in multiple parts. This way the GraphQL growth in our products list is more organic ( an outcome that looks at scalability and developer happiness) and doesn't impact performance: each query can grow independently and reduces the amount of potential roll-out & code merge conflicts.

    Just a warning when using parallel queries, when transferring the load server-side, make sure you set tripwires to avoid overloading your server. Consult with site reliability experts (SREs or at Shopify, production engineers), and back-end developers, they can help monitor the performances server-side when using parallel querying.

    Another challenge tied to parallel queries, is to plug the partial data responses into the screen state’s. This is likely to require some refactor into the existing implementation. It could be a good opportunity to support partial rendering at the same time.

    Over the past four years, I have worked on shipping and scaling features in the Orders mobile space at Shopify. Being at the core of our Merchants workflow gave me the opportunity to develop empathy for their daily work. Empowering them with better features meant that we had to scale our solutions. I have been using those patterns to achieve that, and I’m still discovering new ones. I love how flexible GraphQL is on the client-side! I hope you’ll use some of these querying tricks in your own apps. If you do, please reach out to us, we want to hear how you scale your GraphQL queries!

    Additional Information on GraphQL

    Théo Ben Hassen is a development manager on the Merchandising mobile team. His focus is on enabling mobile to reach its maximum impact through mobile capabilities. He's interested about anything related to mobile analytics and user experience.


    We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

    Continue reading

    Deleting the Undeletable

    Deleting the Undeletable

    At Shopify, we analyze a variety of events from our buyers and merchants to improve their experience and the platform, and empower their decision making. These events are collected via our streaming platform (Kafka) and stored in our data warehouse at a rate of tens of billions of events per day. Historically, these events were collected by a variety of teams in an ad-hoc manner and lacked any guaranteed structure, leading to a variety of usability and maintainability issues, and difficulties with fulfilling data subject rights requests. The image below depicts how these events were collected, stored in our data warehouse, and used in other online dashboards.

    An animated gif showing how events were collected, stored in the data warehouse, and used in dashboards. The image represents Analytical Events as three yellow envelopes that are on the left hand side of the image. The Kafka pipeline in the centre of the image is represented by a blue cylindrical shape. The data warehouse, on the right hand side of the image, is a represented by a set of six grey circles stacked on top of each other. Below the data warehouse is a computer screen which represents the dashboards.  The animation shows yellow envelopes passing through the Kafka pipeline and continues to the data warehouse for sign up events or the dashboard for POS transactions
    How events were collected, stored in our data warehouse, and used in other online dashboards in the old system.

    Some of these events contained Personally Identifiable Information (PII), and in order to comply with regulation, such as European General Data Protection Regulation (GDPR), we needed to find data subject’s PII within our data warehouse and to access or delete (via privacy requests) them upon request in a timely manner. This quickly escalated to a very challenging task due to:

    • Lack of guaranteed structure and ownership: Most of these events were only meaningful to and parsable by their creators and didn’t have a fixed schema. Further, there was no easy way to figure out ownership of all of them. Hence, it was near impossible to automatically parse and search these events. Let alone accessing and deleting PII within them.

    • Missing data subject context: Even knowing where PII resided in this dataset isn’t enough to fulfill a privacy request. We needed a reliable way to know to whom this PII belongs and who is the data controller. For example, we act as a processor for our merchants when they collect customer data, and so we are only able to process customer deletion requests when instructed by the merchant (the controller of that personal data).

    • Scale: The size of the dataset (in order of Petabytes) made it difficult, costly and time consuming to do any full search. In addition, it continuously grows at billions of events per day. Hence any solution needs to be highly scalable to keep up with incoming online events as well as processing historic ones.

    • Missing dependency graph: Some of these events and datasets power critical tasks and jobs. Any disruption or change to them can severely affect our operations. However, due to lack of ownership and missing lineage information readily and easily available for each event group, it was hard to determine the full scope of disruption should any change to a dataset happen.

    So we were left with finding a needle in an ever growing haystack. These challenges, as well as other maintainability and usability issues with this platform, brought up a golden opportunity for the Privacy team and the Data Science & Engineering team to collaborate and address them together. The rest of this blog post focuses on our collaboration efforts and the technical challenges we faced when addressing these issues in a large organization such as Shopify.

    Context Collection

    Lack of guaranteed schemas for events was the root cause of a lot of our challenges. To address this, we designed a schematization system that specified the structure of each event including types of each field, evolution (versions) context, ownership, as well as privacy context. The privacy context specifically includes marking sensitive data, identifying data subjects, and handling PII ( that is, what to do with PII).

    Schemas are designed by data scientists or developers interested in capturing a new kind of event (or changing an existing one). They’re proposed in a human readable JSON format and then reviewed by team members for accuracy and privacy reasons. As of today, we have more than 4500 active schemas. This schema information is then used to enforce and guarantee the structure of every single event going through our pipeline at generation time.

    Above shows a trimmed signup event schema. Let’s read through this schema and see what we learn from it:

    The privacy_setting section specifies whose PII this event includes by defining a data controller and data subject. Data controller indicates the entity that decides why and how personal data is processed (Shopify in this example). Data subject designates whose data is being processed that’s tracked via email (of the person in question) in this schema. It’s worthwhile to mention, generally when we deal with buyer data, merchants are the data controller and Shopify plays the data processor role (a third party that processes personal data on behalf of a data controller).

    Every field in a schema has a data-type and doc field, and a privacy block indicating if it contains sensitive data. The privacy block indicates what kind of PII is being collected under this field and how to handle that PII.

    Our new schematization platform was successful in capturing the aforementioned context and it significantly increased privacy education and awareness among our data scientists and developers because of discussions on schema proposals about identifying personal data fields. In the vast majority of cases, the proposed schema contained all the proper context, but when required, or in doubt, privacy advice was available. This exemplified that when given accurate and simple tooling, everyone is inclined to do the right thing and respect privacy. Lastly, this platform helped with reusability, observability, and streamlining common tasks for the data scientists too. Our schematization platform signified the importance of capitalizing on shared goals across different teams in a large organization.

    Personal Data Handling

    At this point, we have flashy schemas that gather all the context we need regarding structure, ownership, and privacy for our analytical events. However, we still haven’t addressed the problem of how to handle and track personal information accurately in our data warehouse. In other words, after having received a deletion or access privacy request, how do we fetch and remove PII from our data warehouse?

    The short answer: we won’t store any PII in our data warehouse. To facilitate this, we perform two types of transformation on personal data before entering our data warehouse. These transformations convert personal (identifying) data to non-personal (non-identifying) data, hence there’s no need to remove or report them anymore. It sounds counterintuitive since it seems data might be rendered useless at this point. We preserve analytical knowledge without storing raw personal data through what GDPR calls pseudonymisation, “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information”. In particular, we employ two types of pseudonymisation techniques: Obfuscation and Tokenization.

    It’s important to stress that personal data that’s undergone pseudonymisation and could be attributed to a natural person by the use of additional information, directly or indirectly, is still considered personal information under GDPR and requires proper safeguards. Hence when we said, we won’t have any PII in our data warehouse, it wasn’t entirely precise. However, it allows us to control personal data, reduce risk, and truly anonymize or remove PII when requested.

    Obfuscation and Enrichment

    When we obfuscate an IP address, we mask half of the bytes but include geolocation data at city and country level. In most cases, this is how the raw IP address was intended to be used for in the first place. This had a big impact on adoption of our new platform, and in some cases offered added value too.

    In obfuscation, identifying parts of data are either masked or removed so the people whom the data describe remain anonymous. Our obfuscation operators don’t just remove identifying information, they enrich data with non-personal data as well. This often removes the need for storing personal data at all. However, a crucial point is to preserve the analytical value of these records in order for them to stay useful.

    Looking at different types of PII and how they’re used, we quickly observed patterns. For instance, the main use case of a full user agent string is to determine operating system, device type, and major version that are shared among many users. But a user agent can contain very detailed identifying information including screen resolution, fonts installed, clock skew, and other bits that can identify a data subject, hence they’re considered PII. So, during obfuscation, all identifying bits are removed and replaced with generalized aggregate level data that data analysts seek. The table below shows some examples of different PII types and how they’re obfuscated and enriched.

    PII Type

    Raw Form

    Obfuscated

    IP Address

    207.164.33.12

    {

    "masked": "207.164.0.0", "geo_country": "Canada"

    }

    User agent

    CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69 Instagram 8.4.0 (iPhone7,2; iPhone OS 9_3_2; nb_NO; nb-NO; scale=2.00; 750x1334

    {

    "Family": "Instagram", "Major": "8",
    "Os.Family": "iOS",
    "Os.Major": "9",
    "Device.Brand": "Apple",
    "Device.Model": "iPhone7"

    }

    Latitude/Longitude

    45.4215° N, 75.6972° W

    45.4° N, 75.6° W

    Email

     john@gmail.com

    behrooz@example.com

    REDACTED@gmail.com

     REDACTED@REDACTED.com

    A keen observer might realize some of the obfuscated data might still be unique enough to identify individuals. For instance, when a new device like an iPhone is released, there might be few people who own that device, hence, leading to identification especially combined with other obfuscated data. To address these limitations, we hold a list of allowed devices, families, or versions that we’re certain have enough unique instances (more than a set threshold) and gradually add to this list (as more unique individuals are part of that group). It’s important to note that this still isn’t perfect anonymization, and it’s possible that an attacker combines enough anonymized and other data to identify an individual. However, that risk and threat model isn’t as significant within an organization where access to PII is more easily available.

    Tokenization

    Obfuscation is irreversible (the original PII is gone forever) and doesn’t suit every use case. There are times when data scientists require access to the actual raw PII values of PII (imagine preparing a list of emails to send a promotional newsletter). To address these needs, we built a tokenization engine that exchanges PII with a consistent random token. We then only store tokens in the data warehouse and not the raw PII. A separate secured vault service is in charge of storing the token to PII mapping. This way, if there’s a delete request only the mapping in the vault service needs removing and all the copies of that corresponding token across the data warehouse become effectively non-detokenizable (in other words, just a random string).

    To understand the tokenization process better let’s go through an example. Let’s say Hooman is a big fan of AllBirds and GymShark products, and he purchases two pairs of shoes from AllBirds and a pair of shorts from GymShark to hit the trails! His purchase data might look like the table below before tokenization:

    Email
    Shop
    Product
    ...
    hooman@gmail.com
    allbirds
    Sneaker
    hooman@gmail.com
    Gymshark
    Shorts
    hooman@gmail.com
    allbirds
    Running Shoes
     
    After tokenization is applied the before table will look like the table below:

    Email

    Shop

    Product

    ...

    Token123

    allbirds

    Sneaker

    Token456

    Gymshark

    Shorts

    Token123

    allbirds

    Running Shoes

    There are two important observations in the after tokenization table:

    1. The same PII (hooman@gmail.com) was replaced by the same token(Token123) under the same data controller (allbirds shop) and data subject (Hooman). This is the consistency property of tokens.
    2. On the other hand, the same PII (hooman@gmail.com) got a different token (Token456) under a different data controller (merchant shop) even though the actual PII remained the same. This is the multi-controller property of tokens and allows data subjects to exercise their rights independently among different data controllers (merchant shops). For instance, if Hooman wants to be forgotten or deleted from allbirds, that shouldn’t affect his history with Gymshark.

    Now let’s take a look inside how all of this information is stored within our tokenization vault service shown in table below.

    Data Subject
    Controller
    Token
    PII
    hooman@gmail.com
    allbirds
    Token123
    hooman@gmail.com
    hooman@gmail.com
    Gymshark
    Token456
    hooman@gmail.com
    ...
    ...
    ...
    ... 
    The vault service holds token to PII mapping and the privacy context including data controller and subject. It uses this context to decide whether to generate a new token for the given PII or reuse the existing one. The consistency property of tokens allows data scientists to perform analysis without requiring access to the raw value of PII. For example, all orders of Hooman from GymShark could be tracked only by looking for Token456 across the orders tokenized dataset.

    Now back to our original goal, let’s review how all of this helps with deletion of PII in our data warehouse (reporting and accessing PII requests is similar except, instead of deletion of target records, they’ll be reported back). If we store only obfuscated and tokenized PII in our datasets, essentially there will be nothing left in the data warehouse to delete after removing the mapping from the tokenization vault. To understand this let’s go through some examples of deletion requests and how it will affect our datasets as well as tokenization vault.

    Data Subject Controller Token PII
    hooman@gmail.com allbirds Token123 hooman@gmail.com
    hooman@gmail.com Gymshark Token456 hooman@gmail.com
    hooman@gmail.com Gymshark Token789
    222-333-4444
    eva@hotmail.com
    Gymshark
    Token011
    IP 76.44.55.33
    Assume the table above shows the current content of our tokenization vault, and these tokens are stored across our data warehouse in multiple datasets. Now Hooman sends a deletion request to Gymshark (controller) and subsequently Shopify (data processor) receives it. At this point, all that’s required to delete Hoomans PII under GymShark is to just locate rows with the following condition:

    DataSubject == ‘hooman@gmail.com’ AND Controller == Gymshark

    Which results in the rows identified with a star (*) in the table below:

    Data Subject Controller Token PII
    hooman@gmail.com allbirds Token123 hooman@gmail.com
    * hooman@gmail.com Gymshark Token456 hooman@gmail.com
    * hooman@gmail.com Gymshark Token789 222-333-4444
    eva@hotmail.com Gymshark Token011 IP 76.44.55.33
    Similarly, if Shopify needed to delete all Hooman’s PII across all controllers (shops), it would need to only look for rows that have Hooman as the data subject, highlighted below:
    Data Subject Controller Token PII
    * hooman@gmail.com allbirds Token123 hooman@gmail.com
    * hooman@gmail.com Gymshark Token456 hooman@gmail.com
    * hooman@gmail.com Gymshark Token789 222-333-4444
    eva@hotmail.com Gymshark Token011 IP 76.44.55.33
    Last but not least, the same theory applies to merchants too. For instance, assume (let’s hope that will never happen!) Gymshark (data subject) decides to close their shop and ask Shopify (data controller) to delete all PII controlled by them. In this case, we could do a search in with the following condition:

    Controller == Gymshark

    Which will result in rows indicated in table:

    Data Subject Controller Token PII
    hooman@gmail.com allbirds Token123 hooman@gmail.com
    * hooman@gmail.com Gymshark Token456 hooman@gmail.com
    * hooman@gmail.com Gymshark Token789 222-333-4444
    * eva@hotmail.com Gymshark Token011 IP 76.44.55.33
    Notice in all of these examples, there was nothing to do in the actual data warehouse since once the mapping of token ↔ PII is deleted, tokens effectively become consistent random strings. In addition, all of these operations can be done in fractions of a second whereas doing any task in a petabyte scale data warehouse can become very challenging, and time and resource consuming.

    Schematization Platform Overview

    So far we’ve learned about details of schematization, obfuscation, and tokenization. Now it’s time to put all of these pieces together in our analytical platform. The image below shows an overview of the journey of an event from when it’s fired until it’s stored in the data warehouse:

    An animated gif overview of the journey of an event from when it’s fired until it’s stored in the data warehouse. On the left hand side is Analytical Events represented by three yellow envelopes. In the center of the image is a cylindrical object that represents the Scheme Repository. An arrow from the Scheme Repository points downward to the Kafka pipeline which is represented by a blue cylindrical object. On the right hand side of the image is the Tokenization Vault that is represented by a blue square with a vault lock. Underneath the vault is the data warehouse represented by six grey circles stacked on top of each other.

    In this example:

    1. A SignUp event is triggered into the messaging pipeline (Kafka)
    2. A tool, Scrubber, intercepts the message in the pipeline and applies pseudonymisation on the content using the predefined schema fetched from the Schema Repository for that message
    3. The Scrubber identifies that the SignUp event contains tokenization operations too. It then sends the raw PII and Privacy Context to the Tokenization Vault.
    4. Tokenization Vault exchanges PII and Privacy Context for a Token and sends it back to the Scrubber
    5. Scrubber replaces PII in the content of the SignUp event with the Token
    6. The new anonymized and tokenized SignUp event is put back onto the message pipeline.
    7. The PII free SignUp event is stored in the Data warehouse.

    In theory, this schematization platform can allow a PII free data warehouse for all new incoming events; however, in practice, there still exists some challenges to be addressed.

    Lessons from Managing PII at Shopify Scale

    Despite having a sound technical solution for classifying and handling PII in our data warehouse, Shopify scale made adoption and reprocessing of our historic data a difficult task. Here are some lessons that helped us in this journey.

    Adoption

    Having a solution versus adopting it are two different problems. Initially, with a sound prototype ready, we struggled getting approval and commitment from all stakeholders to implement this new tooling and rightly so. Looking back at all of these proposed changes and tools to an existing platform, it does seem like open heart surgery, and of course, you’d likely face resistance. There’s no bulletproof solution to this problem, or at least one that we knew! Let’s review a few factors that significantly helped us.

    Make the Wrong Thing the Hard Thing

    Make the right thing the default option. A big factor in the success and adoption of our tooling was to make our tooling the default and easy option. Nowadays, creating and collecting unstructured analytical events at Shopify is difficult and goes through a tedious process with several layers of approval. Whereas creating structured privacy-aware events is a quick, well documented, and automated task.

    “Trust Me, It Will Work” Isn’t Enough!

    Proving scalability and accuracy of the proposed tooling was critical to building trust in our approach. We used the same tooling and mechanisms that the Data Science & Engineering team uses to prove correctness, reconciliation. We showed the scalability of our tooling by testing it on real datasets and stress testing under order of magnitudes higher load.

    Make Sure the Tooling Brings Added Value

    Our new tooling is not only the default and easy way to collect events, but also offers added value and benefits such as:

    • Streamlined workflow: No need for multiple teams to worry about compliance and fulfilling privacy requests
    • Increased data enrichment: For instance geolocation data from IP, family, or device info from user agent strings is the information that data scientists are often after in the first place.
    • Shared privacy education: Our new schematization platform encourages asking about and discussing privacy concerns. They range from what’s PII to other topics like what can or can’t be done with PII. It brings clarity and education that wasn’t easily available before.
    • Increased dataset discoverability: Schemas for events allow us to automatically integrate with query engines and existing tooling, making datasets quick to be used and explored.

    These examples are a big driver and encouragement in adoption of our new toolings.

    Capitalizing on Shared Goals

    Schematization isn’t only useful for privacy reasons, it helps with reusability and observability, reduces storage cost, and streamlines common tasks for the data scientists too. Both privacy and data teams are important stakeholders in this project and it made collaboration and adoption a lot easier because we capitalized on shared goals across different teams in a large organization.

    Historic Datasets are several petabytes of historic events collected in our data warehouse prior to the schematization platform. Even after implementing the new platform, the challenge of dealing with large historic datasets remained. What made it formidable was the sheer amount of data that was hard to identify an owner, reprocess, and migrate without disturbing the production platform. In addition, it’s not particularly the most exciting kind of work either, hence it’s easy to get deprioritized.

    A dependency graph showing a partial view of interdependency between analytical jobs. The graph is large and has many branches represented by green, pink, black, blue, and white boxes. Numerous black lines terminating with arrows connect the dependencies.
    Intricate interdependencies between some of the analytical jobs depending on these datasets

    The above image shows a partial view of the intricate interdependency between some of the analytical jobs depending on these datasets. Similar to adoption challenges, there’s no easy solution for this problem, but here are some practices that helped us in mitigating this challenge.

    Organizational Alignment

    Any task of this scale goes beyond the affected individuals, projects, or even teams. Hence an organizational commitment and alignment is required to get it done. People, teams, priorities, and projects might change, but if there’s organizational support and commitment for addressing privacy issues, the task can survive. Organizational alignment helped us to put out consistent messaging to various team leads that meant everyone understood the importance of the work. With this alignment in place, it was usually just a matter of working with leads to find the right balance of completing their contributions in a timely fashion without completely disrupting their roadmap.

    Dedicated Task Force

    These kinds of projects are slow and time consuming, in our case, it took over a year and several changes at individual and team levels happened. We understood the importance of having a team and project, so we didn’t depend on individuals. People come and go, but the project must carry on.

    Tooling, Documentation, and Support

    One of our goals was to minimize the amount of effort individual dataset owners and users needed to migrate their datasets to the new platform. We documented the required steps, built automation for tedious tasks, and created integrations with tooling that data scientists and librarians were already familiar with. In addition, having Engineering support with hurdles was important. For instance, on many occasions when performance or other technical issues came up, Engineering support was available to solve the problem. Time spent on building the tooling, documentation, and support procedures easily paid off in the long time run.

    Regular Progress Monitoring

    Questioning dependencies, priorities, and blockers regularly paid off because we found better ways. For instance, in a situation where x is considered a blocker for y maybe:

    • we can ask the team working on x to reprioritize and unblock y earlier.
    • both x and y can happen at the same time if the teams owning them align on some shared design choices.
    • there's a way to reframe x or y or both so that the dependency disappears.

    We were able to do this kind of reevaluation because we had regular and constant progress monitoring to identify blockers.

    New Platform Operational Statistics

    Our new platform has been in production use for over two years. Nowadays, we have over 4500 distinct analytical schemas for events, each designed to capture certain metrics or analytics, and with their own unique privacy context. On average, these schemas generate roughly 20 billions events per day or approximately 230K events per second with peaks of over 1 million events per second during busy times. Every single one of these events is processed by our obfuscation and tokenization tools in accordance to its privacy context before being accessible in the data warehouse or other places.

    Our tokenization vault holds more than 500 billions distinct PII to token mappings (approximately 200 TeraBytes) from which tens to hundreds of millions are deleted daily in response to privacy or shop purge requests. The magical part of this platform is that deletion happens instantaneously only in the tokenization vault without requiring any operation in the data warehouse or any other place where tokens are stored. This is the super power that enables us to delete data that used to be very difficult to identify, the undeletable. These metrics and the ease of fulfilling privacy requests proved the efficiency and scalability of our approach and new tooling.

    As part of onboarding our historic datasets into our new platform, we rebuilt roughly 100 distinct datasets (approximately tens of petabytes of data in total) feeding hundreds of jobs in our analytical platform. Development, rollout, and reprocessing of our historical data altogether took about three years with help from 94 different individuals signifying the scale of effort and commitment that we put into this project.

    We believe sharing the story of a metamorphosis in our data analytics platform to facilitate privacy requests is valuable because when we looked for industry examples, there were very few available. In our experience, schematization and a platform to capture the context including privacy and evolution is beneficial in analytical event collection systems. They enable a variety of opportunities in treating sensitive information and educating developers and data scientists on data privacy. In fact, our adoption story showed that people are highly motivated to respect privacy when they have the right tooling at their disposal.

    Tokenization and obfuscation proved to be effective tools in helping with handling, tracking and deletion of personal information. They enabled us to efficiently delete the undeletable at a very large scale.

    Finally, we learned that solving technical challenges isn’t the entire problem. It remains a tough problem to address organizational challenges such as adoption and dealing with historic datasets. While we didn’t have a bulletproof solution, we learned that bringing new value, capitalizing on shared goals, streamlining and automating processes, and having a dedicated task force to champion these kinds of big cross team initiatives are effective and helpful techniques.

    Additional Information

    Behrooz is a staff privacy engineer at Shopify where he works on building scalable privacy tooling and helps teams to respect privacy. He received his MSc in Computer Science at University of Waterloo in 2015.  Outside of the binary world, he enjoys being upside down (gymnastics) 🤸🏻, on a bike  🚵🏻‍♂️ , on skis ⛷, or in the woods. Twitter: @behroozshafiee

    Shipit! Presents: Deleting the Undeletable

    On September 29, 2021, Shipit!, our monthly event series, presented Deleting the Undeletable. Watch Behrooz Shafiee and Jason White as they discuss the new schematization platform and answer your questions.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Updating Illustrations at Scale

    Updating Illustrations at Scale

    The Polaris team creates tools, education and documentation that helps unite teams across Shopify to build better experiences. We created Polaris, our design system, and continue to maintain it. We are a multidisciplinary team with a range of experience. Some people have been at Shopify for over 6 years and others, like me, are a part of our Dev Degree program.

    Continue reading

    Shipit! Presents: How We Write React Native Apps

    Shipit! Presents: How We Write React Native Apps

    On May 19, 2021, Shipit!, our monthly event series, presented How We Write React Native Apps. Join Colin Gray, Haris Mahmood, Guil Varandas, and Michelle Fernandez who are the developers setting the React Native standards at Shopify. They’ll share more on how we write performant React Native apps.

     

    Q: What best practices can we follow when we’re building an app, like for accessibility, theming, typography, and so on?
    A: Our Restyle and Polaris documentation cover a lot of this and is worth reading through to reference, or to influence your own decisions on best practices.

    Q: How do you usually handle running into crashes or weird bugs that are internal to React Native? In my experience some of these can be pretty mysterious without good knowledge of React Native internals. Pretty often issues on GitHub for some of these "rare" bugs might stall with no solution, so working on a PR for a fix isn't always a choice (after, of course, submitting a well documented issue).
    A: We rely on various debugging and observability tools to detect crashes and bug patterns. That being said, running into bugs that are internal to React Native isn’t a scenario that we have to handle often, and if it ever happens, we rely on the technical expertise of our teams to understand it and communicate, or fix it through the existing channels. That’s the beauty of open source!

    Q: Do you have any guide on what needs to be flexible and what fixed size... and where to use margin or paddings? 
    A: Try to keep most things flexible unless absolutely necessary. This results in your UI being more fluid and adaptable to various devices. We use fixed sizes mostly for icons and imagery. We utilize padding to create spacing within a component and margins to create spacing between components.

    Q: Does your team use React Studio?
    A: No but a few native android developers coming from the IntelliJ suite of editors have set up their IDE to allow them to code in React and Kotlin with code resolutions using one IDE

    Q: Do you write automated tests using protractor/cypress or jest?
    A: Jest is our go-to library for writing and running unit and integration tests.

    Q: Is Shopify Brownfield app? If it is, how are you handling navigation with React Native and Native!!
    A: Shop and POS are both React Native from the ground up, but we do have a Brownfield app in the works. We are adding React Native views in piecemeal, and so navigation is being handled by the existing navigation controllers. Wiring this up is work, no getting around that.

    Q: How do you synchronize native (KMM) and React Native state
    A: We try to treat React Native state as the “Source of Truth”. At startup, we pass in whatever is necessary for the module to begin its work, and any shared state is managed in React Native, and updated via the native module (updates from the native module are sent via EventEmitter). This means that the native module is only responsible for its internal state and shared state is kept in React Native. One exception to this in the Point of Sale app is the SQLite database. We access that entirely via a native module. But again there’s only one source of truth.


    Q: How do you manage various screen sizes and responsive layouts in React Native? (Polaris or something else)
    A: We try not to use fixed sizing values whenever possible resulting in UIs more able to adjust to various device sizes. The Restyle library allows you to define breakpoints and pass in different values for each breakpoint when defining styles. For example, you can pass in different font sizes or spacing values depending on the breakpoints you define.

    Q: Are you using Reanimated 2 in production at Shopify?
    A: We are! The Shop app uses Reanimated 2 in production today.

    Q: What do you use to configure and manage your CI builds?
    A: We use Buildkite. Check out these two posts to learn more


    Q: In the early stage of your React Native apps did you use Expo, or it was never an option?
    A: We explored it, but most of our apps so quickly needed to “eject” from that workflow. We eventually decided that we would create our React Native applications as “vanilla” applications. Expo is great though, and we encourage people to use it for their own side projects.

    Q: Are the nightly QAs automatic? How is the QA cycle?
    A: Nightly builds are created automatically on our main branch. These builds automatically get uploaded to a test distribution platform and the Shopifolk (product managers, designers, developers) who have the test builds installed can opt in to always be updated to the latest version. Thanks to the ShipIt tool, any feature branches with failing tests will never be allowed to be merged to main.

    All our devs are responsible for QA of the app and ensuring that no regressions occur before release.

    Q: Have you tried Loki?
    A: Some teams have tried it, but Loki doesn’t work with our CI constraints.

    Learn More About React Native at Shopify


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    How Shopify Built An In-Context Analytics Experience

    How Shopify Built An In-Context Analytics Experience

    Federica Luraschi & Racheal Herlihy 

    Whether determining which orders to fulfill first or assessing how their customer acquisition efforts are performing, our merchants rely on data to make informed decisions about their business. In 2013, we made easy-access to data a reality for Shopify merchants with the Analytics section. The Analytics section lives within the Shopify admin (where merchants can login to manage their business) and it gives them access to data that helps them understand how their business is performing.

    While the Analytics section is a great way to get data into merchants’s hands, we realized there was an opportunity to bring insights right within their workflows. That’s why we’ve launched a brand new merchant analytics experience that surfaces data in-context within the most visited page in the Shopify admin—the Orders page.

    Below, we’ll discuss the motivation behind bringing analytics in-context and the data work that went into launching this new product. We’ll walk you through the exploratory analysis, data modeling, instrumentation, and success metrics work we conducted to bring analytics into our merchants’s workflow.

    Moving Towards In-Context Analytics

    Currently, almost all the data surfaced to merchants lives in the Analytics section in the Shopify admin. The Analytics section contains:

    • Overview Dashboard: A dashboard that captures a merchant’s business health over time, including dozens of metrics ranging from total sales to online store conversion rates.
    • Reports: Enables merchants to explore their data more deeply and view a wider, more customizable range of metrics.
    • Live View: Gives merchants a real-time view of their store activity by showcasing live metrics and a globe to highlight visitor activity around the world. 
    An image of the Shopify Analytics section in Live View. The image consists of an image of the earth that's focused on Canada and the US. There are dots of blue and green across the US and Canada representing visitors and orders to the Shopify store.  At the bottom of the image is a dashboard showcasing Visitors right now, Total sessions, Total Sales, Total orders, Page views, and Customer behaviour.
    Shopify Analytics section Live View


    From user research, as well as quantitative analysis, we found merchants were often navigating back and forth between the Analytics section and different pages within the Shopify admin in order to make data-informed decisions. To enable merchants to make informed decisions faster, we decided to bring data to where it is most impactful to them—right within their workflow.

    Our goal was to insert analytics into the most used page in the Shopify admin, where merchants view and fulfill their orders—aka the Orders page. Specifically, we wanted to surface real-time and historical metrics that give merchants insight into the health of their orders and fulfillment workflows.

    Step 1: Choosing (and Validating) Our Metrics

    First, we collaborated with product managers to look at merchant workflows on the Orders Page (for example how merchants fulfill their orders) and, just as important, the associated goals the merchant would have for that workflow (e.g. they want to reduce the time required for fulfillment). We compared these to the available data to identify:

    • The top-level metrics (for example median fulfillment time)
    • The dimensions for the metrics (for example location)
    • Complementary visualizations or data points to the top-level metrics that we would surface in reports (for example the distribution of the fulfillment time)

    For every metric we identified, we worked through specific use cases to understand how a merchant would use the metric as part of their workflow. We wanted to ensure that seeing a metric would push merchants to a follow-up action or give them a certain signal. For example, if a merchant observes that their median fulfillment time for the month is higher than the previous month, they could explore this further by looking at the data over time to understand if this is a trend. This process validated that the metrics being surfaced would actually improve merchant workflows by providing them actionable data.

    Step 2: Understanding the Shape of the Data

    Shopify has over 1.7 million merchants all at different stages of their entrepreneurial journey. In order to build the best analytics experience it was important for our team to have an understanding of what the data would look like for different merchant segments. Some of the ways we segmented our merchants were by:
    • Order volumes (for example low, medium or high volume stores)
    • The stage of their entrepreneurial journey (for example stores that just started making sales to stores that have been running their business for years)
    • Different fulfillment and delivery processes (for example merchants that deliver orders themselves to merchants that use shipping carriers)
    • Geographic region
    • Industry

    After segmenting our merchants, we looked at the “shape of the data” for all the metrics we wanted to surface. More specifically, it was important for us to answer the following questions for each metric:

    • How many merchants would find this metric useful?
    • What merchant segments would find it most useful?
    • How variable is this metric over time for different merchants?
    • What time period is this metric most valuable for?

    These explorations helped us understand the data that was available for us to surface and also validate or invalidate our proposed metrics. Below are some examples of how what we noticed in the shape of data affected our product:

    What we saw
    Action we took
    We don’t have the data necessary to compute a metric, or a metric is always 0 for a merchant segment
    Only show the metric to the stores where the metric is applicable
    A metric stays constant over time
    The metric isn’t a sensitive enough health indicator to show in the Orders page
    A metric is most useful for longer time periods
    The metric will only be available in reports where a merchant could look at those longer time periods 

     

    These examples highlight that looking at the real data for different merchant segments was crucial for us to build an analytics experience that was useful, relevant, and fit the needs of all merchants.

     An image showing the template for data highlights.  The template has a title, data question being answered, description of the findings including a space to place graphical analysis, and the Data Science team's recommendation.
    The data highlights template slide

    As you can imagine, these data explorations meant we were producing a lot of analyses. By the end of our project, we had collected hundreds of data points. To communicate these analyses clearly to our collaborators we produced a slide deck that contained one data point per slide, along with its implications and product recommendations. This enabled us to share our analyses in a consistent, digestible format, keeping everyone on our team informed.

    Step 3: Prototyping the Experience with Real Data

    Once we had explored the shape of the data, aligned on the metrics we wanted to surface to our merchants, and validated them, we worked with the UX team to ensure that they could prototype with data for the different merchant segments we outlined above.

    In-context analytics fulfillment, shipping, and delivery times report
    Example of prototyped data
     An image of a In-context analytics fulfillment, shipping, and delivery times report in the Shopify Admin.  The left hand side of the image is a Menu navigation feature with Reports highlighted. The report displayed is bar graph and below is that graph's numerical values represented in a table.
    In-context analytics fulfillment, shipping, and delivery times report
     

    When we started exploring data for real stores, we found ourselves often re-thinking our visualizations or introducing complementary metrics to the ones we already identified. For example, we initially considered a report that displayed the median fulfillment, delivery, and in-transit times by location. When looking at the data and prototyping the report, we noticed that there was a spread in the event durations. We identified that a histogram visualization of the distribution of the event durations would be the most informative to merchants. As data scientists, we could prototype the graphs with real data and explore new visualizations that we provided to our product, engineering, and UX collaborators, influencing the final visualizations. 

    Step 4: Finalizing Our Findings and Metrics

    Every metric in the experience (on the Orders page and reports) was powered by a different query, meaning that there were a lot of queries to keep track of. We wanted to make sure that all clients (Web, Android, and iOS) were using the same query with the same data source, fields, and aggregations. To make sure there was one source of truth for the queries powering the experience, the final step of this exploratory phase was to produce a data specification sheet.

    This data specification sheet contained the following information for each metric in-context and in reports:

    • The merchant goal
    • The qualification logic (we only surface a metric to a merchant if relevant to them)
    • The query (this was useful to our developer team when we started building)
     A table representing a Data specification template.  The columns are Metric, Merchant goal, Qualification logic, and query.  The table has one row of sample data
    Data specification sheet template example

    Giving our whole team access to the queries powering the experience meant that anyone could look at data for stores when prototyping and building.

    Step 5: Building and Productionizing Our Models

    Once we had identified the metrics we wanted to surface in-context, we worked on the dataset design for all the new metrics that weren’t already modelled. This process involved a few different steps:

    1. Identifying the business processes that our model would need to support (fulfilling an order, marking an order in-transit, marking an order as delivered)
    2. Selecting the grain of our data (we chose the atomic grain of one row per fulfillment, in-transit or delivery event, to ensure our model would be compatible with any future product or dimension we wanted to add)
    3. The dimensions we would need to include
    The last step was to finalize a schema for the model that would support the product’s needs.

    Beyond the dataset design, one of our requirements was that we wanted to surface real-time metrics, meaning we needed to build a streaming pipeline. However, for a few different reasons, we decided to start by building a batch pipeline.

    First, modelling our data in batch meant we could produce a model in our usual development environment, iron out the details of the model, and iterate on any data cleaning. The model we created was available for internal analysis before sending it to our production analytics service, enabling us to easily run sanity checks. Engineers on our team were also able to use the batch model’s data as a placeholder when building.

    Second, given our familiarity with building models in the batch environment, we were able to produce the batch model quickly. This gave us the ability to iterate on it behind the scenes and gave the engineers on our team the ability to start querying the model, and using the data as a placeholder when building the experience.

    Third, the batch model allowed us to backfill the historic data for this model, and eventually use the streaming pipeline to power the most recent data that wasn’t included in our model. Using a lambda architecture approach where historical data came from the batch data model, while the streaming model powered the most recent data not yet captured in batch, helped limit any undue pressure on our streaming infrastructure.

    Step 6: Measuring Success

    At Shopify, our primary success metrics always come back to: what is the merchant problem we’re solving? We use a framework that helps us define what we intend to answer and consider any potential factors that could influence our findings. The framework goes over the following questions:

    1. What is the problem?
    2. What is our hypothesis or expected outcome for resolving the problem?
    3. What signals can we use to determine if we are successful?
    4. What factors could contribute to seeing this signal go up or down, and which are good or bad?
    5. What is our baseline or goal?
    6. What additional context or segments should we focus on?

    Here’s an example for this project using the above framework:

    1. Problem: as a merchant, I don’t want to have to dig for the data I need to inform decisions. I want data to be available right within my workflow.
    2. Hypothesis: providing in-context analytics within a merchant’s workflow will enable them to make informed decisions, and not require them to move between pages to find relevant data.
    3. Signals: merchants are using this data while completing operational tasks on the Orders page, and we see a decrease in their transitioning between the Orders page and Analytics section.
    4. Considerations: the signal and adoption may be lower than expected due to this being a new pattern. This may not necessarily be because the data wasn’t valuable to the merchant, but simply because merchants aren’t discovering it.
    5. Baseline: merchants transitioning between the Orders Page and Analytics section prior to release compared to post-release.
    6. Context: explore usage by segments like merchant business size - e.g. the larger a business the more likely they are to hire staff to fulfill orders, which may mean they aren’t interested in analyzing the data.

    This is just one example and with each point in the framework there is a long list of things to consider. It’s also important to note that for this project there are other audiences who have questions of their own. For instance, our data engineers have goals around the performance of this new data model. Due to the fact that this project has multiple goals and audiences, combining all of these success metrics into one dashboard would be chaotic. That’s why we decided to create a dashboard for each goal and audience, documenting the key questions each dashboard would answer. If you’re interested in how we approach making dashboards at Shopify, check out our blog!

    As for understanding how this feature impacts users, we’re still working on that last step to ensure there are no unintended negative impacts.

    The Results

    From the ideation and validation of the metrics, to prototyping and building the data models, to measuring success, data science was truly involved end-to-end for this project. With our new in-context analytics experience, merchants can see the health of their orders and fulfillment workflows right within their Orders page. More specifically, merchants are surfaced in-context data about their:

    • Total orders (overall and over time)
    • Number of ordered items
    • Number of returned items
    • Fulfilled orders (overall and over time)
    • Delivered orders
    • Median fulfillment time

    These metrics capture data for the day, the last seven days, and last thirty days. For every time period, merchants can see a green (positive change) or red (negative change) colored indicator informing them of the metric’s health compared to a previous comparison period.

    An image of a In-context analytics on the Orders page in the Shopify Admin. The left hand side of the image is a Menu navigation feature with Orders highlighted. The report displayed has a dashboard on the top displaying the last 30 days of aggregate order data. Below that dashboard is a table that shows Order Number, Date, Customer, Total, Payment status, Fulfillment status, Item, Delivery method, and tags data.
    The in-context analytics experience on the Orders page

    We also gave merchants the functionality to click on a metric to view reports that give them a more in-depth view of the data:

    • The Orders Over Time Report: Displays the total number of orders that were received over the selected time period. It includes total orders, average units (products) per transaction, average order value, and returned items.
    • Product Orders and Returns Report: Helps merchants understand which products are their best sellers and which get returned the most often.
    • Fulfillment, Shipping, and Delivery Times Report: Shows how quickly orders move through the entire fulfillment process, from order receipt to delivery to the customer.
    • Fulfillments Over Time Report: Showcases the total number of orders that were either fulfilled, shipped, or delivered over the selected time period.
    An image of a In-context analytics fulfillments over time report in the Shopify Admin. The left hand side of the image is a Menu navigation feature with Reports highlighted. The report displayed is line graph with the y-axis represents number of fulfilled orders and the x-axis is time. Below the line graph is the table representation of the data that includes date, number of fulfilled orders, number of shipped orders and number of delivered orders
    In-context analytics fulfillments over time report

    What We Learned

    There were a lot of key takeaways from this project that we plan to implement in future analytics projects, including:

    1. Collaborating with different disciplines and creating central documents as a source of truth. Working with and communicating effectively with various teams was an essential part of enabling data-informed decision making. Creating documents like the data highlights slidedeck and data specification sheet ensured our full team was kept up-to-date.

    2. Exploring and prototyping our experience with real, segmented data. We can’t stress this enough - our merchants come in all shapes and sizes, so it was critical for us to look at various segments and prototype with real data to ensure we were creating the best experience for all our merchants.

    3. Prototyping models in the batch environment before making them streaming. This was effective in derisking the modelling efforts and unblocking engineers.

    So, what’s next? We plan to continue putting data into the hands of our merchants, when and where they need it most. We aim to make data accessible to merchants in more surfaces that involve their day-to-day workflows beyond their Orders page.

    If you’re interested in building analytic experiences that help entrepreneurs around the world make more data-informed decisions, we’re looking for talented data scientists and data engineers to join our team.

    Federica Luraschi:  Federica is a data scientist on the Insights team. In her last 3+ years at Shopify, she has worked on building and surfacing insights and analytics experiences to merchants. If you’d like to connect with Federica, reach out here.

    Racheal Herlihy: Racheal has been a Data Scientist with Shopify for nearly 3 years. She works on the Insights team whose focus is on empowering merchants to make business informed decisions. Previously, Racheal helped co-found a social enterprise helping protect New Zealand’s native birds through sensor connected pest traps. If you’d like to get in touch with Racheal, you can reach out on LinkedIn.

    Continue reading

    Other Driven Developments

    Other Driven Developments

    Mental models within an industry, company, or even a person, change constantly. As methodologies mature, we see the long term effects our choices have wrought and can adjust accordingly. As a team or company grows, methodologies that worked well for five people may not work as well for 40 people. If all employees could keep an entire app in their head, we’d need fewer rules and checks and balances on our development, but that is not the case. As a result, we summarize things we notice have been implicit in our work.

    Continue reading

    Three Ways We Share Context at Shopify Engineering

    Three Ways We Share Context at Shopify Engineering

    To do your best work as a developer, you need context. A development community thrives when its members share context as a habit. That's why Shopify Engineering believes that sharing context is vital to our growth and success—as a company, as a team, and as individuals.

    Context is an abstract term, and we use it to refer to the why, what, and how of our development philosophies, approaches, and choices. Although sharing context comes in a myriad of forms, three of the ways we do it at Shopify are through:

    • Our in-house Development Handbook
    • A vibrant developer presentation program, called Dev Talks
    • Podcasts that go deep on technical subjects.

    Each of these programs is by and for developers at Shopify, and each has strong support from leadership. Let's take a brief look at each of these ways that we share context and how they benefit our development community.

    Shopify Development Handbook

    Managing content is always a challenge. In hi-tech organizations, where tools and technologies change frequently, products ship quickly, and projects can pivot from day to day, having access to up-to-date and relevant information is critical. In a whirlwind of change and information volatility, we try to keep a central core of context in our Development Handbook.

    Origins

    The origins of the Handbook go back to 2015, when Jean-Michel Lemieux (JML), our current Chief Technology Officer, joined Shopify. In his first role at the company, he needed to learn about the Shopify platform—quickly. The company and the product were both growing rapidly, and the documentation that existed was scattered. Much of the knowledge and expertise was only in the heads of Shopifolk, not written down anywhere.

    So he started a simple Google Doc to try to get a grip on Shopify both technically and architecturally. As the doc got longer, it was clear that the content was valuable not just to him—this was a resource that every developer could contribute to and benefit from.

    Soon, the doc took shape. It contained a history of Shopify development, an overview of the architecture, our development philosophy, and the technology stack we had chosen. It also included information about our production environment and guidance on how we handled incidents.

    Becoming a Website

    The next logical step was to evolve that document into an internal website that could be more easily searched and navigated, better maintained, and more visible across the whole engineering organization. A small team was formed to build the site, and in 2018 the Development Handbook site went live.

    Since then, developers from all disciplines across Shopify have contributed hundreds of topics to the Handbook. Now, it also contains information about our development cultures and practices, using our key technologies, our principles of development, and a wealth of detailed content on how we deploy, manage, and monitor our code.

    The process for adding content is developer-friendly, using the same GitHub processes of pull requests and reviews that developers use while coding. We use Markdown for easy content entry and formatting. The site runs on Middleman, and developers contribute to operations of the site itself, like a recent design refresh (including dark mode), improvements to search, and even adding client-side rendering of KaTeX equations.

    A sample page from the Development Handbook website.  The top of the page contains the search functionality and hamburger menu. Below that is a breadcrumb menu feature.  The title of the Page is Data Stores and the copy is shown below the title.
    Example topic from the internal Shopify Development Handbook

    Handling Oversight and Challenges

    The Handbook is essentially crowd-sourced, but it's also overseen, edited, and curated by the small but mighty Developer Context team. This team defines the site's information architecture, evangelizes the Handbook, and works with developers to assist them as they contribute and update content.

    The Development Handbook allows our team to push knowledge about Sorbet, static-typing in Ruby, and our best practices to the rest of the company. It's a necessary resource for all our newcomers and a good one even for our experts. A must read.

    Alexandre Terrasa, Staff Production Engineer

    Despite its success as a central repository for technical content and context in Shopify’s Engineering org, the Handbook always faces challenges. Some content is better off residing in other locations, like GitHub, where it's closest to the code. Some content that has a limited audience might be better off in a standalone site. There’s constant and opposing pressures to either add content to the Handbook or to move content from the Handbook to elsewhere.

    Keeping content fresh and up-to-date is also a never-ending job. To try to ensure content isn't created and then forgotten about, we have an ownership model that asks teams to be responsible for any topics that are closely related to their mandates and projects. However, this isn't sufficient, as engineering teams are prone to being reorganized and refocused on new areas.

    We haven't found the sweet spot yet for addressing these governance challenges. However, sharing is in the DNA of Shopify developers, and we have a great community of contributors who update content proactively, while others update anything they encounter that needs to change. We're exploring a more comprehensive proactive approach to content maintenance, but we're not sure what final form that will take yet.

    In all likelihood, there won't be a one-time fix. Just like the rest of the company, we'll adapt and change as needed to ensure that the Handbook continues to help Shopify developers get up to speed quickly and understand how and why we do the things we do.

    Dev Talks

    Similar to the Development Handbook, the Dev Talks program has a long history at Shopify. It started in 2014 as a weekly open stage to present demos, prototypes, experiments, technical findings, or any other idea that might resonate with fellow developers.

    Although the team managing this program has changed over the years, and its name has gone through several iterations, the primary goals remain the same: it's a place for developers to share their leading-edge work, technology explorations, wins, or occasional failures, with their peers. The side benefits are the opportunity for developers to build their presentation skills and to be recognized and celebrated for their work. The Developer Context team took over responsibility for the program in 2018.

    Before Shopify shifted to a fully remote work model, talks were usually presented in our large cafeteria spaces, which lent an informal and casual atmosphere to the proceedings, and where big screens were used to show off one's work. Most presenters used slides as a way of organizing their thoughts, but many talks were off the cuff, or purely demos.

    A picture of Shopify employees gathering in the lunch room of the Ottawa Office for Dev Talks.  There are 5 long tables with several people sitting down. The tables are placed in front of a stage with a screen. On that screen is an image of Crash Bandicoot. There is a person standing at a lectern on the stage presenting.
    Developers gather for a Dev Talk in the former Shopify headquarters in Ottawa, Canada

    With the company's shift to Digital by Design in 2020, we had to change the model and decided on an on-demand approach. Now, developers record their presentations so they can be watched at any time by their colleagues. Talks don't have prescribed time limits, but most tend to be about 15 to 20 minutes long.

    To ensure we get eyes on the talks, the Developer Context team promotes the presentations via email and Slack. The team set up a process to simplify signing up to do a talk and get it promoted, and created branding like a Dev Talks logo and presentation template. Dev Talks is a channel on the internal Shopify TV platform helping ensure the talks have a consistent home and are easy to find.

    Dev Talks have given me the opportunity to share my excitement about the projects I've worked on with other developers. They have helped me develop confidence and improve my public speaking skills while also enabling me to build my personal brand at Shopify.

    Adrianna Chang, Developer
    Dev Talks Logo

    Our on-demand model is still very new, so we'll monitor how it goes and determine if we need to change things up again. What's certain is that the desire for developers to share their technical context is strong, and the appetite to learn from colleagues is equally solid.

    Internal Podcasts

    The third way we share context is through podcasts. Podcasts aren't new, but they have surged in popularity in recent years. In fact, the founder and CEO of Shopify, Tobi Lütke, has hosted his own internal podcast, called Context, since 2017. This podcast has a wide thematic scope. Most of them, as we might expect from our CEO, have a technology spin, but they’re geared for a Shopify-wide audience.

    To provide an outlet for technical conversations focused squarely on our developer community, the Technical Leadership Team (TLT)—a group of senior developers who help to ensure that Shopify makes great technical decisions—recently launched their own internal podcast, Shift. The goal of these in-depth conversations is to unpack technical decisions and dig deep into the context around them.

    The Shift podcast is where we talk about ideas that are worth reinforcing in Shopify Engineering. All systems degrade over time, so this forum lets us ensure that the best parts are properly oiled.

    Alex Topalov, Senior Development Manager

    About once a month, several members of the TLT sit down virtually with a senior engineering leadership member to probe for specifics around technologies or technical concepts. And the leaders who are interviewed are in the hot seat for a while—these recorded podcasts can last up to an hour. Recent episodes have focused on conversations about machine learning and artificial intelligence at Shopify, the resiliency of our systems, and how we approach extensibility.

    To ensure everyone can take advantage of the podcasts, they're made available in both audio and video formats, and a full transcript is provided. Because they’re lengthy deep dives, developers can listen to partial segments at the time of their choosing. The on-demand nature of these podcasts is valuable, and the data shows that uptake on them is strong.

    We'll continue to measure the appetite for this type of detailed podcast format to make sure it's resonating with the developer community over the long run.

    Context Matters

    The three approaches covered here are just a sample of how we share context at Shopify Engineering. We use many other techniques including traditional sharing of information through email newsletters, Slack announcement channels, organization town halls with ask me anything (AMA) segments, video messages from executives, and team demos—not to mention good old-fashioned meetings.

    We expect post-pandemic that we’ll reintroduce in-person gatherings where technical teams can come together for brief but intense periods of context sharing hand-in-hand with prototyping, team-building, and deep development explorations.

    Our programs are always iterating and evolving to target what works best, and new ideas spring up regularly to complement our more formal programs like the Development Handbook, Dev Talks, and the Shift podcast. What matters most is that an environment is in place to promote, recognize, and celebrate the benefits of sharing knowledge and expertise, with solid buy-in from leadership.

    Christopher writes and edits internal content for developers at Shopify and is on the Developer Context team. He’s a certified copy editor with expertise in content development and technical editing. He enjoys playing the piano and has recently been exploring works by Debussy and Rachmaninov.


    We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

    Continue reading

    How I Define My Boundaries to Prevent Burnout

    How I Define My Boundaries to Prevent Burnout

    There’s a way to build a model for your life and the other demands on you. It doesn’t have to be 100 hours a week. It might not even be 40 hours a week. Whatever sustainable model you come up with, prioritize fiercely and communicate expectations accordingly. I’ve worked this way for three decades through different phases of my life. The hours have changed, but the basic model and principles remain the same: Define the time, prioritize the work, and communicate those two effectively.

    Continue reading

    A Five-Step Guide for Conducting Exploratory Data Analysis

    A Five-Step Guide for Conducting Exploratory Data Analysis

    Have you ever been handed a dataset and then been asked to describe it? When I was first starting out in data science, this question confused me. My first thought was “What do you mean?” followed by “Can you be more specific?” The reality is that exploratory data analysis (EDA) is a critical tool in every data scientist’s kit, and the results are invaluable for answering important business questions.

    Simply put, an EDA refers to performing visualizations and identifying significant patterns, such as correlated features, missing data, and outliers. EDA’s are also essential for providing hypotheses for why these patterns occur. It most likely won’t appear in your data product, data highlight, or dashboard, but it will help to inform all of these things.

    Below, I’ll walk you through key tips for performing an effective EDA. For the more seasoned data scientists, the contents of this post may not come as a surprise (rather a good reminder!), but for the new data scientists, I’ll provide a potential framework that can help you get started on your journey. We’ll use a synthetic dataset for illustrative purposes. The data below does not reflect actual data from Shopify merchants. As we go step-by-step, I encourage you to take notes as you progress through each section!

    Before You Start

    Before you start exploring the data, you should try to understand the data at a high level. Speak to leadership and product to try to gain as much context as possible to help inform where to focus your efforts. Are you interested in performing a prediction task? Is the task purely for exploratory purposes? Depending on the intended outcome, you might point out very different things in your EDA.

    With that context, it’s now time to look at your dataset. It’s important to identify how many samples (rows) and how many features (columns) are in your dataset. The size of your data helps inform any computational bottlenecks that may occur down the road. For instance, computing a correlation matrix on large datasets can take quite a bit of time. If your dataset is too big to work within a Jupyter notebook, I suggest subsampling so you have something that represents your data, but isn’t too big to work with.

    The first 5 rows of our synthetic dataset. The dataset above does not reflect actual data from Shopify merchants.
    The first 5 rows of our synthetic dataset. The dataset above does not reflect actual data from Shopify merchants.

    Once you have your data in a suitable working environment, it’s usually a good idea to look at the first couple rows. The above image shows an example dataset we can use for our EDA. This dataset is used to analyze merchant behaviour. Here are a few details about the features:

    • Shop Cohort: the month and year a merchant joined Shopify
    • GMV (Gross Merchandise Volume): total value of merchandise sold.
    • AOV (Average Order Value): the average value of customers' orders since their first order.
    • Conversion Rate: the percentage of sessions that resulted in a purchase.
    • Sessions: the total number of sessions on your online store.
    • Fulfilled Orders: the number of orders that have been packaged and shipped.
    • Delivered Orders: the number of orders that have been received by the customer.

    One question to address is “What is the unique identifier of each row in the data?” A unique identifier can be a column or set of columns that is guaranteed to be unique across rows in your dataset. This is key for distinguishing rows and referencing them in our EDA.

    Now, if you’ve been taking notes, here’s what they may look like so far:

    • The dataset is about merchant behaviour. It consists of historic information about a set of merchants collected on a daily basis
    • The dataset contains 1500 samples and 13 features. This is a reasonable size and will allow me to work in Jupyter notebooks
    • Each row of my data is uniquely identified by the “Snapshot Date” and “Shop ID” columns. In other words, my data contains one row per shop per day
    • There are 100 unique Shop IDs and 15 unique Snapshot Dates
    • Snapshot Dates range from ‘2020-01-01’ to ‘2020-01-15’ for a total of 15 days

    1. Check For Missing Data

    Now that we’ve decided how we’re going to work with the data, we begin to look at the data itself. Checking your data for missing values is usually a good place to start. For this analysis, and future analysis, I suggest analyzing features one at a time and ranking them with respect to your specific analysis. For example, if we look at the below missing values analysis, we’d simply count the number of missing values for each feature, and then rank the features by largest amount of missing values to smallest. This is especially useful if there are a large amount of features.

    Feature ranking by missing value counts
    Feature ranking by missing value counts

    Let’s look at an example of something that might occur in your data. Suppose a feature has 70 percent of its values missing. As a result of such a high amount of missing data, some may suggest to just remove this feature entirely from the data. However, before we do anything, we try to understand what this feature represents and why we’re seeing this behaviour. After further analysis, we may discover that this feature represents a response to a survey question, and in most cases it was left blank. A possible hypothesis is that a large proportion of the population didn’t feel comfortable providing an answer. If we simply remove this feature, we introduce bias into our data. Therefore, this missing data is a feature in its own right and should be treated as such.

    Now for each feature, I suggest trying to understand why the data is missing and what it can mean. Unfortunately, this isn’t always so simple and an answer might not exist. That’s why an entire area of statistics, Imputation, is devoted to this problem and offers several solutions. What approach you choose depends entirely on the type of data. For time series data without seasonality or trend, you can replace missing values with the mean or median. If the time series does contain a trend but not seasonality, then you can apply a linear interpolation. If it contains both, then you should adjust for the seasonality and then apply a linear interpolation. In the survey example I discussed above, I handle missing values by creating a new category “Not Answered” for the survey question feature. I won’t go into detail about all the various methods here, however, I suggest reading How to Handle Missing Data for more details on Imputation.

    Great! We’ve now identified the missing values in our data—let’s update our summary notes:

    • ...
    • 10 features contain missing values
    • “Fulfilled Orders” contains the most missing values at 8% and “Shop Currency” contains the least at 6%

    2. Provide Basic Descriptions of Your Sample and Features

    At this point in our EDA, we’ve identified features with missing values, but we still know very little about our data. So let’s try to fill in some of the blanks. Let’s categorize our features as either:

    Continuous: A feature that is continuous can assume an infinite number of values in a given range. An example of a continuous feature is a merchant’s Gross Merchandise Value (GMV).

    Discrete: A feature that is discrete can assume a countable number of values and is always numeric. An example of a discrete feature is a merchant’s Sessions.

    Categorical: A feature that is discrete can only assume a finite number of values. An example of a discrete feature is a merchant’s Shopify plan type.

    The goal is to classify all your features into one of these three categories.

    GMV AOV Conversion Rate
    62894.85 628.95 0.17
    NaN 1390.86 0.07
    25890.06  258.90 0.04
    6446.36 64.46 0.20
    47432.44 258.90 0.10

     Example of continuous features

    Sessions Products Sold Fulfilled Orders
    11 52 108
    119 46 96
    182 47 98
    147 44 99
    45 65 125

     Example of discrete features

    Plan
    Country
    Shop Cohort
    Plus
    UK
    Nan
    Advanced Shopify
    UK
    2019-04
    Plus
    Canada
    2019-04
    Advanced Shopify
    Canada
    2019-06
    Advanced Shopify
    UK
     2019-06

     Example of categorical features

    You might be asking yourself, how does classifying features help us? This categorization helps us decide what visualizations to choose in our EDA, and what statistical methods we can apply to this data. Some visualizations won’t work on all continuous, discrete, and categorical features. This means we have to treat groups of each type of feature differently. We will see how this works in later sections.

    Let’s focus on continuous features first. Record any characteristics that you think are important, such as the maximum and minimum values for that feature. Do the same thing for discrete features. For categorical features, some things I like to check for are the number of unique values and the number of occurrences of each unique value. Let’s add our findings to our summary notes:

    •  ...
    • There are 3 continuous features, 4 discrete features, and 4 categorical features
    • GMV:
      • Continuous Feature
      •  Values are between $12.07 and $814468.03
      • Data is missing for one day…I should check to make sure this isn’t a data collection error”
    •  Plan:
      • Categorical Feature
      • Assumes four values: “Basic Shopify”, “Shopify”, “Advanced Shopify”, and “Plus”.
      • The value counts of “Basic Shopify”, “Shopify”, “Advanced Shopify”, and “Plus” are 255, 420, 450, and 375, respectively
      • There seems to be more merchants on the “Advanced Shopify” plan than on the “Basic” plan. Does this make sense?

    3. Identify The Shape of Your Data

    The shape of a feature can tell you so much about it. What do I mean by shape? I’m referring to the distribution of your data, and how it can change over time. Let’s plot a few features from our dataset:

    GMV and Sessions behaviour across samples
    GMV and Sessions behaviour across samples

    If the dataset is a time series, then we investigate how the feature changes over time. Perhaps there’s a seasonality to the feature or a positive/negative linear trend over time. These are all important things to consider in your EDA. In the graphs above, we can see that AOV and Sessions have positive linear trends and Sessions emits a seasonality (a distinct behaviour that occurs in intervals). Recall that Snapshot Data and Shop ID uniquely define our data, so the seasonality we observe can be due to particular shops having more sessions than other shops in the data. In the line graph below, we see that the Sessions seasonality was a result of two specific shops: Shop 1 and Shop 51. Perhaps these shops have a higher GMV or AOV?

    In line graph below we have Snapshot date on the x-axis and we see that the Sessions (y-axis) seasonality was a result of two specific shops: Shop 1 and Shop 51

    Next, you’ll calculate the mean and variance of each feature. Does the feature hardly change at all? Is it constantly changing? Try to hypothesize about the behaviour you see. A feature that has a very low or very high variance may require additional investigation.

    Probability Density Functions (PDFs) and Probability Mass Functions (PMFs) are your friends. To understand the shape of your features, PMFs are used for discrete features and PDFs for continuous features.

    Example of feature density functions
    Example of feature density functions

    Here a few things that PMFs and PDFs can tell you about your data:

    • Skewness
    • Is the feature heterogeneous (multimodal)?
    • If the PDF has a gap in it, the feature may be disconnected.
    • Is it bounded?

    We can see from the example feature density functions above that all three features are skewed. Skewness measures the asymmetry of your data. This might deter us from using the mean as a measure of central tendency. The median is more robust, but it comes with additional computational cost.

    Overall, there are a lot of things you can consider when visualizing the distribution of your data. For more great ideas, I recommend reading Exploratory Data Analysis. Don’t forget to update your summary notes:

    • ...
    • AOV:
      • Continuous Feature
      • Values are between $38.69 and $8994.58
      • 8% of values missing
      • Observe a large skewness in the data.
      • Observe a positive linear trend across samples
    • Sessions:
      • Discrete Feature
      • Contains count data (can assume non-negative integer data)
      • Values are between 2 and 2257
      • 7.7% of values missing
      • Observe a large skewness in the data
      • Observe a positive linear trend across samples
      • Shops 1 and 51 have larger daily sessions and show a more rapid growth of sessions compared to other shops in the data
    • Conversion Rate:
      • Continuous Feature
      • Values are between 0.0 and 0.86
      • 7.7% of values missing
      • Observe a large skewness in the data

    4. Identify Significant Correlations

    Correlation measures the relationship between two variable quantities. Suppose we focus on the correlation between two continuous features: Delivered Orders and Fulfilled Orders. The easiest way to visualize correlation is by plotting a scatter plot with Delivered Orders on the y axis and Fulfilled Orders on the x axis. As expected, there’s a positive relationship between these two features.

    Scatter plot showing positive correlation between features “Delivered Orders” and “Fulfilled Orders”
    Scatter plot showing positive correlation between features “Delivered Orders” and “Fulfilled Orders”

    If you have a high number of features in your dataset, then you can’t create this plot for all of them—it takes too long. So, I recommend computing the Pearson correlation matrix for your dataset. It measures the linear correlation between features in your dataset and assigns a value between -1 and 1 to each feature pair. A positive value indicates a positive relationship and a negative value indicates a negative relationship.

    Correlation matrix for continuous and discrete features
    Correlation matrix for continuous and discrete features

    It’s important to take note of all significant correlation between features. It’s possible that you might observe many relationships between features in your dataset, but you might also observe very little. Every dataset is different! Try to form hypotheses around why features might be correlated with each other.

    In the correlation matrix above, we see a few interesting things. First of all, we see a large positive correlation between Fulfilled Orders and Delivered Orders, and between GMV and AOV. We also see a slight positive correlation between Sessions, GMV, and AOV. These are significant patterns that you should take note of.

    Since our data is a time series, we also look at the autocorrelation of shops. Computing the autocorrelation reveals relationships between a signal’s current value and its previous values. For instance, there could be a positive correlation between a shop’s GMV today and its GMV from 7 days ago. In other words, customers like to shop more on Saturdays compared to Mondays. I won’t go into any more detail here since Time Series Analysis is a very large area of statistics, but I suggest reading A Very Short Course on Time Series Analysis.

    Shop Cohort 2019-01 2019-02 2019-03 2019-04 2019-05 2019-06
    Plan
    Adv. Shopify 71 102 27 71 87 73
    Basic Shopify 45 44 42 59 0 57
    Plus 29 55 69 44 71 86
    Shopify 53 72 108 72 60 28

    Contingency table for discrete features “Shop Cohort” an “Plan”

    The methodology outlined above only applies to continuous and discrete features. There are a few ways to compute correlation between categorical features, however, I’ll only discuss one, the Pearson chi-square test. This involves taking pairs of discrete features and computing their contingency table. Each cell in the contingency table shows the frequency of observations. In the Pearson chi-square test, the null hypothesis is that the categorical variables in question are independent and, therefore, not related. In the table above, we compute the contingency table for two categorical features from our dataset: Shop Cohort and Plan. After that, we perform a hypothesis test using the chi-square distribution with a specified alpha level of significance. We then determine whether the categorical features are independent or dependent.

    5. Spot Outliers in the Dataset

    Last, but certainly not least, spotting outliers in your dataset is a crucial step in EDA. Outliers are significantly different from other samples in your dataset and can lead to major problems when performing statistical tasks following your EDA. There are many reasons why an outlier might occur. Perhaps there was a measurement error for that sample and feature, but in many cases outliers occur naturally.

    Continuous feature box plots
    Continuous feature box plots

    The box plot visualization is extremely useful for identifying outliers. In the above figure, we observe that all features contain quite a few outliers because we see data points that are distant from the majority of the data.

    There are many ways to identify outliers. It doesn’t make sense to plot all of our features one at a time to spot outliers, so how do we systematically accomplish this? One way is to compute the 1st and 99th percentile for each feature, then classify data points that are greater than the 99th percentile or less than the 1st percentile. For each feature, count the number of outliers, then rank them from most outliers to least outliers. Focus on the features that have the most outliers and try to understand why that might be. Take note of your findings.

    Unfortunately, the aforementioned approach doesn’t work for discrete features since there needs to be an ordering to compute percentiles. An outlier can mean many things. Suppose our discrete feature can assume one of three values: apple, orange, or pear. For 99 percent of samples, the value is either apple or orange, and only 1 percent for pear. This is one way we might classify an outlier for this feature. For more advanced methods on detecting anomalies in categorical data, check out Outlier Analysis.

    What’s Next?

    We’re at the finish line and completed our EDA. Let’s review the main takeaways:

    • Missing values can plague your data. Make sure to understand why they are there and how you plan to deal with them.
    • Provide a basic description of your features and categorize them. This will drastically change the visualizations you use and the statistical methods you apply.
    • Understand your data by visualizing its distribution. You never know what you will find! Get comfortable with how your data changes across samples and over time.
    • Your features have relationships! Make note of them. These relationships can come in handy down the road.
    • Outliers can dampen your fun only if you don’t know about them. Make known the unknowns!

    But what do we do next? Well that all depends on the problem. It’s useful to present a summary of your findings to leadership and product. By performing an EDA, you might answer some of those crucial business questions we alluded to earlier. Going forward, does your team want to perform a regression or classification on the dataset? Do they want it to power a KPI dashboard? So many wonderful options and opportunities to explore!

    It’s important to note that an EDA is a very large area of focus. The steps I suggest are by no means exhaustive and if I had the time I could write a book on the subject! In this post, I share some of the most common approaches, but there’s so much more that can be added to your own EDA.

    If you’re passionate about data at scale, and you’re eager to learn more, we’re hiring! Reach out to us or apply on our careers page.

    Cody Mazza-Anthony is a Data Scientist on the Insights team. He is currently working on building a Product Analytics framework for Insights. Cody enjoys building intelligent systems to enable merchants to grow their business. If you’d like to connect with Cody, reach out on LinkedIn.

    Continue reading

    Dynamic ProxySQL Query Rules

    Dynamic ProxySQL Query Rules

    ProxySQL comes with a powerful feature called query rules. The main use of these rules at Shopify is to reroute, rewrite, or reject queries matching a specified regex. However, with great power comes great responsibility. These rules are powerful and can have unexpected consequences if used incorrectly. At Shopify’s scale, we’re running thousands of ProxySQL instances, so applying query rules to each one is a painful and time consuming process, especially during an incident. We’ve built a tool to help us address these challenges and make deploying new rules safe and scalable.

    Continue reading

    How Shopify Dynamically Routes Storefront Traffic

    How Shopify Dynamically Routes Storefront Traffic

    In 2019 we set out to rewrite the Shopify storefront implementation. Our goal was to make it faster. We talked about the strategies we used to achieve that goal in a previous post about optimizing server-side rendering and implementing efficient caching. To build on that post, I’ll go into detail on how the Storefront Renderer team tweaked our load balancers to shift traffic between the legacy storefront and the new storefront.

    First, let's take a look at the technologies we used. For our load balancer, we’re running nginx with OpenResty. We previously discussed how scriptable load balancers are our secret weapon for surviving spikes of high traffic. We built our storefront verification system with Lua modules and used that system to ensure our new storefront achieved parity with our legacy storefront. The system to permanently shift traffic to the new storefront, once parity was achieved, was also built with Lua. Our chatbot, spy, is our front end for interacting with the load balancers via our control plane.

    At the beginning of the project, we predicted the need to constantly update which requests were supported by the new storefront as we continued to migrate features. We decided to build a rule system that allows us to add new routing rules easily.

    Starting out, we kept the rules in a Lua file in our nginx repository, and kept the enabled/disabled status in our control plane. This allowed us to quickly disable a rule without having to wait for a deploy if something went wrong. It proved successful, and at this point in the project, enabling and disabling rules was a breeze. However, our workflow for changing the rules was cumbersome, and we wanted this process to be even faster. We decided to store the whole rule as a JSON payload in our control plane. We used spy to create, update, and delete rules in addition to the previous functionality of enabling and disabling the rules. We only needed to deploy nginx to add new functionality.

    The Power of Dynamic Rules

    Fast continuous integration (CI) time and deployments are great ways to increase the velocity of getting changes into production. However, for time-critical use cases like ours removing the CI time and deployment altogether is even better. Moving the rule system into the control plane and using spy to manipulate the rules removed the entire CI and deployment process. We still require a “code review” on enabled spy commands or before enabling a new command, but that’s a trivial amount of time compared to the full deploy process used prior.

    Before diving into the different options available for configuration, let’s look at what it’s like to create a rule with spy. Below are three images showing creating a rule, inspecting it, and then deleting it. The rule was never enabled, as it was an example, but that process requires getting approval from another member of the team. We are affecting a large share of real traffic on the Shopify platform when running the command spy storefront_renderer enable example-rule, so the rules to good code reviews still apply.

    An example of how to create a
routing rule with spy via slack.
    Adding a rule with spy
    An example of how to describe an
existing rule with spy via slack.
    Displaying a rule with spy
    An example of how to describe an
existing rule with spy via slack.
    Removing a rule with spy

    Configuring New Rules

    Now let’s review the different options available when creating new rules.

    Option Name
    Description Default  Example
    rule_name
    The identifier for the rule.
    products-json
    filters
    A comma-separated list of filters.
    is_product_list_json_read
    shop_ids
    A comma-separated list of shop ids to which the rule applies.
    all

    The rule_name is the identifier we use. It can be any string, but it’s usually descriptive of the type of request it matches.

    The shop_ids option lets us choose to have a rule target all shops or target a specific shop for testing. For example, test shops allow us to test changes without affecting real production data. This is useful to test rendering live requests with the new storefront because verification requests happen in the background and don’t directly affect client requests.

    Next, the filters option determines which requests would match that rule. This allows us to partition the traffic into smaller subsets and target individual controller actions from our legacy Ruby on Rails implementation. A change to the filters list does require us to go through the full deployment process. They are kept in a Lua file, and the filters option is a comma-separated list of function names to apply to the request in a functional style. If all filters return true, then the rule will match that request.

    Above is an example of a filter, is_product_list_path, that lets us target HTTP GET requests to the storefront products JSON API implemented in Lua.

    Option Name
    Description
    Default
    Example
    render_rate
    The rate at which we render allowed requests.
    0
    1
    verify_rate
    The rate at which we verify requests.
    0
    0
    reverse_verify_rate
    The rate at which requests are reverse-verified when rendering from the new storefront.
    0
    0.001

    Both render_rate and verify_rate allow us to target a percentage of traffic that matches a rule. This is useful for doing gradual rollouts of rendering a new endpoint or verifying a small sample of production traffic.

    The reverse_verify_rate is the rate used when a request is already being rendered with the new storefront. It lets us first render the request with the new storefront and then sends the request to the legacy implementation asynchronously for verification. We call this scenario a reverse-verification, as it’s the opposite or reverse of the original flow where the request was rendered by the legacy storefront then sent to the new storefront for verification. We call the opposite flow forward-verification. We use forward-verification to find issues as we implement new endpoints and reverse-verifications to help detect and track down regressions.

    Option Name

    Description

    Default

    Example

    self_verify_rate

    The rate at which we verify requests in the nearest region.

    0

    0.001

     

    Now is a good time to introduce self-verification and the associated self_verify_rate. One limitation of the legacy storefront implementation was due to how our architecture for a Shopify pod meant that only one region had access to the MySQL writer at any given time. This meant that all requests had to go to the active region of a pod. With the new storefront, we decoupled the storefront rendering service from the database writer and now serve storefront requests from any region where a MySQL replica is present.

    However, as we started decoupling dependencies on the active region, we found ourselves wanting to verify requests not only against the legacy storefront but also against the active and passive regions with the new storefront. This led us to add the self_verify_rate that allows us to sample requests bound for the active region to be verified against the storefront deployment in the local region.

    We have found the routing rules flexible, and it made it easy to add new features or prototypes that are usually quite difficult to roll out. You might be familiar with how we generate load for testing the system's limits. However, these load tests will often fall victim to our load shedder if the system gets overwhelmed. In this case, we drop any request coming from our load generator to avoid negatively affecting a real client experience. Before BFCM 2020 we wondered how the application behaved if the connections to our dependencies, primarily Redis, went down. We wanted to be as resilient as possible to those types of failures. This isn’t quite the same as testing with a load generation tool because these tests could affect real traffic. The team decided to stand up a whole new storefront deployment, and instead of routing any traffic to it, we used the verifier mechanism to send duplicate requests to it. We then disconnected the new deployment from Redis and turned our load generator on max. Now we had data about how the application performed under partial outages and were able to dig into and improve resiliency of the system before BFCM. These are just some of the ways we leveraged our flexible routing system to quickly and transparently change the underlying storefront implementation.

    Implementation

    I’d like to walk you through the main entry point for the storefront Lua module to show more of the technical implementation. First, here is a diagram showing where each nginx directive is executed during the request processing.

    A flow chart showing the order
different Lua callbacks are run in the nginx request lifecycle.
    Order in which nginx directives are run - source: github.com/openresty/lua-nginx-module

    During the rewrite phase, before the request is proxied upstream to the rendering service, we check the routing rules to determine which storefront implementation the request should be routed to. After the check during the header filter phase, we check if the request should be forward-verified (if the request went to the legacy storefront) or reverse-verified (if it went to the new storefront). Finally, if we’re verifying the request (regardless of forward or reverse) in the log phase, we queue a copy of the original request to be made to the opposite upstream after the request cycle has completed.

    In the above code snippet, the renderer module referenced in the rewrite phase and the header filter phase and the verifier module reference in the header filter phase and log phase, use the same function find_matching_rule from the storefront rules module below to get the matching rule from the control plane. The routing_method parameter is passed in to determine whether we’re looking for a rule to match for rendering or for verifying the current request.

    Lastly, the verifier module uses nginx timers to send the verification request out of band of the original request so we don’t leave the client waiting for both upstreams. The send_verification_request_in_background function shown below is responsible for queuing the verification request to be sent. To duplicate the request and verify it, we need to keep track of the original request arguments and the response state from either the legacy storefront (in the case of a forward verification request) or the new storefront (in the case of a reverse verification request). This will then pass them as arguments to the timer since we won’t have access to this information in the context of the timer.

    The Future of Routing Rules

    At this point, we're starting to simplify this system because the new storefront implementation is serving almost all storefront traffic. We’ll no longer need to verify or render traffic with the legacy storefront implementation once the migration is complete, so we'll be undoing the work we’ve done and going back to the hardcoded rules approach of the early days of the project. Although that doesn’t mean the routing rules are going away completely, the flexibility provided by the routing rules allowed us to build the verification system and stand up a separate deployment for load testing. These features weren’t possible before with the legacy storefront implementation. While we won’t be changing the routing between storefront implementations, the rule system will evolve to support new features.


    We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

    Derek Stride is a Senior Developer on the Storefront Renderer team. He joined Shopify in 2016 as an intern and transitioned to a full-time role in 2018. He has worked on multiple projects and has been involved with the Storefront Renderer rewrite since the project began in 2019.

    Continue reading

    Building Smarter Search Products: 3 Steps for Evaluating Search Algorithms

    Building Smarter Search Products: 3 Steps for Evaluating Search Algorithms

    By Jodi Sloan and Selina Li

    Over 2 million users visit Shopify’s Help Center every month to find help solving a problem. They come to the Help Center with different motives: learn how to set up their store, find tips on how to market, or get advice on troubleshooting a technical issue. Our search product helps users narrow down what they’re looking for by surfacing what’s most relevant for them. Algorithms empower search products to surface the most suitable results, but how do you know if they’re succeeding at this mission?

    Below, we’ll share the three-step framework we built for evaluating new search algorithms. From collecting data using Kafka and annotation, to conducting offline and online A/B tests, we’ll share how we measure the effectiveness of a search algorithm.

    The Challenge

    Search is a difficult product to build. When you input a search query, the search product sifts through its entire collection of documents to find suitable matches. This is no easy feat as a search product’s collection and matching result set might be extensive. For example, within the Shopify Help Center’s search product lives thousands of articles, and a search for “shipping” could return hundreds. 

    We use an algorithm we call Vanilla Pagerank to power our search. It boosts results by the total number of views an article has across all search queries. The problem is that it also surfaces non-relevant results. For example, if you conducted a search for “delete discounts” the algorithm may prefer results with the keywords “delete” or “discounts”, but not necessarily results on “deleting discounts”. 

    We’re always trying to improve our users’s experience by making our search algorithms smarter. That’s why we built a new algorithm, Query-specific Pagerank, which aims to boost results with high click frequencies (a popularity metric) from historic searches containing the search term. It basically boosts the most frequently clicked-on results from similar searches. 

    The challenge is any change to the search algorithms might improve the results for some queries, but worsen others. So to have confidence in our new algorithm, we use data to evaluate its impact and performance against the existing algorithm. We implemented a simple three-step framework for evaluating experimental search algorithms.

    1. Collect Data

    Before we evaluate our search algorithms, we need a “ground truth” dataset telling us which articles are relevant for various intents. For Shopify’s Help Center search, we use two sources: Kafka events and annotated datasets.

    Events-based Data with Kafka

    Users interact with search products by entering queries, clicking on results, or leaving feedback. By using a messaging system like Kafka, we collect all live user interactions in schematized streams and model these events in an ETL pipeline. This culminates in a search fact table that aggregates facts about a search session based on behaviours we’re interested in analyzing.

    A simplified model of a search fact generated by piecing together raw Kafka events. The model shows that 3 Kafka events make up a Search fact: 1. search query, 2. search result click, and 3. contacted support.
    A simplified model of a search fact generated by piecing together raw Kafka events

    With data generated from Kafka events, we can continuously evaluate our online metrics and see near real-time change. This helps us to:

    1. Monitor product changes that may be having an adverse impact on the user experience.
    2. Assign users to A/B tests in real time. We can set some users’s experiences to be powered by search algorithm A and others by algorithm B.
    3. Ingest interactions in a streaming pipeline for real-time feedback to the search product.

    Annotation

    Annotation is a powerful tool for generating labeled datasets. To ensure high-quality datasets within the Shopify Help Center, we leverage the expertise of the Shopify Support team to annotate the relevance of our search results to the input query. Manual annotation can provide high-quality and trustworthy labels, but can be an expensive and slow process, and may not be feasible for all search problems. Alternatively, click models can build judgments using data from user interactions. However, for the Shopify Help Center, human judgments are preferred since we value the expert ratings of our experienced Support team.

    The process for annotation is simple:

    1. A query is paired with a document or a result set that might represent a relative match.
    2. An annotator judges the relevance of the document to the question and assigns the pair a rating.
    3. The labels are combined with the inputs to produce a labelled dataset.
    A diagram showing the 3 steps in the annotation process
    Annotation Process

    There are numerous ways we annotate search results:

    • Binary ratings: Given a user’s query and a document, an annotator can answer the question Is this document relevant to the query? by providing a binary answer (1 = yes, 0 = no).
    • Scale ratings: Given a user’s query and a document, a rater can answer the question How relevant is this document to the query? on a scale (1 to 5, where 5 is the most relevant). This provides interval data that you can turn into categories, where a rating of 4 or 5 represents a hit and anything lower represents a miss.
    • Ranked lists: Given a user’s query and a set of documents, a rater can answer the question Rank the documents from most relevant to least relevant to the query. This provides ranked data, where each query-document pair has an associated rank.

    The design of your annotation options depends on the evaluation measures you want to employ. We used Scale Ratings with a worded list (bad, ok, good, and great) that provides explicit descriptions on each label to simplify human judgment. These labels are then quantified and used for calculating performance scores. For example, when conducting a search for “shipping”, our Query-specific Pagerank algorithm may return documents that are more relevant to the query where the majority of them are labelled “great” or “good”.

    One thing to keep in mind with annotation is dataset staleness. Our document set in the Shopify Help Center changes rapidly, so datasets can become stale over time. If your search product is similar, we recommend re-running annotation projects on a regular basis or augmenting existing datasets that are used for search algorithm training with unseen data points.

    2. Evaluate Offline Metrics

    After collecting our data, we wanted to know whether our new search algorithm, Query-specific Pagerank, had a positive impact on our ranking effectiveness without impacting our users. We did this by running an offline evaluation that uses relevance ratings from curated datasets to judge the effectiveness of search results before launching potentially-risky changes to production. By using offline metrics, we run thousands of historical search queries against our algorithm variants and use the labels to evaluate statistical measures that tell us how well our algorithm performs. This enables us to iterate quickly since we simulate thousands of searches per minute.

    There are a number of measures we can use, but we’ve found Mean Average Precision and Normalized Discounted Cumulative Gain to be the most useful and approachable for our stakeholders.

    Mean Average Precision

    Mean Average Precision (MAP) measures the relevance of each item in the results list to the user’s query with a specific cutoff N. As queries can return thousands of results, and only a few users will read all of them, only top N returned results need to be examined. The top N number is usually chosen arbitrarily or based on the number of paginated results. Precision@N is the percentage of relevant items among the first N recommendations. MAP is calculated by averaging the AP scores for each query in our dataset. The result is a measure that penalizes returning irrelevant documents before relevant ones. Here is an example of how MAP is calculated:

    An example of how MAP is calculated for an algorithm, given 2 search queries as inputs
    Example of MAP Calculation

    Normalized Discounted Cumulative Gain

    To compute MAP, you need to classify if a document is a good recommendation by determining a cutoff score. For example, document and search query pairs that have ok, good, and great labels (that is scores greater than 0) will be categorized as relevant. But the differences in the relevancy of ok and great pairs will be neglected.

    Discounted Cumulative Gain (DCG) addresses this drawback by maintaining the non-binary score while adding a logarithmic reduction factor—the denominator of the DCG function to penalize the recommended items with lower positions in the list. DCG is a measure of ranking quality that takes into account the position of documents in the results set.

    An example of calculating and comparing DCG of two search algorithms. The scale ratings of each query and document pair are determined by annotators
    Example of DCG Calculation

    One issue with DCG is that the length of search results differ depending on the query provided. This is problematic because the more results a query set has, the better the DCG score, even if the ranking doesn’t make sense. Normalized Discounted Cumulative Gain Scores (NDCG) solves this problem. NDCG is calculated by dividing DCG by the maximum possible DCG score—the score calculated from the sorted search results.

    Comparing the numeric values of offline metrics is great when the differences are significant. The higher value, the more successful the algorithm. However, this only tells us the ending of the story. When the results aren’t significantly different, the insights we get from the metrics comparison are limited. Therefore, understanding how we got to the ending is also important for future model improvements and developments. You gather these insights by breaking down the composition of queries to look at:

    • Frequency: How often do our algorithms return worse results than annotation data?
    • Velocity: How far off are our rankings in different algorithms?
    • Commonalities: Understanding the queries that consistently have positive impacts on the algorithm performance, and finding the commonality among these queries, can help you determine the limitations on an algorithm.

    Offline Metrics in Action

    We conducted a deep dive analysis on evaluating offline metrics using MAP and NDCG to assess the success of our new Query-specific Pagerank algorithm. We found that our new algorithm returned higher scored documents more frequently and had slightly better scores in both offline metrics.

    3. Evaluate Online Metrics

    Next, we wanted to see how our users interact with our algorithms by looking at online metrics. Online metrics use search logs to determine how real search events perform. They’re based on understanding users’ behaviour with the search product and are commonly used to evaluate A/B tests.

    The metrics you choose to evaluate the success of your algorithms depends on the goals of your product. Since the Shopify Help Center aims to provide relevant information to users looking for assistance with our platform, metrics that determine success include:

    • How often users interact with search results
    • How far they had to scroll to find the best result for their needs
    • Whether they had to contact our Support team to solve their problem.

    When running an A/B test, we need to define these measures and determine how we expect the new algorithm to move the needle. Below are some common online metrics, as well as product success metrics, that we used to evaluate how search events performed:

    • Click-through rate (CTR): The portion of users who clicked on a search result when surfaced. Relevant results should be clicked, so we want a high CTR.
    • Average rank: The average rank of clicked results. Since we aim for the most relevant results to be surfaced first, and clicked, we want to have a low average rank.
    • Abandonment:When a user has intent to find help but they didn’t interact with search results, didn’t contact us, and wasn’t seen again. We always expect some level of abandonment due to bots and spam messages (yes, search products get spam too!), so we want this to be moderately low.
    • Deflection: Our search is a success when users don’t have to contact support to find what they’re looking for. In other words, the user was deflected from contacting us. We want this metric to be high, but in reality we want people to contact us when it’s what is best for their situation, so deflection is a bit of a nuanced metric.

    We use the Kafka data collected in our first step to calculate these metrics and see how successful our search product is over time, across user segments, or between different search topics. For example, we study CTR and deflection rates between users in different languages. We also use A/B tests to assign users to different versions of our product to see if our new version significantly outperforms the old.

    Flow of Evaluation Online Metrics
    Flow of Evaluation Online Metrics

    A/B testing search is similar to other A/B tests you may be familiar with. When a user visits the Help Center, they are assigned to one of our experiment groups, which determines which algorithm their subsequent searches will be powered by. Over many visits, we can evaluate our metrics to determine which algorithm outperforms the other (for example, whether algorithm A has a significantly higher click-through rate with search results than algorithm B).

    Online Metrics in Action

    We conducted an online A/B test to see how our new Query-specific Pagerank algorithm measured against our existing algorithm, with half of our search users assigned to group A (powered by Vanilla Pagerank) and half assigned to group B (powered by Query-specific Pagerank). Our results showed that users in group B were:

    • Less likely to click past the first page of results
    • Less likely to conduct a follow-up search
    • More likely to click results
    • More likely to click the first result shown
    • Had a lower average rank of clicked results

    Essentially, group B users were able to find helpful articles with less effort when compared to group A users.

    The Results

    After using our evaluation framework to measure our new algorithm against our existing algorithm, it was clear that our new algorithm outperformed the former in a way that was meaningful for our product. Our metrics showed our experiment was a success, and we were able to replace our Vanilla Pagerank algorithm with our new and improved Query-specific Pagerank algorithm.

    If you’re using this framework to evaluate your search algorithms, it’s important to note that even a failed experiment can help you learn and identify new opportunities. Did your offline or online testing show a decrease in performance? Is a certain subset of queries driving the performance down? Are some users better served by the changes than other segments? However your analysis points, don’t be afraid to dive deeper to understand what’s happening. You’ll want to document your findings for future iterations.

    Key Takeaways for Evaluating a Search Algorithm

    Algorithms are the secret sauce of any search product. The efficacy of a search algorithm has a huge impact on a users’ experience, so having the right process to evaluate a search algorithm’s performance ensures you’re making confident decisions with your users in mind.

    To quickly summarize, the most important takeaways are:

    • A high-quality and reliable labelled dataset is key for a successful, unbiased evaluation of a search algorithm.
    • Online metrics provide valuable insights on user behaviours, in addition to algorithm evaluation, even if they’re resource-intensive and risky to implement.
    • Offline metrics are helpful for iterating on new algorithms quickly and mitigating the risks of launching new algorithms into production.

    Jodi Sloan is a Senior Engineer on the Discovery Experiences team. Over the past 2 years, she has used her passion for data science and engineering to craft performant, intelligent search experiences across Shopify. If you’d like to connect with Jodi, reach out on LinkedIn.

    Selina Li is a Data Scientist on the Messaging team. She is currently working to build intelligence in conversation and improve merchant experiences in chat. Previously, she was with the Self Help team where she contributed to deliver better search experiences for users in Shopify Help Center and Concierge. If you would like to connect with Selina, reach out on Linkedin.


    If you’re a data scientist or engineer who’s passionate about building great search products, we’re hiring! Reach out to us or apply on our careers page.

    Continue reading

    How to Build a Web App with and without Rails Libraries

    How to Build a Web App with and without Rails Libraries

    How would we build a web application only using the standard Ruby libraries? In this article, I’ll break down the key foundational concepts of how a web application works while building one from the ground up. If we can build a web application only using Ruby libraries, why would we need web server interfaces like Rack and web applications like Ruby on Rails? By the end of this article, you’ll gain a new appreciation for Rails and its magic.

    Continue reading

    Start your free 14-day trial of Shopify