What Is Nix

What Is Nix

Over the past year and a bit, Shopify has been progressively rebuilding parts of our developer tooling with Nix. I initially planned to write about how we're using Nix now, and what we're going to do with it in the future (spoiler: everything?). However, I realize that most of you won't have a really clear handle on what Nix is, and I haven't found a lot of the introductory material to convey a clear impression very quickly, so this article is going to be a crash course in what Nix is, how to think about it, and why it's such a valuable and paradigm-shifting piece of technology.

There are a few places in this post where I will lie to you in subtle ways to gloss over all of the small nuances and exceptions to rules. I'm not going to call these out. I'm just trying to build a general understanding. At the end of this post, you should have the basic conceptual scaffolding you need in order to think about Nix. Let's dive in!

What is Nix?

The most basic, fundamental idea behind Nix is this:

Everything on your computer implicitly depends on a whole bunch of other things on your computer.

  • All software exists in a graph of dependencies.
  • Most of the time, this graph is implicit.
  • Nix makes this graph explicit.

Four Building Blocks

Let's get this out of the way up front: Nix is a hard thing to explain.

There are a few components that you have to understand in order to really get it, and all of their explanations are somewhat interdependent; and, even after explaining all of these building blocks, it still takes a bit of mulling over the implications of how they compose in order for the magic of Nix to really click. Nevertheless, we'll try, one block at a time.

The major building blocks, at least in my mental model of Nix, are:

  1. The Nix Store
  2. Derivations
  3. Sandboxing
  4. The Nix Language.

The Nix Store

The easiest place to start is the Nix Store. Once you've installed Nix, you'll wind up with a directory at /nix/store, containing a whole bunch of entries that look something like this:


This directory, /nix/store, is a kind of Graph Database. Each entry (each file or directory directly under /nix/store) is a Node in that Graph Database, and the relationships between them constitute Edges.

The only thing that's allowed to write directories and files into /nix/store is Nix itself, and after Nix writes a Node into this Graph Database, it's completely immutable forever after: Nix guarantees that the contents of a Node doesn't change after it's been created. Further, due to magic that we'll discuss later, the contents of a given Node is guaranteed to be functionally identical to a Node with the same name in some other Graph, regardless of where they're built.

What, then, is a "relationship between them?" Put another way, what is an Edge? Well, the first part of a Store path (the 32-character-long alphanumeric blob) is a cryptographic hash (of what, we'll discuss later). If a file in some other Store path includes the literal text "h9bvv0qpiygnqykn4bf7r3xrxmvqpsrd-nix-2.3.3", that constitutes a graph Edge pointing from the Node containing that text to the Node referred to by that path. Nodes in the Nix store are immutable after they're created, and the Edges they originate are scanned and cached elsewhere when they're first created.

To demonstrate this linkage, if you run otool -L (or ldd on Linux) on the nix binary, you'll see a number of libraries referenced, and these look like:


That's extracted by otool or ldd, but ultimately comes from text embedded in the binary, and Nix sees this too when it determines the Edges directed from this Node.

Highly astute readers may be skeptical that scanning for literal path references in a Node after it's created is a reliable way to determine a dependency. For now, just take it as given that this, surprisingly, works almost flawlessly in practice.

To put this into practice, we can demonstrate just how much of a Graph Database this actually is using nix-store --query. /nix/store is a tool built in to Nix that interacts directly with the Nix Store, and the --query mode has a multitude of flags for asking different questions of the Graph Database that is the Store.

Let's find all of the Nodes that <hash>-nix-2.3.3 has Edges pointing to:

$ nix-store --query --references /nix/store/h9bvv0qpiygnqykn4bf7r3xrxmvqpsrd-nix-2.3.3/
...(and 21 more)...

Similarly, we could ask for the Edges pointing to this node using --referers, or we could ask for the full transitive closure of Nodes reachable from the starting Node using --requisites.

The transitive closure is an important concept in Nix, but you don't really have to understand the graph theory: An Edge directed from a Node is logically a dependency: if a Node includes a reference to another Node, it depends on that Node. So, the transitive closure (--requisites) also includes those dependencies' dependencies, and so on recursively, to include the total set of things depended on by a given Node.

For example, a Ruby application may depend on the result of bundling together all the rubygems specified in the Gemfile. That bundle may depend on the result of installing the Gem nokogiri, which may depend on libxml2 (which may depend on libc or libSystem). All of these things are present in the transitive closure of the application (--requisites), but only the gem bundle is a direct reference (--references).

Now here's the key thing: This transitive closure of dependencies always exists, even outside of Nix: these things are always dependencies of your application, but normally, your computer is just trusted to have acceptable versions of acceptable libraries in acceptable places. Nix removes these assumptions and makes the whole graph explicit.

To really drive home the "graphiness" of software dependencies, we can install Ruby via nix (nix-env -iA nixpkgs.ruby) and then build a graph of all of its dependencies:

nix-store --query --graph $(which ruby) \
| nix run nixpkgs.graphviz -c dot > ruby.svg

Graphiness of Software Dependencies

Graphiness of Software Dependencies


The second building block is the Derivation. Above, I offhandedly mentioned that only Nix can write things into the Nix Store, but how does it know what to write? Derivations are the key.

A Derivation is a special Node in the Nix store, which tells Nix how to build one or more other Nodes.

If you list your /nix/store, you'll see a whole lot of items most likely, but some of them will end in .drv:


This is a Derivation. It's a special format written and read by Nix, which gives build instructions for anything in the Nix store. Just about everything (except Derivations) in the Nix store is put there by building a Derivation.
So what does a Derivation look like?

$ cat /nix/store/ynzfmamryf6lrybjy1zqp1x190l5yiy5-demo.drv
Derive([("out","/nix/store/76gxh82dqh6gcppm58ppbsi0h5hahj07-demo","","")],[],[],"x86_64-darwin","/bin/sh",["-c","echo 'hello world' > $out"],[("builder","/nix/store/5arhyyfgnfs01n1cgaf7s82ckzys3vbg-bash-4.4-p23/bin/bash"),("name","demo"),("out","/nix/store/76gxh82dqh6gcppm58ppbsi0h5hahj07-demo"),("system","x86_64-darwin")])

That's not especially readable, but there's a couple of important concepts to communicate here:

  • Everything required to build this Derivation is explicitly listed in the file by path (you can see "bash" here, for example).
  • The hash component of the Derivation's path in the Nix Store is essentially a hash of the contents of the file.

Since every direct dependency is mentioned in the contents, and the path is a hash of the contents, that means that if the dependencies and whatever other information the derivation contains don't change, the hash won't change, but if a different version of a dependency is used, the hash changes.

There are a few different ways to build Derivations. Let's use nix-build:

$ nix-build /nix/store/ynzfmamryf6lrybjy1zqp1x190l5yiy5-demo.drv

This ran whatever the build instructions were and generated a new path in the Nix Store (a new Node in the Graph Database).

Take a close look at the hash in the newly-created path. You'll see the same hash in the Derivation contents above. That output path was pre-defined, but not pre-generated. The output path is also a stable hash. You can essentially think of it as being a hash of the derivation and also the name of the output (in this case: "out"; the default output).

So, if a dependency of the Derivation changes, that changes the hash of the Derivation. It also changes the hashes of all of that Derivation's outputs. This means that changing a dependency of a dependency of a dependency bubbles all the way down the tree, changing the hashes of every Derivation and all those Derivation's outputs that depend on the changed thing, directly or indirectly.

Let's break apart that unreadable blob of Derivation content from above a little bit.

  • outputs: What nodes can this build?
  • inputDrvs: Other Derivations that must be built before this one
  • inputSrcs: Things already in the store on which this build depends
  • platform: Is this for macOS? Linux?
  • builder: What program gets run to do the build?
  • args: Arguments to pass to that program
  • env: Environment variables to set for that program

Or, to dissect that Derivation:



This Derivation has one output, named "out" (the default name), with some path that would be generated if we would build it.


[ ]

This is a simple toy derivation, with no inputDrvs. What this really means is that there are no dependencies, other than the builder. Normally, you would see something more like:


This indicates a dependency on the OpenSSH Derivation's default output.


[ ]

Again, we have a very simple toy Derivation! Commonly, you will see:


It's not really critical to the mental model, but Nix can also copy static files into the Nix Store in some limited ways, and these aren't really constructed by Derivations. This field just lists any static files in the Nix store on which this Derivation depends.



Nix runs on multiple platforms and CPU architectures, and often the output of compilers will only work on one of these, so the derivation needs to indicate which architecture it's intended for.

There's actually an important point here: Nix Store entries can be copied around between machines without concern, because all of their dependencies are explicit. The CPU details are a dependency in many cases.



This program is executed with args and env, and is expected to generate the output(s).


["-c","echo 'hello world' > $out"]

You can see that the output name ("out") is being used as a variable here. We're running, basically, bash -c "echo 'hello world' > $out". This should just be writing the text "hello world" into the Derivation output.



Each of these is set as an Environment Variable before calling the builder, so you can see how we got that $out variable above, and note that it's the same as the path given in outputs above.

Derivation in Summary

So, if we build that Derivation, let's see what the output is:

$ nix-build /nix/store/ynzfmamryf6lrybjy1zqp1x190l5yiy5-demo.drv
$ cat /nix/store/76gxh82dqh6gcppm58ppbsi0h5hahj07-demo
hello world

As we expected, it's "hello world".

A Derivation is a recipe to build some other path in the Nix Store.


After walking through that Derivation in the last section, you may be starting to develop a feel for how explicitly-declared dependencies make it into the build, and how that Graph structure comes together—but what prevents builds from referring to things at undeclared paths, or things that aren't in the Nix store at all?

Nix does a lot of work to make sure that builds can only see the Nodes in the Graph which their Derivation has declared, and also, that they don't access things outside of the store.

A Derivation build simply cannot access anything not declared by the Derivation. This is enforced in a few ways:

  • For the most part, Nix uses patched versions of compilers and linkers that don't try to look in the default locations (/usr/lib, and so on).
  • Nix typically builds Derivations in an actual sandbox that denies access to everything that the build isn't supposed to access.

A Sandbox is created for a Derivation build that gives filesystem read access to—and only to—the paths explicitly mentioned in the Derivation.

What this amounts to is that artifacts in the Nix Store essentially can't depend on anything outside of the Nix Store.

The Nix Language

And finally, the block that brings it all together: the Nix Language.
Nix has a custom language used to construct derivations. There's a lot we could talk about here, but there are two major aspects of the language's design to draw attention to. The Nix Language is:

  1. lazy-evaluated
  2. (almost) free of side-effects.

I'll try to explain these by example.

Lazy Evaluation

Take a look at this code:

data = {
  a = 1;
  b = functionThatTakesMinutesToRun 1;

This is Nix code. You can probably figure out what's going on here: we're creating something like a hash table containing keys "a" and "b", and "b" is the result of calling an expensive function.

In Nix, this code takes approximately no time to run, because the value of "b" isn't actually evaluated until it's needed. We could even:

  data = {
   a = 1;
   b = functionThatTakesMinutesToRun 1;
in data.a

Here, we're creating the table (technically called an Attribute Set in Nix), and extracting "a" from it.

This evaluates to "1" almost instantly, without ever running the code that generates "b".

Conspicuously absent in the code samples above is any sort of actual work getting done, other than just pushing data around within the Nix language. The reason for this is that the Nix language can’t actually do very much.

Free of Side Effects (almost)

The Nix language lacks a lot of features you will expect in normal programming languages. It has

  • No networking
  • No user input
  • No file writing
  • No output (except limited debug/tracing support).

It doesn't actually do anything at all in terms of interacting with the world…well, except for when you call the derivation function.

One Side Effect

The Nix Language has precisely one function with a side effect. When you call derivation with the right set of arguments, Nix writes out a new <hash>-<name>.drv file into the Nix Store as a side effect of calling that function.

For example:

derivation {
  name = "demo";
  builder = "${bash}/bin/bash";
  args = [ "-c" "echo 'hello world' > $out" ];
  system = "x86_64-darwin";

If you evaluate this in nix repl, it will print something like:

«derivation /nix/store/ynzfmamryf6lrybjy1zqp1x190l5yiy5-demo.drv»

That returned object is just the object you passed in (with name, builder, args, and system keys), but with a few extra fields (including drvPath, which is what got printed after the call to derivation), but importantly, that path in the Nix store was actually created.

It's worth emphasizing again: This is basically the only thing that the Nix Language can actually do. There's a whole lot of pushing data and functions around in Nix code, but it all boils down to calls to derivation.

Note that we referred to ${bash} in that Derivation. This is actually the Derivation from earlier in this article, and that variable substitution is actually how Derivations depend on each other. The variable bash refers to another call to derivation, which generates instructions to build bash when it's evaluated.

The Nix Language doesn't ever actually build anything. It creates Derivations, and later, other Nix tools read those derivations and build the outputs. The Nix Language is just a Domain Specific Language for creating Derivations

Nixpkgs: Derivation and Lazy Evaluation

Nixpkgs is the global default package repository for Nix, but it's very unlike what you probably think of when you hear "package repository."

Nixpkgs is a single Nix program. It makes use of the fact that the Nix Language is Lazy Evaluated, and includes many, many calls to derivation. The (simplified but) basic structure of Nixpkgs is something like:

  ruby = derivation { ... };
  python = derivation { ... };
  nodejs = derivation { ... };

In order to build “ruby”, various tools just force Nix to evaluate the “ruby” attribute of that Attribute Set, which calls derivation, generating the Derivation for Ruby into the Nix Store and returning the path that was built. Then, the tool runs something like nix-build on that path to generate the output.

Shipit! Presents: How Shopify Uses Nix

Well, it takes a lot more words than I can write here—and probably some amount of hands-on experimentation—to let you really, viscerally, feel the paradigm shift that Nix enables, but hopefully I’ve given you a taste.

If you’re looking for more Nix content, I’m currently re-releasing a series of screencasts I recorded for developers at Shopify to the public. Check out Nixology on YouTube.

You can also join me for a discussion about how Shopify is using Nix to rebuild our developer tooling. I’ll cover some of this content again, and show off some of the tooling we actually use on a day-to-day basis.

What: ShipIt! Presents: How Shopify Uses Nix

Date: May 25, 2020 at 1:00 pm EST

Please view the recording at


If you want to work on Nix, come join my team! We're always hiring, so visit our Engineering career page to find out about our open positions. 

Continue reading

A Brief History of TLS Certificates at Shopify

A Brief History of TLS Certificates at Shopify

Transport Layer Security (TLS) encryption may be commonplace in 2020, but this wasn’t always the case. Back in 2014, our business owner storefront traffic wasn’t encrypted. We manually provisioned the few TLS certificates that were in production. In this post, we’ll cover Shopify’s journey from manually provisioning TLS certificates to the fully automated system that supports over 1M business owners today.

In the Beginning

Up to 2014, only business owner shop administration and checkout traffic were encrypted. All checkouts were on the checkout.shopify.com domain. Secured shop administration functions used the *.myshopify.com certificate and a single-domain certificate for checkout.shopify.com. Our Operations team renewed the certificates manually as needed. During this time, teams began research on what it would take for us to offer TLS encryption for all business owners in an automated fashion.

Shopify Plus

We launched Shopify Plus in early 2014. One of Plus’s earliest features was TLS encrypted storefronts. We manually provisioned certificates, adding new domains to the Subject Alternative Name (SAN) list as required. As our certificate authority placed a limit on the number of domains per certificate, certificates were added to support the new domains being onboarded. At the time, Internet Explorer on Windows XP was still used by a significant number of users, which prevented our use of the Server Name Indication (SNI) extension.

While this addressed our immediate needs, there were several drawbacks:

  • Manual certificate updates and provisioning were labor-intensive and needed to be handled with care.
  • Additional IP addresses were needed to support new certificates.
  • Having domains for non-related shops in a single certificate wasn’t ideal.

The pace of onboarding was manageable at first. As we onboarded more merchants, it was apparent that this process wasn’t sustainable. At this point, there were dozens of certificates that all had to be manually provisioned and renewed. For each Plus account onboarded, the new domains had to be manually added. This was labor-intensive and error-prone. We worked on a fully automated system during Shopify’s Hack Days, and it became a fully staffed project in May 2015.

Shopify’s Notary System

Automating TLS certificates had to address multiple facets of the process including

  • How are the certificates provisioned from the certificate authority?
  • How to serve the certificates at scale?
  • What other considerations are there for offering encrypted storefronts?

Shopify's Notary System
Shopify's Notary System

Provisioning Certificates

Our Notary system provisions certificates. When a business owner adds a domain to their shop, the system receives a request for a certificate to be provisioned. The certificate provisioning is fully automated via Application Programming Interface (API) calls to the certificate authority. This includes the order request, domain ownership verification, and certificate/private key pair delivery. Certificate renewals are performed automatically in the same fashion.

While it makes sense that we group domains from a shop to one certificate, the system handles all domains separately for simplicity. Each certificate has one domain with a unique private key. The certificate and private key are stored in a relational database. This relational database is accessible by the load balancers for terminating TLS connections.

Scaling Up Certificate Provisioning

At the time, we hosted our nginx load balancers at our datacenters. Storing the TLS certificates on disk and reloading nginx when certificates changed wasn’t feasible. In a past article, we talked about our use of nginx and OpenResty Lua modules. Using OpenResty allowed us to script nginx to serve dynamic content outside of the nginx configuration. In addition, browser support for the TLS SNI extension was almost universal. By leveraging the TLS SNI extension, we dynamically load TLS certificates from our database in a Lua middleware via the ssl_certificate_by_lua module. Certificates and private keys are directly accessible from the relational database via a single SQL query. An in-memory Least Recently Used (LRU) cache reduced the latency of TLS handshakes for frequently accessed domains.

Solving Mixed Content Warnings

With TLS certificates in place for business owner shop domains, we could offer encrypted storefronts for all shops. However, there was still a significant hurdle to overcome. Each shop’s theme could have images or assets referencing non-encrypted Uniform Resource Locators (URLs). Mixing of encrypted and unencrypted content would cause the browser to display a Mixed Content warning, denoting that some resources on the page are not encrypted. To resolve this problem, we had to process all the shop themes to replace references to HTTP with HTTPS.

With all the infrastructure in place, we realized the goal of supporting encrypted storefronts for all merchants in February 2016. The same system is still in place and has scaled to provide TLS certificates for all of our 1M+ merchants.

Let’s Encrypt!

Let’s Encrypt is a non-profit certificate authority that provides TLS certificates at no charge. Shopify has been and is currently a sponsor. The service launched in April 2016, shortly after our Notary went into production. With the exception of Extended Verification (EV) certificates and other special cases, we’ve migrated away from our paid certificate authority in favor of Let’s Encrypt.

Move to the Cloud

In June 2019, our network edge moved from our datacenter to a cloud provider. The number of TLS certificates in our requirements needing support drastically reduced the viable vendor list. Once the cloud provider was selected, our TLS provisioning system had to be adapted to work with their system. There were two paths forward, using the cloud provider’s managed certificates or continuing to provision Let’s Encrypt certificates and upload them. The initial migration leveraged the provider’s certificate provisioning.

Using managed certificates from the cloud provider has the advantage of being maintenance-free after they’ve been provisioned. There are no storage concerns for certificates and private keys. In addition, certificates are automatically renewed by the vendor. Administrative work was required during the migration to guide merchants to modify their domain’s Certification Authority Authorization (CAA) Domain Name System (DNS) records as needed. Backfilling the certificates for our 1M+ merchants took several weeks to complete.

After the initial successful migration to our cloud provider, we revisited the certificate provisioning strategy. As we maintain an alternate edge network for contingency, the Notary infrastructure is still in place to provide certificates for that infrastructure. The intent of using provider managed certificates is for it to be a stepping stone for deprecating Notary in the future. While the cloud provider-provisioned certificates worked well for us, there are now two sets of certificates to keep synchronized. To simplify certificate state and operation load, we now use the Notary provisioned certificates for both edge networks. Instead of provisioning certificates on our cloud provider, certificates from Notary are uploaded as new ones are required.

Outside of our business owner shop storefronts, we rely on nginx for other services that are part of our cloud infrastructure. Some of our Lua middleware, including the dynamic TLS certificate loading code, was contributed to the ingress-nginx Kubernetes project.

Our TLS certificate journey took us from a handful of manually provisioned certificates to a fully automated system that can scale up to support over 1M merchants. If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Learn about the actions we’re taking as we continue to hire during COVID‑19

Continue reading

Dev Degree: Behind the Scenes

Dev Degree: Behind the Scenes

On April 24th, we proudly celebrated the graduation of our first Dev Degree program cohort. This milestone holds a special place in Shopify history because it’s a day born out of trial and error, experimentation, iteration, and hustle. The 2016 cohort had the honor and challenge of being our first class, lived through the churn and pivots of a newly designed program, and completed their education during a worldwide pandemic, thrust into remote learning and work. The students’ success is a testament to their dedication, adaptability, and grit. It’s also the product of a thoughtfully-designed program and a high-functioning Dev Degree team.

What does it take to create an environment where students can thrive and develop into work-ready employees in four years?

We’ve achieved this mission with the Dev Degree program. The key to our success is our learning structure and multidisciplinary team. With our model, students master development skills faster than traditional methods.

The Dev Degree Program Structure

When we set out to shake-up software education in 2016, we had no prescriptive blueprint to guide us and no tried-and-true best practices. Still, we embraced the opportunity to forge a new path in partnership with trusted university advisors and experienced internal educators at Shopify.

Our vision was to create an alternative to the traditional co-op model that alternates between university studies and work placements. In Dev Degree, students receive four years of continual hands-on developer experience at Shopify, through skills training and team placements, in parallel to attending university classes. This model accelerates understanding and allows students to apply classroom theory to real-life product development across a breadth of technology.

Dev Degree Timeline - Year 1: Developer skill training at Shopify. Year 2, 3, 4: New development team and mentor every 8/12 months for 3 yearsDev Degree Timeline

While computer science and technology are at the core of our learning model, what elevates the program is the focus on personal growth, actionable feedback loops, and the opportunity to make an impact on the company, coworkers, and merchants.

University Course Curriculum

The Dev Degree program leads to an accredited Computer Science degree, which is a deciding factor for many students and their families exploring post-secondary education opportunities. All required core theoretical concepts, computer science courses, math courses, and electives are defined by and taught at the universities. Students take three university courses per semester while working 25 hours per week at Shopify throughout the four-year program. All formal assessments, grading, and final exams for university courses are carried out by the universities.

Dev Degree Program - 20 Hrs/week at Carleton or York and 25hrs/week at Shopify
Dev Degree Structure

While the universities set the requirements for the courses, we work collaboratively to define the course sequencing to ensure the students are exposed to computer science content as early as possible in their program before they start work on team placements at Shopify.

In addition to the core university courses, there are internship courses that teach software development concepts applicable to the technology industry. The universities assess the learning outcomes of the internship courses through practicum reports and meetings with university supervisors or advisors.

The courses and concepts taught at Shopify build on the university courses and teach students hands-on development skills, communication skills, developer tools training, and how to contribute to a real-world product development team effectively.

Developer Skills Training: Building a Strong Foundation

One of the lessons we learned early in the program was that students need a solid foundation of developer skills before being placed on development teams to feel confident and ready to contribute. The first year at Shopify sets the Dev Degree program apart from other work-integrated learning programs because we immerse the students in our Shopify-led developer skills training.

In the first year, we introduce the students to skills and tools that form the foundation for how they work at Shopify and other companies. There are skills they need to develop before moving into teams, such as working with code repositories and committing code, using a command line, front-end development, working with data, and more.

The breadth of technical skills that students learn in their first year in Dev Degree goes beyond the traditional university curriculum. This foundation allows students to confidently join their first placement team and have an immediate impact.

We teach this way on purpose. Universities often chose a bottom-up learning model by front-loading theory and concepts. We designed our program to immerse students somewhere in the middle of top-down and bottom-up, allowing them to discover the fundamentals gradually after they develop base skills and code a bit every day.

Due to the ever-evolving nature of software development, we update the developer skills training path often. Our current courses include the following technologies:

  • Command Line Interface (CLI)
  • Vim
  • Git & GitHub
  • Ruby
  • HTML, CSS, JavaScript
  • Databases
  • Ruby on Rails
  • React
  • TypeScript
  • GraphQL

Team Placements: Working on Merchant-Facing Projects

After they’ve completed their developer skills training courses, students spend the next three years on team placements. This is a big deal. On team placements, students get to apply what they learn in their developer skills training at Shopify and from their university courses to meaningful, real-world software development work. Our placements are purposefully-designed to expose students to a wide range of disciplines and teams to make them well-rounded developers, give them new perspectives, and introduce them to new people.

Working with their placement specialist, students interview with teams from back-end, front-end, data, security, and production engineering disciplines.

Over the course of the Dev Degree program, each student receives:

  • One 12-month team placement
  • Three 8-month team placements
  • Four different technical mentors
  • A dedicated Life@Shopify mentor
  • Twenty hours per week working on a Shopify team
  • Actionable feedback on a regular cadence
  • Evaluations from mentors and leads
  • High-touch support from the Dev Degree team
  • Access to new people and workplace culture

By the time students complete the program, they’ve been on four back-to-back team placements in their final three years at Shopify. This experience makes them a valuable asset to their future company. It also allows students to launch their careers with the confidence that they are well-prepared to make a positive contribution.

It Takes a Village: Building an Impactful Program Team

Creating a successful work-integrated learning program requires a significant commitment of time and resources from a team that spans multiple disciplines and functions. While the Dev Degree program team is responsible for the bulk of the heavy-lifting, including logistics, mentorship, and support, the program doesn’t happen without expertise and time from other Shopify subject matter experts and university stakeholders.

Dev Degree Program Team

The Dev Degree team are the most actively involved in all aspects of the program and with the students from onboarding to graduation. They are responsible for ensuring that the program meets the needs of the students, the university, and Shopify.

Program Leads

The Dev Degree program leads are the liaison between Shopify and our university partners. We have a program lead to represent each partnership, and they keep this ambitious program on the rails, including working with educators to define the curriculum and developer skills training courses. They are also responsible for hiring and evaluating student performance.

Student Success Specialists

Many of the Dev Degree students come to Shopify straight from high school, which can be daunting. In traditional co-op programs, students have a couple of years of university experience before starting their internships and being dropped into a professional workplace setting. To ease the transition to Shopify, our student success specialists are responsible for supporting students’ well-being, connecting them with other mentors, helping them learn how to become effective communicators, and being the voice of the students at Shopify. This nurturing environment helps protect first-year students from being overwhelmed and underprepared for team placements.

Placement Specialists

Team placements are an integral part of the applied learning in the Dev Degree program. Placement specialists are responsible for coordinating and overseeing all four 8-month placements for each cohort. This high-touch role requires extensive relationship-building with development teams and a deep understanding of the goals and interests of the students to ensure appropriate compatibility. To ensure that development teams get the return on investment (ROI) from investing in mentorship, placement specialists place students on teams where they can be impactful. They also support the leads and mentors on the development teams and play an active role in advocating for an influential culture of mentorship within Shopify.


The courses we teach at Shopify are foundational for students to prepare them for their team placements. Dev Degree educators have an extensive background in education and computer science and are responsible for building out the curriculum for all the developer skills training courses taught at Shopify. They design and deliver the course material and evaluate the students on technical expertise and subject knowledge. The instructors create courses for a wide range of development skills relevant to development at Shopify and other companies.

Recruitment Team

As with all recruitment at Shopify, we aim to recruit a diverse mix of students to the Dev Degree program. Our talent team is actively involved in helping us create a recruitment strategy to engage and attract top talent from a variety of schools and programs, including university meetups and mentorship in youth programs like Technovation.


After four years of having Dev Degree students on teams across twelve disciplines, the program is woven into the Shopify culture, and mentorship plays a big role.

Development Team Mentors

Development team mentors are critical to helping students build confidence, technical skills, and gain the experience needed to become an asset to the team. Mentors are responsible for guiding, evaluating, and providing actionable feedback to students throughout their 8-month placements. Mentorship requires a strong commitment and takes up about 10% of a developer’s time. We feel it’s worth the investment to build mentors and invest in a culture of mentorship. It’s a challenging but rewarding role, and especially helpful to developers looking to grow leadership skills and level up in their roles.

Life@Shopify Mentors

In addition to placement mentors, we also have experienced Shopify employees who volunteer to mentor students as they navigate through the program, their placements, and the company on the whole. These Life@Shopify mentors act as a trusted guide and help round out the mentorship experience at Shopify.

University Stakeholders

Close relationships between the universities and Shopify help integrate the theory and development practices and deepen both the understanding of concepts and work experience. We’re fortunate to have both Carleton and York University as part of the Dev Degree program and fully engaged in the model that we’ve built. The faculty advisors play an active role in working with the students to guide them on their course selections, navigate the program, and evaluate their internship courses and practicums. Without university buy-in and support, a program like Dev Degree doesn’t happen.

Dev Degree is Worth the Investment

Building a new work-integrated learning program requires a big commitment of company time, resources, and cost, but we are reaping the benefits of our gamble.

  • Graduates are well-rounded developers with a rich development experience across a range of teams and disciplines.
  • 88% of the 2016 cohort have been offered, and have accepted, full-time positions here at Shopify.
  • Students who accept positions at Shopify have already built four years of relationships and have acquired vast knowledge and skills that will help them make an immediate impact on their teams.
  • We are building future leaders through mentorship. 

While we are excited about how far we’ve come, we still have room to grow. We are looking at metrics and data to help us quantify the success of the program and to drive program improvements to take computer science education to a new level. When we started this ambitious endeavor, we wanted to mature it to a point where we could create a blueprint of the Dev Degree program for other companies and universities to adopt it and evolve it. There’s interest in what we’re doing here. It’s just a matter of time before we help make the Dev Degree model of computer science education the norm rather than the exception.

Additional Information

We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions. Learn about the actions we’re taking as we continue to hire during COVID‑19

Continue reading

How to Fix Slow Code in Ruby

How to Fix Slow Code in Ruby

By Jay Lim and Gannon McGibbon

At Shopify, we believe in highly aligned, loosely coupled teams to help us move fast. Since we have many teams working independently on a large monolithic Rails application, inefficiencies in code are sometimes inadvertently added to our codebase. Over time, these problems can add up to serious performance regressions.

By the time such performance regressions are noticeable, it might already be too late to track offending commits down. This can be exceedingly challenging on codebases with thousands of changes being committed each day. How do we effectively find out why our application is slow? Even if we have a fix for the slow code, how can we prove that our new code is faster?

It all starts with profiling and benchmarking. Last year, we wrote about writing fast code in Ruby on Rails. Knowing how to write fast code is useful, but insufficient without knowing how to fix slow code. Let’s talk about the approaches that we can use to find slow code, fix it, and prove that our new solution is faster. We’ll also explore some case studies that feature real world examples on using profiling and benchmarking.


Before we dive into fixing unperformant code, we need to find it first. Identifying code that causes performance bottlenecks can be challenging in a large codebase. Profiling helps us to do so easily.

What is Profiling?

Profiling is a type of program analysis that collects metrics about the program at runtime, such as the frequency and duration of method calls. It’s carried out using a tool known as a profiler, and a profiler’s output can be visualized in various ways. For example, flat profiles, call graphs, and flamegraphs.

Why Should I Profile My Code?

Some issues are challenging to detect by just looking at the code (static analysis, code reviews, etc.). One of the main goals of profiling is observability. By knowing what is going on under the hood during runtime, we gain a better understanding of what the program is doing and reason about why an application is slow. Profiling helps us to narrow down the scope of a performance bottleneck to a particular area.

How Do I Profile?

Before we figure out what to profile, we need to first figure out what we want to know: do we want to measure elapsed time for a specific code block, or do we want to measure object allocations in that code block? In terms of granularity, do we need elapsed time for every single method call in that code block, or do we just need the aggregated value? Elapsed time here can be further broken down into CPU time or wall time.

For measuring elapsed time, a simple solution is to measure the start time and the end time of a particular code block, and report the difference. If we need a higher granularity, we do this for every single method. To do this, we use the TracePoint API in Ruby to hook into every single method call made by Ruby. Similarly, for object allocations, we use the ObjectSpace module to trace object allocations, or even dump the Ruby heap to observe its contents.

However, instead of building custom profiling solutions, we can use one of the available profilers out there, and each has its own advantages and disadvantages. Here are a few options:

1. rbspy

rbspy samples stack frames from a Ruby process over time. The main advantage is that it can be used as a standalone program without needing any instrumentation code.

Once we know the Ruby Process Identifier (PID) that we want to profile, we start the profiling session like this:

rbspy record —pid $PID

2. stackprof

Like rbspy, stackprof samples stack frames over time, but from a block of instrumented Ruby code. Stackprof is used as a profiling solution for custom code blocks:

profile = StackProf.run(mode: :cpu) do
  # Code to profile

3. rack-mini-profiler

The rack-mini-profiler gem is a fully-featured profiling solution for Rack-based applications. Unlike the other profilers described in this section, it includes a memory profiler in addition to call-stack sampling. The memory profiler collects data such as Garbage Collection (GC) statistics, number of allocations, etc. Under the hood, it uses the stackprof and memory_profiler gems.

4. app_profiler

app_profiler is a lightweight alternative to rack-mini-profiler. It contains a Rack-only middleware that supports call-stack profiling for web requests. In addition to that, block level profiling is also available to any Ruby application. These profiles can be stored in a configurable storage backend such as Google Cloud Storage, and can be visualized through a configurable viewer such as Speedscope, a browser-based flamegraph viewer.

At Shopify, we collect performance profiles in our production environments. Rack Mini Profiler is a great gem, but it comes with a lot of extra features such as database and memory profiling, and it seemed too heavy for our use case. As a result, we built App Profiler that similarly uses Stackprof under the hood. Currently, this gem is used to support our on-demand remote profiling infrastructure for production requests.

Case Study: Using App Profiler on Shopify

An example of a performance problem that was identified in production was related to unnecessary GC cycles. Last year, we noticed that a cart item with a very large quantity used a ridiculous amount of CPU time and resulted in slow requests. It turns out, the issue was related to Ruby allocating too many objects, triggering the GC multiple times.

The figure below illustrates a section of the flamegraph for a similar slow request, and the section corresponds to approximately 500ms of CPU time.

A section of the flamegraph for a similar slow request

A section of the flamegraph for a similar slow request

The highlighted chunks correspond to the GC operations, and they interleave with the regular operations. From this section, we see that GC itself consumed about 35% of CPU time, which is a lot! We inferred that we were allocating too many Ruby objects. Without profiling, it’s difficult to identify these kinds of issues quickly.


Now that we know how to identify performance problems, how do we fix them? While the right solution is largely context sensitive, validating the fix isn’t. Benchmarking helps us prove performance differences in two or more different code paths.

What is Benchmarking?

Benchmarking is a way of measuring the performance of code. Often, it’s used to compare two or more similar code paths to see which code path is the fastest. Here’s what a simple ruby benchmark looks like:

This code snippet is benchmarking at its simplest. We’re measuring how long a method takes to run in seconds. We could extend the example to measure a series of methods, a complex math equation, or anything else that fits into a block. This kind of instrumentation is useful because it can unveil regression or improvement in speed over time.

While wall time is a pretty reliable measurement of “performance”, there’s other methods one can measure code by besides realtime, the Ruby standard library’s Benchmark module includes bm and bmbm.

The bm method shows a more detailed breakdown of timing measurements. Let’s take a look at a script with some output:

User, system, and total are all different measurements of CPU time. User refers to time spent working in user space. Similarly, system denotes time spent working in kernel space. Total is the sum of CPU timings, and real is the same wall time measurement we saw from Benchmark.realtime.

What about bmbm? Well, it is exactly the same as bm with one unique difference. Here’s what the output looks like:

The rehearsal, or warmup step is what makes bmbm useful. It runs benchmark code blocks once before measuring to prime any caching or similar mechanism to produce more stable, reproducible results.

Lastly, let’s talk about the benchmark-ips gem. This is the most common method of benchmarking Ruby code. You’ll see it a lot in the wild, this is what a simple script looks like:

Here, we’re benchmarking the same method using familiar syntax with ips method. Notice the inline bundler and gemfile code. We need this in a scripting context because benchmark-ips isn’t part of the standard library. In a normal project setup, we add gem entries to the Gemfile as usual.

The output of this script is as follows:

Ignoring the bundler output, we see the warmup iteration score per 100 milliseconds ran for the default of 2 seconds, and how many times the code block was able to run in 5 seconds. It’ll become more apparent why benchmark-ips is so popular later.

Why Should I Benchmark My Code?

So, now we know what benchmarking is and some tools available to us. But why even bother benchmarking at all? It may not be immediately obvious why benchmarking is so valuable.

Benchmarks are used to quantify the performance of one or more blocks of code. This becomes very useful when there are performance questions that need answers. Often, these questions boil down to “which is faster, A or B?”. Let’s look at an example:

In this script, we’re doing most of what we did in the first benchmark-ips example. Pay attention to the addition of another method, and how it changes the benchmark block. When benchmarking more than one thing at once, simply add another report block. Additionally, the compare! method prints a comparison of all reports:

Wow, that’s pretty snazzy! compare! is able to tell us which benchmark is slower and by how much. Given the amount of thread sleeping we’re doing in our benchmark subject methods, this aligns with our expectations.

Benchmarking can be a means of proving how fast a given code path is. It’s not uncommon for developers to propose a code change that makes a code path faster without any evidence.

Depending on the change, comparison can be challenging. As in the previous example, benchmark-ips may be used to benchmark individual code paths. Running the same single report benchmark on versions of code easily tests pre and post patch performance.

How Do I Benchmark My Code?

Now we know what benchmarking is and why it is important. Great! But how do you get started benchmarking in an application? Trivial examples are easy to learn from but aren’t very relatable.

When developing in a framework like Ruby on Rails, it can be difficult to understand how to set up and load framework code for benchmark scripts. Thankfully, one of the latest features of Ruby on Rails can generate benchmarks automatically. Let’s take a look:

This benchmark can be generated by running bin/rails generate benchmark my_benchmark, placing a file in script/benchmarks/my_benchmark.rb. Note the inline gemfile isn’t required because we piggyback off of the Rails app’s Gemfile. The benchmark generator is slated for release in Rails 6.1.

Now, let’s look at a real world example of a Rails benchmark:

In this example, we’re subclassing Order and caching the calculation it does to find the total price of all line items. While it may seem obvious that this would be a beneficial code change, it isn’t obvious how much faster it is compared to the base implementation. Here’s a more unabridged version of the script for full context.

Running the script reveals a ~50x improvement to a simple order of 4 line items. With orders with more line items, the payoff only gets better.

One last thing to know about benchmarking effectively is being aware of micro-optimization. These are optimizations that are so small, the performance improvement isn’t worth the code change. While these are sometimes acceptable for hot code paths, it’s best to tackle larger scale performance issues first.

Case Study: Rails Contributions

As with many open source projects, Ruby on Rails usually requires performance optimization pull requests to include benchmarks. The same is common for new features to performance sensitive areas like Active Record query building or Active Support’s cache stores. In the case of Rails, most benchmarks are made with benchmark-ips to simplify comparison.

For example, https://github.com/rails/rails/pull/36052 changes how primary keys are accessed in Active Record instances. Specifically, refactoring class method calls to instance variable references. It includes before and after benchmark results with a clear explanation of why the change is necessary.

https://github.com/rails/rails/pull/38401 changes model attribute assignment in Active Record so that key stringification of attribute hashes is no longer needed. A benchmark script with multiple scenarios is provided with results. This is a particularly hot codepath because creating and updating records is at the heart of most Rails apps.

Another example, https://github.com/rails/rails/pull/34197 reduces object allocations in ActiveRecord#respond_to?. It provides a memory benchmark that compares total allocations before and after the patch, with a calculated diff. Reducing allocations delivers better performance because the less Ruby allocates, the less time Ruby spends assigning objects to blocks of memory.

Final Thoughts

Slow code is an inevitable facet of any codebase. It isn’t important who introduces performance regressions, but how they are fixed. As developers, it’s our job to leverage profiling and benchmarking to find and fix performance problems.

At Shopify, we’ve written a lot of slow code, often for good reasons. Ruby itself is optimized for the developer, not the servers we run it on. As Rubyists, we write idiomatic, maintainable code that isn’t always performant, so profile and benchmark responsibly, and be wary of micro-optimizations!

Additional Information

If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions. Learn about the actions we’re taking as we continue to hire during COVID‑19




Continue reading

Categorizing Products at Scale

Categorizing Products at Scale

By: Jeet Mehta and Kathy Ge

With over 1M business owners now on Shopify, there are billions of products being created and sold across the platform. Just like those business owners, the products that they sell are extremely diverse! Even when selling similar products, they tend to describe products very differently. One may describe their sock product as a “woolen long sock,” whereas another may have a similar sock product described as a “blue striped long sock.”

How can we identify similar products, and why is that even useful?

Applications of Product Categorization

Business owners come to our platform for its multiple sales/marketing channels, app and partner ecosystem, brick and mortar support, and so much more. By understanding the types of products they sell, we provide personalized insights to help them capitalize on valuable business opportunities. For example, when business owners try to sell on other channels like Facebook Marketplace, we can leverage our product categorization engine to pre-fill category related information and save them time.

In this blog post, we’re going to step through how we implemented a model to categorize all our products at Shopify, and in doing so, enabled cross-platform teams to deliver personalized insights to business owners. The system is used by 20+ teams across Shopify to power features like marketing recommendations for business owners (imagine: “t-shirts are trending, you should run an ad for your apparel products”), identification of brick-and-mortar stores for Shopify POS, market segmentation, and much more! We’ll also walk through problems, challenges, and technical tradeoffs made along the way.

Why is Categorizing Products a Hard Problem?

To start off, how do we even come up with a set of categories that represents all the products in the commerce space? Business owners are constantly coming up with new, creative ideas for products to sell! Luckily, Google has defined their own hierarchical Google Product Taxonomy (GPT) which we leveraged in our problem.

The particular task of classifying over a large-scale hierarchical taxonomy presented two unique challenges:

  1. Scale: The GPT has over 5000 categories and is hierarchical. Binary classification or multi-class classification can be handled well with most simple classifiers. However, these approaches don’t scale well as the number of classes increases to the hundreds or thousands. We also have well over a billion products at Shopify and growing!
  2. Structure: Common classification tasks don’t share structure between classes (i.e. distinguishing between a dog and a cat is a flat classification problem). In this case, there’s an inherent tree-like hierarchy which adds a significant amount of complexity when classifying.

Sample visualization of the GPT

Sample visualization of the GPT

Representing our Products: Featurization 👕

With all machine learning problems, the first step is featurization, the process of transforming the available data into a machine-understandable format.

Before we begin, it’s worth answering the question: What attributes (or features) distinguish one product from another? Another way to think about this is if you, the human, were given the task of classifying products into a predefined set of categories: what would you want to look at?

Some attributes that likely come to mind are

  • Product title
  • Product image
  • Product description
  • Product tags.

These are the same attributes that a machine learning model would need access to in order to perform classification successfully. With most problems of this nature though, it’s best to follow Occam’s Razor when determining viable solutions.

Among competing hypotheses, the one with the fewest assumptions should be selected.

In simpler language, Occam’s razor essentially states that the simplest solution or explanation is preferable to ones that are more complex. Based on the computational complexities that come with processing and featurizing images, we decided to err on the simpler side and stick with text-based data. Thus, our classification task included features like

  • Product title
  • Product description
  • Product collection
  • Product tags
  • Product vendor
  • Merchant-provided product type.

There are a variety of ways to vectorize text features like the above, including TF-IDF, Word2Vec, GloVe, etc. Optimizing for simplicity, we chose a simple term-frequency hashing featurizer using PySpark that works as follows:

HashingTF toy example

 HashingTF toy example

Given the vast size of our data (and the resulting size of the vocabulary), advanced featurization methods like Word2Vec didn’t scale since they involved storing an in-memory vocabulary. In contrast, the HashingTF provided fixed-length numeric features which scaled to any vocabulary size. So although we’re potentially missing out on better semantic representations, the upside of being able to leverage all our training data significantly outweighed the downsides.

Before performing the numeric featurization via HashingTF, we also performed a series of standard text pre-processing steps, such as:

  • Removing stop words (i.e. “the”, “a”, etc.), special characters, HTML, and URLs to reduce vocabulary size
  • Performing tokenization: splitting a string into an array of individual words or “tokens”.

The Model 📖

With our data featurized, we can now move towards modelling. Ensuring that we maintain a simple, interpretable, solution while tackling the earlier mentioned challenges of scale and structure was difficult.

Learning Product Categories

Fortunately, during the process of solution discovery, we came across a method known as Kesler’s Construction [PDF]. This is a mathematical maneuver that enables the conversion of n one-vs-all classifiers into a single binary classifier. As shown in the figure below, this is achieved by exploding the training data with respect to the labels, and manipulating feature vectors with target labels to turn a multi-class training dataset into a binary training dataset.

Figure 3: Kesler’s Construction formulation

Kesler’s Construction formulation

Applying this formulation to our problem implied pre-pending the target class to each token (word) in a given feature vector. This is repeated for each class in the output space, per feature vector. The pseudo-code below illustrates the process, and also showcases how the algorithm leads to a larger, binary-classification training dataset.

  1. Create a new empty dataset called modified_training_data
  2. For each feature_vector in the original_training_data:
    1. For each class in the taxonomy:
      1. Prepend the class to each token in the feature_vector, called modified_feature_vector
      2. If the feature_vector is an example of the class, append (modified_feature_vector, 1) to modified_training_data
    2. If the feature vector is not an example of the class, append (modified_feature_vector, 0) to modified_training_data
  3. Return modified_training_data

Note: In the algorithm above, a vector can be an example of a class if its ground truth category belongs to a class that’s a descendant of the category being compared to. For example, a feature vector that has the label Clothing would be an example of the Apparel & Accessories class, and as a result would be assigned a binary label of 1. Meanwhile, a feature vector that has the label Cell Phones would not be an example of the Apparel & Accessories class, and as a result would be assigned a binary label of 0.

Combining the above process with a simple Logistic Regression classifier allowed us to:

  • Solve the problem of scale - Kesler’s construction allowed a single model to scale to n classes (in this case, n was into the thousands)
  • Leverage taxonomy structure - By embedding target classes into feature vectors, we’re also able to leverage the structure of the taxonomy and allow information from parent categories to permeate into features for child categories. 
  • Reduce computational resource usage - Training a single model as opposed to n individual classifiers (albeit on a larger training data-set) ensured a lower computational load/cost.
  • Maintain simplicity - Logistic Regression is one of the most simple classification methods available. It’s coefficients allow interpretability, and reduced friction with hyperparameter tuning.

Inference and Predictions 🔮

Great, we now have a trained model, how do we then make predictions to all products on Shopify? Here’s an example to illustrate. Say we have a sample product, a pair of socks, below:

Figure 4: sample product entry for a pair of socks

Sample product entry for a pair of socks

We aggregate all of its text (title, description, tags, etc.) and clean it up using the Kesler’s Construction formulation resulting in the string:

“Check out these socks”

We take this sock product and compare it to all categories in the available taxonomy we trained on. To avoid computations on categories that will likely be low in relevance, we leverage the taxonomy structure and use a greedy approach in traversing the taxonomy.

Figure 5: Sample traversal of taxonomy at inference time

Sample traversal of taxonomy at inference time

For each product, we prepend a target class to each token of the feature vector, and do so for every category in the taxonomy. We score the product against each root level category by multiplying this prepended feature vector against the trained model coefficients. We start at the root level and keep track of the category with the highest score. We then score the product against the children of the category with the highest score. We continue in this fashion until we’ve reached a leaf node. We output the full path from root to leaf node as a prediction for our sock product.

Evaluation Metrics & Performance ✅

The model is built. How do we know if it’s any good? Luckily, the machine learning community has an established set of standards around evaluation metrics for models, and there are good practices around which metrics make the most sense for a given type of task.

However, the uniqueness of hierarchical classification adds a twist to these best practices. For example, commonly used evaluation metrics for classification problems include accuracy, precision, recall, and F1 Score. These metrics work great for flat binary or multi-class problems, but there are several edge cases that show up when there’s a hierarchy of classes involved.

Let’s take a look at an illustrating example. Suppose for a given product, our model predicts the following categorization: Apparel & Accessories > Clothing > Shirts & Tops. There’s a few cases that can occur, based on what the product actually is:

Product is a shirt - Model example

Product is a shirt - Model example

1. Product is a Shirt: In this case, we’re correct! Everything is perfect.

Figure 7. Product is a dress - Model example

Product is a dress - Model example

2. Product is a Dress: Clearly, our model is wrong here. But how wrong is it? It still correctly recognized that the item is a piece of apparel and is clothing

Figure 8. Product is a watch - Model example

Product is a watch - Model example

3. Product is a Watch: Again, the model is wrong here. It’s more wrong than the above answer, since it believes the product to be an accessory rather than apparel.

Figure 9. Product is a phone - Model example

Product is a phone - Model example

4. Product is a Phone: In this instance, the model is the most incorrect, since the categorization is completely outside the realm of Apparel & Accessories.

The flat metrics discussed above would punish each of the above predictions equally, when it’s clear that this isn’t the case. To rectify this, we leveraged work done by Costa et al. on hierarchical evaluation measures [PDF] which use the structure of the taxonomy (output space) to punish incorrect predictions accordingly. This includes:

  • Hierarchical accuracy
  • Hierarchical precision
  • Hierarchical recall
  • Hierarchical F1

As shown below, the calculation of the metrics largely remains the same as their original flat form. The difference is that these metrics are regulated by the distance to the nearest common ancestor. In the examples provided, Dresses and Shirts & Tops are only a single level away from having a common ancestor (Clothing). In contrast, Phones and Shirts & Tops are in completely different sub-trees, and are four levels away from having a common ancestor

Example hierarchical metrics for “Dresses” vs. “Shirts & Tops”

Example hierarchical metrics for “Dresses” vs. “Shirts & Tops”

This distance is used as a proxy to indicate the magnitude of incorrectness of our predictions, and allows us to present, and better assess the performance of our models. The lesson here is to always question conventional evaluation metrics, and ensure that they indeed fit your use-case, and measure what matters.

When Things Go Wrong: Incorrect Classifications ❌

Like all probabilistic models, our model is bound to be incorrect on occasions. While the goal of model development is to reduce these misclassifications, it’s important to note that 100% accuracy will never be the case (and it shouldn’t be the gold standard that teams drive towards).

Instead, given that the data product is delivering downstream impact to the business, it's best to determine feedback mechanisms for misclassification instances. This is exactly what we implemented through a unique setup of schematized Kafka events and an in-house annotation platform.

Feedback system design

Feedback system design

This flexible human-in-the-loop setup ensures a plug-in system that any downstream consumer can leverage, leading to reliable, accurate data additions to the model. It also extends beyond misclassifications to entire new streams of data, such that new business owner-facing products/features that allow them to provide category information can directly feed this information back into our models.

Back to the Future: Potential Improvements 🚀

Having established a baseline product categorization model, we’ve identified a number of possible improvements that can significantly improve the model’s performance, and therefore its downstream impact on the business.

Data Imbalance ⚖️

Much like other e-commerce platforms, Shopify has large sets of merchants selling certain types of products. As a result, our training dataset is skewed towards those product categories.

At the same time, we don’t want that to preclude merchants in other industries from receiving strong, personalized insights. While we’ve taken some efforts to improve the data balance of each product category in our training data, there’s a lot of room for improvement. This includes experimenting with different re-balancing techniques, such as minority class oversampling (e.g. SMOTE [PDF]), majority class undersampling, or weighted re-balancing by class size.

Translations 🌎

As Shopify expands to international markets, it’s increasingly important to make sure we’re providing equal value to all business owners, regardless of their background. While our model currently only supports English language text (that being the primary source available in our training data), there’s a big opportunity here to capture products described and sold in other languages. One of the simplest ways we can tackle this is by leveraging multi-lingual pre-trained models such as Google’s Multilingual Sentence Embeddings.

Images 📸

Product images would be a great way to leverage a rich data source to provide a universal language in which products of all countries and types can be represented and categorized. This is something we’re looking to incorporate into our model in the future, however with images come increased engineering resources required. While very expensive to train images from scratch, one strategy we’ve experimented with is using pre-trained image embeddings like Inception v3 [PDF] and developing a CNN for this classification problem.

Our simple model design allowed us interpretability and reduced computational resource usage, enabling us to solve this problem at Shopify’s scale. Building out a shared language for products unlocked tons of opportunities for us to build out better experiences for business owners and buyers. This includes things like being able to identify trending products or identifying product industries prone to fraud, or even improving storefront search experiences.

If you’re passionate about building models at scale, and you’re eager to learn more - we’re always hiring! Reach out to us or apply on our careers page.

Additional Information



Continue reading

Software Release Culture at Shopify

Software Release Culture at Shopify

A recording of the event and the additional questions are now available in the Release Culture @ Shopify Virtual Event section at the end of the post.

By Jack Li, Kate Neely, and Jon Geiger

At the end of last year, we shared the Merge Queue v2 in our blog post Successfully Merging the Work of 1000+ Developers. One question that we often get is, “why did you choose to build this yourself?” The short answer is that nothing we found could quite solve the problem in the way we wanted. The long answer is that it’s important for us to build an optimized experience for how our developers want to work and to continually shape our tooling and process around our “release culture”.

Shopify defines culture as:

“The sum of beliefs and behavior of everyone at Shopify.”

We approach the culture around releasing software the exact same way. We have important goals, like making sure that bad changes don’t deploy to production and break for our users, and that our changes can make it into production without compromises in security. But there are many ways of getting there and a lot of right answers on how we can do things.

As a team, we try to find the path to those goals that our developers want to take. We want to create experiences through tooling that can make our developers feel productive, and we want to do our best to make shipping feel like a celebration and not a chore.

Measuring Release Culture at Shopify

When we talk about measuring culture, we’re talking about a few things.

  • How do developers want to work?
  • What is important to them?
  • How do they expect the tools they use to support them?
  • How much do they want to know about what’s going on behind the scenes or under the hood of the tools they use?

Often, there isn’t one single answer to these questions, especially given the number and variety of people who deploy every day at Shopify. There are a few active and passive ways we can get a sense of the culture around shipping code. One method isn’t more important than the others, but all of them together paint a clearer picture of what life is like for the people who use our tools.

Passive and active methods of measurement
Passive and active methods of measurement

The passive methods we use really don’t require much work from our team, except to manage and aggregate information that comes in. The developer happiness survey is a biannual survey of developers across the company. Devs are asked to self-report about everything from their satisfaction with the tools they use or where they feel the most of their time is wasted or lost.

In addition, we have Slack channels dedicated to shipping that are open to anyone. Users can get support from our team or each other, and report problems they’re having. Our team is active in these channels to help foster a sense of community and encourage developers to share their experiences, but we don’t often use these channels to directly ask for feedback.

That said, we do want to be proactive about identifying pain points, and we know we can’t rely too much on users to provide that direction, so there are also active things we do to make sure we’re solving the most important problems.

The first thing is dogfooding. Just like other developers at Shopify, our team ships code every day using the same tools that we build and maintain. This helps us identify gaps in our service and empathize with users when things don’t go as planned.

Another valuable resource is our internal support team. They take on the huge responsibility of helping users and supporting our evolving suite of internal tools. They diagnose issues and help users find the right team to direct their questions. And they are invaluable in terms of identifying common pain points that users experience in current workflows, as well as potential pitfalls in concepts and prototypes. We love them.

Finally, especially when it comes to adding new features or changing existing workflows, we do UX research throughout our process:

  • to better understand user behavior and expectations
  • to test out concepts and prototypes as we develop them

We shadow developers as they ship PRs to see what else they’re looking at and what they’re thinking about as they make decisions. We talk to people, like designers and copywriters, who might not ship code at other companies (but they often do at Shopify) and ask them to walk us through their processes and how they learned to use the tools they rely on. We ask interns and new hires to test out prototypes to get fresh perspectives and challenge our assumptions.

All of this together ensures that, throughout the process of building and launching, we’re getting feedback from real users to make things better.

Feedback is a Gift

At Shopify, we often say feedback is a gift, but that doesn’t always make it less intimidating for users to share their frustrations, or easier for us to hear when things go wrong. Our goal with all measuring is to create a feedback loop where users feel comfortable talking about what’s not working for them (knowing that we care and will try to act on it), and we feel energized and inspired by what we learn from users instead of disheartened and bitter. We want them to know that their feedback is valuable and helpful for us to make both the tools and culture around shipping supportive of everyone.

Shopify’s Release Process

Let’s look at what Shopify’s actual release process looks like and how we’re working to improve it.

Release Pipeline

Happy path of the release pipeline
Happy path of the release pipeline

This is what the release pipeline looks like on the happy path. We go from Pull Request (PR) to Continuous Integration (CI)/Merge to Canary deployment and finally Production.

Release pipeline process starts with a PR and a /shipit command
Release pipeline process starts with a PR and a /shipit command

Developers start the process by creating a PR and then issue a /shipit command when ready to ship. From here, the Merge Queue system tries to integrate the PR with the trunk branch, Master.

PR merged to Master and then deployed to Canary
PR merged to Master and then deployed to Canary

When the Merge Queue determines the changes can be integrated successfully, the PR is merged to Master and deployed to our Canary infrastructure. The Canary environment receives a random 5% of all incoming requests.

Changes deployed to Production Changes deployed to Production 

Developers have tooling allowing them to test their changes in the Canary environment for 10 minutes. If there’s no manual intervention and the automated canary analysis doesn’t trigger any alerts, the changes are deployed to Production


Developers want to be trusted and have autonomy over their work. Developers should be able to own the entire release process for their PRs.

Developers own the whole process
Developers own the process

Developers own the whole process. There are no release managers, sign offs, or windows that developers are allowed to make releases in.

We have infrastructure to limit the blast radius of bad changes
We have infrastructure to limit the blast radius of bad changes

Unfortunately, sometimes things will break, and that’s ok. We have built our infrastructure to limit the blast radius of bad changes. Most importantly, we trust each of our developers to be responsible and own the recovery if their change goes bad.

Developers can fast track a fix using /shipit --emergency command
Developers can fast track a fix using /shipit --emergency

Once a fix has been prepared (either a fix-forward or revert), developers can fast track their fix to the front of the line with a single /shipit --emergency command. To help our developers make decisions quickly, we don’t have multiple recovery protocols, and instead, just have a single emergency feature that takes the quickest path to recovery.


Developers want to ship fast.

A quick release process allows us a quick path to recovery
A quick release process allows us a quick path to recovery

Speed of release is a crucial element to most apps at Shopify. It’s a big productivity boost for developers to ship their code multiple times a day and have it reach end-users immediately. But more importantly, having a quick-release process allows us a quick path to recovery.

We’re willing to make tradeoffs in cost for a fast release process
We’re willing to make tradeoffs in cost for a fast release process

In order to invest in really making our release process fast, we’re willing to make tradeoffs in cost. In addition to dedicated infrastructure teams, we also manage our own CI cluster with multiple thousands of nodes at its daily peak.

Automate as Much as Possible

Developers don’t want to perform repetitive tasks. Computers are still better at doing things than humans—so we automate as much as possible. We use automation in places like continuous deployments and canary analysis.

Developers don't have to press deploy, we automate that
Developers don't have to press deploy, we automate that

We automated out the need for developers to press Deploy—we automatically continuously deploy to Canary and Production.

Developers can override automation
Developers can override automation

It’s still important for developers to be able to override the machinery. Developers can lock the automatic deployments and deploy manually, in cases like emergencies.

Release Culture @ Shopify Virtual Event

We held Shipit! Presents a Q&A about Release Culture @ Shopify with our guests Jack Li, Kate Neely, and Jon Geiger on April 20, 2020.  We had a discussion about the culture around shipping code at Shopify and answered your questions. We weren't able to answer all the questions during the event, so we've included answers to all the questions that we didn't get to below.

How has automation and velocity maintained uptime?

Automation has helped maintain uptime by providing assurances on a level of quality for changes on the way out, such as the automated canary analysis guaranteeing that the changes meet a certain level of production quality. Velocity has helped maintain uptime by reducing downtime; when things break, the high velocity of release means that problems are resolved quicker.

For an active monolith with many merges happening throughout the day, deploys to canary must be happening very frequently. How do you identify the "bad" merge, if there have been many recent merges on canary, and how do you ensure that bad merges don't block the release of other merges while there's a "bad" merge in the canary environment?

Our process here is still relatively immature, so triaging failures is still a manual process. Our velocity helps us ensure smaller changelists which makes triaging failures easier. As for reducing the impact of bad changes, I will defer to our blog post about the Merge Queue, which helps us ensure that we are not completely stalled when bad changes happen. 

How do you as a tooling organization handle sprawl? How do you balance enabling and controlling? That is, can teams choose to use their own languages and frameworks within those languages, or are they restricted to a set of things they're allowed to use?

Generally, we are more restrictive in technology choices. This is mostly because we want to be strategic in the technologies that we use, and so we are open to experimentation, but we have technologies that are battle-tested at Shopify that we encourage as recommended defaults (e.g. Ruby, Rails). React Native is the Future of Mobile at Shopify is an interesting article that talks about a recent technology change we have made.

What were you using before /shipit? How did that transition look? How did you measure its success?

Successfully Merging the Work of 1000+ Developers tells the story of how we got to the current /shipit system. Our two measures of success around this has been through feedback from our developer happiness survey, and from metrics around average pull request time-to-production.

How many different tools comprise the CI/CD pipeline and are they all developed in house, and are they owned by a specific team or does everyone contribute?

We work with a variety of vendors! Our biggest partners are Buildkite, which we use for scheduling CI, and GitHub which we build our development workflow around. Some more info about our tooling can be found at Stackshare.io. The tools we build are developed and owned by our Developer Acceleration team, but everyone is free to contribute and we get tons of contributions all the time. Our CD system, Shipit, is actually open source and we see contributions by community members frequently as well.

How is performance on production monitored after a feature release?

Typically this is something that teams themselves will strategize around and monitor. Performance is a big deal to us, and we have an internal dashboard to help teams track their performance metrics. We trust each team to take this component of their product seriously. Aside from that, the Production Engineering team has monitors and dashboards around performance metrics of the entire system.

How did you get into creating dev tooling for in-house teams. Are there languages/systems you would recommend learning to someone who is interested?

(Note from Jack: Interpreting this as a kind of career question) Personally, I’ve always gravitated towards the more “meta” parts of software development, focusing on long-term productivity and maintainability of previous projects, so working on dev tooling full-time felt like a perfect fit. In my opinion, the most important skill to be successful in this problem space is to be adaptable, both in adapting to new technologies and to new ideas. Languages like Ruby, Python, that allow you to focus more on the ideas behind your code can be good enablers for this. Docker and Kubernetes knowledge is valuable in this area as well.

Is development done on feature branches and entire features merged all at once, or are partial/incomplete features merged into master, but guarded by a feature flag?

Very good question, I think certain teams/features will do slightly different things, but typically releases happen via feature flags that we call “Beta Flags” in our system. This allows changes to be rolled out on a per-shop basis, or a percentage-of-shops basis.

Do you guys use Crystalball?

We forwarded this question to our test infrastructure team, their response was that we don’t use Crystallball, there was some brief exploration into this, but it wasn’t fast enough to trace through our codebase, and the test suite in our main monolith is written in minitest.

Additional Information

If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.Learn about the actions we’re taking as we continue to hire during COVID‑19



Continue reading

Building Arrive's Confetti in React Native with Reanimated

Building Arrive's Confetti in React Native with Reanimated

Shopify is investing in React Native as our primary choice of mobile technology moving forward. As a part of this we’ve rewritten our package tracking app Arrive with React Native and launched it on Android—an app that previously only had an iOS version.

One of the most cherished features by the users of the Arrive iOS app is the confetti that rains down on the screen when an order is delivered. The effect was implemented using the built-in CAEmitterLayer class in iOS, producing waves of confetti bursting out with varying speeds and colors from a single point at the top of the screen.

When we on the Arrive team started building the React Native version of the app, we included the same native code that produced the confetti effect through a Native Module wrapper. This would only work on iOS however, so to bring the same effect to Android we had two options before us:

  1. Write a counterpart to the iOS native code in Android with Java or Kotlin, and embed it as a Native Module.
  2. Implement the effect purely in JavaScript, allowing us to share the same code on both platforms.

As you might have guessed from the title of this blog post, we decided to go with the second option. To keep the code as performant as the native implementation, the best option would be to write it in a declarative fashion with the help of the Reanimated library.

I’ll walk you through, step by step, how we implemented the effect in React Native, while also explaining what it means to write an animation declaratively.

When we worked on this implementation, we also decided to make some visual tweaks and improvements to the effect along the way. These changes make the confetti spread out more uniformly on the screen, and makes them behave more like paper by rotating along all three dimensions.

Laying Out the Confetti

To get our feet wet, the first step will be to render a number of confetti on the screen with different colors, positions and rotation angles.

Initialize the view of 100 confetti

Initialize the view of 100 confetti

We initialize the view of 100 confetti with a couple of randomized values and render them out on the screen. To prepare for animations further down the line, each confetto (singular form of confetti, naturally) is wrapped with Reanimated's Animated.View. This works just like the regular React Native View, but accepts declaratively animated style properties as well, which I’ll explain in the next section.

Defining Animations Declaratively

In React Native, you generally have two options for implementing an animation:

  1. Write a JavaScript function called by requestAnimationFrame on every frame to update the properties of a view.
  2. Use a declarative API, such as Animated or Reanimated, that allows you to declare instructions that are sent to the native UI-thread to be run on every frame.

The first option might seem the most attractive at first for its simplicity, but there’s a big problem with the approach. You need to be able to calculate the new property values within 16 milliseconds every time to maintain a consistent 60 FPS animation. In a vacuum, this might seem like an easy goal to accomplish, but because of JavaScript's single threaded nature you’ll also be blocked by anything else that needs to be computed in JavaScript during the same time period. As an app grows and needs to be able to do more things at once, it quickly becomes implausible to always be able to finish the computation within the strict time limit.

With the second option, you only rely on JavaScript at the beginning of the animation to set it all up, after which all computation happens on the native UI-thread. Instead of relying on a JavaScript function to answer where to move a view on each frame, you assemble a set of instructions that the UI-thread itself can execute on every frame to update the view. When using Reanimated these instructions can include conditionals, mathematical operations, string concatenation, and much more. These can be combined in a way that almost resembles its own programming language. With this language, you write a small program that can be sent down to the native layer, that is executed once every frame on the UI-thread.

Animating the Confetti

We are now ready to apply animations to the confetti that laid out in the previous step. Let's start by updating our createConfetti function:

Instead of randomizing x, y and angle, we give all confetti the same initial values but instead randomize the velocities that we're going to be applying to them. This creates the effect of all confetti starting out inside an imaginary confetti cannon and shooting out in different directions and speeds. Each velocity expresses how much a value will be changing for each full second of animation.

We need to wrap each value that we're intending to animate with Animated.Value, to prepare them for declarative instructions. The Animated.Clock value is what's going to be the driver of all our animations. As the name implies it gives us access to the animation's time, which we'll use to decide how much to move each value forward on each update.

Further down, next to where we’re mapping over and rendering the confetti, we add our instructions for how the values should be animated:

Before anything else, we set up our dt (delta time) value that will express how much time has passed since the last update, in seconds. This decides the x, y, and angle delta values that we're going to apply.

To get our animation going we need to start the clock if it's not already running. To do this, we wrap our instructions in a condition, cond, which checks the clock state and starts it if necessary. We also need to call our timeDiff (time difference) value once to set it up for future use, since the underlying diff function returns its value’s difference since the last frame it evaluated, and the first call will be used as the starting reference point.

The declarative instructions above roughly translate to the following pseudo code, which runs on every frame of the animation:

Considering the nature of confetti falling through the air, moving at constant speed makes sense here. If we were to simulate more solid objects that aren't slowed down by air resistance as much, we might want to add a yAcc (y-axis acceleration) variable that would also increase the yVel (y-axis velocity) within each frame.

Everything put together, this is what we have now:

Staggering Animations

The confetti is starting to look like the original version, but our React Native version is blurting out all the confetti at once, instead of shooting them out in waves. Let's address this by staggering our animations:

We add a delay property to our confetti, with increasing values for each group of 10 confetti. To wait for the given time delay, we update our animation code block to first subtract dt from delay until it reaches below 0, after which our previously written animation code kicks in.

Now we have something that pretty much looks like the original version. But isn’t it a bit sad that a big part of our confetti is shooting off the horizontal edges of the screen without having a chance to travel across the whole vertical screen estate? It seems like a missed potential.

Containing the Confetti

Instead of letting our confetti escape the screen on the horizontal edges, let’s have them bounce back into action when that’s about to happen. To prevent this from making the confetti look like pieces of rubber macaroni bouncing back and forth, we need to use a good elasticity multiplier to determine how much of the initial velocity to keep after the collision.

When an x value is about to go outside the bounds of the screen, we reset it to the edge’s position and reverse the direction of xVel while reducing it by the elasticity multiplier at the same time:

Adding a Cannon and a Dimension

We’re starting to feel done with our confetti, but let’s have a last bit of fun with it before shipping it off. What’s more fun than a confetti cannon shooting 2-dimensional confetti? The answer is obvious of course—it’s two confetti cannons shooting 3-dimensional confetti!

We should also consider cleaning up by deleting the confetti images and stopping the animation once we reach the bottom of the screen, but that’s not nearly as fun as the two additions above so we’ll leave that out of this blog post.

This is the result of adding the two effects above:

The final full code for this component is available in this gist.

Driving Native-level Animation with JavaScript

While it can take some time to get used to the Reanimated’s seemingly arcane API, once you’ve played around with it for a bit there should be nothing stopping you from implementing butter smooth cross-platform animations in React Native, all without leaving the comfort of the JavaScript layer. The library has many more capabilities we haven’t touched on in this post, for example, the possibility to add user interactivity by mapping animations to touch gestures. Keep a lookout for future posts on this subject!

Continue reading

Optimizing Ruby Lazy Initialization in TruffleRuby with Deoptimization

Optimizing Ruby Lazy Initialization in TruffleRuby with Deoptimization

Shopify's involvement with TruffleRuby began half a year ago, with the goal of furthering the success of the project and Ruby community. TruffleRuby is an alternative implementation of the Ruby language (where the reference implementation is CRuby, or MRI) developed by Oracle Labs. TruffleRuby has high potential in speed, as it is nine times faster than CRuby on optcarrot, a NES emulator benchmark developed by the Ruby Core Team.

I’ll walk you through a simple feature I investigated and implemented. It showcases many important aspects of TruffleRuby and serves as a great introduction to the project!

Introduction to Ruby Lazy Initialization

Ruby developers tend to use the double pipe equals operator ||= for lazy initialization, likely somewhat like this:

Syntactically, the meaning of the double pipe equals operator is that the value is assigned if the value of the variable is currently not set.

The common use case of the operator isn’t so much “assign if” and more “assign once”.

This idiomatic usage is a subset of the operator’s syntactical meaning so prioritizing that logic in the compiler can improve performance. For TruffleRuby, this would lead to less machine code being emitted as the logic flow is shortened.

Analyzing Idiomatic Usage

To confirm that this usage is common enough to be worth optimizing for, I ran static profiling on how many times this operator is used as lazy initialization.

For a statement to count as a lazy initialization for these profiling purposes, we had it match one of the following requirements:

  • The value being assigned is a constant (uses only literals of int, string, symbol, hash, array or is a constant variable). An example would be a ||= [2 * PI].
  • The statement with the ||= operator is in a function, an instance or class variable is being assigned, and the name of the instance variable contains the name of the function or vice versa. The function must accept no params. An example would be def get_a; @a ||= func_call.

These criteria are very conservative. Here are some examples of cases that won’t be considered a lazy initialization but probably still follow the pattern of “assign once”.

After profiling 20 popular open-source projects, I found 2082 usages of the ||= operator, 64% of them being lazy initialization by this definition.

Compiling Code with TruffleRuby

Before we get into optimizing TruffleRuby for this behaviour, here’s some background on how TruffleRuby compiles your code.

TruffleRuby is an implementation of Ruby that aims for higher performance through optimizing Just In Time (JIT) compilation (programs that are compiled as they're being executed). It’s built on top of GraalVM, a modified JVM built by Oracle that provides Truffle, a framework used by TruffleRuby for implementing languages through building Abstract Syntax Tree (AST) interpreters. With Truffle, there’s no explicit step where JVM bytecode is created as with a conventional JVM language, rather Truffle will just use the interpreter and communicate with the JVM to create machine code directly with profiling and a technique called partial evaluation. This means that GraalVM can be advertised as magic that converts interpreters into compilers!

TruffleRuby also leverages deoptimization (more than other implementations of Ruby) which is a term for quickly moving between the fast JIT-compiled machine code to the slow interpreter. One application for deoptimization is how the compiler handles monkey patching (e.g. replacing a class method at runtime). It’s unlikely that a method will be monkey patched, so we can deoptimize if it has been monkey patched to find and execute the new method. The path for handling the monkey patching won't need to be compiled or appear in the machine code. In practice, this use case is even better—instead of constantly checking if a function has been redefined, we can just place the deoptimization where the redefinition is and never need a check in compiled code.

In this case with lazy initialization, we make the deoptimization case the uncommon one where the variable needs to be assigned a value more than once.

Implementing the Deoptimization

Before when TruffleRuby encountered the ||= operator, a Graal profiler would see that since both sides have been used, the entire statement should be compiled into machine code. Our knowledge of how Ruby is used in practice tells us that the right hand side is unlikely to be run again, and so doesn’t need to be compiled into machine code if it’s never been executed or has been executed just once.

TruffleRuby uses little objects called nodes to represent each part of a Ruby program. We use an OrNode to handle the ||= operator, with the left side being the condition and the right side being the action to execute if the left side is true (in this case the action is an assignment). The creation of these nodes are implemented in Java.

To make this optimization, we swapped out the standard OrNode for an OrLazyValueDefinedNode in the BodyTranslator which translates the Ruby AST into nodes that Truffle can understand.

The basic OrNode executes like this:

The ConditionProfile is what counts how many times each branch is executed. With lazy initialization it counts both sides as used by default, so compiles them both into the machine code.

The OrLazyValueDefinedNode only changes the else block. What I'm doing here is counting the number of times the else part is executed, and turning it into a deoptimization if it’s less than twice.

Benchmarking and Impact

Benchmarking isn’t a perfect measure of how effective this change is (benchmarking is arguably never perfect, but that’s a different conversation), as the results would be too noisy to observe in a large project. However, I can still benchmark on some pieces of code to see the improvements. By doing the “transfer to interpreter and invalidate”, time and space is saved in creating machine code for everything related to the right side.

With our new optimization this piece of code compiles about 6% faster and produces about 63% fewer machine code by memory (about half the number of assembly instructions). Faster compilation means more time for your app to run, and smaller machine code means less usage of memory and cache. Producing less machine code more quickly improves responsiveness and should in turn make the program run faster, though it's difficult to prove.

Function foo without optimization

Function foo without optimization


Above is a graph of the foo method in sample code above without the optimization that vaguely represents the logic present in the machine code. I can look at the actual compiler graphs produced by Graal at various stages to understand how exactly our code is being compiled, but this is the overview.

Each of the nodes in this graph expands to more control flow and memory access, which is why this optimization can impact the amount of machine code so much. This graph represents the uncommon case where the checks and call to the calculate_foo method are needed, so for lazy initialization it’ll only need this flow once or zero times.

Function foo with optimizationFunction foo with optimization

The graph that includes the optimization is a bit less complex. The control flow doesn’t need to know anything about variable assignment or anything related to calling and executing a method.

What I've added is just an optimization, so if you:

  • aren’t using ||= to mean lazy initialization
  • need to run the right-hand-side of the expression multiple times
  • need it to be fast

then the optimization goes away and the code is compiled as it would have done before (you can revisit the OrLazyValueDefinedNode source above to see the logic for this).

This optimization shows the benefit of looking at codebases used in industry for patterns that aren’t visible in the language specifications. It’s also worth noting that none of the code changes here were very complicated and modified code in a very modular way—other than the creation of the new node, only one other line was touched!

Truffle is actually named after the chocolates, partially in reference to the modularity of a box of chocolates. Apart from modularity, TruffleRuby is also easy to develop on as it's primarily written in Ruby and Java (there's some C in there for extensions).

Shopify is leading the way in experimenting with TruffleRuby for production applications. TruffleRuby is currently mirroring storefront traffic. This helped us work through some bugs, build better tooling for TruffleRuby and can lead to a faster browsing for customers.

We also contribute to CRuby/MRI and Sorbet as a part of our work on Ruby. We like desserts, so along with contributions to TruffleRuby and Sorbet, we maintain Tapioca! If you'd like to become a part of our dessert medley (or work on other amazing Shopify projects), send us an application!

Additional Information

Tangentially related things about Ruby and TruffleRuby

If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

Continue reading

Refactoring Legacy Code with the Strangler Fig Pattern

Refactoring Legacy Code with the Strangler Fig Pattern

Large objects are a code smell: overloaded with responsibilities and dependencies, as they continue to grow, it becomes more difficult to define what exactly they’re responsible for. Large objects are harder to reuse and slower to test. Even worse, they cost developers additional time and mental effort to understand, increasing the chance of introducing bugs. Unchecked, large objects risk turning the rest of your codebase into a ball of mud, but fear not! There are strategies for reducing the size and responsibilities of large objects. Here’s one that worked for us at Shopify, an all-in-one commerce platform supporting over one million merchants across the globe. 

As you can imagine, one of the most critical areas in Shopify’s Ruby on Rails codebase is the Shop model. Shop is a hefty class with well over 3000 lines of code, and its responsibilities are numerous. When Shopify was a smaller company with a smaller codebase, Shop’s purpose was clearer: it represented an online store hosted on our platform. Today, Shopify is far more complex, and the business intentions of the Shop model are murkier. It can be described as a God Object: a class that knows and does too much.

My team, Kernel Architecture Patterns, is responsible for enforcing clean, efficient, scalable architecture in the Shopify codebase. Over the past few years, we invested a huge effort into componentizing Shopify’s monolithic codebase (see Deconstructing the Monolith) with the goal of establishing well-defined boundaries between different domains of the Shopify platform.

Not only is creating boundaries at the component-level important, but establishing boundaries between objects within a component is critical as well. It’s important that the business subdomain modelled by an object is clearly defined. This ensures that classes have clear boundaries and well-defined sets of responsibilities.

Shop’s definition is unclear, and its semantic boundaries are weak. Unfortunately, this makes it an easy target for the addition of new features and complexities. As advocates for clean, well-modelled code, it was evident that the team needed to start addressing the Shop model and move some of its business processes into more appropriate objects or components.

Using the ABC Code Metric to Determine Code Quality

Knowing where to start refactoring can be a challenge, especially with a large class like Shop. One way to find a starting point is to use a code metric tool. It doesn’t really matter which one you choose, as long as it makes sense for your codebase. Our team opted to use Flog, which uses a score based on the number of assignments, branches and calls in each area of the code to understand where code quality is suffering the most. Running Flog identified a particularly disordered portion in Shop: store settings, which contains numerous “global attributes” related to a Shopify store.

Refactoring Shop with the Strangler Fig Pattern

Extracting store settings into more appropriate components offered a number of benefits, notably better cohesion and comprehension in Shop and the decoupling of unrelated code from the Shop model. Refactoring Shop was a daunting task—most of these settings were referenced in various places throughout the codebase, often in components that the team was unfamiliar with. We knew we’d potentially make incorrect assumptions about where these settings should be moved to. We wanted to ensure that the extraction process was well laid out, and that any steps taken were easily reversible in case we changed our minds about a modelling decision or made a mistake. Guaranteeing no downtime for Shopify was also a critical requirement, and moving from a legacy system to an entirely new system in one go seemed like a recipe for disaster.

What is the Strangler Fig Pattern?

The solution? Martin Fowler’s Strangler Fig Pattern. Don’t let the name intimidate you! The Strangler Fig Pattern offers an incremental, reliable process for refactoring code. It describes a method whereby a new system slowly grows over top of an old system until the old system is “strangled” and can simply be removed. The great thing about this approach is that changes can be incremental, monitored at all times, and the chances of something breaking unexpectedly are fairly low. The old system remains in place until we’re confident that the new system is operating as expected, and then it’s a simple matter of removing all the legacy code.

That’s a relatively vague description of the Strangler Fig Pattern, so let’s break down the 7-step process we created as we worked to extract settings from the Shop model. The following is a macro-level view of the refactor.

Macro-level view of the Strangler Fig Pattern
Macro-level view of the Strangler Fig Pattern

We’ll dive into exactly what is involved in each step, so don’t worry if this diagram is a bit overwhelming to begin with.

Step 1: Define an Interface for the Thing That Needs to Be Extracted

Define the public interface by adding methods to an existing class, or by defining a new model entirely.
Define the public interface by adding methods to an existing class, or by defining a new model entirely

The first step in the refactoring process is to define the public interface for the thing being extracted. This might involve adding methods to an existing class, or it may involve defining a new model entirely. This first step is just about defining the new interface; we’ll depend on the existing interface for reading data during this step. In this example, we’ll be depending on an existing Shop object and will continue to access data from the shops database table.

Let’s look at an example involving Shopify Capital, Shopify’s finance program. Shopify Capital offers cash advances and loans to merchants to help them kick-start their business or pursue their next big goal. When a merchant is approved for financing, a boolean attribute, locked_settings, is set to true on their store. This indicates that certain functionality on the store is locked while the merchant is taking advantage of a capital loan. The locked_settings attribute is being used by the following methods in the Shop class:

We already have a pretty clear idea of the methods that need to be involved in the new interface based on the existing methods that are in the Shop class. Let’s define an interface in a new class, SettingsToLock, inside the Capital component.

As previously mentioned, we’re still reading from and writing to a Shop object at this point. Of course, it’s critical that we supply tests for the new interface as well.

We’ve clearly defined the interface for the new system. Now, clients can start using this new interface to interact with Capital settings rather than going through Shop.

Step 2: Change Calls to the Old System to Use the New System Instead

Replace calls to the existing “host” interface with calls to the new system instead
Replace calls to the existing “host” interface with calls to the new system instead

Now that we have an interface to work with, the next step in the Strangler Fig Pattern is to replace calls to the existing “host” interface with calls to the new system instead. Any objects sending messages to Shop to ask about locked settings will now direct their messages to the methods we’ve defined in Capital::SettingsToLock.

In a controller for the admin section of Shopify, we have the following method:

This can be changed to:

A simple change, but now this controller is making use of the new interface rather than going directly to the Shop object to lock settings.

Step 3: Make a New Data Source for the New System If It Requires Writing

New Data Source
New data source

If data is written as a part of the new interface, it should be written to a more appropriate data source. This might be a new column in an existing table, or may require the creation of a new table entirely.

Continuing on with our existing example, it seems like this data should belong in a new table. There are no existing tables in the Capital component relevant to locked settings, and we’ve created a new class to hold the business logic—these are both clues that we need a new data source.

The shops table currently looks like this in db/schema.rb

We create a new table, capital_shop_settings_locks, with a column locked_settings and a reference to a shop.

The creation of this new table marks the end of this step.

Step 4: Implement Writers in the New Model to Write to the New Data Source

implement writers in the new model to write data to the new data source while also writing to the existing data source
Implement writers in the new model to write data to the new data source and existing data source

The next step in the Strangler Fig Pattern is a bit more involved. We need to implement writers in the new model to write data to the new data source while also writing to the existing data source.

It’s important to note that while we have a new class, Capital::SettingsToLockcapital_shop_settings_locks, these aren’t connected at the moment. The class defining the new interface is a plain old Ruby object and solely houses business logic. We are aiming to create a separation between the business logic of store settings and the persistence (or infrastructure) logic. If you’re certain that your model’s business logic is going to stay small and uncomplicated, feel free to use a single Active Record. However, you may find that starting with a Ruby class separate from your infrastructure is simpler and faster to test and change.

At this point, we introduce a record object at the persistence layer. It will be used by the Capital::SettingsToLock class to read data from and write data to the new table. Note that the record class will effectively be kept private to the business logic class.

We accomplish this by creating a subclass of ApplicationRecord. Its responsibility is to interact with the capital_shop_settings_locks table we’ve defined. We define a class Capital::SettingsToLockRecord, map it to the table we’ve created, and add some validations on the attributes.

Let’s add some tests to ensure that the validations we’ve specified on the record model work as intended:

Now that we have Capital::SettingsToLockRecord to read from and write to the table, we need to set up Capital::SettingsToLock to access the new data source via this record class. We can start by modifying the constructor to take a repository parameter that defaults to the record class:

Next, let’s define a private getter, record. It performs find_or_initialize_by on the record model, Capital::SettingsToLockRecord, using shop_id as an argument to return an object for the specified shop.

Now, we complete this step in the Strangler Fig Pattern by starting to write to the new table. Since we’re still reading data from the original data source, we‘ll need to write to both sources in tandem until the new data source is written to and has been backfilled with the existing data. To ensure that the two data sources are always in sync, we’ll perform the writes within transactions. Let’s refresh our memories on the methods in Capital::SettingsToLock that are currently performing writes.

After duplicating the writes and wrapping these double writes in transactions, we have the following:

The last thing to do is to add tests that ensure that lock and unlock are indeed persisting data to the new table. We control the output of SettingsToLockRecord’s find_or_initialize_by, stubbing the method call to return a mock record.

At this point, we are successfully writing to both sources. That concludes the work for this step.

Step 5: Backfill the New Data Source with Existing Data

Backfill the data
Backfill the data

The next step in the Strangler Fig Pattern involves backfilling data to the new data source from the old data source. While we’re writing new data to the new table, we need to ensure that all of the existing data in the shops table for locked_settings is ported over to capital_shop_settings_locks.

In order to backfill data to the new table, we’ll need a job that iterates over all shops and creates record objects from the data on each one. Shopify developed an open-source iteration API as an extension to Active Job. It offers safer iterations over collections of objects and is ideal for a scenario like this. There are two key methods in the iteration API: build_enumerator specifies the collection of items to be iterated over, and each_iteration defines the actions to be taken out on each object in the collection. In the backfill task, we specify that we’d like to iterate over every shop record, and each_iteration contains the logic for creating or updating a Capital::SettingsToLockRecord object given a store. The alternative is to make use of Rails’ Active Job framework and write a simple job that iterates over the Shop collection. 

Some comments about the backfill task: the first is that we’re placing a pessimistic lock on the Shop object prior to updating the settings record object. This is done to ensure data consistency across the old and new tables in a scenario where a double write occurs at the same time as a row update in the backfill task. The second thing to note is the use of a logger to output information in the case of a persistence failure when updating the settings record object. Logging is extremely helpful in pinpointing the cause of persistence failures in a backfill task such as this one, should they occur.

We include some tests for the job as well. The first tests the happy path and ensures that we're creating and updating settings records for every Shop object. The other tests the unhappy path in which a settings record update fails and ensures that the appropriate logs are generated

After writing the backfill task, we enqueue it via a Rails migration:

Once the task has run successfully, we celebrate that the old and new data sources are in sync. It’s wise to compare the data from both tables to ensure that the two data sources are indeed in sync and that the backfill hasn’t failed anywhere.

Step 6: Change the Methods in the Newly Defined Interface to Read Data from the New Source

Change the reader methods to use the new data source
Change the reader methods to use the new data source

The remaining steps of the Strangler Fig Pattern are fairly straightforward. Now that we have a new data source that is up to date with the old data source and is being written to reliably, we can change the reader methods in the business logic class to use the new data source via the record object. With our existing example, we only have one reader method:

It’s as simple as changing this method to go through the record object to access locked_settings:

Step 7: Stop Writing to the Old Source and Delete Legacy Code

Remove the now-unused, “strangled” code from the codebase
Remove the now-unused, “strangled” code from the codebase

We’ve made it to the final step in our code strangling! At this point, all objects are accessing locked_settings through the Capital::SettingsToLock interface, and this interface is reading from and writing to the new data source via the Capital::SettingsToLockRecord model. The only thing left to do is remove the now-unused, “strangled” code from the codebase.

In Capital::SettingsToLock, we remove the writes to the old data source in lock and unlock and get rid of the getter for shop. Let’s review what Capital::SettingsToLock looks like.

After the changes, it looks like this:

We can remove the tests in Capital::SettingsToLockTest that assert that lock and unlock write to the shops table as well.

Last but not least, we remove the old code from the Shop model, and drop the column from the shops table.

With that, we’ve successfully extracted a store settings column from the Shop model using the Strangler Fig Pattern! The new system is in place, and all remnants of the old system are gone.


In summary, we’ve followed a clear 7-step process known as the Strangler Fig Pattern to extract a portion of business logic and data from one model and move it into another:

  1. We defined the interface for the new system.
  2. We incrementally replaced reads to the old system with reads to the new interface.
  3. We defined a new table to hold the data and created a record for the business logic model to use to interface with the database.
  4. We began writing to the new data source from the new system.
  5. We backfilled the new data source with existing data from the old data source.
  6. We changed the readers in the new business logic model to read data from the new table.
  7. Finally, we stopped writing to the old data source and deleted the remaining legacy code.

The appeal of the Strangler Fig Pattern is evident. It reduces the complexity of the refactoring journey by offering an incremental, well-defined execution plan for replacing a legacy system with new code. This incremental migration to a new system allows for constant monitoring and minimizes the chances of something breaking mid-process. With each step, developers can confidently move towards a refactored architecture while ensuring that the application is still up and tests are green. We encourage you to try out the Strangler Fig Pattern with a small system that already has good test coverage in place. Best of luck in future code-strangling endeavors!

Continue reading

The Evolution of Kit: Automating Marketing Using Machine Learning

The Evolution of Kit: Automating Marketing Using Machine Learning

For many Shopify business owners, whether they’ve just started their entrepreneur journey or already have an established business, marketing is one of the essential tactics to build audience and drive sales to their stores. At Shopify, we offer various tools to help them do marketing. One of them is Kit, a virtual employee that can create easy Instagram and Facebook ads, perform email marketing automation and offer many other skills through powerful app integrations. I’ll talk about the engineering decision my team made to transform Kit from a rule based system to an artificially-intelligent assistant that looks at a business owner’s products, visitors, and customers to make informed recommendations for the next best marketing move. 

As a virtual assistant, Kit interacts with business owners through messages over various interfaces including Shopify Ping and SMS. Designing the user experience for messaging is challenging especially when creating marketing campaigns that can involve multiple steps, such as picking products, choosing audience and selecting budget. Kit not only builds a user experience to reduce the friction for business owners in creating ads, but also goes a step further to help them create more effective and performant ads through marketing recommendation.

Simplifying Marketing Using Heuristic Rules

Marketing can be daunting, especially when the number of different configurations in the Facebook Ads Manager can easily overwhelm its users.

Facebook Ads Manager ScreenshotFacebook Ads Manager screenshot

There is a long list of settings that need configuring including objective, budget, schedule, audience, ad format and creative. For a lot of business owners who are new to marketing, understanding all these concepts is already time consuming, let alone making the correct decision at every decision point in order to create an effective marketing campaign.

Kit simplifies the flow by only asking for the necessary information and configuring the rest behind the scenes. The following is a typical flow on how a business owner interacts with Kit to start a Facebook ad.

Screenshot of the conversation flow on how a business owner interacts with Kit to start a Facebook ad
Screenshot of the conversation flow on how a business owner interacts with Kit to start a Facebook ad

Kit simplifies the workflow into two steps: 1) pick products as ad creative and 2) choose a budget. We use heuristic rules based on our domain knowledge and give business owners limited options to guide them through the workflow. For products, we identify several popular categories that they want to market. For budget, we offer a specific range based on the spending behavior of the business owners we want to help. For the rest of configurations, Kit defaults to best practices removing the need to make decisions based on expertise.

The first version of Kit was a standalone application that communicated with Shopify to extract information such as orders and products to make product suggestions and interacted with different messaging channels to deliver recommendations conversationally.

System interaction diagram for heuristic rules based recommendation. There are two major systems that Kit interacts with: Shopify for product suggestions; messaging channels for communication with business owners
System interaction diagram for heuristic rules based recommendation. There are two major systems that Kit interacts with: Shopify for product suggestions; messaging channels for communication with business owners

Building Machine Learning Driven Recommendation

One of the major limitations in the existing heuristic rules-based implementations is that the range of budget is hardcoded into the application where every business owner has the same option to choose from. The static list of budget range may not fit their needs, where the more established ones with store traffic and sales may want to spend more. In addition, for many of the business owners who don’t have enough marketing experience, it’s a tough decision to choose the right amount in order to generate the optimal return.

Kit strives to automate marketing by reducing steps when creating campaigns. We found that budgeting is one of the most impactful criteria in contributing to successful campaigns. By eliminating the decision from the configuration flow, we reduced the friction for business owners to get started. In addition, we eliminated the first step of picking products by generating a proactive recommendation for a specific category such as new products. Together, Kit can generate a recommendation similar to the following:

Screenshot of a one-step marketing recommendation
Screenshot of a one-step marketing recommendation

To generate this recommendation, there are two major decisions Kit has to make:

  1. How much is the business owner willing to spend?
  2. Given the current state of the business owner, will the budget be enough for the them to successfully generate sales?

Kit decided that for business owner Cheryl, she should spend about $40 for the best chance to make sales given the new products marketing opportunity. From a data science perspective, it’s broken down into two types of machine learning problems:

  1. Regression: given a business owner’s historic spending behavior, predict the budget range that they’re likely to spend.
  2. Classification: given the budget a business owner has with store attributes such as existing traffic and sales that can measure the state of their stores, predict the likelihood of making sales.

The heuristic rules-based system allowed Kit to collect enough data to make solving the machine learning problem possible. Kit can generate actionable marketing recommendation that gives the business owners the best chance of making sales based on their budget range and the state of their stores using the data we learnt.

The second version of Kit had its first major engineering revision by implementing the proactive marketing recommendation in the app through the machine learning architecture in Google Cloud Platform:

Flow diagram on generating proactive machine learning driven recommendation in Kit
Flow diagram on generating proactive machine learning driven recommendation in Kit

There are two distinct flows in this architecture:

Training flow: Training is responsible for building the regression and classification models that are used in the prediction flow.

  1. Aggregate all relevant features. This includes the historic Facebook marketing campaigns created by business owners through Shopify, and the store state (e.g. traffic and sales) at the time when they create the marketing campaign.
  2. Perform feature engineering, a process using domain knowledge to extract useful features from the source data that are used to train the machine learning models. For historic marketing features, we derive features such as past 30 days average ad spend and past 30 days marketing sales. For shop state features, we derive features such as past 30 days unique visitors and past 30 days total orders. We take advantage of Apache Spark’s distributed computation capability to tackle the large scale Shopify dataset.
  3. Train the machine learning models using Google Cloud’s ML Engine. ML Engine allows us to train models using various popular frameworks including scikit-learn and TensorFlow.
  4. Monitor the model metrics. Model metrics are methods to evaluate the performance of a given machine learning model by comparing the predicted values against the ground truth. Monitoring is the process to validate the integrity of the feature engineering and model training by comparing the model metrics against its historic values. The source features in Step 1 can sometimes be broken leading to inaccurate feature engineering results. Even when feature pipeline is intact, it’s possible that the underlying data distribution changes due to unexpected new user behavior leading to deteriorating model performance. A monitoring process is important to keep track of historic metrics and ensure the model performs as expected before making it available for use. We employed two types of monitoring strategies: 1) threshold: alert when the model metric is beyond a defined threshold; 2) outlier detection: alert when the model metrics deviates from its normal distribution. We use z-score to detect outliers.
  5. Persist the models for prediction flow.
Prediction flow: Prediction is responsible for generating the marketing recommendation by optimizing for the budget and determining whether or not the ad will generate sales given the existing store state.
  1. Generate marketing recommendations by making predictions using the features and models prepared in the training flow.
  2. Send recommendations to Kit through Apache Kafka.

At Shopify, we have a data platform engineering team to maintain the data services required to implement both the training and prediction flows. This allows the product team to focus on building the domain specific machine learning pipelines, prove product values, and iterate quickly.

Moving to Real Time Prediction Architecture

Looking back at our example featuring business owner Cheryl, Kit decided that she can spend $40 for the best chance of making sales. In marketing, making sales is often not the first step in a business owner’s journey, especially when they don’t have any existing traffic to their store. Acquiring new visitors to the store in order to build lookalike audiences that are more relevant to the business owner is a crucial step to expand the audience size in order to create more successful marketing campaigns afterward. For this type of business owner, Kit evaluates the budget based on a different goal and suggests a more appropriate amount to acquire enough new visitors in order to build the lookalike audience. This is how the recommendation looks:

[Screenshot of a recommendation to build lookalike audience
Screenshot of a recommendation to build lookalike audience

To generate this recommendation, there are three major decisions Kit has to make:

  1. How many new visitors does the business owner need in order to create lookalike audiences?
  2. How much are they willing to spend?
  3. Given the current state of the business owner, will the budget be enough for them to acquire those visitors?

Decision two and three are solved using the same machine learning architecture as described previously. However, there’s a new complexity in this recommendation that step one needs to determine the required number of new visitors in order to build lookalike audiences. Since the traffic to a store can change in real time, the prediction flow needs to process the request at the time when the recommendation is delivered to the business owner.

One major limitation for the Spark-based prediction flow is that recommendations are optimized in batch manner rather than on demand, i.e., the prediction flow is triggered from Spark on schedule basis rather than from Kit at the time when the recommendation is delivered to business owners. With the Spark batch setting, it’s possible that the budget recommendation is already stale by the time it’s delivered to the business owner. To solve that problem, we built a real time prediction service to replace the Spark prediction flow.

Flow diagram on generating real time recommendation in Kit
Flow diagram on generating real time recommendation in Kit

One major distinction compared to the previous Spark-based prediction flow is that Kit is proactively calling into the real time prediction service to generate the recommendation.

  1. Based on the business owner’s store state, Kit decides that their marketing objective should be building lookalike audiences. Kit sends a request to the prediction service to generate budget recommendation from which the request reaches an HTTP API exposed through the web container component.
  2. Similar to the batch prediction flow in Spark, the web container generates marketing recommendations by making predictions using the features and models prepared in the training flow. However, there are several design considerations:
    1. We need to ensure efficient access to the features to minimize prediction request latency. Therefore, once features are generated during the feature engineering stage, they are immediately loaded into a key value store using Google Cloud’s Bigtable.
    2. Model prediction can be computationally expensive especially when the model architecture is complex. We use Google’s TensorFlow Serving which is a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving also provides out-of-the-box integration with TensorFlow models from which it can directly consume the models generated from the training flow with minimal configurations.
    3. Since the most heavy-lifting CPU/GPU-bound model prediction operations are dedicated to TensorFlow Serving, the web container remains a light-weight application that holds the business logic to generate recommendations. We chose Tornado as the Python web framework. By using non-blocking network I/O, Tornado can scale to tens of thousands open connections for model predictions.
  3. Model predictions are delegated to the TensorFlow Serving container.
  4. TensorFlow Serving container preloads the machine learning models generated during the training flow and uses them to perform model predictions upon requests.

Powering One Third of All Kit Marketing Campaigns

Kit started as a heuristic rules-based application that uses common best practices to simplify and automate marketing for Shopify’s business owners. We progressively improved the user experience by building machine learning driven recommendations to further reduce user friction and to optimize budgets giving business owners a higher chance of creating a more successful campaign. By first using a well established Spark-based prediction process (that’s well supported within Shopify) we showed the value of machine learning in driving user engagement and marketing results. This also allows us to focus on productionalizing an end-to-end machine learning pipeline with both training and prediction flows that serve tens of thousands of business owners. 

We learned that having a proper monitoring component in place is crucial to ensure the integrity of the overall machine learning system. We moved to an advanced real time prediction architecture to solve use cases that required time-sensitive recommendations. Although the real time prediction service introduced two additional containers (web and TensorFlow Serving) to maintain, we delegated the most heavy-lifting model prediction component to TensorFlow Serving, which is a well supported service by Google and integrates with Shopify’s existing cloud infrastructure easily. This ease of use allowed us to focus on defining and implementing the core business logic to generate marketing recommendation in the web container.

Moving to machine learning driven implementation has been proven valuable. One third of the marketing campaigns in Kit are powered by machine learning driven recommendations. Kit will continue to improve its marketing automation skills by optimizing for different marketing tactics and objectives in order to support their diverse needs.

We're always on the lookout for talent and we’d love to hear from you. Please take a look at our open positions on the Data Science & Engineering career page.

Continue reading

Creating Native Components That Accept React Native Subviews

Creating Native Components That Accept React Native Subviews

React Native adoption has been steadily growing since its release in 2015, especially with its ability to quickly create cross-platform apps. A very strong open-source community has formed, producing great libraries like Reanimated and Gesture Handler that allow you to achieve native performance for animations and gestures while writing exclusively React Native code. At Shopify we are using React Native for many different types of applications, and are committed to giving back to the community.

However, sometimes there is a native component you made for another app, or already exists on the platform, which you want to quickly port to React Native and aren’t able to build cross-platform using exclusively React Native. The documentation for React Native has good examples of how to create a native module which exposes native methods or components, but what should you do if you want to use a component you already have and render React Native views inside of it? In this guide, I’ll show you how to make a native component which provides bottom sheet functionality to React Native and lets you render React views inside of it. 

A simple example is the bottom sheet pattern from Google’s Material Design. It’s a draggable view which peeks up from the bottom of the screen and is able to expand to take up the full screen. It renders subviews inside of the sheet, which can be interacted with when the sheet is expanded.

This guide only focuses on an Android native implementation and assumes a basic knowledge of Kotlin. When creating an application, it’s best to make sure all platforms have the same feature parity.

Bottom sheet functionality

Bottom sheet functionality

Table of Contents

Setting Up Your Project

If you already have a React Native project set up for Android with Kotlin and TypeScript you’re ready to begin. If not, you can run react-native init NativeComponents —template react-native-template-typescript in your terminal to generate a project that is ready to go.

As part of the initial setup, you’ll need to add some Gradle dependencies to your project.

Modify the root build.gradle (android/build.gradle) to include these lines:

Make sure to substitute your current Kotlin version in the place of 1.3.61.

This will add all of the required libraries for the code used in the rest of this guide.

You should use fixed version numbers instead of + for actual development.

Creating a New Package Exposing the Native Component

To start, you need to create a new package that will expose the native component. Create a file called NativeComponentsReactPackage.kt.

Right now this doesn’t actually expose anything new, but you’ll add to the list of View Managers soon. After creating the new package, go to your Application class and add it to the list of packages.

Creating The Main View

A ViewGroupManager<T> can be thought of as a React Native version of ViewGroup from Android. It accepts any number of children provided, laying them out according to the constraints of the type T specified on the ViewGroupManager.

Create a file called ReactNativeBottomSheet.kt and a new

The basic methods you have to implement are getName() and createViewInstance().

name is what you’ll use to reference the native class from React Native.

createViewInstance is used to instantiate the native view and do initial setup.

Inflating Layouts Using XML

Before you create a real view to return, you need to set up a layout to inflate. You can set this up programmatically, but it’s much easier to inflate from an XML layout.

Here’s a fairly basic layout file that sets up some CoordinatorLayouts with behaviours for interacting with gestures. Add this to android/app/src/main/res/layout/bottom_sheet.xml.

The first child is where you’ll put all of the main content for the screen, and the second is where you’ll put the views you want inside BottomSheet. The behaviour is defined so that the second child can translate up from the bottom to cover the first child, making it appear like a bottom sheet.

Now that there is a layout created, you can go back to the createViewInstance method in ReactNativeBottomSheet.kt.

Referencing The New XML File

First, inflate the layout using the context provided from React Native. Then save references to the children for later use.

If you aren’t using Kotlin Synthetic Properties, you can do the same thing with container = findViewById(R.id.container).

For now, this is all you need to initialize the view and have a fully functional bottom sheet.

The only thing left to do in this class is to manage how the views passed from React Native are actually handled.

Handling Views Passed from React Native To Native Android

By overriding addView you can change where the views are placed in the native layout. The default implementation is to add any views provided as children to the main CoordinatorLayout. However, that won’t have the effect expected, as they’ll be siblings to the bottom sheet (the second child) you made in the layout.

Instead, don’t make use of super.addView(parent, child, index) (the default implementation), but manually add the views to the layout’s children by using the references stored earlier.

The basic idea followed is that the first child passed in is expected to be the main content of the screen, and the second child is the content that’s rendered inside of the bottom sheet. Do this by simply checking the current number of children on the container. If you already added a child, add the next child to the bottomSheet.

The way this logic is written, any views passed after the first one will be added to the bottom sheet. You’re designing this class to only accept two children, so you’ll make some modifications later.

This is all you need for the first version of our bottom sheet. At this point, you can run react-native run-android, successfully compile the APK, and install it.

Referencing the New Native Component in React Native

To use the new native component in React Native you need to require it and export a normal React component. Also set up the props here, so it will properly accept a style and children.

Create a new component called BottomSheet.tsx in your React Native project and add the following:

Now you can update your basic App.tsx to include the new component.

This is all the code that is required to use the new native component. Notice that you're passing it two children. The first child is the content used for the main part of the screen, and the second child is rendered inside of our new native bottom sheet.

Adding Gestures

Now there's a working native component that renders subviews from React Native, you can add some more functionality.

Being able to interact with the bottom sheet through gestures is our main use case for this component, but what if you want to programmatically collapse/expand the bottom sheet?

Since you’re using a CoordinatorLayout with behaviour to make the bottom sheet in native code, you can make use of BottomSheetBehaviour. Going back to ReactNativeBottomSheet.kt, we will update the createViewInstance() method.

By creating a BottomSheetBehaviour you can make more customizations to how the bottom sheet functions and when you’re informed about state changes.

First, add a native method which specifies what the expanded state of the bottom sheet should be when it renders.

This adds a prop to our component called sheetState which takes a string and sets the collapsed/expanded state of the bottom sheet based on the value sent. The string sent should be either collapsed or expanded.

We can adapt our TypeScript to accept this new prop like so:

Now, when you include the component, you can change whether it’s collapsed or expanded without touching it. Here’s an example of updating your App.tsx to add a button that updates the bottom sheet state.

Now, when pressing the button, it expands the bottom sheet. However, when it’s expanded, the button disappears. If you drag the bottom sheet back down to a collapsed state, you'll notice that the button isn't updating its text. So you can set the state programmatically from React Native, but interacting with the native component isn't propagating the value of the bottom sheet's state back into React. To fix this you will add more to the *BottomSheetBehaviour* you created earlier.

This code adds a state change listener to the bottom sheet, so that when its collapsed/expanded state changes, you emit a React Native event that you listen to in the React component. The event is called "BottomSheetStateChange” and has the same value as the states accepted in setSheetState().

Back in the React component, you listen to the emitted event and call an optional listener prop to notify the parent that our state has changed due to a native interaction.


Updating the App.tsx again

Now when you drag the bottom sheet, the state of the button updates with its collapsed/expanded state.

Native Code And Cross Platform Components

When creating components in React Native our goal is always to make cross-platform components that don’t require native code to perform well, but sometimes that isn’t possible or easy to do. By creating ViewGroupManager classes, we are able to extend the functionality of our native components so that we can take full advantage of React Native’s flexible layouts, with very little code required.

Additional Information

All the code included in the guide can be found at the react-native-bottom-sheet-example repo.

This guide is just an example of how to create native views that accept React Native subviews as children. If you want a complete implementation for bottom sheets on Android, check out the react-native wrapper for android BottomSheetBehavior.

You can follow the Android guideline for CoordinatorLayout and BottomSheetBehaviour to better understand what is going on. You’re essentially creating a container with two children.

If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

Continue reading

Your Circuit Breaker is Misconfigured

Your Circuit Breaker is Misconfigured

Circuit breakers are an incredibly powerful tool for making your application resilient to service failure. But they aren’t enough. Most people don’t know that a slightly misconfigured circuit is as bad as no circuit at all! Did you know that a change in 1 or 2 parameters can take your system from running smoothly to completely failing?

I’ll show you how to predict how your application will behave in times of failure and how to configure every parameter for your circuit breaker.

At Shopify, resilient fallbacks integrate into every part of the application. A fallback is a backup behavior which activates when a particular component or service is down. For example, when Shopify’s Redis, that stores sessions, is down, the user doesn’t have to see an error. Instead, the problem is logged and the page renders with sessions soft disabled. This results in a much better customer experience. This behaviour is achievable in many cases, however, it’s not as simple as catching exceptions that are raised by a failing service.

Imagine Redis is down and every connection attempt is timing out. Each timeout is 2 seconds long. Response times will be incredibly slow, since requests are waiting for the service to timeout. Additionally, during that time the request is doing nothing useful and will keep the thread busy.

Utilization, the percentage of a worker’s maximum available working capacity, increases indefinitely as the request queue builds up, resulting in a utilization graph like this:

Utilization during service outage
Utilization during service outrage

A worker which had a request processing rate of 5 requests per second now can only process half a request per second. That’s a tenfold decrease in throughput! With utilization this high, the service can be considered completely down. This is unacceptable for production level standards.

Semian Circuit Breaker

At Shopify, this fallback utilization problem is solved by Semian Circuit Breaker. This is a circuit breaker implementation written by Shopify. The circuit breaker pattern is based on a simple observation: if a timeout is observed for a given service one or more times, it’s likely to continue to timeout until that service recovers. Instead of hitting the timeout repeatedly, the resource is marked as dead and an exception is raised instantly on any call to it.

I'm looking at this from the configuration perspective of Semian circuit breaker but another notable circuit breaker library is Hystrix by Netflix. The core functionality of their circuit breaker is the same, however, it has less available parameters for tuning which means, as you will learn below, it can completely lose its effectiveness for capacity preservation.

A circuit breaker can take the above utilization graph and turn it into something more stable.

Utilization during service outage with a circuit breaker
Utilization during service outage with a circuit breaker

The utilization climbs for some time before the circuit breaker opens. Once open, the utilization stabilizes so the user may only experience some slight request delays which is much better.

Semian Circuit Breaker Parameters

Configuring a circuit breaker isn’t a trivial task. It’s seemingly trivial because there are just a few parameters to tune: 

  • name
  • error_threshold
  • error_timeout
  • half_open_resource_timeout
  • success_threshold.

However, these parameters cannot just be assigned to arbitrary numbers or even best guesses without understanding how the system works in detail. Changes to any of these parameters can greatly affect the utilization of the worker during a service outage.

At the end, I'll show you a configuration change that drops the utilization requirement of 263% to 4%. That’s the difference between complete outage and a slight delay. But before I get to that, let’s dive into detail about what each parameter does and how it affects the circuit breaker.


The name identifies the resource being protected. Each name gets its own personal circuit breaker. Every different service type, such as MySQL, Redis, etc. should have its own unique name to ensure that excessive timeouts in a service only opens the circuit for that service.

There is an additional aspect to consider here. The worker may be configured with multiple service instances for a single type. In certain environments, there can be dozens of Redis instances that a single worker can talk to.

We would never want a single Redis instance outage to cause all Redis connections to go down so we must give each instance a different name.

For this example, see the diagram below. We will model a total of 3 Redis instances. Each instance is given a name “redis_cache_#{instance_number}”.

3 Redis instances. Each instance is given a name “redis_cache_#{instance_number}”
3 Redis instances. Each instance is given a name “redis_cache_#{instance_number}”

You must understand how many services your worker can talk to. Each failing service will have an aggregating effect on the overall utilization. When going through the examples below, the maximum number of failing services you would like to account for is defined by failing_services. For example, if you have 3 Redis instances, but you only need to know the utilization when 2 of those go down, failing_services should be 2.

All examples and diagrams in this post are from the reference frame of a single worker. None of the circuit breaker state is shared across workers so we can simplify things this way.

Error Threshold

The error_threshold defines the number of errors to encounter within a given timespan before opening the circuit and starting to reject requests instantly. If the circuit is closed and error_threshold number of errors occur within a window of error_timeout, the circuit will open.

The larger the error_threshold, the longer the worker will be stuck waiting for input/output (I/O) before reaching the open state. The following diagram models a simple scenario where we have a single Redis instance failure.

error_threshold = 3, failing_services = 1

A single Redis instance failure
A single Redis instance failure

3 timeouts happen one after the other for the failing service instance. After the third, the circuit becomes open and all further requests raise instantly.

3 timeouts must occur during the timespan before the circuit becomes open. Simple enough, 3 times the timeout isn’t so bad. The utilization will spike, but the service will reach steady state soon after. This graph is a real world example of this spike at Shopify:

A real world example of a utilization spike at Shopify
A real world example of a utilization spike at Shopify

The utilization begins to increase when the Redis services goes down, after a few minutes, the circuit begins opening for each failing service and the utilization lowers to a steady state.

Furthermore, if there’s more than 1 failing service instance, the spike will be larger, last longer, and cause more delays for end users. Let’s come back to the example from the Name section with 3 separate Redis instances. Consider all 3 Redis instances being down. Suddenly the time until all circuits are open triples.

error_threshold = 3, failing_services = 3

3 failing services and each service has 3 timeouts before the circuit opens
3 failing services and each service has 3 timeouts before the circuit opens

There are 3 failing services and each service has 3 timeouts before the circuit opens. All the circuits must become open before the worker will stop being blocked by I/O.

Now, we have a longer time to reach steady state because each circuit breaker wastes utilization waiting for timeouts. Imagine 40 Redis instances instead of 3, a timeout of 1 second and an error_threshold of 3 means there’s a minimum time of around 2 minutes to open all the circuits.

The reason this estimate is a minimum is because the order that the requests come in cannot be guaranteed. The above diagram simplifies the scenario by assuming the requests come in a perfect order.

To keep the initial utilization spike low, the *error_threshold* should be reduced as much as possible. However, the probability of false-positives must be considered. Blips can cause the circuit to open despite the service not being down. The lower the error_threshold, the higher the probability of a false-positive circuit open.

Assuming a steady state timeout error rate is 0.1% in your time window of error_timeout. An error_timeout of 3 will give you a 0.0000001% chance of getting a false positive.

100 *(probability_of_failure)number_of_failures =(0.001)3=0.0000001%

You must balance this probability with your error_timeout to reduce the number of false positives circuit opens. When the circuit opens, it will be instantly raising for every request that is made during error_timeout.

Error Timeout

The error_timeout is the amount of time until the circuit breaker will try to query the resource again. It also determines the period to measure the error_threshold count. The larger this value is, the longer the circuit will take to recover after an outage. The larger this value is, the longer a false positive circuit open will affect the system.

error_threshold = 3, failing_services = 1

The circuit will stay open for error_timeout amount of time
The circuit will stay open for error_timeout amount of time

After the failing service causes the circuit to become open, the circuit will stay open for error_timeout amount of time. The Redis instance comes back to life and error_timeout amount of time passes so requests start sending to Redis again.

It’s important to consider the error_timeout in relation to half_open_resource_timeout. These 2 parameters are the most important for your configuration. Getting these right will determine the success of the circuit breakers resiliency mechanism in times of outage for your application.

Generally we want to minimize the error_timeout because the higher it is, the higher the recovery time. However, the primary constraints come from its interaction with these parameters. I’ll show you that maximizing error_timeout will actually preserve worker utilization.

Half Open Resource Timeout

The circuit is in half-open state when it’s checking to see if the service is back online. It does this by letting a real request through. The circuit becomes half-open after error_timeout amount of time has passed. When the operating service is completely down for an extended period of time, a steady-state behavior arises.

failing_services = 1

Circuit becomes half-open after error_timeout amount of time has passed
Circuit becomes half-open after error_timeout amount of time has passed

Error threshold expires but the service is still down. The circuit becomes half-open and a request is sent to Redis, which times out. The circuit opens again and the process repeats as long as Redis remains down.

This flip flop between the open and half-open state is periodic which means we can deterministically predict how much time is wasted on timeouts.

By this point, you may already be speculating on how to adjust wasted utilization. The error_timeout can be increased to reduce the total time wasted in the half-open state! Awesome — but the higher it goes, the slower your application will be to recover. Furthermore, false positives will keep the circuit open for longer. Not good, especially if we have many service instances. 40 Redis instances with a timeout of 1 second is 40 seconds every cycle wasted on timeouts!

So how else do we minimize the time wasted on timeouts? The only other option is to reduce the service timeout. The lower the service timeout, the less time is wasted on waiting for timeouts. However, this cannot always be done. Adjusting this timeout is highly dependent on how long the service needs to provide the requested data. We have a fundamental problem here. We cannot reduce the service timeout because of application constraints and we cannot increase the error_timeout because the recovery time will be too slow.

Enter half_open_resource_timeout, the timeout for the resource when the circuit is in the half-open state. It gets used instead of the original timeout. Simple enough! Now, we have another tunable parameter to help adjust utilization. To reduce wasted utilization, error_timeout and half_open_resource_timeout can be tuned. The smaller half_open_resource_timeout is relative to *error_timeout*, the better the utilization will be.

If we have 3 failing services, our circuit diagram looks something like this:

failing_services = 3

A total of 3 timeouts before all the circuits are open
A total of 3 timeouts before all the circuits are open

In the half-open state, each service has 1 timeout before the circuit opens. With 3 failing services, that’s a total of 3 timeouts before all the circuits are open. All the circuits must become open before the worker will stop being blocked by I/O.

Let’s solidify this example with the following timeout parameters:

error_timeout = 5 seconds
half_open_resource_timeout = 1 second

The total steady state period will be 8 seconds with 5 of those seconds spent doing useful work and the other 3 wasted waiting for I/O. That’s 37% of total utilization wasted on I/O.

Note: Hystrix does not have an equivalent parameter for half_open_resource_timeout which may make it impossible to tune a usable steady state for applications that have a high number of failing_services.

Success Threshold

The success_threshold is the amount of consecutive successes for the circuit to close again, that is, to start accepting all requests to the circuit.

The success_threshold impacts the behavior during outages which have an error rate of less than 100%. Imagine a resource error rate of 90%, with a success_threshold of 1, the circuit will flip flop between open and closed quite often. In this case there’s a 10% chance of it closing when it shouldn’t. Flip flopping also adds additional strain on the system since the circuit must spend time on I/O to re-close.

Instead, if we increase the success_threshold to 3, then the likelihood of an open becomes significantly lower. Now, 3 successes must happen in a row to open the circuit reducing the chance of flip flop to 0.1% per cycle.

Note: Hystrix does not have an equivalent parameter for success_threshold which may make it difficult to reduce the flip flopping in times of partial outage for certain applications.

Lowering Wasted Utilization

Each parameter affects wasted utilization in some way. Semian can easily be configured into a state where a service outage will consume more utilization than the capacity allows. To calculate the additional utilization required, I have put together an equation to model all of the parameters of the circuit breaker. Use it to plan your outage effectively.

The Circuit Breaker Equation

The Circuit Break Equation

This equation applies to the steady state failure scenario in the last diagram where the circuit is continuously checking the half-open state. Additional threads reduce the time spent on blocking I/O, however, the equation doesn’t account for the time it takes to context switch a thread which could be significant depending on the application. The larger the context switch time, the lower the thread count should be.

I ran a live test to test out the validity of the equation and the utilization observed closely matched the utilization predicted by the equation.

Tuning Your Circuit

Let’s run through an example and see how the parameters can be tuned to match the application needs. In this example, I’m integrating a circuit breaker for a Rails worker configured with 2 threads. We have 42 Redis instances, each configured with its own circuit and a service timeout of 0.25s.

As a starting point, let’s go with the following parameters. Failing instances is 42 because we are judging behaviour in the worst case, when all of the Redis instances are down.

Parameter  Value
0.25 seconds
2 seconds
0.25 seconds (same as service timeout)

Plugging into The Circuit Breaker Equation, we require an extra utilization of 263%. Unacceptable! Ideally we should have something less than 30% to account for regular traffic variation.

So what do we change to drop this number?

From production observation metrics, I know 99% percent of Redis requests have a response time of less than 50ms. With a value this low, we can easily drop the half_open_resource_timeout to 50ms and still be confident that the circuit will close when Redis comes back up from an outage. Additionally, we can increase the error_timeout to 30 seconds. This means a slower recovery time but it reduces the worst case utilization.

With these new numbers, the additional utilization required drops to 4%!

I use this equation as something concrete to relate back to when making tuning decisions. I hope this equation helps you with your circuit breaker configuration as it does with mine.

Author's Edit: "I fixed an error with the original circuit breaker equation in this post. success_threshold does not have an impact on the steady state utilization because it only takes 1 error to keep the circuit open again."

If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

Continue reading

Great Code Reviews—The Superpower Your Team Needs

Great Code Reviews—The Superpower Your Team Needs

There is a general consensus that code reviews are an important aspect of highly effective teams. This research paper is one of many exploring this subject. Most organizations undergo code reviews of some form.

However, it’s all too common to see code reviews that barely scratch the surface, or that offer feedback that is unclear or hard to act upon. This robs the team the opportunity to speed up learning, share knowledge and context, and raise the quality bar on the resulting code.

At Shopify, we want to move fast while building for the long term. In our experience, having strong code review practices has a huge impact on the growth of our engineers and in the quality of the products we build.

A Scary Scenario

Imagine you join a new team and you’re given a coding task to work on. Since you’re new on the team, you really want to show what you’re made of. You want to perform. So, this is what you do:

  1. You work frantically on your task for 3 weeks.
  2. You submit a Pull Request for review with about 1000 new lines of code
  3. You get a couple comments about code style and a question that shows the person has no clue what this work is about.
  4. You get approval from both reviewers after fixing the code style and answering the question.
  5. You merge your branch into master, eyes closed, shoulders tense, grinding your teeth. After a few minutes, CI completes. Master is not broken. Yet.
  6. You live in fear for 6 months, not knowing when and how your code will break.

You may have lived through some of the situations above, and hopefully you’ve seen some of the red flags in that process.

Let’s talk about how we can make it much better.

Practical Code Review Practices

At Shopify, we value the speed of shipping, learning, and building for the long term. These values - which sometimes conflict - lead us to experiment with many techniques and team dynamics. In this article, I have distilled a series of very practical techniques we use at Shopify to ship valuable code that can stand the test of time.

A Note about terminology: We refer to Pull Requests (PR) as one unit of work that's put forth for review before merging into the base branch. Github and Bitbucket users will be familiar with this term.

1. Keep Your Pull Requests Small

As simple as this sounds, this is easily the most impactful technique you can follow to level up your code review workflow. There are 2 fundamental reasons why this works:

  • It’s mentally easier to start and complete a review for a small piece. Larger PRs will naturally make reviewers delay and procrastinate examining the work, and they are more likely to be interrupted mid-review.
  • As a reviewer, it’s exponentially harder to dive deep if the PR is long. The more code there is to examine, the bigger the mental map we need to build to understand the whole piece.

Breaking up your work in smaller chunks increases your chances of getting faster and deeper reviews.

Now, it’s impossible to set one universal standard that applies to all programming languages and all types of work. Internally, for our data engineering work, the guideline is around 200-300 lines of code affected. If we go above this threshold, we almost always break up the work into smaller blocks.

Of course, we need to be careful about breaking up PRs into chunks that are too small, since this means reviewers may need to inspect several PRs to understand the overall picture.

2. Use Draft PRs

Have you heard the metaphor of building a car vs. drawing a car? It goes something like this:

  1. You’re asked to build a car.
  2. You go away for 6 months and build a beautiful Porsche.
  3. When you show it to your users, they ask about space for their 5 children and the surf boards.

Clearly, the problem here is that the goal is poorly defined and the team jumped directly into the solution before gathering enough feedback.If after step 1 we created a drawing of the car and showed it to our users, they would have asked the same questions and we would have discovered their expectations and saved ourselves 6 months of work. Software is no different—we can make the same mistake and work for a long time on a feature or module that isn't what our users need.

At Shopify, it’s common practice to use Work In Progress (WIP) PRs to elicit early feedback whose goal is validating direction (choice of algorithm, design, API, etc). Early changes mean less wasted effort on details, polish, documentation, etc.

As an author, this means you need to be open to changing the direction of your work. At Shopify, we try to embrace the principle of strong opinions, loosely held. We want people to make decisions confidently, but also be open to learning new and better alternatives, given sufficient evidence. In practice, we use Github’s Draft PRs—they clearly signal the work is still in flow and Github prevents you from merging a Draft PR. Other tools may have similar functionality, but at the very least you can create normal PRs with a clear WIP label to indicate the work is early stage. This will help your reviewers focus on offering the right type of feedback.

3. One PR Per Concern

In addition to line count, another dimension to consider is how many concerns your unit of work is trying to address. A concern may be a feature, a bugfix, a dependency upgrade, an API change, etc. Are you introducing a new feature while refactoring at the same time? Fixing two bugs in one shot? Introducing a library upgrade and a new service?

Breaking down PRs into individual concerns has the following effects:

  • More independent review units and therefore better review quality
  • Fewer affected people, therefore less domains of expertise to gather
  • Atomicity of rollbacks, the ability of rolling back a small commit or PR. This is valuable because if something goes wrong, it will be easier to identify where errors were introduced and what to roll back.
  • Separating easy stuff from hard stuff. Imagine a new feature that requires refactoring a frequently used API. You change the API, update a dozen call-sites, and then implement your feature. 80% of your changes are obvious and skimmable with no functional changes, while 20% are new code that needs careful attention to test coverage, intended behaviour, error handling, etc. and will likely go through multiple revisions. With each revision, the reviewer will need to skim through all of the changes to find the relevant bits. By splitting this in two PRs, it becomes easy to quickly land the majority of the work and to optimize the review effort applied to the harder work.

If you end up with a PR that includes more than one concern, you can break it down into individual chunks. Doing so will accelerate the iteration cycle on each individual review, giving a faster review overall. Often part of the work can land quickly, avoiding code rot and merge conflicts.

Breaking down PRs into individual concerns

Breaking down PRs into individual concerns

In the example above, we’ve taken a PR that covered three different concerns and broke it up. You can see how each reviewer has strictly less context to go over. Best of all, as soon as any of the reviews is complete, the author can begin addressing feedback while continuing to wait for the rest of the work. In the most extreme cases, instead of completing a first draft, waiting several days (and shifting focus), and then eventually returning to address feedback, the author can work almost continuously on their family of PRs as they receive the different reviews asynchronously.

4. Focus on the Code, Not the Person

Focus on the code, not the person practice refers to communication styles and relationships between people. Fundamentally, it’s about trying to focus on making the product better, and avoiding the author perceiving a review as personal criticism.

Here are some tips you can follow:

  • As a reviewer, think, “This is our code, how can we improve on it?”
  • Offer positive remarks! If you see something done well, comment on it. This reinforces good work and helps the author balance suggestions for improvement.
  • As an author, assume best intention, and don’t take comments personally.

Below are a few examples of not-so-great review comments, and a suggestion on how we can reword to emphasize the tips above.

Less of These
 More of These
Move this to Markdown
How about moving this documentation into our Markdown README file? That way we can more easily share with other users.
Read the Google Python style guidelines
We should avoid single-character variables. How about board_size or size instead?
This feels too slow. Make it faster. Lightning fast.
 This algorithm is very easy to read but I’m concerned about performance. Let’s test this with a large dataset to gauge its efficiency.
Bool or int?
Why did you choose a list of bool values instead of integers?

Ultimately, a code review is a learning and teaching opportunity and should be celebrated as such.

5. Pick the Right People to Review

It’s often challenging to decide who should review your work. Here are some questions can use as guidance:

  • Who has context on the feature or component you’re building?
  • Who has strong skills in the language, framework, or tool you’re using?
  • Who has strong opinions on the subject?
  • Who cares about the result of what you’re doing?
  • Who should learn this stuff? Or if you’re a junior reviewing someone more senior, use this as an opportunity to ask questions and learn. Ask all the silly questions, a strong team will find the time to share knowledge.

Whatever rules your team might have, remember that it is your responsibility as an author to seek and receive a high-quality code review from a person or people with the right context.

6. Give Your Reviewers a Map

Last but definitely not least, the description on your PR is crucial. Depending on who you picked for review, different people will have different context. The onus is on the author to help reviewers by providing key information or links to more context so they can produce meaningful feedback.

Some questions you can include in your PR templates:

  • Why is this PR necessary?
  • Who benefits from this?
  • What could go wrong?
  • What other approaches did you consider? Why did you decide on this approach?
  • What other systems does this affect?

Good code is not only bug-free; it is also useful! As an author, ensure that your PR description ties your code back to your team’s objectives, ideally with link to a feature or bug description in your backlog. As a reviewer, start with the PR description; if it’s incomplete, send it back before attempting to judge the suitability of the code against undefined objectives. And remember, sometimes the best outcome of a code review is to realize that the code isn’t needed at all!

What’s the Benefit?

By adopting some of the techniques above, you can have a strong impact on the speed and quality of your software building process. But beyond that, there’s the potential for a cultural effect:

  • Teams will build a common understanding. The group understands your work better and you’re not the only person capable of evolving any one area of the codebase.
  • Teams will adopt a sense of shared responsibility. If something breaks, it’s not one person’s code that needs fixing. It’s the team’s work that needs fixing.

Any one person in a team should be able to take a holiday and disconnect from work for a number of days without risking the business or stressing about checking email to make sure the world didn’t end.

What Can I Do to Improve My Team’s Code Review Process?

If you lead teams, start experimenting with these techniques and find what works for your team.

If you’re an individual contributor, discuss with your lead on why you think code reviews techniques are important, how they help effectiveness and how they help your team.

Bring this up on your next 1:1 or your next team synch.

The Importance of Code Reviews

To close, I’ll share some words from my lead, which summarizes the importance of Code Reviews:

“We could prioritize landing mediocre but working code in the short term, and we will write the same debt-ridden code forever, or we can prioritize making you a stronger contributor, and all of your future contributions will be better (and your career brighter).

An enlightened author should be delighted to have this attention.”

We're always on the lookout for talent and we’d love to hear from you. Please take a look at our open positions on the Data Science & Engineering career page.

Continue reading

Bug Bounty Year in Review 2019

Bug Bounty Year in Review 2019

For the third year in a row, we’ve taken time to reflect on our Bug Bounty program. This past year was an exciting one for us because we ran multiple experiments and made a number of process improvements to increase our program speed. 

2020 Program Improvements

Building on our program’s continued success in 2019, we’re excited to announce more improvements. 

Bounties Paid in Full Within 7 Days

As of today, we pay bounties in full within 7 days of a report being triaged. Paying our program minimum on triage has been a resounding success for us and our hackers. After having experimented with paying full bounties on triage in Shopify-Experiments (described below), we’ve decided to make the same change to our public program.

Maximum Bounty is Now $50,000

We are increasing our maximum bounty amount to $50,000. Beginning today, we are doubling the bounty amounts for valid reports of Arbitrary Code Execution, now $20K–$50K, SQL Injection, now $20K$40K, and Privilege Escalation to Shop Owner, now $10K$30K. Trust and security is our number one priority at Shopify and these new amounts demonstrate our commitment to both.

Surfacing More Information About Duplicate Reports

Finally, we know how important it is for hackers to trust the programs they choose to work with. We value that trust. So, beginning today, anyone who files a duplicate report to our program will be added to the original report, when it exists within HackerOne. We're continuing to explore ways to share information about internally known issues with hackers and hope to have a similar announcement later this year.

Learning from Bug Bounty Peers

Towards the end of 2018, we reached out to other bug bounty programs to share experiences and lessons learned. This was amazing. We learned so much chatting with our peers and those conversations gave us better insight into improving our data analytics and experimenting with a private program.

Improving Our Analytics

At Shopify, we make data-informed decisions and our bug bounty program is no exception. However, HackerOne platform data only gives us insight into what hackers are reporting and when; it doesn’t tell us who is testing what and how often. Discussing this problem with other programs revealed how some had already tackled this obstacle; they were leveraging provisioned accounts to understand their program funnel, from invitation, to registration, to account creation, and finally testing. Hearing this, we realized we could do the same.

To participate in our bug bounty program, we have always required hackers to register for an account with a specific identifier (currently a @wearehackerone.com email address). Historically, we used that registration requirement for investigating reports of suspicious platform activity. However, we realized that the same data could tell us how often people are testing our applications. Furthermore, with improvements to the HackerOne API and the ability to export all of our report data regularly, we have all the data necessary to create exciting activity reports and program trends. It’s also given us more knowledge to share in our monthly program recap tweets.

Shopify-Experiments, A Private Bug Bounty Program

Chatting with other programs, we also shared ideas about what is and isn’t working. We heard about some having success running additional private programs. Naturally, we launched a private bug bounty program to test the return on investment. We started Shopify-Experiments in mid-2019 and invited high signal, high impact hackers who have reported to our program previously or who have a proven track record on the HackerOne platform. The program allowed us to run controlled experiments aimed at improving our public program. For example, in 2019, we experimented with:

  • expanding the scope to help us better understand the workload implications
  • paying bounties in full after validating and triaging a report
  • making report disclosure mandatory and adding hackers to duplicate reports
  • allowing for self-closing reports that were submitted in good faith, but were false positives
  • increasing opportunities to collaborate with Shopify third party developers to test their apps.

These experiments had immediate benefits for our Application Security Team and the Shopify public program. For instance, after running a controlled experiment with an expanded scope, we understood the workload it would entail in our public program. So, on September 11, 2019, we added all but a few Shopify-developed apps into the scope of our public program. Since then, we’ve received great reports about these new assets, such as Report 740989 from Vulnh0lic, which identified a misconfiguration in our OAuth implementation for the Shopify Stocky app. If you’re interested in being added to the program, all it takes is 3 resolved Shopify reports with an overall signal of 3.0 or more in our program.

Improving Response Times with Automation

In 2018, our average initial response time was 17 hours. In 2019, we wanted to do better. Since we use a dedicated Slack channel to manage incoming reports, it made sense to develop a chatbot and use the HackerOne API. In January last year, we implemented HackerOne API calls to change report states, assign reports, post public and private comments as well as suggest bounty amounts.

Immediately this gave us better access to responding to reports on mobile devices. However, our chosen syntax was difficult to remember. For example, changing a report state was done via the command hackerone change_state <report_id> <state>. Responding with an auto response was hackerone auto_respond <report_id> <state> <response_id>. To make things easier, we introduced shorthands and emoji responses. Now, instead of typing hackerone change_state 123456 not-applicable, we can use h1 change_state 123456 na. For common invalid reports, we react with emojis which post the appropriate common response and close the report as not applicable.

2019 Bug Bounty Statistics

Knowing how important communication is to our hackers, we continue to pride ourselves on all of our response metrics being among the best on HackerOne. For another straight year, we reduced our communication times. Including weekends, our average time to first response was 16 hours compared to 1 day and 9 hours in 2018. This was largely a result of being able to quickly close invalid reports on weekends with Slack. We reduced our average time to triage from 3 days and 6 hours in 2018 to 2 days and 13 hours in 2019.

We were quicker to pay bounties and resolve bugs; our average time to bounty from submission was 7 days and 1 hour in 2019 versus 14 days in 2018. Our average resolution time from time of triage was down to 20 days and 3 hours from 48 days and 15 hours in 2018. Lastly, we thanked 88 hackers in 2019, compared to 86 in 2018.

Average Shopify Response Times - Hours vs. YearsAverage Shopify Response Times - Hours vs. Years

We continued to request disclosure on our resolved bugs. In 2019, we disclosed 74 bugs, up from 37 in 2018. We continue to believe it’s extremely important that we build a resource library to enable ethical hackers to grow in our program. We strongly encourage other companies to do the same.

Reports Disclosed - Number vs. YearReports Disclosed - Number of Reports vs. Year

In contrast to our speed improvements and disclosures, our bounty related statistics were down from 2018, largely a result of having hosted H1-514 in October 2018, which paid out over $130,000 to hackers. Our total amount paid to hackers was down to $126,100 versus $296,400 in 2018, despite having received approximately the same number of reports; 1,379 in 2019 compared to 1,306 in 2018.

Bounties Paid - Bounties Awarded vs. YearBounties Paid - Bounties Awarded vs. Years

Number of Reports by Year - Number of Reports vs. YearNumber of Reports by Year - Number of Reports vs. Year

Report States by Year - Number of Reports vs. YearReport States by Year - Number of Reports vs. Year

Similarly, our average bounty awarded was also down in 2019, $1,139 compared to $2,052 in 2018. This is partly attributed to the amazing bugs found at H1-514 in October 2018 and our decision to merge the Shopify Scripts bounty program, which had a minimum bounty of $100, to our core bounty program in 2019. We rewarded bounties to fewer reports; 107 in 2019 versus 182 in 2018.

After another successful year in 2019, we’re excited to work with more hackers in 2020. If you’re interested in helping to make commerce more secure, visit hackerone.com/shopify to start hacking or our careers page to check out our open Trust and Security positions.

Happy Hacking.
- Shopify Trust and Security

Continue reading

React Native is the Future of Mobile at Shopify

React Native is the Future of Mobile at Shopify

After years of native mobile development, we’ve decided to go full steam ahead building all of our new mobile apps using React Native. As I’ll explain, that decision doesn’t come lightly.

Each quarter, the majority of buyers purchase on mobile (with 71% of our buyers purchasing on mobile in Q3 of last year). Black Friday and Cyber Monday (together, BFCM) are the busiest time of year for our merchants, and buying activity during those days is a bellwether. During this year’s BFCM, Shopify merchants saw another 3% increase in purchases on mobile, an average of 69% of sales.

So why the switch to React Native? And why now? How does this fit in with our native mobile development? It’s a complicated answer that’s best served with a little background.

Mobile at Shopify Pre-2019

We have an engineering culture at Shopify of making specific early technology bets that help us move fast.

On the whole, we prefer to have few technologies as a foundation for engineering. This provides us multiple points of leverage:

  • we build extremely specific expertise in a small set of deep technologies (we often become core contributors)
  • every technology choice has quirks, but we learn them intimately
  • those outside of the initial team contribute, transfer and maintain code written by others
  • new people are onboarded more quickly.

At the same time, there are always new technologies emerging that provide us with an opportunity for a step change in productivity or capability. We experiment a lot for the opportunity to unlock improvements that are an order of magnitude improvement—but ultimately, we adopt few of these for our core engineering.

When we do adopt these early languages or frameworks, we make a calculated bet. And instead of shying away from the risk, we meticulously research, explore and evaluate such risks based on our unique set of conditions. As is often within risky areas, the unexplored opportunities are hidden. We instead think about how we can mitigate that risk:

  • what if a technology stops being supported by the core team?
  • what if we run into a bug we can’t fix?
  • what if the product goes in a direction against our interests?

Ruby on Rails was a nascent and obscure framework when Tobi (our CEO) first got involved as a core contributor in 2004. For years, Ruby on Rails has been seen as a non-serious, non-performant language choice. But that early bet gave Shopify the momentum to outperform the competition even though it was not a popular technology choice. By using Ruby on Rails, the team was able to build faster and attract a different set of talent by using something more modern and with a higher level of abstraction than traditional programming languages and frameworks. Paul Graham talks about his decision to use Lisp in building Viaweb to similar effect and 6 of the 10 most valuable Y Combinator companies today all use Ruby on Rails (even though again, it still remains largely unpopular). As a contrast, none of the Top 10 most valuable Y Combinator companies use Java; largely considered the battle tested enterprise language.

Similarly two years ago, Shopify decided to make the jump to Google CloudAgain, a scary proposition for the 3rd largest US Retail eCommerce site in 2019—to do a cloud migration away from our own data centers, but to also pick an early cloud contender. We saw the technology arc of value creation moving us to focusing on what we’re good at—enabling entrepreneurship and letting others (in this case Google Cloud) focus on the undifferentiated heavy lifting of maintaining physical hardware, power, security, the operating system updates, etc.

What is React Native?

In 2015, Facebook announced and open sourced React Native; it was already being used internally for their mobile engineering. React Native is a framework for building native mobile apps using React. This means you can use a best-in-class JavaScript library (React) to build your native mobile user interfaces.

At Shopify, the idea had its skeptics at the time (and still does), but many saw its promise. At the company’s next Hackdays the entire company spent time on React Native. While the early team saw many benefits, they decided that we couldn’t ship an app we’d be proud of using React Native in 2015. For the most part, this had to do with performance and the absence of first-class Android support. What we did learn was that we liked the Reactive programming model and GraphQL. Also, we built and open-sourced a functional renderer for iOS after working with React Native. We adopted these technologies in 2015 for our native mobile stack, but not React Native for mobile development en masse. The Globe and Mail documented our aspirations in a comprehensive story about the first version of our mobile apps.

Until now, the standard for all mobile development at Shopify was native mobile development. We built mobile tooling and foundations teams focused on iOS and Android helping accelerate our development efforts. While these teams and the resulting applications were all successful, there was a suspicion that we could be more effective as a team if we could:

  • bring the power of JavaScript and the web to mobile
  • adopt a reactive programming model across all client-side applications
  • consolidate our iOS and Android development onto a single stack.

How React Native Works

React Native provides a way to build native cross platform mobile apps using JavaScript. React Native is similar to React in that it allows developers to create declarative user interfaces in JavaScript, for which it internally creates a hierarchy tree of UI elements or in React terminology a virtual DOM. Whereas the output of ReactJS targets a browser, React Native translates the virtual DOM into mobile native views using platform native bindings that interface with application logic in JavaScript. For our purposes, the target platforms are Android and iOS, but community driven effort have brought React Native to other platforms such as Windows, macOS and Apple tvOS.

ReactJS targets a browser, whereas React Native can can target mobile APIs.


ReactJS targets a browser, whereas React Native can target mobile APIs.

When Will We Not Default to Using React Native?

There are situations where React Native would not be the default option for building a mobile app at Shopify. For example, if we have a requirement of:

  • deploying on older hardware (CPU <1.5GHz)
  • extensive processing
  • ultra-high performance
  • many background threads.

Reminder: Low-level libraries including many open sourced SDKs will remain purely native. And we can always create our own native modules when we need to be close to the metal.

Why Move to React Native Now?

There were 3 main reasons now is a great time to take this stance:

  1. we learned from our acquisition of Tictail (a mobile first company that focused 100% on React Native) in 2018 how far React Native has come and made 3 deep product investments in 2019
  2. Shopify uses React extensively on the web and that know-how is now transferable to mobile
  3. we see the performance curve bending upwards (think what’s now possible in Google Docs vs. desktop Microsoft Office) and we can long-term invest in React Native like we do in Ruby, Rails, Kubernetes and Rich Media.

Mobile at Shopify in 2019

We have many mobile surfaces at Shopify for buyers and merchants to interact, both over the web and with our mobile apps. We spent time over the last year experimenting with React Native with three separate teams over three apps: Arrive, Point of Sale, and Compass.

From our experiments we learned that:

  • in rewriting the Arrive app in React Native, the team felt that they were twice as productive than using native development—even just on one mobile platform
  • testing our Point of Sale app on low-power configurations of Android hardware let us set a lower CPU threshold than previously imagined (1.5GHz vs. 2GHz)
  • we estimated ~80% code sharing between iOS and Android, and were surprised by the extremely high-levels in practice—95% (Arrive) and 99% (Compass)

As an aside, even though we’re making the decision to build all new apps using React Native, that doesn’t mean we’ll automatically start rewriting our old apps in React Native.


At the end of 2018, we decided to rewrite one of our most popular consumer apps, Arrive in React Native. Arrive is no slouch, it’s a highly rated, high performing app that has millions of downloads on iOS. It was a good candidate because we didn’t have an Android version. Our efforts would help us reach all of the Android users who were clamoring for Arrive. It’s now React Native on both iOS and Android and shares 95% of the same code. We’ll do a deep dive into Arrive in a future blog post.

So far this rewrite resulted in:

  • less crashes on iOS than our native iOS app
  • an Android version launched
  • team composed of mobile + non-mobile developers.

The team also came up with this cool way to instantly test work-in-progress pull requests. You simply scan a QR code from an automated GitHub comment on your phone and the JavaScript bundle is updated in your app and you’re now running the latest code from that pull request. JML, our CTO, shared the process on Twitter recently.

Point of Sale

At the beginning of 2019, we did a 6-week experiment on our flagship Point of Sale (POS) app to see if it would be a good candidate for a rewrite in React Native. We learned a lot, including that our retail merchants expect almost 2x the responsiveness in our POS due to the muscle memory of using our app while also talking to customers.

In order to best serve our retail merchants and learn about React Native in a physical retail setting, we decided to build out the new POS natively for iOS and use React Native for Android.

We went ahead with 2 teams for the following reasons:

  1. we already had a team ramped up with iOS expertise, including many of the folks that built the original POS apps
  2. we wanted to be able to benchmark our React Native engineering velocity as well as app performance against the gold standard which is native iOS
  3. to meet the high performance requirements of our merchants, we felt that we’d need all of the Facebook re-architecture updates to React Native before launch (as it turns out, they weren’t critical to our performance use cases). Having two teams on two platforms, de-risked our ability to launch.

We announced a complete rewrite of POS at Unite 2019. Look for both the native iOS and React Native Android apps to launch in 2020!


The Start team at Shopify is tasked with helping folks new to entrepreneurship. Before the company wide decision to write all mobile apps in React Native came about, the team did a deep dive into Native, Flutter and React Native as possible technology choices. They chose React Native and now have iOS and Android apps (in beta) live in the app stores.

The first versions of Compass (both iOS and Android) were launched within 3 months with ~99% of the code shared between iOS and Android.

Mobile at Shopify 2020+

We have lots in store for 2020.

Will we rewrite our native apps? No. That’s a decision each app team makes independently

Will we continue to hire native engineers? Yes, LOTS!

We want to contribute to core React Native, build platform specific components, and continue to understand the subtleness of each of the platforms. This requires deep native expertise. Does this sound like you?

Partnering and Open Source

We believe that building software is a team sport. We have a commitment to the open web, open source and open standards.

We’re sponsoring Software Mansion and Krzysztof Magiera (co-founder of React Native for Android) in their open source efforts around React Native.

We’re working with William Candillon (host of Can It Be Done in React Native) for architecture reviews and performance work.

We’ll be partnering closely with the React Native team at Facebook on automation, 3rd party libraries and stewardship of some modules via Lean Core.

We are working with Discord to accelerate the open sourcing of FastList for React Native (a library which only renders list items that are in the viewport) and optimizing for Android.

Developer Tooling and Foundations for React Native

When you make a bet and go deep into a technology, you want to gain maximum leverage from that choice. In order for us to build fast and get the most leverage, we have two types of teams that help the rest of Shopify build quickly. The first is a tooling team that helps with engineering setup, integration and deployment. The second is a foundations team that focuses on SDKs, code reuse and open source. We’ve already begun spinning up both of these teams in 2020 to focus on React Native.

Our popular Shopify Ping app which has enabled hundreds of thousands of customer conversations is currently only iOS. In 2020, we’ll be building the Android version using React Native out of our San Francisco office and we’re hiring.

In 2019, Twitter released their desktop and mobile web apps using something called React Native Web. While this might seem confusing, it allows you to use the same React Native stack for your web app as well. Facebook promptly hired Nicolas Gallagher as a result, the lead engineer on the project. At Shopify we’ll be doing some React Native Web experiments in 2020.

Join Us

Shopify is always hiring sharp folks in all disciplines. Given our particular stack (Ruby on Rails/React/React Native) we’ve always invested in people even if they don’t have this particular set of experiences coming in to Shopify. In mobile engineering (btw, I love this video about engineering opinions) we’ll continue to write mobile native code and hire native engineers (iOS and Android).

In addition we are looking for a Principal Mobile Developer to work with me directly across the mobile portfolio at Shopify. This person has a track record of excellence, can solve extremely complex technical challenges and can help Shopify to become an industry and technology leader in React Native. If this sounds like you, message me directly farhanATshopify.com!

Farhan Thawar is VP Engineering for Channels and Mobile at Shopify
Twitter: @fnthawar

Continue reading

Scaling Mobile Development by Treating Apps as Services

Scaling Mobile Development by Treating Apps as Services

Scaling development without slowing down the delivery speed of new features is a problem that companies face when they grow. Speed can be achieved through better tooling, but the bigger the teams and projects, the more tooling they need. When projects and teams use different tools to solve similar problems, it gets harder for tooling teams to create one solution that works for everybody. Additionally, it complicates knowledge sharing and makes it difficult for developers to contribute to other projects. This lack of knowledge and developers is magnified during incident response because only a handful of people have enough context and understanding of the system to mitigate and fix issues.

At Shopify, we believe in highly aligned, but loosely coupled teams—teams working independently from each other while sharing the same vision and goals—that move fast and minimize slowdowns in productivity. To continue working towards this goal, we designed tools to share processes and best practices that ease collaboration and code sharing. With tools, teams ship code fast while maintaining quality and productivity. Tooling worked efficiently for our web services, but we lacked something similar for mobile projects. Tools enforce processes that increase quality, reliability and reproducibility. A few examples include using

  • continuous Integration (CI) and automated testing
  • continuous Delivery to release new versions of the software
  • containers to ensure that the software runs in a controlled environment.

Treating Apps as Services

Last year, the Mobile Tooling Team shipped tools helping mobile developers be more productive, but we couldn’t enforce the usage of those tools. Moreover, checking which tools mobile apps used required digging into configuration files and scripts spread across different projects repositories. We have several mobile apps available for download between Google Play and the App Store, so this approach didn’t scale.

Fortunately, Shopify has a tool that enforces tool usage and we extended it to our mobile projects. ServicesDB tracks all production services running at Shopify and has three major goals:

  1. keep track of all running services across Shopify
  2. define what it means to own a service, and what the expectations are for an owner
  3. provide tools for owners to improve the quality of the infrastructure around their services.

ServicesDB allows us to treat apps as services with an owner, and for which we define a set of expectations. We specify, in a configuration file, the information that we need to codify best practices and allows us to check for things such as

  • Service Ownership: each project must be owned by a team or an individual, and they must be responsible for its maintenance and development. A team is accountable for any issues or requests that might arise in regards to the app.
  • Contact Information: Slack channels people use if they need more information about a certain mobile app. We also use those channels to notify teams about their projects not meeting the required checks.
  • Testing and Deployment Configuration: CI and our mobile deployment tool, Shipit Mobile, are properly configured. This check is essential because we need to be able to release a new version of our apps at any time.
  • Versioning: Apps use the latest version of our internal tools. With this check we make sure that our dependencies don’t contain known security vulnerabilities.
  • Monitoring: Bug tracking services configured to check for errors and crashes that are happening in production.
ServicesDB performs checks on one of our mobile apps

ServicesDB checks for one of our mobile apps

ServicesDB defines a contract with the development team through automatic checks for tooling requirements on mobile projects that mitigate the problem of understanding how projects are configured and which tools they are using which keeps teams highly aligned, but loosely coupled. Now, the Mobile Tooling team can see if a project can use our tooling. It allows developers to understand why some tools don’t work with their projects, and instructs them on how to fix it, as every check provides a description for how to make it pass. Some common issues are using an outdated Ruby version, or not having a bug tracking tool configured. If any of them fails, we automatically create an issue on Github to notify the team that they aren’t meeting the contract.

Github issue created when a check fails. It contains instructions to fix the failure

Github issue created when a check fails. It contains instructions to fix it.

Abstracting Tooling and Configuration Away

If you want to scale development efficiently, you need to be opinionated about the tools supported. Through ServicesDB we detect misconfigured projects, notify their owners, and help them to fix those issues. At the end of the day, we don’t want our mobile developers to think about tooling and configurations. Our goal is to make commerce better for everyone, so we want people to spend time solving commerce problems that provide a better experience to both buyers and entrepreneurs.

At the moment, we’ve only implemented some basic checks, but in the future we plan to define service level objectives for mobile apps and develop better tools for easing the creation of new projects and reducing build times, all while being confident that they will work as long as the defined contract is satisfied.

Intrigued? Shopify is hiring and we’d love to hear from you. Please take a look at our open positions on the Engineering career page

Continue reading

How to Implement a Secure Central Authentication Service in Six Steps

How to Implement a Secure Central Authentication Service in Six Steps

As Shopify merchants grow in scale they will often introduce multiple stores into their organization. Previously, this meant that staff members had to be invited to multiple stores to setup their accounts. This introduced administrative friction and more work for the staff users who had to manage multiple accounts just to do their jobs.

We created a new service to handle centralized authentication and user identity management called, surprisingly enough, Identity. Having a central authentication service within Shopify was accomplished by building functionality on the OpenID Connect (OIDC) specification. Once we had this system in place, we built a solution to reliably and securely allow users to combine their accounts to get the benefit of single sign-on. Solving this specific problem involved a team comprising product management, user experience, engineering, and data science working together with members spread across three different cities: Ottawa, Montreal, and Waterloo.

The Shop Model

Shopify is built so that all the data belonging to a particular store (called a Shop in our data model) lives in a single database instance. The data includes core commerce objects like Products, Orders, Customers, and Users. The Users model represents the staff members who have access, with specific permissions, to the administration interface for a particular Shop.

Shop Commerce Object Relationships
Shop Commerce Object Relationships

User authentication and profile management belonged to the Shop itself and worked as long as your use of Shopify never went beyond a single store. As soon as a Merchant organization expanded to using multiple stores, the experience for both the person managing store users and the individual users involved more overhead. You had to sign into each store independently as there was no single sign-on (SSO) capabilities because Shops don’t share any data between each other. The users had to manage their profile data, password, and two-step authentication on each store they had access to.

Shop isolation of users
Shop isolation of users

Modelling User Accounts Within Identity

User accounts modelled within our Identity service are two important types: Identity accounts and Legacy accounts. A service or application that a user can access via OIDC is modelled as a Destination within Identity. Examples of destinations within Shopify would be stores, the Partners dashboard, or our Community discussion forums.

A Legacy account only has access to a single store and an Identity account can be used to access multiple destinations.

Legacy account model: one destination per account. Can only access Shops
Legacy account model: one destination per account. can only access Shops

We ensured that new accounts are created as Identity accounts and that existing users with legacy accounts can be safely and securely upgraded to Identity accounts. The big problem was combining multiple legacy accounts together. When a user has the same email to sign into several different Shopify stores we combined these accounts together into a single Identity account without blocking their access to any of the stores they used.

Combined account model: each account can have access to multiple destinations
Combined account model: each account can have access to multiple destinations

There were six steps needed to get us to a single account to rule them all.

  1. Synchronize data from existing user accounts into a central Identity service.
  2. Have all authentication go through the central Identity service via OpenID Connect.
  3. Prompt users to combine their accounts together.
  4. Prompt users to enable a second factor (2FA) to protect their account.
  5. Create the combined Identity account.
  6. Prevent new legacy accounts from being created.

1. Synchronize Data From Existing User Accounts Into a Central Identity Service

We ensured that all user profile and security credential information was synchronized from the stores, where it's managed, into the centralized Identity service. This meant synchronizing data from the store to the Identity service every time one of the following user events occurred

  • creation
  • deletion
  • profile data update
  • security data update (password or 2FA).

2. Have All Authentication Go Through the Central Identity Service Via OpenID Connect (OIDC)

OpenID Connect is an extension to the OpenID 2.0 specification and the method used to delegate authentication from the Shop to the Identity service. Prior to this step, all password and 2FA verification was done within the core Shop application runtime. Given that Shopify shards the database for the core platform by Shop, all of the data associated with a given Shop is available on a single database instance.

One downside with having all authentication go through Identity is that when a user first signs into a Shopify service it requires sending the user’s browser to Identity to perform an OIDC authentication request (AuthRequest), so there is a longer delay on initial sign in to a particular store.

 Users signing into Shopify got familiar with this loading spinner
Users signing into Shopify got familiar with this loading spinner

3. Prompt Users to Combine Their Accounts Together

Users with an email address that can sign into more than one single Shopify service are prompted to combine their accounts together into a single Identity account. When a legacy user is signing into a Shopify product we interrupt the OIDC AuthRequest flow, after verifying they were authenticated but before sending them to their destination, to check if they had accounts that could be upgraded.

There were two primary upgrade paths to an Identity account for a user: auto-upgrading a single legacy account or combining multiple accounts.

Auto-upgrading a single legacy account occurs when a user’s email address only has a single store association. In this case, we convert the single account into an Identity account retaining all of their profile, password, and 2FA settings. Accounts in the Identity service are modelled using single table inheritance with a type attribute specifying which class a particular record uses. Upgrading a legacy account in this case was as simple as updating the value of this type attribute. This required no other changes anywhere else within the Shopify system because the universally unique identifier (UUID) for the account didn't change and this is the value used to identity an account in other systems.

Combining multiple accounts is triggered when a user has more than one active account (legacy or Identity) that uses the same email address. We created a new session object, called a MergeSession, for this combining process to keep track of all the data required to create the Identity account. The MergeSession was associated to an individual AuthRequest which means that when the AuthRequest was completed, the session would no longer be active. If a user went through more than a single combining process we would have to generate a new MergeSession object for each one.

The prompt users saw when they had multiple accounts that could be combined
The prompt users saw when they had multiple accounts that could be combined

Shopify doesn't require users to verify their email address when creating a new store. This means it’s possible that someone could sign up for a trial using an email address they don’t have access to. Because of this we need to verify that you have access to the email address before we show a user information about other accounts with the same email or allow you to take any actions on those other accounts. This verification involves you requesting an email be sent to your address with a link.

If the user’s email address on the store they were signing in to was verified, we list all of the other destinations where their email address was used. If a user hadn't verified their email address for the account they are authenticating into then we would only indicate that there were other accounts and they must verify their email address before proceeding with combining them.

The prompt users saw when they signed in with an unverified email address
The prompt users saw when they signed in with an unverified email address

If any of the accounts that need combining use 2FA then the user had to provide a valid code for each required account. When someone uses SMS as a 2FA method, they could potentially save some time in this step if they use the same phone number across multiple accounts because we only require a single code for all of the destinations that use the same number. This was a secure convenience to our users in an attempt to reduce time spent on this step. Individuals using an authenticator app (e.g. Google Authenticator, Authy, 1Password, etc.), however, had to provide a code per destination because the authenticator app is configured per user account and there’s nothing associating them to one another.

If a user couldn’t provide a 2FA code for any accounts other than the account they are signing into, they are able to exclude that account from being combined. Legitimate reasons why a person may be unable to provide a code include if the account uses an old SMS phone number that the person no longer has access to or the person no longer has an authenticator app configured to generate a code for that account.

The idea here is that any account which was excluded can be combined at a later date when the user re-gains access to the account.

Once the 2FA requirements for all accounts are satisfied we prompt the user to setup a new password for their combined account. We store the encrypted password hash on an object that is keeping track of state for this session.

4. Prompt Users to Enable a Second Factor to Protect Their Account

Having a user engaged in performing account maintenance was an excellent opportunity to expose them to the benefits of protecting their account with a second factor of security. We displayed a different flow to users who already had 2FA enabled on at least one of their accounts being combined as the assumption was they don’t require explanation about what 2FA is but someone who had never set it up most likely would.

5. Create the Combined Identity Account

Once a user had validated their 2FA configuration of choice, or opted out of setting it up, we performed the following actions:

Attach 2FA setup, if present, to an object that keeps track of the specific account combination session (MergeSession).

Merge session object with new password and 2FA configuration.
Merge session object with new password and 2FA configuration.

Inside a single database transaction create the complete new account, associate destinations from legacy accounts to it, and delete the old accounts

We needed to do this inside a transaction after getting all of the information from a user to prevent the potential for reducing the security of their accounts. If a user was using 2FA before starting this process and we created the Identity account immediately after the new password was provided, there exists a small window of time when their new Identity account would be less secure than their old legacy accounts. As soon as the Identity account exists and has a password associated with it, it could be used to access destinations with only knowledge of the password. Deferring account creation until both password and 2FA are defined means that the new account can be as secure as the ones being combined were.

Final state of combined account
Final state of combined account

Generate a session for the new account and use it to satisfy the AuthRequest that initiated this session in the first place.

Some of the more complex pieces of logic for this process included finding all of the related accounts for a given email address and the information about the destinations they had access to, replacing the legacy accounts when creating the Identity account, and ensuring that the Identity account was setup correctly with all of the required data defined correctly. For these parts of the solution we relied on a Ruby library called ActiveOperation. It's a very small framework allowing you to isolate and model business logic within your application in an operation class. Traditionally in a Rails application you end up having to put logic either in your controllers or models and in this case we were able to have controllers and models that were very small by defining the complex business logic as operations. These operations were easily testable given that they were isolated and had very specific responsibilities that each separate class was responsible for.

There are other libraries for handling this kind of business logic process but we chose ActiveOperation because it was easy to use, made our code easier to understand, and had built-in support for the RSpec testing framework we were using.

We added support for the new Web Authentication (WebAuthn) standard in our Identity service just as we were beginning to roll out the account combining flow to our users. This meant that we were able to allow users to use physical security keys as a second factor when securing their accounts rather than just the options of SMS or an authenticator app.

6. Prevent New Legacy Accounts From Being Created

We didn’t want any more legacy accounts created. There were two user scenarios that needed to be updated to use the Identity creation flow: signing up for a new trial store on shopify.com and inviting new staff members to an existing store.

When signing up for a new store you would enter your email address as part of that process. This email address was used as the primary owner for the new store. With legacy accounts even if the email address belonged to another store we’d still be creating a new legacy account for the newly created store.

When inviting a new staff member to your store you would enter the email address for the new user and an invite would be sent that email address that includes a link to accept the invite and finish setting up their account. Similarly to the store creation process, this would always be a new legacy account on each individual store.

In both cases with the new process we determine whether the email address belongs to an Identity account already and, if so, require the user to be authenticated for the account belonging to that email address before they can proceed.

Build New Experiences for Shopify Users That Rely on SSO Identity Accounts

As of the time of this writing over 75% of active user accounts have been auto-upgraded or combined into a single Identity account. Accounts that don’t require user interaction, such as accounts that can be auto-upgraded, can be done automatically without the user signing in. The accounts that require a user to prove ownership of their accounts can only be combined when logging in. At some point in the future we will prevent users from signing into Shopify without having an Identity account.

When product teams within Shopify can rely on our active users having Identity accounts we can start building new experiences for those users that delegate authentication and profile management to the Identity service. Authorization is still up to the service leveraging these Identity accounts as Identity specifically only handles authentication and knows nothing about the permissions within the services that the accounts can access.

For our users, it means that they don’t have to create and manage a new account when Shopify launches a new service that utilizes Identity for user sign in.

If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions. 

Continue reading

How Shopify Manages API Versioning and Breaking Changes

How Shopify Manages API Versioning and Breaking Changes

Earlier this year I took the train from Ottawa to Toronto. While I was waiting in line in the main hall of the station, I noticed a police officer with a detection dog. The police officer was giving the dog plenty of time at each bag or person as they worked and weaved their way back and forth along the lines. The dog would look to his handler for direction, receiving it with the wave of a hand or gesture towards the next target. That’s about the moment I began asking myself a number of questions about dogs… and APIs.

To understand why, you have to appreciate that the Canadian government recently legalized cannabis. Watching this incredibly well-trained dog work his way up and down the lines, it made me wonder, how did they “update” the dogs once the legislation changed? Can you really retrain or un-train a dog? How easy is it to implement this change, and how long does it take to roll out? So when the officer ended up next to me I couldn’t help but ask,

ME: “Excuse me, I have a question about your dog if that’s alright with you?”

OFFICER: “Sure, what’s on your mind?”

ME: “How did you retrain the dogs after the legalization of cannabis?”

OFFICER: “We didn’t. We had to retire them all and train new ones. You really can’t teach an old dog new tricks.“

ME: “Wow, seriously? How long did that take?”

OFFICER: “Yep, we needed a full THREE YEARS to retire the previous group and introduce a new generation. It was a ton of work.”

I found myself sitting on the train thinking about how simple it might have been for one layer of government plotting out the changes, to completely underestimate the downstream impact on the K9 unit of the police services. To anyone that didn’t understand the system (dogs), the change sounds simple. Simply detect substances in a set that is now n-1 in size. In reality, due to the way this dog-dependent system works, it requires significant time and effort, and a three-year program to migrate from the old system to the new.

How We Handle API Versioning

At Shopify, we have tens of thousands of partners building on our APIs that depend on us to ensure our merchants can run their businesses every day. In April of this year, we released the first official version of our API. All consumers of our APIs require stability and predictability and our API versioning scheme at Shopify allows us to continue to develop the platform while providing apps with stable API behavior and predictable timelines for adopting changes.

The increasing growth of our API RPM quarter over quarter since 2017 overlaid with growth in active API clients

The increasing growth of our API RPM quarter over quarter since 2017 overlaid with growth in active API clients

To ensure that we provide a stable and predictable API, Shopify releases a new API version every three months at the beginning of the quarter. Version names are date-based to be meaningful and semantically unambiguous (for example, 2020-01).

Shopify API Versioning Schedule

Shopify API Versioning Schedule


Although the Platform team is responsible for building the infrastructure, tooling, and systems that enforce our API versioning strategy at Shopify, there are a 1000+ engineers working across Shopify, each with the ability to ship code that can ultimately affect any of our APIs. So how do we think about versioning, and help manage changes to our APIs at scale?

Our general rule of thumb about versioning is that

API versioning is a powerful tool that comes with added responsibility. Break the API contract with the ecosystem only when there are no alternatives or it’s uneconomical to do otherwise.

API versions and changes are represented in our monolith through new frozen records, one file for versions, and one for changes. API changes are packaged together and shipped as a part of a distinct version. API changes are initially introduced to the unstable version, and can optionally have a beta flag associated with the change, to prevent the change from being visible publicly. At runtime, our code can check whether a given change is in effect through a ApiChange.in_effect? construct. I’ll show you how this, and other methods of the ApiChange module are used in an example later on.

Dealing With Breaking and Non-breaking Changes

As we continue to improve our platform, changes are necessary and can be split into two broad categories: breaking and non-breaking.

Breaking changes are more problematic and require a great deal of planning, care and go-to-market effort to ensure we support the ecosystem and provide a stable commerce platform for merchants. Ultimately, a breaking change is any change that requires a third-party developer to do any migration work to maintain the existing functionality of their application. Some examples of breaking changes are

  • adding a new or modifying an existing validation to an existing resource
  • requiring a parameter that wasn’t required before
  • changing existing error response codes/messages
  • modifying the expected payload of webhooks and async callbacks
  • changing the data type of an existing field
  • changing supported filtering on existing endpoints
  • renaming a field or endpoint
  • adding a new feature that will change the meaning of a field
  • removing an existing field or endpoint
  • changing the URL structure of an existing endpoint.

Teams inside Shopify considering a breaking change conduct an impact analysis. They put themselves into the shoes of a third-party developer using the API and think through the changes that might be required. If there is ambiguity, our developer advocacy team can reach out to our partners to gain additional insight and gauge the impact of proposed changes. 

On the other hand, to determine if a change is non-breaking, a change must pass our forward compatibility test. Forward compatible changes are those which can be adopted and used by any merchant, without limitation, regardless of whether shops have been migrated or any other additional conditions have been met.

Forward compatible changes can be freely adopted without worrying about whether there is a new user experience or the merchant’s data is adapted to work with the change, etc. Teams will keep these changes in the unstable API version and if forward compatibility cannot be met, keep access limited and managed by protecting the change with a beta flag.

Every change is named in the changes frozen record mentioned above, to track and manage the change, and can be referenced by its name, for example,


Analyzing the Impact of Breaking Changes

If a proposed change is identified as a breaking change, and there is agreement amongst the stakeholders that it’s necessary, the next step is to enable our teams to figure out just how big the change’s impact is.

Within the core monolith, teams make use of our API change tooling methods mark_breaking and mark_possibly_breaking to measure the impact of a potential breaking change. These methods work by capturing request metadata and context specific to the breaking code path then emitting this into our event pipeline, Monorail, which places the events into our data warehouse.

The mark_breaking method is called when the request would break if everything else was kept the same, while mark_possibly_breaking would be used when we aren’t sure whether the call would have an adverse effect on the calling application. An example would be the case where a property of the response has been renamed or removed entirely:


Once shipped to production, teams can use a prebuilt impact assessment report to see the potential impact of their changes across a number of dimensions.

Measuring and Managing API Adoption

Once the change has shipped as a part of an official API version, we’re able to make use of the data emitted from mark_breaking and mark_possibly_breaking to measure adoption and identify shops and apps that are still at risk. Our teams use the ApiChange.in_effect? method (made available by our API change tooling) to create conditionals and manage support for the old and new behaviour in our API. A trivial example might look something like this:

The ApiChange module and the automated instrumentation it drives allow teams at Shopify to assess the current risk to the platform based on the proportion of API calls still on the breaking path, and assist in communicating these risks to affected developers.

At Shopify, our ecosystem’s applications depend on the predictable nature of our APIs. The functionality these applications provide can be critical for the merchant’s businesses to function correctly on Shopify. In order to build and maintain trust with our ecosystem, we consider any proposed breaking change thoroughly and gauge the impact of our decisions. By providing the tooling to mark and analyze API calls, we empower teams at Shopify to assess the impact of proposed changes, and build a culture that respects the impact our decisions have on our ecosystem. There are real people out there building software for our merchants, and we want to avoid ever having to ask them to replace all the dogs at once!

We're always on the lookout for talent and we’d love to hear from you. Please take a look at our open positions on the Engineering career page.

Continue reading

Successfully Merging the Work of 1000+ Developers

Successfully Merging the Work of 1000+ Developers

Collaboration with a large team is challenging, and even more so if it’s on a single codebase, like the Shopify monolith. Shopify changes 40 times a day. We follow a trunk-based development workflow and merge around 400 commits to master daily. There are three rules that govern how we deploy safely, but they were hard to maintain at our growing scale. Soft conflicts broke master, slow deployments caused large drift between master and production, and the time to deploy emergency merges slowed due to a backlog of pull requests. To solve these issues, we upgraded the Merge Queue (our tool to automate and control the rate of merges going into master) so it integrates with GitHub, runs continuous integration (CI) before merging to master keeping it green, removes pull requests that fail CI, and maximizes deployment throughput of pull requests.

Our three essential rules about deploying safely and maintaining master:

  1. Master must always be green (passing CI). Important because we must be able to deploy from master at all times. If master is not green, our developers cannot merge, slowing all development across Shopify.
  2. Master must stay close to production. Drifting master too far ahead of what is deployed to production increases risk.
  3. Emergency merges must be fast. In case of emergencies, we must be able to quickly merge fixes intended to resolve the incident.

Merge Queue v1

Two years ago, we built the first iteration of the merge queue inside our open-source continuous deployment tool, Shipit. Our goal was to prevent master from drifting too far from production. Rather than merging directly to master, developers add pull requests to the merge queue which merges pull requests on their behalf.

Merge Queue v1 - developers add pull requests to the merge queue which merges pull requests on their behalfMerge Queue v1

Pull requests build up in the queue rather than merging to master all at once. Merge Queue v1 controlled the batch size of each deployment and prevented merging when there were too many undeployed pull requests on master. It reduced the risk of failure and possible drift from production. During incidents, we locked the queue to prevent any further pull requests from merging to master, giving space for emergency fixes.

Merge Queue v1 browser extensionMerge Queue v1 Browser Extension

Merge Queue v1 used a browser extension allowing developers to send a pull request to the merge queue within the GitHub UI, but also allowed them to quickly merge fixes during emergencies by bypassing the queue.

Problems with Merge Queue v1

Merge Queue v1 kept track of pull requests, but we were not running CI on pull requests while they sat in the queue. On some unfortunate days—ones with production incidents requiring a halt to deploys—we would have upwards of 50 pull requests waiting to be merged. A queue of this size could take hours to merge and deploy. There was also no guarantee that a pull request in the queue would pass CI after it was merged, since there could be soft conflicts (two pull requests that pass CI independently, but fail when merged together) between pull requests in the queue.

The browser extension was a major pain point because it was a poor experience for our developers. New developers sometimes forgot to install the extension which resulted in accidental direct merges to master instead of going through the merge queue, which can be disruptive if the deploy backlog is already large, or if there is an ongoing incident.

Merge Queue v2

This year, we completed Merge Queue v2. We focused on optimizing our throughput by reducing the time that the queue is idle, and improving the user experience by replacing the browser extension with a more integrated experience. We also wanted to address the pieces that the first merge queue didn’t address: keeping master green and faster emergency merges. In addition, our solution needed to be resilient to flaky tests—tests that can fail nondeterministically.

No More Browser Extension

Merge Queue v2 came with a new user experience. We wanted an interface for our developers to interact with that felt native to GitHub. We drew inspiration from Atlantis, which we were already using for our Terraform setup, and went with a comment-based interface.

Merge Queue v2 went with a comment-based interfaceMerge Queue v2 went with a comment-based interface

A welcome message gets issued on every pull request with instructions on how to use the merge queue. Every merge now starts with a /shipit comment. This comment fires a webhook to our system to let us know that a merge request has been initiated. We check if Branch CI has passed and if the pull request has been approved by a reviewer before adding the pull request to the queue. If successful, we issue a thumbs up emoji reaction to the /shipit comment using the GitHub addReaction GraphQL mutation.

In the case of errors, such as invalid base branch, or missing reviews, we surface the errors as additional comments on the pull request.

Jumping the queue by merging directly to master is bad for overall throughput. To ensure that everyone uses the queue, we disable the ability to merge directly to master using GitHub branch protection programmatically as part of the merge queue onboarding process.


However, we still need to be able to bypass the queue in an emergency, like resolving a service disruption. For these cases, we added a separate /shipit --emergency command that skips any checks and merges directly to master. This helps communicate to developers that this action is reserved for emergencies only and gives us auditability into the cases where this gets used.

Keeping Master Green

In order to keep master green, we took another look at how and when we merged a change to master. If we run CI before merging to master, we ensure that only green changes merge. This improves the local development experience by eliminating the cases of pulling a broken master, and by speeding up the deploy process by not having to worry about delays due to a failing build.

Our solution here is to have what we call a “predictive branch,” implemented as a git branch, onto which pull requests are merged, and CI is run. The predictive branch serves as a possible future version of master, but one where we are still free to manipulate it. We avoid maintaining a local checkout, which incurs the cost of running a stateful system that can easily be out of sync, and instead interact with this branch using the GraphQL GitHub API.

To ensure that the predictive branch on GitHub is consistent with our desired state, we use a similar pattern as React’s “Virtual DOM.” The system constructs an in-memory representation of the desired state and runs a reconciliation algorithm we developed that performs the necessary mutations to the state on GitHub. The reconciliation algorithm synchronizes our desired state to GitHub by performing two main steps. The first step is to discard obsolete merge commits. These are commits that we may have created in the past, but are no longer needed for the desired state of the tree. The second step is to create the missing desired merge commits. Once these merge commits are created, a corresponding CI run will be triggered. This pattern allows us to alter our desired state freely when the queue changes and gives us a layer of resiliency in the case of desynchronization.

Merge Queue v2Merge Queue v2 runs CI in the queue

To ensure our goal of keeping master green, we need to also remove pull requests that fail CI from the queue to prevent them from cascading failures to all pull requests behind them. However, like many other large codebases, our core Shopify monolith suffers from flaky tests. The existence of these flaky tests makes removing pull requests from the queue difficult because we lack certainty about whether failed tests are legitimate or flaky. While we have work underway to clean up the test suite, we have to be resilient to the situation we have today.

We added a failure-tolerance threshold, and only remove pull requests when the number of successive failures exceeds the failure tolerance. This is based on the idea that legitimate failures will propagate to all later CI runs, but flaky tests will not block later CI runs from passing. Larger failure tolerances will increase the accuracy, but at the tradeoff of taking longer to remove problematic changes from the queue. In order to calculate the best value, we can take a look at the flakiness rate. To illustrate, let’s assume a flakiness rate of 25%. These are the probabilities of a false positive based on how many successive failures we get.

Failure tolerance
0 25%
1 6.25%
2 1.5%
3 0.39%
4 0.097%

From these numbers, it’s clear that the probability decreases significantly with each increase to the failure tolerance. The possibility will never reach exactly 0%, but in this case, a value of 3 will bring us sufficiently close. This means that on the fourth consecutive failure, we will remove the first pull request failing CI from the queue.

Increasing Throughput

An important objective for Merge Queue v2 was to ensure we can maximize throughput. We should be continuously deploying and making sure that each deployment contains the maximum amount of pull requests we deem acceptable.

To continuously deploy, we make sure that we have a constant flow of pull requests that are ready to go. Merge Queue v2 affords this by ensuring that CI is started for pull requests as soon as they are added to the queue. The impact is especially noticeable during incidents when we lock the queue. Since CI is running before merging to master, we will have pull requests passing and ready to deploy by the time the incident is resolved and the queue is unlocked. From the following graph, the number of queued pull requests rises as the queue gets locked, and then drops as the queue is unlocked and pull requests get merged immediately.

The number of queued pull requests rises as the queue gets locked, and then drops as the queue is unlocked and pull requests get merged immediately

To optimize the number of pull requests for each deploy, we split the pull requests in the merge queue up into batches. We define a batch as the maximum number of pull requests we can put in a single deploy. Larger batches result in higher theoretical throughput, but higher risk. In practice, the increased risk of larger batches impedes throughput by causing failures that are harder to isolate, and results in an increased number of rollbacks. In our application, we went with a batch size of 8 as a balance between throughput and risk.

At any given time, we run CI on 3 batches worth of pull requests in the queue. Having a bounded number of batches ensures that we’re only using CI resources on what we will need soon, rather than the entire set of pull requests in the queue. This helps reduce cost and resource utilization.


We improved the user experience, safety of deploying to production, and throughput of deploys through the introduction of the Merge Queue v2. While we accomplished our goals for our current level of scale, there will be patterns and assumptions that we’ll need to revisit as we grow. Our next steps will focus on the user experience and ensure developers have the context to make decisions every step of the way. Merge Queue v2 has given us flexibility to build for the future, and this is only the beginning of our plans to scale deploys.

We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.

Continue reading

Four Steps to Creating Effective Game Day Tests

Four Steps to Creating Effective Game Day Tests

At Shopify, we use Game Day tests to practice how we react to unpredictable situations. Game Day tests involve deliberately triggering failure modes within our production systems, and analyzing whether the systems handle these problems in the ways we expect. I’ll walk through a set of best practices that we use for our internal Shopify Game Day tests, and how you can apply these guidelines to your own testing.

Shopify’s primary responsibility is to provide our merchants with a stable ecommerce platform. Even a small outage can have a dramatic impact on their businesses, so we put a lot of work into preventing them before they occur. We verify our code changes rigorously before they’re deployed, both through automated tests and manual verification. We also require code reviews from other developers who are aware of the context of these changes and their potential impact to the larger platform.

But these upfront checks are only part of the equation. Inevitably, things will break in ways that we don’t expect, or due to forces that are outside our control. When this happens, we need to quickly respond to the issue, analyze the situation at hand, and restore the system back to a healthy state. This requires close coordination between humans and automated systems, and the only way to ensure that it goes smoothly is to practice it beforehand. Game Day tests are a great way of training your team to expect the unexpected.

1. List All the Things That Could Break

The first step to running a successful Game Day test is to compile a list of all the potential failure scenarios that you’re interested in analyzing. Collaborate with your team to take a detailed inventory of everything that could possibly cause your systems to go haywire. List all the problem areas you know about, but don’t stop there—stretch your imagination! 

  • What are the parts of your infrastructure that you think are 100% safe? 
  • Where are your blind spots?
  • What would happen if your servers started inexplicably running out of disk space? 
  • What would happen if you suffered a DNS outage or a DDOS attack? 
  • What would happen if all network calls to a host started timing out?
  • Can your systems support 20x their current load?

You’ll likely end up with too many scenarios to reasonably test during a single Game Day testing session. Whittle down the list by comparing the estimated impact of each scenario against the difficulty you’d face in trying to reasonably simulate it. Try to avoid weighing particular scenarios based on your estimates of the likelihood that those scenarios will happen. Game Day testing is about insulating your systems against perfect storm incidents, which often hinge on failure points whose danger was initially underestimated.

2. Create a Series of Experiments

At Shopify, we’ve found that we get the best results from our Game Day tests when we run them as a series of controlled experiments. Once you’ve compiled a list of things that could break, you should start thinking about how they will break, as a list of discrete hypotheses. 

  • What are the side effects that you expect will be triggered when you simulate an outage during your test? 
  • Will the correct alerts be dispatched? 
  • Will downstream systems manifest the expected behaviors?
  • When you stop simulating a problem, will your systems recover back to their original state?

If you express these expectations in the form of testable hypotheses, it becomes much easier to plan the actual Game Day session itself. Use a separate spreadsheet (using a tool like Google Sheets or similar) to catalogue each of the prerequisite steps that your team will walk through to simulate a specific failure scenario. Below those steps indicate the behaviors that you hypothesize will occur when you trigger that scenario, along with an indicator for whether this behavior occurs. Lastly, make sure to list the necessary steps to restore your system back to its original state.

Example spreadsheet for a Game Day test that simulates an upstream service outage. A link to this spreadsheet is available in the “Additional Resources” section below.

Example spreadsheet for a Game Day test that simulates an upstream service outage. A link to this spreadsheet is available in the “Additional Resources” section below. 

3. Test Your Human Systems Too

By this point, you’ve compiled a series of short experiments that describe how you expect your systems to react to a list of failure scenarios. Now it’s time to run your Game Day test and validate your experimental hypotheses. There are a lot of different ways to run an Game Day test. One approach isn’t necessarily better than another. How you approach the testing should be tailored to the types of systems you’re testing, the way your team is structured and communicates, the impact your testing poses to production traffic, and so on. Whatever approach you take, just make sure that you track your experiment results as you go along!

However, there is one common element that should be present regardless of the specifics of your particular testing setup: team involvement. Game Day tests aren’t just about seeing how your automated systems react to unexpected pressures—you should also use the opportunity to analyze how your team handles these situations on the people side. Good team communication under pressure can make a huge difference when it comes to mitigating the impact of a production incident. 

  • What are the types of interactions that need to happen among team members as an incident unfolds? 
  • Is there a protocol for how work is distributed among multiple people? 
  • Do you need to communicate with anyone from outside your immediate team?

Make sure you have a basic system in place to prevent people from doing the same task twice, or incorrectly assuming that something is already being handled.

4. Address Any Gaps Uncovered

After running your Game Day test, it’s time to patch the holes that you uncovered. Your experiment spreadsheets should be annotated with whether each hypothesis held up in practice.

  • Did your off hours alerting system page the on-call developer? 
  • Did you correctly switch over to reading from the backup database? 
  • Were you able to restore things back to their original healthy state?

For any gaps you uncover, work with your team to determine why the expected behavior didn’t occur, then establish a plan for how to correct the failed behavior. After doing so, you should ideally run a new Game Day test to verify that your hypotheses are now valid with the new fixes in place.

This is also the opportunity to analyze any gaps in communication between your team, or problems that you identified regarding how people distribute work among themselves when they’re under pressure. Set aside some time for a follow up discussion with the other Game Day participants to discuss the results of the test, and ask for their input on what they thought went well versus what could use some improvement. Finally, make any necessary changes to your team’s guidelines for how to respond to these incidents going forward.

In Conclusion

Using these best practices, you should be able to execute a successful Game Day test that gives you greater confidence in how your systems—and the humans that control them—will respond during unexpected incidents. And remember that a Game Day test isn’t a one-time event: you should periodically update your hypotheses and conduct new tests to make sure that your team remains prepared for the unexpected. Happy testing!

Additional resources


Continue reading

Sam Saffron AMA: Performance and Monitoring with Ruby

Sam Saffron AMA: Performance and Monitoring with Ruby

Sam Saffron is a co-founder of Discourse and the creator of the mini_profiler, memory_profiler, mini_mime and mini_racer gems. He has written extensively about various performance topics on samsaffron.com and is dedicated to ensuring Discourse keeps running fast.

Sam visited Shopify in Ottawa and talked to us about Discourse’s approach to Ruby performance and monitoring. He also participated in an AMA and answered the top voted questions submitted by Shopifolk which we are sharing here.

Ruby has a bad reputation when it comes to performance. What do you think are the actual problems? And do you think the community is on the right track to fix this reputation?

Sam Saffron: I think there are a lot of members of the community that are very keen to improve performance. And this runs all the way from above. DHH is also very interested in improving performance of Ruby.

I think the big problem that we have is resources and focus. A lot of times, I can feel that as a community we’re not focusing necessarily on the right thing. It’s very tempting, in performance, just to look at a micro bench. And it’s easy just to look at micro bench and make something 20 times faster, but in the big scheme of things you may not be fixing the right thing. So, it doesn’t make a big difference.

I think one area that Ruby can get better at, is finding the actual real production bottlenecks that people are seeing out there, and working towards solving them. And when I think about performance for us at Discourse, the biggest pain is memory, not CPU. When looking at adoption of Discourse, a lot of it depends on the people being able to run it on very cheap servers and they’re very constrained on memory. It’s a huge difference to adoption for us whether we can run on a 512MB system versus 1024MB. We see these memory issues in our hosting as well, our CPUs are usually doing okay, but memory is where we have issues. I wish the community would focus more on memory.

Just to summarize, I wish we looked at what big pain points consumers in the ecosystem are having and just set the agenda based on that. The other thing would be to spend more time on memory.

Are there any Ruby features or patterns that you generally avoid for performance reasons?

Sam Saffron: That’s an interesting question. Well, I’ll avoid ActiveRecord sometimes if I have something performance sensitive. For example, when I think of a user flow that I’m working on, it could be one that the user will visit once a month, or it could be one an extremely busy route like the topic page. If I’m working on the topic page, it’s a performance sensitive area, then maybe I may opt to skip ActiveRecord and just use MiniSql.

As for using Ruby patterns, I don’t go and write while loops just because I hate blocks and I know that blocks are a little bit slower. I like how wonderful Ruby looks and how wonderful it reads. So, I won’t be like, “Oh, yeah, I have to write C in Ruby now because I don’t want to use blocks anywhere.” I think it’s a there’s a balancing act with patterns and I’ll only strive or move away for two reasons. One is clarity. If the code will be clearer without like using some of these sophisticated patterns, I’ll just go for clear and dumb versus fancy, sophisticated and pretty. I prefer clear and dumb. An example of that is I hate using /unless/. It’s a pet peeve that I have, I won’t use the /unless/ keyword because I find it harder for me to comprehend what the code means. And the second is for performance reasons only. Only rarely where I absolutely have to take the performance hit, will I do that.

Sam Saffron presenting at Shopify in OttawaSam Saffron presenting at Shopify in Ottawa

What is the right moment to shift focus on the performance of a product, rather than on other features? Do you have any tripwires or metrics in place?

Sam Saffron: We’re constantly thinking about performance at Discourse. We’ve always got the monitoring in place and we’re always looking at our graphs to see how things are going. I don’t think performance is something that you forget about for two years then go back and say, “Yeah, we’ll do a round of performance now.” I think there should be a culture of performance instilled day-to-day and always be considering it. It doesn’t mean performance the only thing you should be thinking about but it should be in the back of your mind as something that is a constant that you are trying to do.

There’s a balancing act. You want to ship new features, but as long as performance is something the team is constantly thinking about, then I think it’s safe. I would never consider shipping a new feature that is very slow just because I want to get the feature out there. I prefer to have the feature both correct and fast before shipping it.

What was one of the most difficult performance bugs you’ve found? How did you stay focused and motivated?

Sam Saffron: The thing that keeps me focused is having very clear goals. It’s important when you’re dealing with performance issues. You have a graph, it’s going a certain shape, and you want to change the shape of it. That’s your goal. You forget about everything else and it’s about taking that graph from this shape to that shape. When you can break a problem down from something that is impossible into something that is practical and easy to reason about, it’s at that point, you can attack these problems.

Particular war stories are hard—there’s nothing that screams out at me as the worst bug we’ve had. I guess memory leaks have been traditionally, some of the hardest problems we’ve faced. Back in the old days we used the TheRubyRacer, and it had a leak in the interop layer between Ruby and V8. It was a nightmare to find, because you’d have these processes that just keep climbing, and you don’t know what’s responsible for it. It’s something random that you’re doing but how do you get to it? So we looked at that graph and start removing parts of the app and when you remove half of the app, the graph is suddenly stable. So, we put the other half of the app back in and slowly bisect it until you find the problem area and start resolving it. Luckily these days the tooling for debugging memory leaks is far more advanced making it much easier to deal with issues like this.

Do you employ any kind of performance budgeting in your products and/or libraries? If you do, what metrics do you monitor and how do you decide on a budget?

Sam Saffron: Well, one constant budget I have is that any new dependency in our gem file has to be approved by me, and people have to justify its use. So I think dependencies are a big thing which is part of performance budget. In that, it’s easy to add dependencies, but to remove them later is very hard. I need to make sure that every new dependency we add is part of a performance budget that we agree we absolutely need it.

I’m constantly thinking about our performance budget. We’ve got the budget on boot. I’m very proud of the way that I can boot Rails console in under two seconds on my laptop. So boot budget is important to me, especially for dev work. If I want to just open a Rails console, I just do it. I don’t have to think that I’m going to have to wait 20 seconds for this thing to boot up. I might as well go and browse the web.

We’ve got this constant budget, they’re the high profile pages. We can’t afford any of regression there. So, one thing that we’re looking at adding is alerts. If the query count on a topic page is now sitting on a median of 60 queries to SQL, if it goes up to 120, I want to get an alert saying, “There are 120 queries on this page, and there used to be 60 only.” So somebody will have a look at that, and it’ll open an alert topic on Discourse. So I definitely do want to get into more alerting that say, “Look, something happened at this point, look at it.”

What’s your take on the different Ruby runtimes out there? Is MRI still the “go to one” for a new project? If so, what do you think are the other ones missing to become real contenders?

Sam Saffron: We’ve always wanted Discourse to work on a wide array of platforms. That’s been a goal because when we started it was just about pure adoption. We didn’t care if people were paying us or not paying us, we just wanted the software to be adopted. So if it can run on JRuby, all power to JRuby—it makes adoption easier. The unfortunate thing that happened over the years is that we have never been able to run Discourse on JRuby, and they’ve been attempts out there but we are not quite there. Being able to host V8 in Java in JRuby is very very hard. A lot of what we do is married to the C implementation. It’s extremely hard to move to another world. I want there to be diversity, but unfortunately the only option we have at the moment is MRI, and I don’t see any other options in the next couple of years popping up that would be feasible.

Matz (Yukihiro Matsumoto) is saying that he wants Ruby 3 to be three times faster. Are you following the Ruby 3 development? Do you think they are going in the right direction?

Sam Saffron: I think there’s definitely a culture of performance at CRuby. There are a lot of improvements happening patch after patch where they are shaving this bit off and that bit off. CRuby itself, is tracking well but whether it’ll get three times faster or not, I don’t know. Where it gets complicated, the ecosystem itself is tracking its own trajectory and that’s where it gets complicated. There’s one trajectory for the engine, but the other trajectory for the ecosystem.

If you look at things like Active Record, it’s not tracking three times faster for the next version of Rails, unfortunately. And that’s where all our pain is at the moment. When you look at what CRuby is doing, the goal is not making Active Record three times faster because it’s not a goal that is even practical for them to take on. So, they’re just dealing with little micro benchmarks that may help this situation or they may not help the situation, we don’t know.

Overall, Do I think MRI is tracking well? Yes, MRI is tracking well, but I think we need to put a lot more focus around the ecosystem, if we want to the ecosystem to be 3x faster.

Is there any performance tooling that you think MRI is missing right now?

Sam Saffron: Yes. I’d say memory profiling is the big tooling piece that is missing. We have a bunch of tooling, for example, you can get full heap dumps. But the issue is how are you going to analyze it? The tooling for analysis is woeful, to say the least. If you compare Ruby on Rails to what they have in Java or .NET, we’re worlds behind. In Java and .NET, when it comes to tooling for looking at memory, you can get back traces from where something is allocated. In MRI, at best, you can get a call site of where something was allocated, you can’t get the full backtrace of where it was allocated. Having the full backtrace gives you significantly more tools to figure out and pinpoint what it is.

So, I’d say there are some bits missing of raw information that you could opt in for, that would be very handy. And a lot of tooling around visualizing and analyzing what is going on, especially when it comes to the world between managed and unmanaged because it’s very murky.

People look at a process and the process is consuming one gig of memory, and they want to know why. And if you were able at Shopify, for example, to have that picture immediately of why? You might say, well, maybe killing Unicorn workers is not what we need because all the memory looks like this and it’s coming from here. Maybe we just rewrite this little component and we don’t have to kill these Unicorns anymore because we’ve handled the root cause. I think that area is missing.

Intrigued about scaling using Ruby? Shopify is hiring and we’d love to hear from you. Please take a look at our open positions on the Engineering career page.

Continue reading

Make Great Decisions Quickly with TOMASP

Make Great Decisions Quickly with TOMASP

As technical leaders and managers, our job is to make the right decision most of the time. Hiring, firing, technology choices, software architecture, and project prioritization are examples of high impact decisions that we need to make right if our teams are to be successful.

This is hard. As humans, we're naturally bad at making those types of decisions. I’ll show you how you can consistently make great decisions quickly using a simple framework called TOMASP. I made up the acronym for the purpose of this blog post but it’s inspired from many great books (see resources) as well as my personal experience of leading engineering teams for almost 15 years.

Let’s start with a concrete real world example:

Michelle, a technical lead for a popular mobile app is agonizing about whether or not she should direct her team to rewrite the app using Flutter, a new technology for building mobile apps.

Flutter has an elegant architecture that should make development much faster without compromising quality. It was created by Google and is already in use by several other reputable companies. If Flutter delivers on its promises, Michelle’s team has a good chance of achieving their goals which seem highly unlikely with the current tech stack.

But starting a big rewrite now will be hard. It’s going to be difficult to get buy-in from senior leadership, no one on the team has experience with Flutter and Mike, one of the senior developers on the team is really not interested in trying something new and will probably quit if she decides to more forward with Flutter.

Before reading further ask yourself, what is the right decision here? What would you advise Michelle to do? Should she rewrite the app using Flutter or not?

I have asked this question many times and I bet most of you have an opinion on the matter. Now think about it, how much do you really know about Michelle and her team? How much do you know about the app and the problem they’re trying to solve? We will get back to Michelle and her difficult decision by the end of this post but first a little bit of theory.

How We Make the Wrong Decisions

“The normal state of your mind is that you have feelings and opinions about almost everything that comes your way”

Daniel Kahneman - Thinking, Fast and Slow

This ability of our mind to form opinions very quickly and automatically is what enables us to make thousands of decisions every day, but it can get in the way of making the best decision when the decision is complex and the impact is high. This is just one of the ways our brain can trick us into making the wrong decision.

Here are some other examples:

  • We are highly susceptible to cognitive biases
  • We put too much weight on short term emotion
  • We are over confident about how the future will unfold (when was the last time your project finished sooner than you anticipated?)

The good news here is that it’s possible, through deliberate practice, to counteract those biases and make great decisions quickly even in complex high impact situations.

“We can’t turn off our biases, but we can counteract them”

Chip Heath, Dan Heath - Decisive

What is a Great Decision?

Before I show you how to counteract your biases using TOMASP, we need to get on the same page as to what is a great decision. Let’s go through a couple of examples.

An example of a good decision:

In 2017 Shopify started to migrate its production infrastructure to Google Cloud ... 

Scaling up for BFCM used to take months, now it only takes a few days✌️.

In my experience this image is the mental model that most people have when they think of great decisions:

A Decision Has a Direct Link to the Impact
A decision has a direct link to the impact

In the previous example the decision is to move to google cloud and the impact is the reduced effort to prep for BFCM.

Now let’s look at an example of a bad decision:

In 2017 Shopify started to migrate its production infrastructure to Google Cloud… 2 years later, Shopify is down for all merchants due to an outage in Google Cloud 😞.

Do you notice how the previous mental model is too simplistic? The same decision often leads to multiple outcomes.

Here is a better mental model for decisions:

a decision leads to execution which leads to multiple impacts. Moreover, things outside of our control will also affect the outcomes

A decision leads to execution which leads to multiple impacts. Moreover, things outside of our control will also affect the outcomes

Some things are outside of our control and a single decision often has multiple outcomes. Moreover, we never know the alternative outcomes (i.e. what would have happened if we had taken a different decision).

Considering this, we have to recognize that a great decision is NOT about the outcomes. A great decision is about how the decision is made & implemented. More specifically a great decision is timely, considers many alternatives, recognizes biases and uncertainty.


To put that in practice think TOMASP. TOMASP is an acronym to remember those specific behaviour you can take to counteract your biases and make better decisions.

Timebox (T) the Decision

Define ahead of time how much time is this decision worth.

You Need This If…

It’s unclear when the decision should be made and how much time you should spend on it.

How to Do It

If the decision is hard to reverse, aim to make it the same week, otherwise aim for same day. One week for a “hard to reverse” decision might sound too little time, and it probably is. The intent here is to focus the attention and to prioritize. In my experience this can lead to a few different outcomes:

  1. Most likely, this is actually not a hard to reverse decision and aiming to make it on the same week will lead you to focus on risk management and identify how you can reverse the decision if needed
  2. This is truly a hard to reverse decision and it shouldn’t be made this week, however there are aspects that can be decided this week, such as how to go about making the decision (e.g who are the key stakeholders, what needs to be explored)

Multiple decisions are often made at the same time, whenever this happens make sure you’re spending the most time on the most impactful decision.

This Helps Avoid

  • Analysis Paralysis: over-analyzing (or over-thinking) a situation so that a decision or action is never taken, in effect paralyzing the outcome
  • Bikeshedding: spending a disproportionate amount of time and effort on a trivial or unimportant detail of a system, such as the color of a bikeshed for a nuclear plant

Generate More Options (O)

Expand the number of alternative you’re considering.

You Need This If…

You’re considering “whether or not” to do something.

How to Do It

Aim to generate at least 3 reasonable options.

Do the Vanishing Option Test: if you couldn’t do what you’re currently considering, what else would you do.

Describe the problem, not the solutions, and ask diverse people for ideas, not for feedback (yet).

This Helps Avoid

  • Narrow framing: making a decision without considering the whole context
  • Taking it personally: by truly considering more than 2 options you will become less personally attached to a particular course of action

Meta (M) Decision

Decide on how you want to make the decision

You Need This If…

It’s hard to build alignment with your team or your stakeholders on what is the right decision.

How to Do It

Ask what should we be optimizing for?

Define team values or principles first and then use them to inform the decision.

Look for heuristics. For instance at Shopify we have the following heuristic to quickly choose a programming language: if it’s server side business logic we default to ruby, if it’s performance critical or needs to be highly concurrent we use Go.

This Helps Avoid

  • Mis-alignment with your team or stakeholders: I have found it easier to agree on the criteria to make the decision and the criteria can then be leveraged to quickly align on many decisions.

  • Poor implementation: having explicit decision-making criteria will make it a lot easier to articulate the rationale and to give the proper context for anyone executing on it.

Analyze (A) Your Options

Make a table to brainstorm and compare the “pros” and “cons” of each options.

You Need This If…

There is consensus very quickly or (if you’re making a decision on your own) you have very weak “pros” for all but one option.

How to Do It

Make your proposal look like a rough draft to make it easier for people to disagree.

Nominate a devil’s advocate, someone whose role is to argue for the opposite of what most people are leaning towards.

Make sure you have a diverse set of people analysing the options. I’ve gotten in trouble before when there were only developers in the room and we completely missed the UX trade-off of our decision.

For each option that you are considering, ask yourself what would have to be true for this to be the right choice.

This Helps Avoid

  • Groupthink

  • Confirmation bias

  • Status-quo bias

  • Blind spots

Step (S) Back

Hold off on making the decision until the conditions are more favorable.

You Need This If…

It’s the end of the day or the end of the week and emotions are high or energy is low.

How to Do It

Go have lunch, sleep on it, wait until Monday (or until after your next break if you don’t work Monday to Friday).

Do 10/10/10 analysis: this is another trick I learned from the book Decisive (see resources). Ask yourself how you would feel about the decisions 10 mins later, 10 months later and 10 years later. The long term perspective is not necessarily the right one but thinking about those different timescales help put the short term emotion in perspective.

Ask yourself these two questions:

  1. What would you advise your best friend?
  2. What would your replacement do?

This Helps Avoid

  • Putting too much weight on short term emotions

  • Irrational decision making due to low energy or fatigue

Prepare (P) to be Wrong

Chances are, you’re over-confident about how the future will unfold.

You Need This If…

Always do this :-)

How to Do It

Set “tripwires”: systems that will snap you to attention when a decision is needed. For example a development project can be split into multiple phases with clear target dates and deliverable. At Shopify, we typically split project into think, explore, build and release phases. The transition between each phase acts as a tripwire. For example, before moving to build the team and stakeholders review the technical design (the deliverable for that phase) and have to make a conscious decision to continue the project or pause it.

Whenever a phase is expected to be over 4 weeks, I like to break it down further into milestones. Again, it’s essential that each milestone has a clear target date and deliverable (e.g 50% of the tasks are completed by Oct 10th) so that it can act as a tripwire.

You can setup additional tripwires by doing a pre-mortem analysis: imagine the worst case scenario, now brainstorm potential root causes. You now have leading indicators that you can monitor and use as tripwires.

This Helps Avoid

  • Reacting too slowly: setting tripwires will help you detect early when things are going off the rails.

TOMASP in Action

At the beginning of this post, I gave the following example:

Michelle, a technical lead for a popular mobile app is agonizing about whether or not she should direct her team to rewrite the app using Flutter, a new technology for building mobile apps.

Flutter has an elegant architecture that should make development much faster without compromising quality. It was created by Google and is already in use by several other reputable companies. If Flutter delivers on its promises, Michelle’s team has a good chance of achieving their goals which seem highly unlikely with the current tech stack.

But starting a big rewrite now will be hard. It’s going to be hard to get buy-in from senior leadership, no one on the team has experience with Flutter and Mike, one of the senior developers on the team is really not interested in trying something new and will probably quit if she decides to more forward with Flutter.

Here is how Michelle can use TOMASP to make a Great Decision Quickly:

  • Timebox (T):
    • This feels like a hard to reverse decision, so Michelle aims to make it by the end of the week.
  • Generate More Options (O):
    • Michelle uses the Vanishing Option Test to think of alternatives. If she couldn’t rewrite the whole app using Flutter what could she do?
    • Use a hybrid approach and only rewrite a section of the app in Flutter.
    • Have the iOS and Android developers systematically pair-program when implementing features.
    • Use another cross-platform framework such as React Native or Xamarin.
  • Meta (M) Decision:
    • What should Michelle optimize for? She comes up with the following hierarchy: 1) cross-platform consistency 2) performance 3) development speed
  • Analyze (A) Options:
    • Michelle concludes that for Flutter to be the right choice, a developer should be able to deliver the same level of quality in 50% or less of the time (to account for the risk and learning time of using a new technology).
  • Step (S) Back:
    • Michelle decides to make the decision first thing Friday morning and do a 10/10/10 analysis to ensure she’s not putting too much weight on short term emotion.
  • Prepare (P) to be Wrong:
    • Michelle decides to timebox a prototype: over the next 2 weeks she will pair with a developer on her team to build a section of the app using Flutter. She will then ask her team members to do a blind test and see if they can guess which part of the app has been rebuilt using Flutter.

That’s it! Even if Michelle ends up making the same decision, notice how much better she’s prepared to execute on it.

Thanks for reading, I hope you find this decision framework useful. I would be very interested in hearing how you’ve put TOMASP to use, please let me know by posting a comment below.

Some great resources:

We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.

Continue reading

Five Common Data Stores and When to Use Them

Five Common Data Stores and When to Use Them

An important part of any technical design is choosing where to store your data. Does it conform to a schema or is it flexible in structure? Does it need to stick around forever or is it temporary?

In this article, we’ll describe five common data stores and their attributes. We hope this information will give you a good overview of different data storage options so that you can make the best possible choices for your technical design.

The five types of data stores we will discuss are

  1. Relational database
  2. Non-relational (“NoSQL”) database
  3. Key-value store
  4. Full-text search engine
  5. Message queue

Relational Database

Databases are, like, the original data store. When we stopped treating computers like glorified calculators and started using them to meet business needs, we started needing to store data. And so we (and by we, I mean Charles Bachman) invented the first database management system in 1963. By the mid to late ‘70s, these database management systems had become the relational database management systems (RDBMSs) that we know and love today.

A relational database, or RDB, is a database which uses a relational model of data.

Data is organized into tables. Each table has a schema which defines the columns for that table. The rows of the table, which each represent an actual record of information, must conform to the schema by having a value (or a NULL value) for each column.

Each row in the table has its own unique key, also called a primary key. Typically this is an integer column called “ID.” A row in another table might reference this table’s ID, thus creating a relationship between the two tables. When a column in one table references the primary key of another table, we call this a foreign key.

Using this concept of primary keys and foreign keys, we can represent incredibly complex data relationships using incredibly simple foundations.

SQL, which stands for structured query language, is the industry standard language for interacting with relational databases.

At Shopify, we use MySQL as our RDBMS. MySQL is durable, resilient, and persistent. We trust MySQL to store our data and never, ever lose it.

Other features of RDBMSs are

  • Replicated and distributed (good for scalability)
  • Enforces schemas and atomic, consistent, isolated, and durable (ACID) transactions (leads to well-defined, expected behavior of your queries and updates)
  • Good, configurable performance (fast lookups, can tune with indices, but can be slow for cross-table queries)

When to Use a Relational Database

Use a database for storing your business critical information. Databases are the most durable and reliable type of data store. Anything that you need to store permanently should go in a database.

Relational databases are typically the most mature databases: they have withstood the test of time and continue to be an industry standard tool for the reliable storage of important data.

It’s possible that your data doesn’t conform nicely to a relational schema or your schema is changing so frequently that the rigid structure of a relational database is slowing down your development. In this case, you can consider using a non-relational database instead.

Non-Relational (NoSQL) Database

Computer scientists over the years did such a good job of designing databases to be available and reliable that we started wanting to use them for non-relational data as well. Data that doesn’t strictly conform to some schema or that has a schema which is so variable that it would be a huge pain to try to represent it in relational form.

These non-relational databases are often called “NoSQL” databases. They have roughly the same characteristics as SQL databases (durable, resilient, persistent, replicated, distributed, and performant) except for the major difference of not enforcing schemas (or enforcing only very loose schemas).

NoSQL databases can be categorized into a few types, but there are two primary types which come to mind when we think of NoSQL databases: document stores and wide column stores.

(In fact, some of the other data stores below are technically NoSQL data stores, too. We have chosen to list them separately because they are designed and optimized for different use cases than these more “traditional” NoSQL data stores.)

Document Store

A document store is basically a fancy key-value store where the key is often omitted and never used (although one does get assigned under the hood—we just don’t typically care about it). The values are blobs of semi-structured data, such as JSON or XML, and we treat the data store like it’s just a big array of these blobs. The query language of the document store will then allow you to filter or sort based on the content inside of those document blobs.

A popular document store you might have heard of is MongoDB.

Wide Column Store

A wide column store is somewhere in between a document store and a relational DB. It still uses tables, rows, and columns like a relational DB, but the names and formats of the columns can be different for various rows in the same table. This strategy combines the strict table structure of a relational database with the flexible content of a document store.

Popular wide column stores you may have heard of are Cassandra and Bigtable.

At Shopify, we use Bigtable as a sink for some streaming events. Other NoSQL data stores are not widely used. We find that the majority of our data can be modeled in a relational way, so we stick to SQL databases as a rule.

When to use a NoSQL Database

Non-relational databases are most suited to handling large volumes of data and/or unstructured data. They’re extremely popular in the world of big data because writes are fast. NoSQL databases don’t enforce complicated cross-table schemas, so writes are unlikely to be a bottleneck in a system using NoSQL.

Non-relational databases offer a lot of flexibility to developers, so they are also popular with early-stage startups or greenfield projects where the exact requirements are not yet clear.

Key-Value Store

Another way to store non-relational data is in a key-value store.

A key-value store is basically a production-scale hashmap: a map from keys to values. There are no fancy schemas or relationships between data. No tables or other logical groups of data of the same type. Just keys and values, that’s it.

At Shopify, we use two key-value stores: Redis and Memcached.

Both Redis and Memcached are in-memory key-value stores, so their performance is top-notch.

Since they are in-memory, they (necessarily) support configurable eviction policies. We will eventually run out of memory for storing keys and values, so we’ll need to delete some. The most popular strategies are Least Recently Used (LRU) and Least Frequently Used (LFU). These eviction policies make key-value stores an easy and natural way to implement a cache.

(Note: There are also disk-based key-value stores, such as RocksDB, but we have no experience with them at Shopify.)

One major difference between Redis and Memcached is that Redis supports some data structures as values. You can declare that a value in Redis is a list, set, queue, hash map, or even a HyperLogLog, and then perform operations on those structures. With Memcached, everything is just a blob and if you want to perform any operations on those blobs, you have to do it yourself and then write it back to the key again.

Redis can also be configured to persist to disk, which Memcached cannot. Redis is therefore a better choice for storing persistent data, while Memcached remains only suitable for caches.

When to use a Key-Value Store

Key-value stores are good for simple applications that need to store simple objects temporarily. An obvious example is a cache. A less obvious example is to use Redis lists to queue units of work with simple input parameters.

Full-Text Search Engine

Search engines are a special type of data store designed for a very specific use case: searching text-based documents.

Technically, search engines are NoSQL data stores. You ship semi-structured document blobs into them, but rather than storing them as-is and using XML or JSON parsers to extract information, the search engine slices and dices the document contents into a new format that is optimized for searching based on substrings of long text fields.

Search engines are persistent, but they’re not designed to be particularly durable. You should never use a search engine as your primary data store! It should be a secondary copy of your data, which can always be recreated from the original source in an emergency.

At Shopify we use Elasticsearch for our full-text search. Elasticsearch is replicated and distributed out of the box, which makes it easy to scale.

The most important feature of any search engine, though, is that it performs exceptionally well for text searches.

To learn more about how full-text search engines achieve this fast performance, you can check out Toria’s lightning talk from StarCon 2019.

When to use a Full-Text Search Engine

If you have found yourself writing SQL queries with a lot of wildcard matches (for example, “SELECT * FROM products WHERE description LIKE “%cat%” to find cat-related products) and you’re thinking about brushing up on your natural-language processing skills to improve the results… you might need a search engine!

Search engines are also pretty good at searching and filtering by exact text matches or numeric values, but databases are good at that, too. The real value add of a full-text search engine is when you need to look for particular words or substrings within longer text fields.

Message Queue

The last type of data store that you might want to use is a message queue. It might surprise you to see message queues on this list because they are considered more of a data transfer tool than a data storage tool, but message queues store your data with as much reliability and even more persistence than some of the other tools we’ve discussed already!

At Shopify, we use Kafka for all our streaming needs. Payloads called “messages” are inserted into Kafka “topics” by “producers.” On the other end, Kafka “consumers” can read messages from a topic in the same order they were inserted in.

Under the hood, Kafka is implemented as a distributed, append-only log. It’s just files! Although not human-readable files.

Kafka is typically treated as a message queue, and rightly belongs in our message queue section, but it’s technically not a queue. It’s technically a distributed log, which means that we can do things like set a data retention time of “forever” and compact our messages by key (which means we only retain the most recent value for each key) and we’ve basically got a key-value document store!

Although there are some legitimate use cases for such a design, if what you need is a key-value document store, a message queue is probably not the best tool for the job. You should use a message queue when you need to ship some data between services in a way that is fast, reliable, and distributed.

When to use a Message Queue

Use a message queue when you need to temporarily store, queue, or ship data.

If the data is very simple and you’re just storing it for use later in the same service, you could consider using a key-value store like Redis. You might consider using Kafka for the same simple data if it’s very important data, because Kafka is more reliable and persistent than Redis. You might also consider using Kafka for a very large amount of simple data, because Kafka is easier to scale by adding distributed partitions.

Kafka is often used to ship data between services. The producer-consumer model has a big advantage over other solutions: because Kafka itself acts as the message broker, you can simply ship your data into Kafka and then the receiving service can poll for updates. If you tried to use something more simple, like Redis, you would have to implement some kind of notification or polling mechanism yourself, whereas Kafka has this built-in.

In Conclusion

These are not the be-all-end-all of data stores, but we think they are the most common and useful ones. Knowing about these five types of datastores will get you on the path to making great design decisions!

What do you think? Do you have a favourite type of datastore that didn’t make it on the list? Let us know in the comments below.

We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.

Continue reading

How to Write Fast Code in Ruby on Rails

How to Write Fast Code in Ruby on Rails

At Shopify, we use Ruby on Rails for most of our projects. For both Rails and Ruby, there exists a healthy amount of stigma toward performance. You’ll often find examples of individuals (and entire companies) drifting away from Rails in favor of something better. On the other hand, there are many who have embraced Ruby on Rails and found success, even at our scale, processing millions of requests per minute (RPM).

Part of Shopify’s success with Ruby on Rails is an emphasis on writing fast code. But, how do you really write fast code? Largely, that’s context sensitive to the problem you’re trying to solve. Let’s talk about a few ways to start writing faster code in Active Record, Rails, and Ruby.

Active Record Performance

Active Record is Rails’ default Object Relational Mapper (ORM). Active Record is used to interact with your database by generating and executing Structured Query Language (SQL). There are many ways to query large volumes of data poorly. Here are some suggestions to help keep your queries fast.

Know When SQL Gets Executed

Active Record evaluates queries lazily. So, to query efficiently, you should know when queries are executed. Finder methods, calculations, and association methods all cause queries to evaluated. Here’s an example:

Here the code is appending a comment to a blog post and automatically saving it to the database. It isn’t immediately obvious that this executes a SQL insert to save the appended blog post. These kinds of gotchas become easier to spot through reading documentation and experience.

Select Less Where Possible

Another way to query efficiently is to select only what you need. By default, Active Record selects all columns in SQL with SELECT *. Instead, you can leverage select and pluck to take control of your select statements:

Here, we’re selecting all IDs in a blog’s table. Notice select returns an Active Record Relation object (that you can chain query methods off of) whereas pluck returns an array of raw data.

Forget About The Query Cache

Did you know that if you execute the same SQL within the lifetime of a request, Active Record will only query the database once? Query Cache is one of the last lines of defense against redundant SQL execution. This is what it looks like in action:

In the example, subsequent blog SELECTs using the same parameters are loaded from cache. While this is helpful, depending on query cache is a bad idea. Query cache is stored in memory, so its persistence is short-lived. The cache can be disabled, so if your code will run both inside and outside of a request, it may not always be efficient.

Avoid Querying Unindexed Columns

Avoid querying unindexed columns, it often leads to unnecessary full table scans. At scale, these queries are likely to timeout and cause problems. This is more of a database best practice that directly affects query efficiency.

The obvious solution to this problem is to index the columns you need to query. What isn’t always obvious, is how to do it. Databases often lock writes to a table when adding an index. This means large tables can be write-blocked for a long time.

At Shopify, we use a tool called Large Hadron Migrator (LHM) to solve these kinds of scaling migration problems for large tables. On later versions of Postgres and MySQL, there is also concurrent indexing support.

Rails Performance

Zooming out from Active Record, Rails has many other moving parts like Active Support, Active Job, Action Pack, etc. Here are some generalized best practices for writing fast code in Ruby on Rails.

Cache All The Things

If you can’t make something faster, a good alternative is to cache it. Things like complex view compilation and external API calls benefit greatly from caching. Especially if the resultant data doesn’t change often.

Taking a closer look at the fundamentals of caching, key naming and expiration are critical to building effective caches. For example:

In the first block, we cache all subscription plan names indefinitely (or until the key is evicted by our caching backend). The second block caches the JSON of all posts for a given blog. Notice how cache keys change in the context of a different blog or when a new post is added to a blog. Finally, the last block caches a global comment count for approved comments. The key will automatically be removed by our caching backend every five minutes after initial fetching.

Throttle Bottlenecks

But what about operations you can’t cache? Things like delivering an email, sending a webhook, or even logging in can be abused by users of an application. Essentially, any expensive operation that can’t be cached should be throttled.

Rails doesn’t have a throttling mechanism by default. So, gems like rack-attack and rack-throttle can help you throttle unwanted requests. Using rack-attack:

This snippet limits a given IP’s post requests to /admin/sign_in to 10 in 15 minutes. Depending on your application’s needs, you can also build solutions that throttle further up the stack inside your rails app. Rack-based throttling solutions are popular because they allow you to throttle bad requests before they hit your Rails app.

Do It Later (In a Job)

A cornerstone of the request-response model we work with as web developers is speed. Keeping things snappy for users is important. So, what if we need to do something complicated and long-running?

Jobs allow us to defer work to another process through queueing systems often backed by Redis. Exporting a dataset, activating a subscription, or processing a payment are all great examples of job-worthy work. Here’s what jobs look like in Rails:

This is a trivial example of how you would write a CSV exporting job. Active Job is Rails’ job definition framework which plugs into specific queueing backends like Sidekiq or Resque.

Start Dependency Dieting

Ruby’s ecosystem is rich, and there are a lot of great libraries you can use in your project. But how much is too much? As a project grows and matures, dependencies often turn into liabilities.

Every dependency adds more code to your project. This leads to slower boot times and increased memory usage. Being aware of your project’s dependencies and making conscious decisions to minimize them help maintain speed in the long term.

Shopify’s core monolith, for example, has ~500 gem dependencies. This year, we’ve taken steps to evaluate our gem usage and remove unnecessary dependencies where possible. This lead to removing unused gems, addressing tech debt to remove legacy gems, and using a dependency management service (eg. Dependabot).

Ruby Performance

A framework is only as fast as the language it’s written in. Here are some pointers on writing performant Ruby code. This section is inspired by Jeremy Evans’s closing keynote on performance at RubyKaigi 2019.

Use Metaprogramming Sparingly

Changing a program’s structure at runtime is a powerful feature. In a highly dynamic language like Ruby, there are significant performance costs associated to metaprogramming. Let’s look at method definition as an example:

These are three common ways of defining a method in Ruby. The first most common method uses def. The second uses define_method to define a metaprogrammed method. The third uses class_eval to evaluate a string at runtime as source code (which defines a method using def).

This is output of a benchmark that measures the speed of these three methods using the benchmark-ips gem. Let’s focus on the lower half of the benchmark that measures how many times Ruby could run the method in 5 seconds. For the normal def method, it was ran 10.9 million times, 7.7 million times for the define_method method, and 10.3 million times for the class_eval def defined method.

While this is a trivial example, we can conclude there are clear performance differences associated with _how_ you define a method. Now, let’s look at method invocation

This simply defines invoke and method_missing methods on an object named obj. Then, we call the invoke method normally, using the metaprogrammed send method, and finally via method_missing.

Less surprisingly, a method invoked with send or method_missing is much slower than a regular method invocation. While these differences might seem minuscule, they add up fast in large codebases, or when called many times recursively. As a rule of thumb, use metaprogramming sparingly to prevent unnecessary slowness.

Know the difference between O(n) and O(1)

What O(n) and O(1) mean is that there are two kinds of operations. O(n) is an operation that scales in time with size, and O(1) is one that is constant in time regardless of size. Consider this example:

This becomes very apparent when finding a value in an array compared to a hash. With every element you add to an array, there’s more potential data to iterate through whereas hash lookups are always constant regardless of size. The moral of the story here is to think about how your code will scale with more data.

Allocate Less

Memory management is a complicated subject in most languages, and Ruby is no exception. Essentially, the more objects you allocate, the more memory your program consumes. High-level languages usually implement Garbage Collection to automate removal of unused objects making developers’s lives much easier.

Another aspect of memory management is object mutability. For example, if you need to combine two arrays together, do you allocate a new array or mutate an existing one? Which option is more memory efficient?

Generally speaking, less allocations is better. Rubyists often classify these kinds of self-mutating methods as “dangerous”. Dangerous methods in Ruby often (but not always) end with an exclamation mark. Here’s an example:

The code above allocates an array of symbols. The first uniq call allocates and returns a new array with all redundant symbols removed. The second uniq! call mutates the receiver directly to remove redundant symbols and returns itself.

If used improperly, dangerous methods can lead to unwanted side effects in your code. A best practice to follow is to avoid mutating global state while leveraging mutation on local state.

Minimize Indirection

Indirection in code, especially through layered abstractions, can be described as both a blessing and a curse. In terms of performance, it’s almost always a curse

Merb, a web application framework that was merged into Rails has a motto: “No code is faster than no code.” This can be interpreted as “The more layers of complexity you add to something, the slower it will be.’’ While this isn’t necessarily true for performance optimizing code, it’s still a good principle to remember when refactoring.

An example of necessary indirection is Active Resource, an ORM for interacting with web services. Developers don’t use it for better performance, they use it because manually crafting requests and responses is much more difficult (and error prone) by comparison.

Final Thoughts

Software development is full of tradeoffs. As developers, we have enough difficult decisions to make while juggling technical debt, code style, and code correctness. This is why optimizing for speed shouldn’t come first.

At Shopify, we treat speed as a feature. While it lends itself to better user experiences and lower server bills, it shouldn’t take precedence over the happiness of developers working on an application. Remember to keep your code fun while making it fast!

Additional Reading

We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.

Continue reading

How Shopify Manages Petabyte Scale MySQL Backup and Restore

How Shopify Manages Petabyte Scale MySQL Backup and Restore

At Shopify, we run a large fleet of MySQL servers, with numerous replica-sets (internally known as “shards”) spread across three Google Cloud Platform (GCP) regions. Given the petabyte scale size and criticality of data, we need a robust and efficient backup and restore solution. We drastically reduced our Recovery Time Objective (RTO) to under 30 minutes by redesigning our tooling to use disk-based snapshots, and we want to share how it was done.

Challenges with Existing Tools

For several years, we backed up our MySQL data using Percona’s Xtrabackup utility, stored its output in files, and archived them on Google Cloud Storage (GCS). While pretty robust, it provided a significant challenge when backing up and restoring data. The amount of time taken to back up a petabyte of data spread across multiple regions was too long, and increasingly hard to improve. We perform backups in all availability regions to decrease the time it takes to restore data cross-region. However, the restore times for each of our shards was more than six hours, which forced us to accept a very high RTO.

While this lengthy restore time was painful when using backups for disaster recovery, we also leverage backups for day-to-day tasks, such as re-building replicas. Long restore times also impaired our ability to scale replicas up and down in a cluster for purposes like scaling our reads to replicas.

Overcoming Challenges

Since we run our MySQL servers on GCP’s Compute Engine VMs using Persistent Disk (PD) volumes for storage, we invested time in leveraging PD’s snapshot feature. Using snapshots was simple enough, conceptually. In terms of storage, each initial snapshot of a PD volume is a full copy of the data, whereas the subsequent ones are automatically incremental, storing only data that has changed.

In our benchmarks, an initial snapshot of a multi-terabyte PD volume took around 20 minutes and each incremental snapshot typically took less than 10 minutes. The incremental nature of PD snapshots allows us to snapshot disks very frequently, helps us with having the latest copy of data, and minimizes our Mean Time To Recovery.

Modernizing our Backup Infrastructure

Taking a Backup

We built our new backup tooling around the GCP API to invoke PD snapshots. This tooling takes into account the availability regions and zones, the role of MySQL instance (replica or master) and the other MySQL consistency variables. We deployed this tooling in our Kubernetes infrastructure as CronJobs, giving the jobs a distributed nature and avoiding tying them to our individual MySQL VMs allowing us to avoid having to handle coordination in case of a host failure. The CronJob is scheduled to run every 15 minutes across all the clusters in all of our available regions, helping us avoid costs related to snapshot transfer across different regions.

Backup workflow selecting replica and calling disk API to snapshot, per cron schedule
Backup workflow selecting replica and calling disk API to snapshot, per cron schedule

The backup tooling creates snapshots of our MySQL instances nearly 100 times a day across all of our shards, totaling thousands of snapshots every day with virtually no failures.

Since we snapshot so frequently, it can easily cost thousands of dollars every day for snapshot storage if the snapshots aren’t deleted correctly. To ensure we only keep (and pay for) what we actually need, we built a framework to establish a retention policy that meets our Disaster Recovery needs. The tooling enforcing our retention policy is deployed and managed using Kubernetes, similar to the snapshot CronJobs. We create thousands of snapshots every day, but we also delete thousands of them, keeping only the latest two snapshots for each shard, and dailies, weeklies, etc. in each region per our retention policy

Backup retention workflow, listing and deleting snapshots outside of retention policy
Backup retention workflow, listing and deleting snapshots outside of retention policy

Performing a Restore

Having a very recent snapshot always at the ready provides us with the benefit of being able to use these snapshots to clone replicas with the most recent data possible. Given the small amount of time it takes to restore snapshots by exporting a snapshot to a new PD volume, this has brought down our RTO to typically less than 30 minutes, including recovery from replication lag.

Backup restore workflow, selecting a latest snapshot and exporting to disk and attaching to a VM
Backup restore workflow, selecting a latest snapshot and exporting to disk and attaching to a VM

Additionally, restoring a backup is now quite simple: The process involves creating new PDs with source as the latest snapshot to restore and starting MySQL on top of that disk. Since our snapshots are taken while MySQL is online, after restore it must go through MySQL InnoDB instance recovery, and within a few minutes the instance is ready to serve production queries.

Assuring Data Integrity and Reliability

While PD snapshot-based backups are obviously fast and efficient, we needed to ensure that they are reliable, as well. We run a backup verification process for all of the daily backups that we retain. This means verifying two daily snapshots per shard, per region.

In our backup verification tooling, we export each retained snapshot to a PD volume, attached to Kubernetes Jobs and verify the following:

  • if a MySQL instance can be started using the backup
  • if replication can be started using MySQL Global Transaction ID (GTID) auto-positioning with that backup
  • if there is any InnoDB page-level corruption within the backup

Backup verification process, selecting daily snapshot, exporting to disk and spinning up a Kubernetes job to run verification steps
Backup verification process, selecting daily snapshot, exporting to disk and spinning up a Kubernetes job to run verification steps

This verification process restores and verifies more than a petabyte of data every day utilizing fewer resources than expected.

PD snapshots are fast and efficient, but the snapshots created exist only inside of GCP and can only be exported to new PD volumes. To ensure data availability in case of catastrophe, we needed to store backups at an offsite location. We created tooling which backs up the data contained in snapshots to an offsite location. The tooling exports the selected snapshot to new PD volume and runs Kubernetes Jobs to compress, encrypt and archive the data, before transferring them as files to an offsite location operated by another provider.

Evaluating the Pros and Cons of Our New Backup and Restore Solution


  • Using PD snapshots allows for faster backups compared to traditional file-based backup methods.
  • Backups taken using PD snapshots are faster to restore, as they can leverage vast computing resources available to GCP.
  • The incremental nature of snapshots results in reduced backup times, making it possible to take backups more frequently.
  • The performance impact on the donors of snapshots is noticeably lower than the performance impact of the donors of xtrabackup based backups.


  • Using PD snapshots is more expensive for storage compared to traditional file based backups stored in GCS.

  • The snapshot process itself doesn’t perform any integrity checks, for example, scanning for InnoDB page corruption, ensuring data consistency, etc. which means additional tools may need to be built.

  • Because snapshots are not inherently stored as a conveniently packaged backup, it is more tedious to copy, store, or send them off-site.

We undertook this project at the start of 2019 and, within a few months, we had a very robust backup infrastructure built around Google Cloud’s Persistent disk snapshot API. This tooling has been serving us well and has introduced us to new possibilities like, scaling replicas up and down for reads quickly using these snapshots apart from Disaster recovery.

If database systems are something that interests you, we're looking for Database Engineers to join the Datastores team! Learn all about the role on our career page. 

Continue reading

How Shopify Scales Up Its Development Teams

How Shopify Scales Up Its Development Teams

Have you clicked on this article because you’re interested in how Shopify scales its development teams and the lessons we’re learning along the way? Well, cool, you’ve come to the right place. But first, a question.

Are you sure you need to scale your team?

Really, really sure?

Are You Ready to Scale Your Team?

Hiring people is relatively straightforward, but growing effective teams is difficult. And no matter how well you do it, there will be a short-term price to pay in terms of a dip in productivity before, hopefully, you realize a gain in output. So before embarking on this journey you need to make sure your current team is operating well. Would you say that your team:

  1. Ruthlessly prioritizes its work in line with a product vision so it concentrates on the most impactful features?
  2. Maximizes the time it spends developing product, and so minimizes the time it spends on supporting activities like documentation and debates?
  3. Has the tools and methods to ship code within minutes and uncover bugs quickly?

If you can’t answer these questions positively then you can get a lot more from your current team. Maybe you don’t need to add new folks after all.

But let’s assume you’re in a good place with all of these. As we consider how to scale up a development organization, it’s fundamentally important to remember that hiring new people, no matter how brilliant they are, is a means to an end. We are striving to have more teams, each working effectively towards clear goals. So scaling up is partly about hiring great people, but mostly about building effective teams.

Hiring Great People

At Shopify we build a product that is both broad and deep to meet the needs of entrepreneurs who run many different types of business. We’ve deconstructed this domain into problem spaces and mapped them to product areas. Then we’ve broken these down into product development teams of five to nine folks, each team equipped with the skills it needs to achieve its product goals. This means a team generally consists of a product manager, back-end developers, web developers, data specialists and UX designers.

Tech Meeting with five happy adults

Develop Close Relationships with Your Talent Acquisition Team

Software development at scale is a social activity. That’s a fact that’s often underappreciated by inexperienced developers and leaders. When hiring, evaluating the technical abilities of candidates is a given, but evaluating their fit with your culture and their ability to work harmoniously with their teammates is as important. At Shopify we have a well-defined multi-step hiring process that we continually review based on its results. Technical abilities are evaluated by having the candidate participate in problem-solving and pair-programming exercises with experienced developers, and cultural fit is assessed by having them meet with their prospective teammates and leaders. These steps are time consuming, so we need to filter the candidates to ensure that only the most likely hires reach this stage. To do that, we have built close working relationships between our developers and our Talent Acquisition (TA) specialists.

I can’t overemphasize how important it is to have TA specialists who understand the company culture and the needs of each team. They make sure we meet the best candidates, making effective use of the time our leads and developers. So when scaling up, the first folks to recruit are the recruiters themselves, specialists who know your market. You must spend enough time with them so that they deeply understand what it takes to be a successful developer in your teams. They will be the face of your company to candidates. Even candidates whom you do not ultimately hire (in fact, especially those ones) should feel positive about the hiring experience. If they don’t you may find the word gets around in your market and your talent pipeline dries up.

Aim for Diversity of Experience on Teams

We aim to have teams that are diverse in many dimensions, including experience. We’ve learned that on average it takes about a year at Shopify before folks have fully on-boarded and have the business context, product knowledge and development knowhow to make great decisions quickly. So, our rule-of-thumb is that in any team the number of on-boarded folks should be greater than or equal to the number of those still onboarding. We know that the old software development model where a single subject matter expert communicates product requirements to developers leads to poor designs. Instead, we seek to have every team member empathize with entrepreneurs and for the team to have a deep understanding of the business problem they are solving. Scaling up is about creating more of these balanced and effective teams, each with ownership of a well-defined product area or business problem.

Building Effective Teams

Let’s move on from hiring and consider some other aspects of building effective teams. When talking about software development effectiveness, it’s hard to avoid talking about process. So, process! Right, with that out of the way, let’s talk about setting high standards for the craft of coding, and the tools of the craft.

Start With a Baseline

For teams to be effective quickly, they need to have a solid starting point for how they will work, how they will plan their work and track their progress, and for the tools and technologies they will use. We have established many of these starting points based on our experience so having every new team start again from first principles would be a tremendous waste of time. That doesn’t prevent folks from innovating in these areas, but the starting baseline is clear.

Be Open About Technical Design and Code Changes

I mentioned previously about having the right mix of onboarded vs. still onboarding folks and that’s partly about ensuring that in every team there is a deep empathy for our merchants and for what it means to ship code at Shopify scale. But more, we seek to share that context across teams by being extremely open about technical designs and code changes. Our norm is that teams are completely transparent about what they are doing and what they are intending to do, and they can expect to receive feedback from almost anyone in the company who has context in their area. With this approach, there’s a risk that we have longer debates and yeah, that has been known to happen here, but we also have a shared set of values that help to prevent this. Specifically we encourage behaviors that support “making good decisions quickly” and “building for the long term.” In this way, our standards are set by what we do and not by following a process.

Use Tooling to Codify Best Practices

Tooling is another effective way to codify best practices for teams, so we have folks who are dedicated to building a development pipeline for everyone with dashboards, so we can see how every team is doing. This infrastructure work is of great importance when scaling. Standards for code quality and testing are embedded in the toolset, so teams don’t waste time relearning the lessons of others—rather they can concentrate on the job of building great products. Once you start to scale up beyond a single team, you’ll need to dedicate some folks to build and maintain this infrastructure. (I use the plural deliberately because you’ll never have just one developer assigned to anything critical, right?)

You can read more about our tooling here on this blog. The Merge Queue and our Deprecation Toolkit are great examples of codified best practices, and you can read about how we combine these in to a development pipeline in Shopify’s Tech Stack.

As the new team begins its work, we must have feedback loops to re-enforce the behaviors that produce the best outcomes. From a software perspective, this is why the tooling is so important so that a team can ship quickly and respond to the feedback of stakeholders and users.

 Two people pair programming

Use Pairing to Share Experiences

Which brings me to pairing. The absolute best way for onboarded developers to share their experience with new folks is by coding together. Pairing is optional but actively encouraged in Shopify. We have pairing rooms in all our offices, and we hold retrospectives on the results to ensure it adds value. There’s an excellent article on our blog by Mihai Popescu that describes our approach: Pair Programming Explained.

Conduct Frequent Retrospectives

From a team effectiveness perspective, frequent retrospectives allow us to step back from the ongoing tasks to get a wider perspective on how well the team is doing. At Shopify, we encourage teams to have their retrospectives facilitated by someone outside the team to bring fresh eyes and ideas. In this way, a team is driven to continually improve its effectiveness.

At Shopify we understand that hiring is only a small step towards the goal of scaling up. Ultimately, we’re trying to create more teams and have them work more effectively. We’ve found that to scale development teams you need to have a baseline to build from, an openness around technical design, effective tooling, pair programming and frequent retrospectives.

We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.

Continue reading

Want to Improve UI Performance? Start by Understanding Your User

Want to Improve UI Performance? Start by Understanding Your User

My team at Shopify recently did a deep dive into the performance of the Marketing section in the Shopify admin. Our focus was to improve the UI performance. This included a mix of improvements that affected load time, perceived load time, as well as any interactions that happen after the merchant has landed in our section.

It’s important to take the time to ask yourself what the user (in our case, merchant) is likely trying to accomplish when they visit a page. Once you understand this, you can try to unblock them as quickly as possible. We as UI developers can look for opportunities to optimize for common flows and interactions the merchant is likely going to take. This helps us focus on improvements that are user centric instead of just trying to make our graphs and metrics look good.

I’ll dive into a few key areas that we found made the biggest impact on UI performance:

  • How to assess your current situation and spot areas that could be improved
  • Prioritizing the loading of components and data
  • Improving the perceived loading performance by taking a look at how the design of loading states can influence the way users experience load time.

Our team has always kept performance top of mind. We follow industry best practices like route-based bundle splitting and are careful not to include any large external dependencies. Nevertheless, it was still clear that we had a lot of room for improvement.

The front end of our application is built using React, GraphQL, and Apollo. The advice in this article aims to be framework agnostic, but there are some references to React specific tooling.

Assess Your Current Situation

Develop Merchant Empathy by Testing on Real-World Devices

In order to understand what needed to be improved, we had to first put ourselves in the shoes of the merchant. We wanted to understand exactly what the merchant is experiencing when they use the Marketing section. We should be able to offer merchants a quality experience no matter what device they access the Shopify admin from.

We think testing using real, low-end devices is important. Testing on a low-end device allows us to ensure that our application performs well enough for users who may not have the latest iPhone or Macbook Pro.

Moto G3
Moto G3 Device

We grabbed a Moto G3 and connected the device to Chrome developer tools via the remote devices feature. If you don’t have access to a real device to test with, you can make use of webpagetest.org to run your application on a real device remotely.

Capture an Initial Profile

Our initial performance profile captured using Chrome Developer tools.
Our initial performance profile captured using Chrome Developer tools

After capturing our initial profile using the performance profiler included in the Chrome developer tools, we needed to break it down. This profile gives us a detailed timeline of every network request, JavaScript execution, and event that happens during our recording plus much, much more. We wanted to understand exactly what is happening when a merchant interacts with our section.

We ran the audit with React in development mode so we could take advantage of the user timings they provide. Running the application with React in production mode would have performed better, but having the user timings made it much easier to identify which components we need to investigate.

React profiler by React dev tools
React Profiler from React Dev Tools

We also took the time to capture a profile using the profiler provided by React dev tools. This tool allowed us to see React specific details like how long it took to render a component or how many times that component has been updated. The React profiler was particularly useful when we sorted our components from slowest to fastest.

Get Our Priorities in Order

After reviewing both of these profiles, we were able to take a step back and gain some perspective. It became clear that our priorities were out of order.

We found that the components and data that are most crucial to merchants were being delayed by components that could have been loaded at a later time. There was a big opportunity here to rearrange the order of operations in our favor with the ultimate goal of making the page useful as soon as possible.

We know that the majority of visits to the Marketing section are incremental. This means that the merchant navigated to the Marketing section from another page in the admin. Because the admin is a single page app, these incremental navigations are all handled client side (in our case using React Router). This means that traditional performance metrics like time to first byte or first meaningful paint may not be applicable. We instead make use of the Navigation Timing API to track navigations within the admin.

When a merchant visits the Marketing section, the following events happen:

  • JavaScript required to render the page is fetched
  • A GraphQL query is made for the data required for the page
  • The JavaScript is executed and our view is rendered with our data

Any optimizations we do will be to improve one of those events. This could mean fetching less data and JavaScript, or making the execution of the JavaScript faster.

Deprioritize Non-Essential Components and Code Execution

We wanted the browser to do the least amount of work necessary to render our page. In our case, we were asking the browser to do work that did not immediately benefit the merchant. This low-priority work was getting in the way of more important tasks. We took two approaches to reducing the amount of work that needed to be done:

  • Identifying expensive tasks that are being run repeatedly and memoize (~cache) them.
  • Identifying components that are not immediately required and deferring them.

Memoizing Repetitive and Expensive Tasks

One of the first wins here was around date formatting. The React profiler was able to identify one component that was significantly slower than the rest of the components on the page.

React Profiler Identifying <StartEndDates /> Component is Significantly Slower
React Profiler Identifying <StartEndDates /> Component is Significantly Slower

The <StartEndDates /> component stood out. This component renders a calendar that allows merchants to select a start and end date. After digging into this component, we discovered that we were repeating a lot of the same tasks over and over. We found that we were constructing a new Intl.DateTimeFormat object every time we needed to format a date. By creating a single Intl.DateTimeFormat object and referencing it every time we needed to format a date, we were able to reduce the amount of work the browser needed to do in order to render this component.

<StartEndDates /> after memoization of two other date formatting utilities
<StartEndDates /> after memoization of two other date formatting utilities

This in combination with the memoization of two other date formatting utilities resulted in a drastic improvement in this components render time. Taking it from ~64.7 ms down to ~0.5 ms.

Defer Non-Essential Components

Async loading allows us to load only the minimum amount of JavaScript required to render our view. It is important to keep the JavaScript we do load small and fast as it contributes to how quickly we can render the page on navigation.

One example of a component that we decided to defer was our <ImagePicker />. This component is a modal that is not visible until the merchant clicks a Select image button. Since this component is not needed on the initial load, it is a perfect candidate for deferred loading.

By moving the JavaScript required for this component into a separate bundle that is loaded asynchronously, we were able to reduce the size of the bundle that contained the JavaScript that is critical to rendering our initial view.

Get a Head Start

Prefetching the image picker when the merchant hovers over the activator button makes it feel like the modal instantly loads.
Prefetching the image picker when the merchant hovers over the activator button makes it feel like the modal instantly loads

Deferring the loading of components is only half the battle. Even though the component is deferred, it may still be needed later on. If we have the component and its data ready when the merchant needs it, we can provide an experience that really feels instant.

Knowing what a merchant is going to need before they explicitly request it is not an easy task. We do this by looking for hints the merchant provides along the way. This could be a hover, scrolling an element in to the viewport, or common navigation flows within the Shopify admin.

In the case of our <ImagePicker /> modal, we do not need the modal until the Select image button is clicked. If the merchant hovers over the button, it’s a pretty clear hint that they will likely click. We start prefetching the <ImagePicker /> and its data so by the time the merchant clicks we have everything we need to display the modal.

Improve the Loading Experience

In a perfect world, we would never need to show a loading state. In cases where we are unable to prefetch or the data hasn’t finished downloading, we fallback to the best possible loading state by using a spinner or skeleton content. We typically choose to use a skeleton if we have an idea what the final content would look like.

Use Skeletons

Skeleton content has emerged as a best practice for loading states. When done correctly, skeletons can make the merchant feel like they have ‘arrived’ at the next state before the page has finished loading.

Skeletons are often not as effective as they could be. We found that it’s not enough to put up a skeleton and call it a day. By including static content that does not rely on data from our API, the page will feel a lot more stable as data arrives from the server. The merchant feels like they have ‘arrived’ instead of being stuck in an in between loading state.

Animation showing how adding headings helps the merchant understand what content they can expect as the page loads.
Animation showing how adding headings helps the merchant understand what content they can expect as the page loads

Small tweaks like adding headings to the skeleton go a long way. These changes give the merchant a chance to scan the page and get a feel for what they can expect once the page finishes loading. They also have the added benefit of reducing the amount of layout shift that happens as data arrives.

Improve Stability

When navigating between pages, there are often going to be several loading stages. This may be caused by data being fetched from multiple sources, or the loading of resources such as images or fonts.

As we move through these loading stages, we want the page to feel as stable as possible. Drastic changes to the pages layout are disorienting and can even cause the user to make mistakes.

Using a skeleton to help improve stability by matching the height of the skeleton to the height of the final content as closely as possible.
Using a skeleton to help improve stability by matching the height of the skeleton to the height of the final content as closely as possible

Here’s an example how we used a skeleton to help improve stability. The key is to match the height of the skeleton to the height of the final content as closely as possible.

Make the Page Useful as Quickly as Possible

Rendering the ‘Create campaign’ button while we are still in the loading state
Rendering the Create campaign button while we are still in the loading state

In this example, you can see that we are rendering the Create campaign button while we are still in the loading state. We know this button is always going to be rendered, so there’s no sense in hiding it while we are waiting for unrelated data to arrive. By showing this button while still in the loading state, we unblock the merchant.

No Such Thing as Too Fast

The deep dive helped our team develop best practices that we are able to apply to our work going forward. It also helped us refine a performance mindset that encourages exploration. As we develop new features, we can apply what we’ve learned while always trying to improve on these techniques. Our focus on performance has spread to other disciplines like design and research. We are able to work together to build up a clearer picture of the merchants intent so we can optimize for this flow.


Many of the techniques described by this article are powered by open source JavaScript libraries that we’ve developed here at Shopify.

The full collection of libraries can be found in our Quilt repo. Here you will find a large selection of packages that enable everything from preloading, to managing React forms, to using Web Workers with React.

We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.

Continue reading

Building Resilient GraphQL APIs Using Idempotency

Building Resilient GraphQL APIs Using Idempotency

A payment service which isn’t resilient could fail to complete a charge or even double-charge buyers. Also, the client calling the API wouldn’t be certain of the outcome in the case of errors returned from the request reducing trust in the payment methods provided by that service. Shopify’s new Payment Service, which centralizes payment processing for certain payment methods, uses API idempotency to prevent these situations from happening in the first place.

Shopify's New Payment Service
Shopify's New Payment Service

The new Payment Service is owned by the Money Infrastructure team which is responsible for the code that moves money, handles and records the interactions with various payment providers. The service provides a GraphQL interface that’s used by Shopify and our Billing system. The Billing system charges the merchants and pays Shopify Partners, based on monthly subscriptions and usage, as well as paying application developers.

The Issues With Non-resilient Payment Services

A payment API should offer an ‘exactly once’ model of resiliency. Payments should not happen twice, and should offer a way for clients to recover in the case of an error. When an API request can’t be re-attempted and an error happens during a payment attempt, the outcome is unknown.

For example, the Payment Service has a ChargeCreate mutation which creates a payment using the buyer’s chosen payment method. If this mutation is called by the client, and that request returns an error or times-out, then without idempotency the client can’t discover what state this new payment is in.

If the error occurred before the payment was completed, and the client doesn’t retry the request, then the merchant would go unpaid. If the error occurred after the Payment was completed, and the client retries the request, which would not be associated with the first attempt, then the buyer would be double charged.

Possible Solutions

The Money Infrastructure team chose API level idempotency to create a resilient system but there are different approaches to dealing with this:

  • Fix manually: Ship maintenance tasks created one by one, to repair the data. This doesn’t scale.
  • Automatic Reconciliation: Write code to detect cases where the payment state is unknown and repair them. This would require ongoing work since introducing new payment methods and providers would require new reconciliation work. And the results of reconciliation would require API clients to react to these corrections as well to keep their data up to date.

What is API Idempotency?

An idempotent API is one where repeated requests with the same parameters will be executed only once, no matter how many times it’s retried. This strategy gives clients the flexibility to retry API requests which may have failed due to connection issues, without causing duplication or conflicts in the API provider’s system.

Creating an Idempotent API

There are some requirements when creating an idempotent API. Please note that if remote service providers APIs are not idempotent, it will be very hard to implement an idempotent API.

Name the Request: Use Idempotency Keys

One of the parameters to every mutation is an idempotency-key, which is used to uniquely identify the request. We use a randomly generated universally unique identifier (UUID), but it could be any unique identifier.

Here is an example of a mutation and input which shows the idempotency key is part of the input. The idempotency key is a ‘first class citizen’ of the API, we’re not using an HTTP header for middleware. This allows us to require the presence of the idempotency key using the same GraphQL parameter validation as the rest of the API, and return any errors in the usual way, rather than returning errors outside the GraphQL mechanism.

Lock the API Call: on the ‘name’ Client + Idempotency Key to Prevent Duplicate Simultaneous Requests

One way a request can fail is due to dropping network connections. If this happens after the API server has received the request and begun processing, the client can retry the request while the first attempt is still processing. To prevent the duplicate simultaneous request, a lock around the API call based on the client and idempotency key will allow the server to reject the request with an HTTP code of 409, meaning that the client may try again shortly.

Track Requests: Store the Incoming Requests, Uniquely Identified By Client + Idempotency Key

The Payment Service needs to keep track of these requests and stores that information in the database. The Payment Service uses a model called IncomingRequest to track information related to each request. Each model instance is uniquely identified by the client and idempotency key.

The existence of the saved IncomingRequest instance can be used to determine if any request is a new request or a retry. If the IncomingRequest model instance is loaded instead of created, then we know that the request is a retry. When the request is started it can also determine if the previous request was completed or not. If the request was previously completed, the previous response can be returned immediately.

Track Progress: The IncomingRequest Record Provides a Place to Track Progress for That Request

The IncomingRequest model includes a column where the progress for a request is stored as it is completed. The Payment Service breaks the progress for a given mutation into named steps, or recovery points. The code in each step (sometimes called recovery points) must be structured in a specific way, otherwise any errors will leave a given request in an unknown state.

Using Steps Explained

Using steps is a strategy for structuring code in a way that isolates the types of side effects a given function has. This isolation allows the progress to be recorded in a stepwise fashion, so that if an error occurs, the current state is known. There are three different kinds of side effects we need to be concerned with in this design:

  • No side effects: This step makes no http calls, or database writes. This is typically a qualifier function, ie. resolving if this handler can process these records in this way.
  • Local side effects: This step only makes writes to the database, and this step will be wrapped in a database transaction so that any errors will cause a rollback.
  • Remote side effects: Calls to service providers, loggers, analytics.

Each step is implemented as a ‘run’ function in a handler class, possibly paired with ‘recover’ version of that function. A step may not need a recover step, for example, if the run step confirms that the handler is the appropriate handler. In the case of a recovery, if the handler made it further than that step, the qualification step would have succeeded in the original request and a recovery function does not need to do anything in the retry request.

How steps are used:

  1. For each step completed in the request, record the successful completion. As the request handler successfully executes each step, the IncomingRequest record is updated to the name of that completed step.
  2. If the request is retrying, but was incomplete, then recover previously completed steps, and continue. If the request is retrying a request that was not completed on a previous attempt, the handler will recover the completed steps and then continue to run the reset of the steps. Every step may have both a ‘run’ and a `recover` function.

The flow through the steps of the initial `run`, versus a subsequent `recover`, after the initial run failed on step 3
The flow through the steps of the initial `run` and subsequent `recover` for the failed step 3

This diagram shows the flow through the steps of the initial `run`, versus a subsequent `recover`, after the initial run failed on step 3.

Here is the handler class implementation for the Sofort payment method. Each recovery_point is configured with a run function, an optional recover function and transactional boolean. The recovery points are configured in the order that they’re executed.

Ruby makes it easy to write an internal Domain Specific Language (DSL), which results in mutation handler implementations which are straightforward and clear. Separating the steps by side-effect does force a certain coding approach, which gives a uniformity to the code.

Drawbacks of API Idempotency

Storing the progress of a request requires extra database writes, this will add overhead to every API call. The stepwise structure of the request handlers forces a specific coding style, which may feel awkward for developers who are new to it. It requires the developer to approach each handler implementation in a particular way, considering which type of side effects each piece of code has, and structuring it up appropriately. Our team quickly learned this new style with a combination of short teaching sessions and example code.

Modifying the implementation of a mutation handler may change, add or remove recovery points. If that happens, the developer must take extra steps to ensure that the implementation can still recover from any already stored recovery points and ensure that any step can be correctly recovered from when the modified handler is deployed. We have a test suite for every handler which exercises every step, as well the different recovery situations the code must handle. This helps us ensure that any modification is correct, and will correctly recover from the different failures.

Remembering the Side Effects is Fundamental

When considering how to implement an idempotent API in your project, start by partitioning the code in a given API implementation into steps by the kind of side effects it has. This will let you see how the parts interact and provide an opportunity to determine how to recover each part. This is the fundamental part of implementing an idempotent API.

There are always going to be trade-offs when adding idempotency to an API, both in performance, as well as ease of implementation and maintenance. We believe that using the recovery point strategy for our mutation handlers has resulted in code that’s clear, well structured and easy to maintain, which is worth the overhead of this approach.

We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.

Continue reading

Living on the Edge of Rails

Living on the Edge of Rails

At Shopify, we make keeping our dependencies up to date a priority. Having outdated dependencies exposes your project to security issues and contributes towards technical debt. Upgrading a large dependency like Rails can be overwhelming, if you are lost and don’t know where to start, you can read our post explaining our journey to upgrade Rails from 4.2 to 5.0 and you can watch the Upgrading Rails at scale recording from RailsConf 2018.

Our Core monolith, used by millions of users, has been running the unreleased version of Rails 6 in production since February 2019. Going forward, our application will continuously run the latest revision of the framework. There are multiple advantages for us and for the community to be living on the Edge of Rails that we’ll cover through this blog post.

The Edge of Rails is the Rails master branch which includes everything up to the very newest commit. Living on the Edge of Rails means that anytime a change is introduced upstream it becomes available in our application. We no longer need to wait multiple months for a new release.

Targeting the HEAD of Rails

Another advantage of targeting the HEAD of Rails is to cut down the time it takes for us to upgrade. Continuously integrating Rails with a small, weekly bump instead of a big one every year, reduces the size of the diff. We also realized that developers are more inclined to contribute to Rails and implement ideas they have if they can use the feature they wrote right away.

Rafael Franca, member of the Rails at Shopify team and a Rails core contributor, as well as release manager, runs our continuous integration (CI) against the framework before any new release or before merging an important change upstream. By being able to run our massive test suite composed of more than 130,000 tests, we're able to discover edge cases, find improvements where needed and propose patches upstream to make Rails better for everyone.

Xavier Noria @fxn tweeted - shout-out to byroot (a Shopify employee), in the last weeks he has focused on adapting Shopify to use Zeitwerk (which they have in production) providing extraordinary feedback about performance relevant to applications at their scale, thank you to his work this gem is better today for everyone
Zeitwerk gives Shopify a shout out for helping improve their gem

We're already seeing the positive impact this has for the Ruby and Rails community. One example is our close contribution to Zeitwerk, the new autoloader that ship with Rails 6.

Updating to the Latest Revision

Solid Track, a bot that upgrade Rails on a weekly basis to the latest revision upstream
Solid Track, a bot that upgrade Rails on a weekly basis to the latest revision upstream

Targeting the HEAD of Rails means that we now need to periodically bump it to the latest revision. To avoid manual steps, we created Solid Track, a bot that upgrade Rails on a weekly basis to the latest revision upstream. The bot opens a Pull Request on GitHub and pings us with a diff of the changes introduced in the framework.

Every Monday, we receive this ping and go over the new commits merged upstream and check if something that our CI didn’t catch could break once in production.

If CI is green, it’s usually good to ship. It’s possible that our test suite didn’t catch a possible issue, but we mitigate the risk thanks to the way we deploy our application. Each time we deploy, only a subset of our servers get the new changes. We called those servers “canaries”. If no new exceptions happen on the canaries for ten minutes, our shipping pipeline proceed and deploy the changes to all remaining servers.

Solid Track bot triggering git bisect
Solid Track bot triggering git bisect

However, if CI is red, our bot automatically takes care of triggering a git bisect to determine which change is breaking the test. This step allows us to save time and instantly identify which commit is problematic. Then we need to determine whether the change is legit or it introduced a regression upstream

Should My Application Target the HEAD of Rails?

If targeting the HEAD of Rails is something you’d be interested in doing in your application, keep in mind that using an unreleased version of any dependency comes with a stability tradeoff. We evaluated the risk in our application and were confident in our tooling, test suite and the way we deploy to take this decision.

Here are the questions and answers we asked ourselves before moving forward:

1. How much you and your team will benefit from targeting HEAD?

We’ll get a lot out of this. Not only we’ll be able to get all bug fixes and new features quickly, we’d also save time and won’t have to dedicate a whole team and months of work to upgrade our application.

2. Do you have enough monitoring in place in case something goes wrong?

We have a lot of monitoring. Exception reporting on Slack, Datadog metrics correctly configured with threshold when a metric is too high/low and 24h on call rotation.

3. Do you have a way to deploy your application on a small subset of servers?

We use canary deploys to put in production the changes only a small subset of servers.

4. Finally, how confident are you with your test suite?

Our test suite is large and coverage is good. There is always room for improvement but we're confident with it.

Upgrading your dependencies is part of having a sane codebase. The more outdated they are the more technical debt you accumulate. If you have outdated dependencies, consider taking some time to upgrade them.

We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.

Continue reading

Pagination with Relative Cursors

Pagination with Relative Cursors

When requesting multiple pages of records from a server the simple implementation is to have an incremental page number in the URL. Starting at page one, each subsequent request that’s sent has a page number that’s one greater than the previous. The problem is that incremental page numbers scale poorly—the bigger the page number, the slower the query. The simple solution is relative cursor pagination because it remembers where you were and continues from that point onwards instead.

The Problem

A common activity for third-party applications on Shopify to do is to sync the full catalogue of products. Some shops have more than 100,000 products and these can’t all be loaded in a single request as it would time out. Instead, the application would make multiple requests to Shopify for successive pages of products which look like this:


This would generate a SQL query like this:

This query scales poorly because the bigger the offset, the slower the query. In the above example, the query needs to go through 2500 records and then discard the first 2400. Using a test shop with 14 million products, we ran some experiments loading pages of products at various offsets. Taking the average time over five runs at each offset, here are the results:


Time (ms)











Omitted from the table are tests with the 1,000,000th offset and above since they consistently timed out.

Not only do queries take a long time when a large offset is used, but there’s also a limited number of queries that can be run concurrently. If too many requests with large page numbers are made at the same time, they can pile up faster than they can be executed. This leads to unrelated, quick queries timing out while waiting to be run because all of the database connections are in use by these slow, large-offset queries.

It’s particularly problematic on large shops when third-party applications load all records for a particular model, be it products, collects, orders, or anything else. Such usage has ramifications outside of the shop they are being run on. Since multiple shops are run on the same database instances, a moderate volume of large-offset queries cause unrelated queries from shops that happen to share the same database instance to be slower or time out altogether. For the long-term health of our platform we couldn’t allow this situation to continue unchecked.

What is Relative Cursor Pagination?

Relative cursor pagination remembers where you were so that each request after the first continues from where the previous request left off. The downside is that you can no longer jump to a specific page. The easiest way to do this is remembering the id of the last record from the last page you’ve seen and continuing from that record, but it requires the results to be sorted by id. With a last id of 67890 this would looks like:

A good index set up can handle this query and will perform much better than using an offset, in this example, it’s the primary index on id. Using the same test shop, here’s how long it takes to get the same pages of records but this time using the last id:


Time using offset (ms)

Time using last id (ms)

Percentage improvement





















With an offset of 100,000 it’s over 400 times faster to use last id! It’s much faster, and it doesn’t matter how many pages you request, the last page takes around the same amount of time as the first.

Sorting and Skipping Records

Sorting by something other than id is possible by remembering the last value of the field being sorted on. For example, if you’re sorting by title, then the last value is the title of the last record in the page. If the sort value is not unique, then if we used it alone we would potentially be skipping records. For example, assume you have the following products:

Sorting by Title
Sorting by Title

Requesting a page size of two sorted by title would return product with ids 3 and 2. To request the next page, just querying by title > “Pants” would skip product 4 and start at product 1.

Sorting by Title - Product Skipped
Sorting by Title - Product Skipped

Whatever the use case of the client that requests these records, it’s likely to have problems if records are sometimes skipped. The solution is to set a secondary sort column on a unique value, like id, and then remembering both the last value and last id. In that case the query for the second page would look like this:

Querying in this way results in getting the expected products on the second page.

Sorting by Title - No Skipped Product
Sorting by Title - No Skipped Product

To ensure the query is performant as the number of records increases you’d need a database index set up on title and id. If an appropriate index is not set up then it could be even slower than using a page number.

Using the same test shop as before, here’s how long it takes to get the same pages of records but this time using both last value and last id:


Time using offset (ms)

Time using last id (ms)

Time using last value (ms)

Percentage improvement over offset

Percentage improvement over last id































Overall, it’s slower than using a last id alone, but still orders of magnitude faster than using an offset when the offset grows large.

Making it Easy for Clients to Use Relative Cursors

The field being sorted on might not be included in the response. For example, in the Shopify API pages of products sorted by total inventory can be requested. We don’t expose total inventory directly on the product, but it can be derived by adding up the inventory_quantity from the nested variants, which are included in the response. Rather than requiring clients do this calculation themselves we make it easy for them by generate URLs that can be used to request the next and previous page, and include them in a Link header in the response. If there’s both a next page and a previous page it looks like this: 

Conversion in Shopify

The problem of large offsets causing queries to be slow was well known within Shopify, as was the solution of using relative cursors. In our internal endpoints, we were making liberal use of them, but rolling relative cursors out to external clients is a much bigger effort. We just added API versioning to our REST API, so it’s reasonable to make such a large change as removing page numbers and switching everything to relative cursors.

As the responsibility for the different endpoints was spread across many different teams there was no clear owner of pagination as a whole. Though the problem wasn’t directly related to my team, Merchandising, our ownership of the products and collects APIs meant we were acutely aware of the problem. They’re two of the largest APIs in terms of both the volume of requests, and the number of records they deal with.

I wanted to fix the problem and no one else was tackling it, so I put together a proposal on how we could fix it across our platform and sent it to my lead and senior engineering leadership. They agreed with my solution and I got the green light to work on it. A couple more engineers joined me and together we put together the patterns all endpoints were to follow, along with the common code they would use, and a guide for how to migrate their endpoints. We made a list of all the endpoints that would need to be converted and pushed it out to the teams who owned them. Soon we had dozens of developers across the company working on it.

As third-party developers must opt in to use relative cursors for now, adoption is currently quite low and we don’t have much in the way of performance measures to share. Early usage of relative pagination on the /admin/products.json endpoint show it to be about 11 times faster on average than comparable requests using a page number. By July 2020 no endpoints will support page numbers on any API version and will need to use relative pagination. We’ll have to wait until then to see the full results of the change.

We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.

Continue reading

Lessons from Leading a Remote Engineering Team

Lessons from Leading a Remote Engineering Team

For my entire engineering management career, I’ve managed remote teams. At Shopify, I manage Developer Acceleration, a department with both colocated and remote teams with members spread across four Canadian offices and in six countries.

You may think that managing remote teams is hard, and it is, but there are real benefits that you can achieve by being open to remote employees and building a remote team. Let’s talk about the benefits of a remote team, how to build your remote team, and how to set your people up to succeed.

The Benefits of Remote

It’s not a matter of right and wrong with colocated and remote. Either configuration can work and both provide benefits.

Some advantages of a remote team are: 

  • expanding to a global hiring pool
  • supporting a more diverse workforce
  • improving your ability to retain top employees
  • adding a location-based team capability

Expanding to a Global Hiring Pool

Hiring well is difficult and time consuming. Recruiters and hiring managers talk about filling the top of the funnel, which means finding suitable candidates to apply or to approach about your role. For specialized roles, like a mobile tooling developer, it can be hard to fill the funnel. A challenge with colocated teams is that your hiring pool is limited to those people who live in the same city as your office and those willing to relocate. A larger pool gives you access to more talent. On my team we’ve hired people in Cyprus, Germany, and the UK, none of whom could relocate to one of our offices in Canada.

More Diverse Workforce

A willingness to hire anywhere also gives access to a more diverse talent pool. There are people who are unwilling or unable to relocate. There are also those who need to work from home. I’ve hired people with mobility issues, people with dependents, such as young children or older parents, and people with strong ties to their communities. They are highly skilled and are excellent additions to our team but wouldn’t have been options had we required them to work out of one of our offices.

Ability to Retain Top Employees

A company invests in each employee that is hired and you want to retain good employees. By being a on remote team, I have retained people who decided to relocate for personal reasons, often out of their direct control. In one case, a spouse had a location-dependent job opportunity that the family had decided to follow. In another, the person needed to be closer to their family for health reasons. I’ve successfully relocated people to Canada, France, the Netherlands, Poland, and the USA. Relocating these high-performing employees is much less expensive than it is to hire and train replacements.

Location-Based Team Capability

A team may also have specific requirements, like 24/7 support, that make it advantageous to distribute people rather than centralize them. My release engineering team supports our build and deploy pipeline for Shopify developers around the world and benefits from having a 24/7 on call schedule without needing people to be on call in the middle of the night.

Man Working at Desk

Building a Remote Team

An engineering manager’s job is to create an effective team. They do this by assembling the right people, defining the problem to solve, and focusing on execution. There’s a key piece in “effective team” that’s often overlooked. A collection of people isn’t a team. A team functions well together and is more than the sum of its parts. This is one of the reasons we don’t hire jerks. Building a team requires the establishment of relationships and trust, which relies on really good communication. Neither relationships nor trust can be mandated. To build the team you need to create an environment and opportunity for people to interact with one another on more than a superficial level.

Shopify has two key cultural values that support remote work:

  1. Default to open internally 
  2. Charge your trust battery

Default to Open Internally

Defaulting to open is about inclusion both in decisions and in results. At Shopify we encourage sharing investment plans, roadmaps, project updates, and tasks. This means writing a lot down, and making information discoverable, which provides a facility to transfer knowledge to remote workers. It also means being deliberate about when to use asynchronous and synchronous communication for discussions and decisions.

Asynchronous Communication

Asynchronous communication is a best practice and should be your default method of interaction as it decouples each person’s availability with their ability to participate in discussions and decisions. People need to be able to disconnect without missing out on key decisions. Asynchronous communication frees people by giving them time to focus on their work and on their personal life. My team has discussions via email or GitHub issues. Longer-form ideas and technical design documents are written and reviewed in Google Docs. Once we start building, day-to-day tasks are kept in GitHub issues and Project Boards. Project updates and related decisions are captured in our internal project management system. I’ve listed a number of tools and that we use, but tools won’t solve this problem for you. Your team needs to choose communication conventions and then support those conventions with tools.

Synchronous Communication

When building teams there is also a place for synchronous communication. My teams each run a weekly check-in meeting on Google Hangouts. The structure of these meetings varies but typically includes demos or highlights of what was accomplished in the last week, short planning for the next week, and a team discussion about topics relevant that week. When managing a team across multiple time zones, common advice is to share the pain by moving the meeting time around. In my experience the result is confusion and people regularly missing the meeting. Just pick one time that is acceptable to the people on the team. Set attendance as a requirement of being on the team with new people before they join even if the meeting time is outside of their regular hours.

My teams are generally built so that everyone will be working at the same time during some portion of the day. These core working hours are an opportunity to have synchronous conversations on Slack or ad hoc video meetings as needed.

Charge Your Trust Battery

Shopify is a relationship-driven company. The Trust Battery is a concept that models the health of your relationships with your co-workers. Positive interactions, like open conversations, listening to others, and following through on commitments, charge the battery. Negative interactions, like being insensitive, demanding, or doing poor work, discharge the battery. This concept brings focus to developing relationships and pushes everyone to revisit their relationships on a regular basis.

Trusted relationships don’t just happen, but they can be given a push. Be open about yourself and encourage people to share details about themselves that you’d typically get with “water cooler” conversation. To facilitate this sharing, I set up Geekbot to prompt everyone on my team in Slack each Monday to answer an optional short list of questions such as

  • What did you do this weekend?
  • What’s something that you’ve read in the last week?
  • Any upcoming travel/vacation/conferences planned?

Participation is pretty high, and I’ve learned quite a lot about the people I work with through this short list of questions. Personal details humanize the people on the other end of your chat window and give you a better, multidimensional view of the people on your team.

Lastly, get people together in person. Use this sparingly as it can be a big request for people to travel. Pick the times when your team will get together. If you have a head office, that is typically a good anchor point. If not, consider selecting different places to share the travel burden. Support people who need it to make these in person sessions possible. For example, if a support person is required for a team member to travel, the cost of their trip should include the cost of the support person. Respect people’s time and schedule by being clear about the outcome of the onsite. Relationship building should be a component of the trip but not the only component. On our team we use our two yearly onsites for alignment and to leave people inspired, appreciated, and recognized. We also carve out time that teams can use to plan and code together in person.

Team Hands in Middle

Setting People Up for Success

Remote workers benefit from support from their managers and company. Work with them to set up a healthy work environment, give them regular attention and information, and champion them whenever you can.

Healthy Work Environment

You want your people to be effective and do their best work, so work with them to ensure that they have a healthy working environment. Reinforce the benefits of having a good desk, chair, monitor, mouse and keyboard, and a reliable internet connection. Speak with them about identifying a place that they can designate as their “office” and how to create a separation between work and personal time when their office is in their home. Some people are good at separating these parts of their life. Others need a ritual, like walking around the block, as the separator. Establish their regular working hours so that you are both in agreement about the hours that they are working and when they are not.

Connection Through Communication

People outside of an office need help to maintain their connection to the company and to you. They’ll miss out on any hallway chatter at an office and other in-person conversations like those that happen at lunch. I have a weekly one-on-one with my employees to provide them with a steady stream of information to keep them informed. I try to bring relevant information to all of my one-on-ones by preparing in a shared agenda in advance. I also ping people regularly on Slack with more timely information about the people on their teams, updates about our shared work, and to keep in touch throughout the week. If you do have an office, discuss whether spending some time there each year makes sense for them and seems like a worthwhile investment for you both. One person requested to be in the office for three weeks a year. To me, saying yes to this request was an investment in them and their future with the company.

Champion Your People

Remote people can fall into the trap of being out of sight and out of mind. Be a champion for your people. Ensure you use their name and highlight good work to your manager, your peers, and to your team. Give them credit, recommend them for relevant opportunities, and speak up on their behalf. Coach them on how to be more visible. Building relationships and working with others takes time and effort. Ultimately, their visibility depends on them and you and is important for their career progression and long-term retention.

Remote Is Worth the Effort

Building and managing a remote team takes effort to keep your team engaged, provide opportunity, and ensure that each person and the team is set up for success. You need to define your methods of communication, and deliberately stay in touch throughout the week. If you’re willing to put in the work, you can benefit from the hiring, composition, retention, and strategic benefits that a remote team provides.

We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.

Continue reading

Componentizing Shopify’s Tax Engine

Componentizing Shopify’s Tax Engine

By Chris Inch and Vignesh Sivasubramanian

Reading Time: 8 minutes

At Shopify, we value building for the long term. This can come in many forms but within Engineering, we want to build things in a way that is easy to understand, modify, and deploy so we are confident to build without introducing bugs or unnecessary complexity. The tax engine that existed in Shopify’s codebase started out simple, but over years of development and incremental additions, it became a challenging part of code to work within. This article details how our Engineering team tackled the problems associated with complex code, and how we built for the long term by moving our tax engine to a componentized architecture within Shopify’s codebase. Oh… and we did all this without anyone noticing.

Tax Calculations: The Wild West

Tax calculations on orders are complex by nature. Many factors go into calculating the final amount charged in taxes on an order like product type, customer location, shipping origin, physical and economic nexus of a business. This complexity created a complicated system within our product where ownership of tax logic was spread far and wide to components that knew too much about how tax calculations worked. It felt like the Wild West of tax code.

Lucky for us, we have a well-defined componentization architecture at Shopify and we leveraged this architecture to implement a new tax component. Essentially, we needed to retain the complexity, but eliminate the complications. Here’s how we did it.

Educate the Team

The first step to making things less complicated was creating a team that would spend time gaining knowledge of the code base around tax. We needed to fully understand which parts of Shopify were influencing tax calculations and how they were being used. And it’s not just code! Taxes are tricky in general. To be able to create a tax component, one must not only understand the code involved, but also understand the tax domain itself. We used an in-house tax Subject Matter Expert (SME) to ensure we continued to support the many complexities of calculating taxes. We also employed different strategies to bring the team’s tax knowledge up to snuff which included weekly trivia question on taxes around the world. This allowed us to learn the domain and have a bit of fun while doing so.

Do you know the difference between zero-rated taxes and no taxes? No? Neither did we but with persistence and a tenacity for learning, the team leveled up with all the intricacies of taxation faced by Shopify merchants. We realized if we wanted to make taxes an independent component in our system, we need to be able to discern what proper tax calculations look like.

Understand Existing Tax Logic

The team figured out where tax logic was used by other systems and how it was consumed. This initial step took the most effort as we used a lot of regular expressions, scripts, and manual processes to find all of the areas that touched taxes. We found that the best way to gain expertise quickly was to work on any known bugs relating to taxes. There was some re-factoring that was beneficial to tackle up front, before componentization, but some of the tax logic was so intertwined with other systems that it would be easier to re-factor once the larger componentization change was in place.

Tax Engine Structure Before Componentization
Taxes Before Componentization

After a full understanding of the tax logic was achieved, the team devised the best strategy to isolate the tax logic into its own component. A component is an efficient way to organize large sections of code that changes together by breaking a large code base into meaningful distinct parts, each with its own defined dependencies. After this, all communication becomes explicit over the component’s architectural boundaries. For example, one of the most complicated aspects of Shopify’s code is order creation. During the creation of an order, the tax engine is invoked by three distinct parts of Shopify Cart -> Checkout -> Order. This change of context brings in more complexity to the system because each area is using taxes in its own selfish way, without consistency. When Checkout changed how it used Taxes, it might have unknowingly broken how Cart was using it.

Creating a Tax Component

Define the Interface

In order to componentize the tax logic, first we had to define a clear interface and entry point into all the tax calls being made in Shopify’s codebase. Everything that requires tax information will pass a set of defined parameters, and expect a specific response when requesting tax rates. The tax request outlines the data it requires in a clear and understandable format. Each of the complex attributes is simply a collection of simple types, this way the tax logic need not worry about the implementation of the caller.

The tax response schema is also composed of simple types that don't make any assumptions about the calling component.

Componentized Tax Engine
Componentized Tax Engine

This above diagram shows how each component interacts cleanly with the tax engine using well-defined requests and responses, TaxesRequestSchema and TaxesReponseSchema. With the new interface, the flow of execution on tax engine looks much more streamlined and easy to understand.

Executing the Plan

Once we had defined a clean interface to make tax requests, it was time to wrangle all the instances of tax-aware code throughout the entire Shopify codebase. We did this by moving all source files touching tax logic under tax component. If taxes were the Wild West, then we were the Sheriff coming to town. We couldn’t leave any rogue tax code outside of our tax component. Additionally, we wanted to make our changes future-proof so that other developers at Shopify aren’t able to accidentally add new code that reaches past our component boundaries, so we added GitHub bot triggers to notify our team on any commits pushed against source files under tax component, this allowed us to be sure that no additional dependencies were added to the system while it is undergoing change.

Updating our Tax Testing Suites

Every line of code that we moved within the component was tested and cleaned. Existing unit tests were re-checked, and new integrations tests were written. We added end-to-end scenarios for each consumer of the tax component until we were satisfied that it tested the usage of tax logic sufficiently— this was the best way to capture failures that may have been introduced to the system as a whole. The unit tests provided confidence that the individual units of our code produced the same functionality and our integration tests provided confidence that our new component did not alter the macro functionality of the system.

Slowly but surely, we completed work on the tax component. Finally, it was ready, and there was just one thing left to do: start using it.


Our code cleanup work was complete, and the only task left was releasing it. We had high confidence in the changes we introduced through componentization of this logic. Even still, we needed to ensure we did not change the behavior of the existing system for the hundreds of thousands of merchants who rely on tax calculations within Shopify while we released it. Up to this point, the code paths into the component were not yet being used in production. For our team, it was paramount that the overall calculation of taxes remained unaffected, so we took a systematic, methodological and measurable approach to releasing.

The Experimental Code Path

The first step to our release was to ensure that our shiny new component was calculating taxes the same way that our existing tax engine was already calculating these same taxes.

We accomplished this by running an “experiment” code path on the new component. When taxes were requested within our code, we allowed our old gnarly code to run, but we simultaneously kicked off the same calculations through the new tax component. Both code paths were being triggered simultaneously and taxes were calculated in both pieces of code concurrently so that we could compare the results. Once we compared the results of old and new code paths, the results from the new component were discarded. Literally, we calculated taxes twice and measured any discrepancies between the two calculations. These result comparisons helped expose some of the more nuanced and intricate portions of code that we needed to modify or test further. Through iterations and minor revisions, we solidified the component and ensured that we didn’t introduce any new problems in the process. This also gave us the opportunity to add additional tests and increase our confidence.

Once there were no discrepancies between old and new, it was time to release the component and start using the new architecture. In order to perform this Indiana Jones-style swap, we rolled out the component to a small number of Shopify shops first, then tested, observed, and monitored. Once we were sure that things were behaving properly, we slowly scaled up the number of shops whose checkouts used the new tax component. Eventually, over the course of a few days, 100% of shops on Shopify were using the new tax component. The tax component is now the only path through the code that is being used to calculate taxes.

Benefits and Impact

Through the efforts of our tax Engineering team, we have added sustainability and extensibility to our tax engine. We did this with no downtime and no merchant impact.

Many junior developers are concerned only with building the required, correct behavior to complete their task. A software engineer needs to ensure that solutions not only deliver the correct behavior, but do it in a way that is easy to understand, modify, and deploy for years to come. Through these componentization efforts, the team organized the code base in a way that is easy for all future developers to work within for years to come.

We constantly receive praise from other developers at Shopify, thanking us for the clean entry point into the Tax Component. Componentization like this reduces the cognitive load and abstract knowledge of the internals of tax calculations in our system.

Interested in learning more about Componentization? Check out cbra.info. It helped us define better interfaces, flow of data and software boundaries.

We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.

Continue reading

Implementing Android POS Receipt Printing on Shopify

Implementing Android POS Receipt Printing on Shopify

Receipts are an essential requirement of every brick-and-mortar business. They’re the proof of purchase that allows buyers to get refunds or make returns/exchanges. Only last year, we estimate that millions of receipts were printed by merchants running Shopify Point Of Sale (POS) for iOS. This feature was only available on iOS because Shopify POS was released first for that platform and is a few development cycles ahead of its Android counterpart. Merchants that strictly needed receipt printing support had no choice but to switch to the iPad but as of March 2019, merchants using an Android device now have the option to provide printed receipts.

The receipt generation process is unique because it’s affected by most features in Shopify POS (like discounts, tips, transactions, gift cards, and refunds) and leads to over 8 billion unique receipt content combinations! These combinations also keep growing as we expand to more countries and support newer payment methods. This article presents our approach to implementing receipt printing support, starting from our goals to an overview of all the challenges involved.

Receipt Printing Support Goals

These were the main goals the Payments & Hardware team had in mind for receipt printing support:

  1. Create a Pragmatic API: printing a receipt for an order should be as simple as a method call.
  2. Be adaptive: supporting printers from different vendors, models and paper sizes should be easily achievable.
  3. Be composable: a receipt is made out of sections, like header, footer, line items, transactions, etc. Adding new sections, in the future, should be a straightforward task.
  4. Be easy to maintain: adding or changing the content of a paper receipt should be as easy as UI development that every Android developer is familiar with.
  5. Be highly testable: almost every feature in the POS app affects the content of a receipt and the combination of content is endless. We should be very confident that the content generation logic is robust enough to cover a multitude of edge cases.

The Printing Pipeline

In order to achieve our goals, first we defined the /Printing Pipeline/ by dividing the receipt printing process into multiple self-contained steps that are executed one after another:

The Printing Pipeline
The Printing Pipeline

During the Data Aggregation step, all the raw data models required to generate a receipt are gathered together. This includes information about the shop, the location where the sale is being made from, a list of payment transactions, gift cards used for payments (if applicable), etc.

In the Content Generation step, we extract all the meaningful data from the raw models to compose the receipt in a buyer-friendly way. Things that matter to the buyer, like the receipt language, date formats and currency formats are taken into account.

Now that we extracted all the meaningful data from the models, we move to the Sections Definition step. At this point, it’s time to split the receipt into smaller logical pieces that we call “receipt sections”.
Receipt Sections

Receipt Sections

After the sections are defined, the receipt is ready to be printed, so we move to the Print Request Creation step. This involves creating a print request out of the buyer-friendly receipt data and sections definition. A print request also includes other printer commands like paper cuts. Depending on the receipt being printed, there might be some paper cuts in it. For example, a gift card purchase requires paper cuts so the buyer can easily detach the printed gift card from the rest of the receipt.

Now a print request is ready to be submitted to the printer, the Content Rendering step kicks in. It’s time to render images for each section of the receipt according to the paper size and printer resolution.

The Printing Pipeline is finalized by the Receipt Printing step. At this point, the receipt images are delivered to the printer vendor SDK and the merchant finally gets a paper receipt out of their printer.

Printing Pipeline Implementation 

Data Aggregation

The very first step is to collect all the raw models required to generate a receipt. We define an interface that asynchronously fetches all these models from either the local SQLite database or our GraphQL API.

Content Generation

After all the data models are collected by the Data Aggregation step, they go through the PrintableReceiptComposer class to be processed and transformed into a PrintableReceipt object, which is a dumb data class with pre-formatted receipt content that will be consumed down the pipeline.

In this context, the use of a coroutine-based API for the Data Aggregation step presented earlier not only improves performance by running all requests in parallel, but also leverages code readability, as it can be seen in the snippet above.

The PrintableReceiptComposer class is where most of the business logic lives. The content of a receipt can drastically change depending on a lot of factors, like item purchased, payment type, credit card payment, payment gateway, custom tax rules, specific card brand certification requirements, exchanges, refunds, discounts, and tips. In order to make sure we are complying with all requirements and the proper display of all features on receipts, we took a heavily test-driven approach. By using test-driven development, we could write the requirements first in the form of unit tests and achieve confidence that data transformation covers not only all the features involved but also several edge cases.

Sections Definition

Now that we have all data put together in its own receipt model exactly like it will be on paper, it’s time to define what sections the receipt is made of:

Sections are just regular Android views that will be rendered in the Content Rendering step. In the Sections Definition step, we specify a list of ViewBinder-like classes, one per section, that is used during the receipt rendering step. Section binders are implementations of a functional interface with a fun bind(view: View, receipt: PrintableReceipt) method definition. All these binders do is bind the PrintableReceipt data model to a given view with little to no business logic in an almost one-to-one, view-to-content mapping. Here is an example of implementation for the total box section:

Print Request Creation

A PrintRequest is a printer-agnostic class composed by a sequence of receipt printer primitives (like lazily-rendered images and cut paper commands) to be executed by low-level printer integration code. It also contains the size of the paper to print on, which can be 2” or 3” wide. During this step, a PrintRequest will be created containing a list of section images and sent to our POS Hardware SDK, which integrates to every printer supported by the app.

Content Rendering

During this step, we will render each section image defined in the PrintRequest. First, the rendering process will inflate a view for the corresponding section and use the section binder to bind the PrintableReceipt object to the inflated view. Then, this bound section view will be drawn to an in-memory Bitmap at a desired scale according to the printer resolution for that paper size.

Receipt Printing

The last step happens in the Hardware SDK where the section Bitmap objects generated in the previous step will be passed down to the printer-specific SDK. At this point, a receipt will come out of the printer.

Hardware SDK Pipeline
The Hardware SDK Pipeline

The POS app will convert an Order object into a PrintRequest by executing all the aforementioned pipeline steps and then it will be sent to the ReceiptPrinterProcessManager in the POS Hardware SDK. At this point, the PrintRequest will be forwarded to a vendor-specific ReceiptPrinter implementation. Since a printer can have multiple connectivity interfaces (like Wi-Fi, Bluetooth or USB), the currently active DeviceConnection will then pass the PrintRequest down to the Printer Vendor SDK at the very last step.

The Hardware SDK is a collection of vendor-agnostic interfaces and their respective implementations that integrate with each vendor SDK. This abstraction enables us to easily add or remove support for different printers and other peripherals of different vendors in isolation, without having to change the receipt generation code.


Since receipt printing is affected by over 30 features, we wanted to make sure we had a multi-step test coverage to enforce correctness, especially when more advanced features, such as tax overrides, come into play. In order to achieve that, we heavily relied on unit tests and test-driven development for the Data Aggregation and Content Generation steps. The latter, which is the most critical one of all, has over 80 test cases stressing a multitude of extraordinary receipt data arrangements, like combinations of different payment types on custom gateways, or transactions in different countries with different currencies and credit card certification rules. Whenever a bug was found, a new test case was introduced along with the fix.

The correctness of the Sections Definition and Content Rendering steps is enforced by screenshot tests. Our continuous integration (CI) infrastructure generates screenshots out of receipt bitmaps and compare them pixel by pixel with pre-recorded baseline ones to ensure receipts look as expected. The Sections Definition benefits from these tests by making sure that each section is properly rendered in isolation and that all of them are composed together in the right order. The Content Rendering step, on the other hand, benefits from having canvas transformations asserted, so that the receipt generation engine can easily adjust to any printer/paper resolution.

Screenshot Test Sample
Baseline screenshot diff on Github after changes made to the line items receipt section

Having a componentized and reusable printing stack gives us the agility we need to focus on extending support for new printer models in the future, no matter what printing resolutions or paper sizes they operate with and it can be done in a just a couple of hours. Taking a test-driven approach not only ensures that multiple edge cases are properly handled, but also enforces a design-by-contract methodology in which the boundaries between steps in the pipeline are well-defined and easy to maintain.

If you like working on problems like these and want to be part of a retail transformation, Shopify is hiring and we’d love to hear from you. Please take a look at the open positions on the Shopify Engineering career page.

Continue reading

Mobile Release Engineering at Scale with Shipit Mobile

Mobile Release Engineering at Scale with Shipit Mobile

One of the most important phases of software development is releasing it out to the final users. At Shopify, we've invested heavily in tooling for continuous deployment for our web apps. When a developer working on a web project wants to deploy their changes to production, the process is as simple as:

Merge → Build container → Run CI → Ship to production

In contrast, uploading a new version of one of our mobile apps to Google Play or the App Store involved several manual steps and a lot of human interaction that caused various problems. We wanted to provide the same level of convenience when releasing mobile apps as we do for web apps, and also take the opportunity to define a framework for all the mobile teams to adopt. For this reason, we developed a new tool: Shipit Mobile, a platform to create, view, and manage app releases.

The Issues With Mobile Releases

Automatic deploys and continuous delivery aren’t possible in mobile for several reasons including approval wait time; coordination between developers, designers, and product managers; and because our users need to update the app. If they don’t have automatic updates enabled, finding several updates of the same app multiple times a week, or even a day, is annoying.
Moreover, a new release isn’t just deploy the latest version of the code from our repository to our infrastructure. Third-party services (the app stores) are involved, and software approval and distribution is owned by them. This means that we can’t update our apps several times a day even if we wanted to.

Uploading a new version of our mobile apps to Google Play or the App Store was fraught with problems. Releasing new apps was error prone due to the high number of manual steps involved, like selecting the commit to release from or executing the script to upload the binary to the store. Different teams had different processes to release mobile apps. The release process wasn’t transferable—knowledge couldn’t be shared within the same organization and the process was inconvenient and complex. Each team had variants of their release scripts, and those scripts were complex and untested. Release version and build numbers had to be managed manually. Finally, there was a lot of responsibility and burden on the release manager. They had to make decisions, fix bugs found along the way, communicate with stakeholders, and coordinate other side tasks.

Our Solution: Shipit Mobile

Mobile Release Flow
Mobile Release flow

Releasing a new mobile app requires performing multiple steps:
Pick a commit to release from

  1. Increment build and version numbers
  2. Run CI
  3. Manually test the app (QA)
  4. Iterate on testing until all the bugs and regressions are fixed
  5. Update screenshots and release notes
  6. Upload it to the store
  7. Wait for app submission to be approved

Our existing tools for releasing web apps weren't suitable for the mobile release process, so we decided to build something new, Shipit Mobile.

Shipit Mobile Home
Shipit Mobile Home

Creating a New Release in Shipit Mobile

Create New Release in Shipit Mobile
Creating a new release in Shipit Mobile

A release starts with a new branch in the repository. We follow a trunk-based development approach in which the release branch is branched off of master. We opted for trunk-based development instead of doing the release directly from the master branch because it reduces the risk of including code that doesn’t belong to the current release by mistake if both new features and bug fixes are pushed to the branch where the release is taking place. Release branches are never merged back to the master branch, and bug fixes are pushed to master and cherry-picked to the release branch.

Branching Model
The branching model

This branching model allows us to automatically manage the release branches and avoid merge conflicts when a release is completed.

Testing and Building the Release Candidate

Release Page in Shipit Mobile
Release page in Shipit Mobile

When a new release is created, a candidate is built. A candidate in Shipit Mobile corresponds with a releasable unit of work. Every new commit in the release branch creates a new candidate in Shipit Mobile. For every candidate, the build number is incremented. A new candidate triggers two continuous integration (CI) pipelines. One is a test pipeline that ensures that the app works as expected, and the other pipeline builds the app for release. We decided to decouple these two pipelines because, if there is an emergency, we want to allow developers to release the app even if the test pipeline has not finished or has failed.

Distributing the App to Different Channels

Distributing the App to Different Channels
Distributing the App to Different Channels

Once the app has been built, tested, and CI is passing, it’s time to upload it to the store. In Shipit Mobile we have the concept of distribution channels. A distribution channel is a platform from which the app can be downloaded. The same release binary can be uploaded to different channels. This is useful if we want to publish the app to Google Play and send the same version of the app internally to our support team so they can quickly access the right version of the app when a support request comes in, and we will know that they have the same app our user downloaded from the store.

The two main channels we support are Google Play and the App Store. For Google Play, we have used the Google Play’s official API. For the App Store, Fastlane makes it easy to upload iOS apps to AppStore Connect. We use them alongside with Apple’s app store connect API to communicate with the App Store, upload the apps and metadata, and check if the apps have been approved and are ready to release.

Configuring Shipit Mobile

Every project that releases using Shipit Mobile needs a configuration file. This file tells Shipit Mobile some basic information about the project, such as the platform, the channels to where we want to upload the app, and optionally the Slack channels that need to be notified with the state of the release. We went with a simple configuration favouring convention over configuration. This can be seen in the way we manage the metadata (release notes, screenshots, app description…). If the configuration file tells Shipit Mobile to upload the metadata to the app store, it knows where to find it and how to upload it as long as it is located in the expected folder.

The Release Captain

A mobile release is a discrete process. Every release contains new features and bugfixes, and we need a person responsible for making decisions along the way. This person is the release captain.

Our previous release process was complex, and the role of the release captain wasn’t easy to transfer to others. Moreover, the release captain had to communicate the state of the release to all the people involved in it.

With Shipit Mobile we wanted to make the release captain role transferable. We made this a reality by building the platform to guide the release captain through the release process and notify the right people when the status of the release changes. It’s now easier for others to see who’s in charge of the release.

Useful Notifications

Providing useful notifications was a request we got from our users since we started to work on Shipit Mobile. Before Shipit Mobile, only the person in charge of the release knew its exact state and was responsible for communicating it to others. With Shipit Mobile we offload this responsibility from the release captain by sending notifications to Slack channels, so every developer in the team can know the release status, removing the burden of communicating the status of the release to stakeholders.

Shipit Mobile Notification

Shipit Mobile Notification

Shipit Mobile Notification
Shipit Mobile Notifications

Emergency Release and Rollout

At Shopify we have trust in our developers and their decisions. Although we strongly recommend a passing CI for both creating a release and uploading the app to the store, we have some mechanisms in place to bypass this restriction. A developer can decide to start a release or upload the app to the store without waiting for CI. This is useful if one of the services we rely on is misbehaving and the status of the builds aren’t being received. At the end of the day, we build tools to make our developers’ lives easier, not to get in their way.

At Shopify, we work on our infrastructure to build better products quickly. Last year, the Mobile Tooling team spent time working on a scalable CI system for Android and iOS, pioneering the use of new technologies like Anka.

Also, we worked on providing a reliable CI system resilient to test and infrastructure flakiness and built tools on top of this system to improve how our mobile developers test apps.

Defining standards through tools across mobile teams and bringing good practices and conventions makes it easier for developers to share knowledge and jump between projects. This is one goal of the Mobile Tooling team.

Shipit Mobile has been used in production for over six months now. During this time we have changed and evolved the platform to accommodate our users’ needs, add new release channels and improve the user interface. It’s shown to be a useful and stable product our developers can trust to release their apps. It’s reduced the complexity that is incurred during a mobile app release, enabling us to speed up our release cadence from three weeks to one week, and we’ve seen more people take on the role of Release Captain for the first time.

Interested in Mobile Tooling? Shopify's Production Engineering team is hiring and we’d love to hear from you. Please take a look at the open positions on the Engineering career page.

Continue reading

Pair Programming Explained

Pair Programming Explained

At Shopify, we use pair programming to share knowledge with team members, create high quality solutions, and have fun doing it all! By pairing, we come up with new and innovative ways of solving problems and we can refine ideas into solutions. Involving pairing in onboarding helped us bring new team members up to speed and integrate them into existing teams.

With this article, we hope to help other teams share in the benefits we’ve seen from pair programming and accelerate the spread of pair programming knowledge in the industry.

What is Pair Programming?

Pair Programming is two people writing, debugging, or exploring code together. In essence, it’s simple, but getting the most out of your pairing session can take some work. Pair programming is a complex practice requiring skill to master. There are many resources on pair programming, but sometimes engineers are dropped into a pairing session with little preparation and expected to just figure it out.

Pair Programming Benefits?

It’s important to understand why we pair and what we’re hoping to accomplish. Aside from making work more enjoyable (although it may take some practice), there are tangible benefits around the work itself. Pairing can:

  • remove knowledge silos to increase team resiliency
  • build collective code ownership
  • improve learning
  • increase efficiency
  • improve software design
  • improve software quality
  • reduce the incidence of bugs
  • increase satisfaction
  • help with team building
  • accelerate on-boarding
  • serve as a short feedback loop, similar to a code review happening live
  • increase safety and trust of the developers pairing
  • increase developer confidence

Table of contents


Problem: What is being solved or explored by the pair.

Pair or 🍐:

  • (noun) Either two people pair programming or the pairing partner. E.g., “The pair can try alternating roles”, “Try spending some time asking your pair about their day before you start working on the problem“.
  • (verb) The act of pairing, E.g., “To pair or not to pair, that is the question.”

Pairer: A member of a pair.

Session: A stretch of time where two people are pairing. It ends when the pair splits up for longer than a short break.

Driver: The pairer at the keyboard typing. Despite what NCIS taught us, it’s not useful or practical to have more than one person typing at the same time.

Navigator (sometimes called the observer): The pairer that isn’t driving. Generally, the navigator figures out the strategy, and guides the driver in implementing that strategy. More details on this relationship later.

Development Environment/Station: The environment where the pairing is happening. At a minimum includes a desk, computer(s), keyboards, IDE. The environment also includes anything that might make the pairing session better or worse outside of the pair itself.

Solution: The approach currently being taken to either resolve the problem or explore it.

Expert: The member of the pair that is relatively more experienced or knowledgeable, especially within the particular problem domain. Not related to seniority

Novice: The member of the pair that is relatively less experienced or knowledgeable especially within the particular problem domain. Not directly related to seniority.

Disengagement: When one member of the pair isn’t focusing on the work or engaging in the pairing. This is a waste of the pairer’s time and it’s the biggest pitfall of pairing, so it needs to be avoided and/or addressed.

Watch the master: Where the expert does puts on a performance while the novice watches and doesn’t take part. This scenario lowers the novice’s satisfaction with the pairing session and the expert misses out on the key contributions from the novice. Disengagement and a break down the pairers’ relationship can follow. Sometimes watch the master is a valid mentoring approach, but it is not useful for pairing.

When Should You Pair Program?

Whenever! Pairing doesn’t need to be structured or planned. Ad hoc pairing works very effectively, especially if the pairers have already done a few sessions together. This isn’t to say that pairing should be done constantly or in every circumstance, it’s an individual decision. While some teams or pairs find full day pairing effective, others find some balance of pairing and solo work is better for them.

It’s important to understand that pairing is quite tiring. During a pairing session, the pair is always focused and keeping that intensity is exhausting. It’s great to take breaks, at least every 2 hours or so. Additionally, just because one pairer is up to keep going, doesn’t mean the other has to be.

Pairer Combinations

This is a somewhat shallow and brief breakdown of the different pair combinations, but the Dreyfus Squared model provides a more in-depth view.


Both parties are knowledgeable within the problem space.


  • Both members of the pair share knowledge and ownership of the solution increasing team resiliency
  • Both members of the pair learn new things if they have different approaches
  • Improved software design and quality because the review feedback loop is short


  • Can resolve the problem faster, but at the cost of more total man-hours if the pairers don’t challenge one another
  • Can lead to disengagement if both experts take the same approach and lack diversity of thought, or if the problem is simple

Expert-Expert works best if both members of the pair are experts in different parts of the problem space and, in this case, expert-expert pairing is very effective.


The most traditional type of pairing, and the most beneficial in the greatest number of scenarios.


  • The novice brings insight and questions the status quo
  • The expert provides mentorship and can guide the flow


  • Can lead to a watch the master scenario


This type of pairing often happens when exploring a new area or working on something unusual and involves a lot of learning and discovery. A novice-novice pair has many benefits, but also the most potential downsides.


  • Can be extremely satisfying
  • Lots of learning involved
  • Often leads to interesting and unique solutions


  • Can introduce and reinforce bad practices
  • Can lead to the pair getting stuck and wasting time
  • Can be very frustrating for the pair when they get stuck

Pair Programming Explained


The two roles, driver and navigator serve different purposes. It’s often not clear who does what and keeping the guidelines here in mind can be helpful.


The driver is the clearest role, because it’s the person at the keyboard. The driver is responsible for the implementation. They keep their focus on what is happening right now and interpret what the navigator is saying. It’s best if the driver doesn’t focus too much on the broader design but making sure the work they are doing is high quality and error-free.

For example, while creating a new class as directed by the navigator, the driver would focus on:

  • extracting variables
  • extracting private methods
  • variable, method, and class names
  • other local refactoring and improvements.
  • method level logic
  • private methods
  • coding style
  • running the tests


The navigator focuses on the broader scope of the problem. Generally, the navigator will set the direction the code should go. For example, they might say something like “Maybe we can pull these shared methods into another class and pass it into the constructors”. The driver will then take this guidance and implement it into the code ensuring it’s clean and error free. The navigator is responsible for keeping the pairing on track. They ensure that rabbit holes are stepped out of in a reasonable time and new approaches are tried as old ones fail to pan out. The navigator also keeps an eye out for any errors, typos, or refactoring opportunities that crop up that the driver misses. Though this is not their primary role it’s still important. The best way to think of the navigator: they keep the plan, and keep to the plan. Some examples of how this plays out:

  • keeping a list of tasks to be worked on
  • prioritizing that list
  • adding tangential work that is discovered to the task list
  • ensuring the pair doesn’t deviate too much from the task at hand
  • thinking strategically about the design
  • identifying code smells or design problems
  • adjust the plan/list to address the new code smells
  • reviewing code as it’s written
  • reminding the driver to test first
  • choosing what test to write
  • helping with naming


It’s important to note that the driver isn’t just implementing what the navigator says, they are part of a dialogue. The idea here is that the driver brings in the “low level” perspective from the code that’s in front of them and what will actually work here. Compared to the navigator, who brings in a more “bird’s eye view” and tries to imagine how what is being written now will fit in with the greater design. By switching roles with some frequency, the pair can ensure that they have a good handle on both perspectives.

Switching Roles

Switching roles while pairing is essential to the process—it’s also one of the trickiest things to do correctly. The navigator and driver have very different frames of reference. Switching roles is a massive context change when the roles are followed effectively. As such, there needs to be some care involved.

The Wrong Way

Pairing is about working together. Anything that impedes one of the pairers from contributing or breaks their flow is bad. Two of the more obvious wrong ways are to “grab the keyboard” or “push the keyboard”.

Grabbing the keyboard: Sometimes when working as the navigator it's oh so tempting to just take the keyboard control away to quickly do something. This puts the current driver in a bad position. Not only are they now not contributing, but such a forceful role change is likely to lead to conflict.

Pushing the keyboard: Other times, the driver feels a strong need to direct the strategy. It’s very tempting to just “push” the keyboard to the navigator, forcing them to take the driver’s seat, and start telling them what to do. This sudden context switch can be jarring and confusing to the unsuspecting navigator. It can lead to resentment and conflict as the navigator feels invalidated or ignored.

Finally, even a consensual role switch can be jarring and confusing if done too quickly and without structure.

The Right Way

The first step to switching roles is always to ask. The navigator needs to ask if they can grab the keyboard before doing so. The driver needs to ask if the navigator is willing to drive before starting to direct them. Sometimes, switching without asking works out but these situations are the exception.

It’s important to take some time when switching as well. Both pairers need to time to acclimatizing to their new roles. This time can be reduced somewhat by having a structure around switching (e.g. Ping-pong pairing) which allows the pairers to be mentally prepared for the switch to happen.

Addressing Pitfalls in Pairing

The greatest pitfalls to pairing are:
  • disengagement
  • watch the master
  • communication breakdown
  • conflict

The first two can usually be resolved by switching the driver and navigator more often. For some ways to facilitate this, take a look in the Techniques section, specifically Ping-pong pairing and Pomodoro pairing.

In a communication breakdown scenario, where the pairers aren’t able to understand one another or feel unheard, taking a break from typing and having a discussion can help. The Digging for Gold technique (more details in the Techniques section) often works very well to get the pair through a tough spot.

Conflict between the two pairers happens sometimes. Addressing this is tricky and this post won’t get into the details, but there are many conflict resolution techniques that can be helpful. More information on resolving conflict are in the other resources section.

Pairing Misconceptions

Here’s a brief overview of some pairing misconceptions.

Don’t Pair on Rote Code, Only Complex

One of the things that pair programming is great at is identifying code smells and opportunities for abstraction. Rote code is often a good source of both of these opportunities so pairing can be very effective here.

Pairing is Less Productive

As Martin Fowler points out in his blog post, pairing is meant to help with the hardest parts of programming. The entire purpose is to be more productive.

But I Don’t Like Pairing

Maybe that’s true. Possibly, it’s a matter of lack of familiarity. If you’ve tried pairing and found that it didn’t work, try a different partner, this makes a huge difference. If you feel lost or it’s still not working, try some of the techniques near the end of this article.

The Pairing Station

Before starting to pair, there should be a good space available that can facilitate and improve the pairing session.

The ideal pairing station criteria:

  • supports ease of screen reading for both pairers
  • makes it easy to switch the driver and navigator
  • makes it easy for both of the pairers to take the driver’s seat
  • creates a comfortable environment for either pairer
  • allows for direct communication
  • allows the pairers to communicate indirectly (e.g. visual cues)
  • includes a whiteboard for discussion
  • allows for in-person interaction, although remote pairing is possible if the environment can meet the above criteria.

Don’t let your environment be a barrier to pairing, if you can’t make/find a good space, ad-hoc over the shoulder pairing still works great! While meeting these criteria for the ideal pairing space is awesome, you’ll still get a lot of benefits even from pairing on a laptop at a table in the lunch area.

If you find that pairing in general isn’t working, physical environment can play a significant role. Try some of the suggestions below as these station modifications can help with a lot of pairing problems.


For in-person pairs, having two monitors mirrored allows both pairers to see the work without looking over each other’s shoulders. For remote pairs, either screen sharing or live code sharing can work effectively. While live code sharing helps with switching, it makes it difficult to track non-code work, some suggestions on how to handle this are addressed in the sections related to communication.


For in-person pairs, having two keyboards and two mice works very well when combined with two mirrored monitors.

For remote pairs using some sort of live code sharing can make it simpler to switch drivers.

Making it Easier to Switch Drivers

This is the hardest part of pairing.

For in-person pairs the keyboard(s) and mouse should be comfortable to use for both pairers. For example, if one member uses left-handed mice, the mouse they are using should work for them. Likewise, The keyboards should be in a layout that is familiar to the pair using it.

Shortcuts and productivity commands should be comfortable for both pairers. The IDE, OS, and any apps installed play a big part of this. Consider especially any developer tools like the type of shell, these have a high learning curve and can impede a driver.

The bottom line is: make sure the machine being paired on can support both pairers. Any time a pairer ends up driving because switching is too difficult, check that it isn’t because of the environment.

Remote pairs need to deal with all of the same problems as in person, on top of that they have a few remote specific problems to deal with. Having the environment local to one pairer can make it harder for both to drive. If one of the pairers needs to drive on a system that is remote to them can cause processing delays. Serious connectivity problems can kill a remote pairing session outright. Remote pairing often requires quite a bit more coordination when switching roles, this introduces additional delay which can impede the flow of work. Finally, make sure that shared resources, like local documentation or a config file, can be accessed by both parties.


These are a constant problem and can cause a lot of tension for the pair. The machine being used should always be able to accommodate the lowest common denominator. As an example, I’m not proficient in Emacs and only a little capable with Vim. If asked to pair on a machine with Emacs/Vim I would struggle to drive. Many modern IDEs like RubyMine, VSCode, Atom, etc. are much simpler to use and have a lower bar for entry. They also generally have Vim plugins, can quickly switch configurations, and have functionality lookups. It’s OK to switch IDEs when the driver switches, but it’s less than ideal because switching is less fluid.

Physically Comfortable

The area both the pairers are working in should be comfortable. When working in person, both pairers should have the same amount of physical space, especially when driving. For example, curved corner desks can create a situation where one person is uncomfortably situated or has less space available to them and a small desk that’s enough for one person is usually uncomfortable for two people to sit at. Remember to take into account all the normal considerations for comfort and accessibility in the workplace. Similarly, when remote pairing, both pairers need to be physically comfortable, for example, we can’t have one (or both) pairers cramped in a phone booth.

Direct Communication

The pair needs to be able to communicate directly, often verbally. If one or both of the pairers is a person with a disability that prevents verbal communication, modify for other forms of communication. The goal is to be able to keep a constant dialogue going.

The pairers need to be able to hear each other implying that the work environment allows for discussions and open communication, or the pair requires their own room. Especially when working remotely, both pairers need to be speaking loudly enough to hear one another.

I Don’t Like Talking

It can be intimidating to think that you’ll be having a constant dialogue with someone. There are ways to make it easier to do while pairing. The first is the structure of pairing itself, the roles help smooth communication between individuals. When taking on a role there are guidelines on what to talk about and when in both the Roles section and in specific Techniques. The second is trying some of the techniques that are useful for pairing. These give you frameworks on when to switch roles, what kind of questions to ask, and even how to get over communication blockages.

Language Barriers

Sometimes, the pairers don’t natively speak the same language. This is a pretty significant barrier to pairing and can cause some people to shy away from it. Whenever you feel that your pair hasn’t understood you, don’t forge on ahead. Spend some time clarifying what they’ve understood and defining things they don’t understand. Not only will this improve your pairing but it can help strength language skills in both pairers!

Indirect Communication

A large part of communication is not direct and triggered through subtle cues. If one or both of the pairers is visually impaired modify the recommendations for other indirect cues.

For in-person pairs they need to be able to see each other well enough to pick up on visual cues. For remote pairs, they need to be able to see each other while screen sharing.


Having a whiteboard or some paper to draw on can help facilitate discussion immensely.

For in-person pairs, an actual whiteboard or some paper works well.

For remote pairs, a piece of paper with the camera trained on it, or even us a virtual whiteboard.

Pair Programming Techniques

There are different techniques that a pair can try when pairing to improve the experience or learn something new. Some of these techniques can work in tandem, while others are mutually exclusive. Picking a technique depends on who the members of the pair are and how they work together.

Perform a Prospective

Spend some time before pairing talking about things not related to the problem at hand. This can set the tone for the session and make everyone feel comfortable. You might also discuss any action items you want to address during the session that aren’t part of the task directly.

Possible Action Items

  • Investigate action items from the Retrospective of the last pairing session
  • Define learning goals (e.g. I want to learn TDD, I want to keep track of what refactorings we use)
  • Evaluate techniques or pairing styles you may use
  • Define switching cadence
  • Examine the pairing station

Ping-pong Pairing

This is a pairing technique derived from Test Driven Development’s (TDD) red→green→refactor loop. It works best if one or both of the pairers are familiar with TDD but can also be a learning opportunity in the novice-novice combination. This technique is also useful when two experts are pairing or when dealing with a particularly challenging problem. Finally, this is a great way to learn or improve TDD skills.

How Does it Work?

We’ll call one pairer left and the other right. The steps to this technique:

  1. Red → left starts off driving by writing a failing test that tests just one isolatable part of the solution, right navigates by talking left through what the test should test and makes sure they follow the three rules of TDD
  2. Green → right picks up the driver’s seat and writes the implementation that passes the test, left navigates them through this and makes sure they follow the three rules of TDD
  3. Refactor → right continues in the driver’s seat and refactors both the production and test code, left keeps the broader picture in mind and navigates the refactor
  4. The pair starts again but this time right is the driver and writes the test and left is navigating, they then switch so that left writes the implementation

Pomodoro Pairing

This is a technique based off of the pomodoro technique. It helps to resolve a watch the master situation or ,more generally, in an expert-novice pair by giving the novice in the pair enough time in both roles to feel comfortable and get into a flow.

What is the Pomodoro Technique?

This is a quick breakdown of the pomodoro technique, not pomodoro pairing.

  1. Pick a task
  2. Start a 25 min timer (this is a pomodoro)
  3. Work until the timer rings or you’ve completed the given task
  4. Take a 5 min break where you spend time not thinking about work (no emails or Repeat
  5. After your fourth pomodoro take a 35 min break
  6. Start again

This leads to ~2.5 hours per cycle. Along the way jot down the following:

  • what task a given pomodoro is for
  • “hard” interruptions (anything that completely takes your attention away from the given task) usually denoted with a -on a piece of paper
  • “soft” interruptions (brief internal distractions) usually denoted with a . on a piece of paper

If there are enough interruptions, restart the pomodoro (usually without a break)

How to Apply The Pomodoro Technique in Pairing?

We’ll call one pairer left and the other right. The steps to this technique:

  1. Start a pomodoro
  2. For this pomodoro left is the driver and right is the navigator
  3. Take the 5 min break, step away from the pair and do something individually
  4. Start another pomodoro
  5. For this pomodoro right is the driver and left is the navigator

It’s ok to sometimes switch roles within the pomodoro but each of the pairers should stick to the given role for the majority of the pomodoro. During the pomodoros keep track of the same things mentioned in the pomodoro technique.

Strong Style

For an idea to go from your head into the computer it *must* go through someone else’s hands - Llewellyn Falco

In the Strong Style technique, the navigator has some idea how they want the solution to look. The driver drives (as described in roles) and any time the driver has an idea they want to try out, the roles switch.

Digging for Gold

Sometimes one of the pairers will have a hard time listening to the other. They might feel that the other pairer's ideas don’t have merit or are lacking expertise. In these cases it’s useful for the pairer that’s having trouble finding value in the pair to try digging for gold. This technique is also useful when the pair is stuck and can’t find a way forward.

How to Dig for Gold?

One of the pairers makes a naive or impractical suggestion for moving forward. The pair then takes this idea and looks for divergent ideas that could help solve the problem. The pair explores this solution, and divergent solutions until they’ve either found one that works or they feel they’ve found all the gold there is. If you end up in a rabbit hole, recall the problem you’re trying to solve and bring the pair’s attention back to it.

Perform a Retrospective

Spend some time reflecting on the pairing session. This can be helpful by allowing the pair to clear the air of any tension or issues that came up during the session. It can also be helpful in addressing any slow downs and improving flow, allows the pair to set a learning goal for next time, and to reflect on this session’s learning goals. Finally, it’s a good way to celebrate wins as well and leave the pair with everyone feeling a sense of forward progress.

When using the Pomodoro Technique, reflect on the number of distractions, what needs to improve?

Questions to Ask

  • “Do you feel that this was a balanced pairing session?”
  • “How can we improve our station?”
  • “How do you feel after our session?”
  • “What did you learn/teach during our session?”
  • “How do you feel about the progress we made?”
  • “How did you feel about the length of the session?”
  • “Is there anything that stood out to you during our pairing?”

Don’t Rush

When one of the pairers moves forward too quickly, it’s good for them to check in with their pair before forging ahead. Pairing requires that both members of the pair are on the same page and understand what’s happening.


  • “Do you know where I’m going with this?”
  • “What do you think of moving along with this solution
  • Step back and don’t write code for a while, discuss what you’ll be doing instead.
  • Don’t “just try this one thing” before explaining what you’re doing

Additional Resources

Here are some other great resources to learn more about Pair Programming!

Pairer Combinations

Communication and conflict resolution



We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.

Continue reading

A New Kubectl Plugin for Kubernetes Ingress Controller ingress-nginx

A New Kubectl Plugin for Kubernetes Ingress Controller ingress-nginx

Shopify makes extensive use of ingress-nginx, an open source Kubernetes Ingress controller built upon NGINX. Nearly every request Shopify serves is handled at one point by ingress-nginx, and we are active contributors to the project. Since we make use of ingress-nginx and its many features (like annotations and configmap) so heavily, the quality of its debugging experience is very important to us. While debugging ingress-nginx was always possible using a complex litany of kubectl commands, the experience was frequently frustrating. To help solve this issue, I recently contributed a kubectl plugin to the project. It provides a number of features that make ingress-nginx much easier to upgrade and debug, saving us time and increasing our confidence while working with it.

Easier Upgrades

ingress-nginx is a fast moving project. New releases happen every few weeks, and usually contain one or two breaking changes. When running a very large cluster, it can be difficult to know whether or not any configuration changes need to be made to remain compatible with the new version. Our usual process for upgrading ingress-nginx before the plugin existed was to read the CHANGELOG for the new version, export every single ingress in a cluster as YAML, then manually grep through those YAMLs to find anything that would be broken by the new version. With the plugin, it's as simple as running:

to get a nice, formatted list of everything you might need to change.

Improved Ingress Listing

When running the vanilla kubectl get ingresses, a fairly minimal amount of information is returned:


Often you don’t just care about the hosts and addresses of a whole ingress, you care about the individual paths inside the ingress, as well as the services they point to. Using the plugin, you can get a more detailed view of the contents of those ingresses without inspecting each one:

Using the plugin, it is much easier to answer questions like “what service is this path hitting?” or “does this site have TLS configured?”.

Better Debugging

There are many common debugging strategies for ingress-nginx that can become tedious to carry out manually. Usually, you are required to find and select an ingress pod to inspect, and you are required to filter the output of whatever commands you run in order to find the information you’re looking for. The plugin provides convenient wrappers for many kubectl commands that make it quicker and easier to perform these tasks, selecting a single ingress pod automatically to run the command in. As an example:

can be replaced by a single command:

Likewise, inspecting the internal state of the controller is much easier. In the case of reading the controller’s generated nginx.conf for a particular host:

can be replaced by:

ingress-nginx stores some of its configuration state dynamically, making use of the openresty lua-nginx-module to add additional request handling logic. The plugin can be used to inspect this state as well. As an example, if you are using the session affinity annotations to add a session cookie, but the cookie doesn’t seem to be applied to requests, you can first use the plugin to check if the controller is registering the annotation correctly:

This shows that the annotation is correctly reflected in the controller’s dynamic configuration.

My Experience

Since the addition of the plugin, I have found that the time it takes me to upgrade or debug ingress-nginx has decreased substantially. When I first arrived as an intern here at Shopify in January 2019, I was tasked with upgrading our ingress-nginx deployments to version 0.22.0. I spent a long time grepping through ingress manifest dumps looking for breaking changes. I spent time trying to come up with the kubectl invocation that would allow me to inspect nginx.conf inside the controller. I didn’t even know that the dynamic configuration information existed, let alone the arcane incantations that would allow me to read it. There existed no way at all to read certificate information. It took me days to fully roll out the new version.

Near the end of my internship, I upgraded our deployments to ingress-nginx version 0.24.1. Finding breaking changes required only a few invocations of the lint subcommand. Debugging the controller configuration was similarly quick. I had the confidence to ship the new version much more quickly, and did so in a fraction of the time it had taken me a few months ago. Much of this can be attributed to the fact that there was now a single tool that allowed me to easily perform every debugging function I had previously been doing, plus a few more that I hadn’t even known existed. In addition, having all of these previously unintuitive and little-documented debugging tricks collected together in one easily usable tool will make it far easier to get started with debugging ingress-nginx for those who are unfamiliar with the project, as I was.

It’s also true that much of this time difference is due to my growth in both confidence and competence at Shopify. I’ve learned a great deal about Kubernetes, especially the nuts and bolts of how requests to Shopify services get from client to server and back. I’ve had the privilege of being paid to work on an open-source project. I’ve learned the skills, both technical and interpersonal, to function as part of a team in a large organization. Changes that I’ve made, code that I’ve written, have helped to process literally billions of web requests during my time here. This has been an extremely productive four months.

The plugin was released as part of ingress-nginx version 0.24.0, but should be compatible with version 0.23.0 as well. You can find the full plugin documentation and install instructions in the ingress-nginx docs. To get involved with the ingress-nginx project, or to ask questions, drop into the #ingress-nginx channel on the Kubernetes Slack.

If solving problems like these interests you, consider interning at Shopify. The intern application window is now open and closes on Wednesday, May 15th at 9:00 AM EST.  Applications will not be accepted after this date.

Continue reading

Building Shopify’s Application Security Program

Building Shopify’s Application Security Program

Shopify builds products for an industry based on trust. From product discovery to purchase, we act as a broker of trust between the 800,000+ merchants who run their business on our platform and their customers, who come from anywhere in the world. That’s why it’s critical that everyone at Shopify understands the importance of trust in everything we build.

Security is a non-negotiable priority, and we’ve purposefully built a security mindset into our culture. It gives our security team a huge advantage because we start with engaged, talented, and security-minded members across our product teams. But, we also know how important it is that every business on our platform has access to the latest and most innovative features to help them be successful. So, the question is: How do we build an application security program that encourages safety at high speed, removes complexities, and fosters an environment for creative problem solving so that everyone can focus on delivering amazing products to our merchants?

There are three parts to our program that I will outline in this post: scaling secure applications, scaling security teams, and scaling security interactions. When I started at Shopify 7 years ago, I was the lone employee focused on security. Since then, we have grown to a team of dozens of security engineers, covering the breadth of Shopify’s applications, infrastructure, integrations, and hardware platform.

Scaling Secure Applications

As your company grows, the number of different applications and services that will be deployed will inevitably increase. For a small team, it can be daunting to think about providing security for many more services than there are team members, but there are ways to wrangle this sprawl and set your company up with trust at scale.

The first recommendation is to work across R&D disciplines (engineering, data, and UX) and decide on a homogeneous technical baseline you’ll use for your services. There are a lot of non-security advantages to doing this, so the appetite for standardization should be present already. For Shopify, deciding that we would default to all of our products being built in Ruby on Rails meant that our security tooling could go deep on the security concerns for Rails, without thinking about any other web application frameworks. We made similar technical choices up and down the stack (databases, routing, caching, and configuration management) which simplifies the developer experience but also allowed us to ignore security concerns anywhere other than in the things we knew we ran.

Knowing what you are running is a lot harder than it sounds, but it is key to achieving security success at high speed. The way this is done will look different in every organization, but the objective will be the same: visibility. When a new vulnerability is announced, you need visibility into what needs to be patched and the ability to notify the responsible team or automatically kickstart the patching process for every affected service. At Shopify, our security team joined our Production Engineering team’s service tracking project and got a massive head start into having observability of the services, dependencies, and code of everything running in our environment, including the ability to automatically update application dependencies.

Additionally, every new application gets to start with the best defaults we have come up with to this point because we have collectively started hundreds of new projects with the same framework, in the same environment, and using the same technology.

Scaling Security Teams

In a start-up, product direction must be fluid and adapt quickly based on the discovery of new information to keep the company growing. Unless security features are differentiating your product from competitors, investing in a security team isn’t usually the top growth priority. For me, it took over a year before we hired our second security team member. This meant I wore a lot of hats and used some of the tactics described above to ensure a security foundation was included in all new product development.

Growing our security team meant carving off specializations to the first few people we hired. Fraud, application security, infrastructure security, networking, and anti-abuse all started as one-person teams going deep into a particular aspect of the overall security program and feeding their lessons back into the teams across the company.

You also need to understand your options for targeted activities and where third-party services can be used to advance your security agenda. Things like penetration testing, bug bounty programs, and auditing can be used as external validation on a time- and budget-limited basis.

No matter the size of the security team, any security incident is everyone’s responsibility to respond to. Having relationships with teams across the organization will help get the right people quickly moving when you’re faced with an urgent situation or a high severity risk to mitigate. It should never happen that the security team is left with only their team members to fix high priority issues. But there are always ways that security priorities can be embedded within other projects being worked on. Maintaining a list of long-term security enhancements that are ready to be worked on is an invaluable way to make things better without the overhead of staffing an entire team.

Scaling Security interactions

Security teams are renowned for being slow, inconsistent, and risk-averse. In trying to defeat each of those stereotypes, the path to success is to be fast, automated, and risk-aware. The way your security team interacts with the rest of the company is the most important part of consistently building secure products for the long-term.

Deploying security tripwires at the testing and code repository levels allows your team to define dangerous methods and detect unwanted patterns as they are committed. The time when a developer is writing code is the best time to course-correct towards a more secure implementation. To make this effective, flagging a security risk should be designed to be like any good production alert: timely, high-fidelity, actionable, and bring a low false positive rate.

Helped by the success of all the approaches discussed so far, we can build these tactics once and deploy across all of our codebases. With these tactics in place, you gain confidence that even when an application is totally off your radar, you know that it’s being built in line with your security standards. An example of this approach at Shopify is how we handle html_safe. In Rails, html_safe is a confusing function that renders a given string as unescaped HTML, which can be quite unsafe and lead to cross-site scripting vulnerabilities. Our approach to solving this problem consists of renaming this method to dangerously_output_as_html so it’s clear what it does, adding a comment to any pull requests using this method that links to our training materials on mitigating XSS, and triggering an alert to our Application Security (appsec) team so they can review the proposed code change and suggest an alternative and safer approach. This allows our application security team to focus on the exponential benefit of automation rather than the linear benefit of human reviews.

Finally, our best security interactions are the ones we don’t need to have. For example, by making risk decisions at the infrastructure level, we can provide a trustworthy security baseline with our built-in safeguards and tripwires to the teams deploying applications running in that infrastructure without them even knowing those protections are there.

These are just a few of the ways we are tackling the problem of security at scale. Our team is always on the lookout for new ideas and people to join our team to help protect the hundreds of thousands of businesses running on our platform. If these sound like the kinds of problems you want to solve, check out these available positions: Director of Security EngineeringSecurity Engineering Manager, and Lead Software Engineer - Mobile Security.

Continue reading

One Million Dollars in Bug Bounties

One Million Dollars in Bug Bounties

Today, we’re excited to announce that we’ve awarded over $1M USD in bounties through our bounty programs. At Shopify, bounty programs complement our security strategy and allow us to leverage a community of researchers who help secure our platform. They each bring their perspective and specialties and are can evaluate our platform from thousands of different viewpoints to create a better Shopify product and a better user experience for the 800,000+ businesses we safeguard. Our ongoing investment is a clear indication that we are committed to security and making sure commerce is secure for everyone.

Some Bug Bounty Stats

Shopify is the fifth public program, out of 176, to reach the $1M USD milestone on HackerOne, our bug bounty platform. We’ve had some amazing reports and worked with awesome hackers over the last four years, here are some stats to put it into perspective:

Shopify's Bug Bounty Program Stats: Highest Bounty Award $25K. Over 400+ Hackers Thanked. Over 950+ Bugs Resolved. 750+ Bounties Awarded. 375+ Public Disclosures. Held 2 Live Events
Statistics about Shopify's Bug Bounty Programs Since Inception

Top Three Interesting Bugs

Shopify is dedicated to publicly disclosing all vulnerability reports discovered through our program to propel industry education and we strongly encourage other companies to do the same. Three of our most interesting resolved bugs over the years are:

1. SSRF in Exchange leads to ROOT access in all instances - Bounty: $25,000 

Shopify infrastructure is isolated into subsets of infrastructure. @0xacb reported it was possible to gain root access to any container in one particular subset by exploiting a server-side request forgery bug in the screenshotting functionality of Shopify Exchange. Within an hour of receiving the report, we disabled the vulnerable service, began auditing applications in all subsets and remediating across all our infrastructure. The vulnerable subset did not include Shopify core. After auditing all services, we fixed the bug by deploying a metadata concealment proxy to disable access to metadata information. We also disabled access to internal IPs on all infrastructure subsets.

2. Shopify admin authentication bypass using partners.shopify.com - Bounty: $20,000

@uzsunny reported that by creating two partner accounts sharing the same business email, it was possible to be granted “collaborator” access to a store. We tracked down the bug to incorrect logic in a piece of code that was meant to automatically convert an existing normal user account into a collaborator account. The intention was that, when a partner had a valid user account on the store, their collaborator account request could be accepted automatically, with the user account converted into a collaborator account. We fixed this issue by properly verifying that the existing account is in fact a user account.

3. Stored cross site scripting in Shopify admin and partner pages - Bounty $5,000

@bored-engineer found we were incorrectly sanitizing sales channel icon SVG files uploaded by Partner accounts. During our remediation, we noted the XSS would execute in partners.shopify.com and the Shopify admin panel, which increased the impact of this bug. The admin functionality was not required, so it was removed. Additionally, we verified that the bug had not been exploited by any other users.

Shopify x HackerOne H1-514
Shopify x HackerOne H1-514

Having reached the $1M in awarded bounties, we’re still looking for ways to ensure our program remains competitive and attractive to hackers. This year we’ll be experimenting with new ways to drive hacker engagement and make Shopify’s bug bounty program more lucrative and attractive to hack on.

Happy Hacking!

If you’re interested in helping to make commerce more secure, visit Shopify on HackerOne to start hacking or our career page to check out our open Trust and Security positions.

Continue reading

Shopify Developers Share Lessons on Self-Advocacy and Dealing with Adversity in the Technology Industry

Shopify Developers Share Lessons on Self-Advocacy and Dealing with Adversity in the Technology Industry

Behind The Code is an ongoing series where we share the stories of Shopify developers and how they’re solving meaningful problems at Shopify and beyond.

In celebration of International Women’s Day, we’re featuring three female developers from various backgrounds sharing their experiences navigating a career in the tech industry and as well as their work and accomplishments at Shopify.

Developer Stella Lee on The Importance of Female Mentorship and Dealing with Adversity in the Technology Industry

Stella Lee

Stella Lee is a developer working on the Shopify Checkout Experience team. The team is responsible for converting a merchants customer’s intent to purchase a product into a successful purchase, and making the buying experience as smooth as possible. She primarily works on building features aiming to eliminate friction for the buyer by allowing them to reuse their information and reduce the number of fields they need to fill out manually, using Typescript and Ruby. When she’s not busy writing code, she’s the founder and co-lead for Shopify’s internal women’s Employee Resource Group (ERG), f(empower), which is supported by our Employee Experience, Diversity and Belonging team and has an executive sponsor. The group works with teams across the business to enable a work environment with equal access to opportunity and supports women to achieve their full potential. The ERG is open to all employees of Shopify and allows a safe space to vocalize the collective experiences and difficulties women in tech face.

When asked what does International Women’s Day mean to her, Stella mentions she never saw the value of needing a specific day to celebrate what should be celebrated every day, but she’s warmed up to it. “The reality is, we still have a long way to go and days like today give people the opportunity to celebrate and learn how each of us can make a positive impact for women. This year, I want to take the opportunity to celebrate the achievements of my fellow ladies in tech, reflect on the current state of gender parity in the industry, and outline concrete ways to advocate for women.”

Dealing with Imposter Syndrome and Learning the Importance of Self-Advocacy

One of the reasons Stella feels so strongly about empowering other women is because she grew up with a mother who she describes as a true inspiration. After leaving behind a great education and career, her mother uprooted the family from their native home in South Korea to Canada in hopes of providing a better life for her family. “She’s the most selfless, strong and caring person that I’ve ever met. She’s the only person I’ve ever met that acts in the best interest of her children without ever expecting anything (not even an ounce of recognition in return). Her unrelenting positivity and resilience in the face of adversity time and time again is truly inspiring.”

Growing up with such a strong role model has played a part in her development and ability to navigate the workplace. She expresses the importance of women advocating for themselves in a way where they can achieve their career goals. “You can't expect anyone else to take charge of your professional development, so you need to own that and figure out what it is you need to grow.” She believes the best way to do this is to ask for the opportunities needed to develop your skills like asking to own a feature of a project, or even to champion one.

Aside from learning how to advocate for herself, Stella has had to learn how to maneuver through her feelings of imposter syndrome, an inability to internalize accomplishments and a persistent fear of being exposed as a fraud.  Imposter syndrome is something that countless people face in many different professions, no matter how far one is in their career. For Stella, these feelings began when she switched to computer science halfway through completing her Bachelor degree. “I didn’t have the full educational background nor the internship experiences that many of my colleagues or other developers had and with the emphasis of gender parity and the importance of diversity in tech, there are times when I’ve wondered if the only reason I’ve occupied a space filled with male developers was because I was a woman.” She goes on further to describe her experience attending various external engineering events where people assume she’s in a non-engineering role or comment that she doesn’t look like a developer. “Previously, I would just internalize these experiences as validation that I didn’t belong within this space. I definitely still struggle with imposter syndrome, but the consistent practice of recognizing and transforming my negative thinking patterns through thought work has helped immensely for my confidence.”

Transforming “Stupid” Questions into Good Questions

Learning how to ask good questions has been her bread and butter since working as a developer, but she believes you can only do this by asking stupid questions first because they’ll eventually become better questions. “Figure out what you don’t know by asking yourself what questions you have, set a timebox, and then take a stab at asking them to yourself or researching the question. I’m a very independent person and I find it hard to ask for help when I know that I’ll eventually find the answer, but I learned later on that doing this wasn’t the best use of my time.” A key to asking questions is to start by stating what you already know, then diving deeper into investigating how to solve that problem. “This shows you’ve done your homework, and it helps the person formulate a more intentional response that isn’t too basic or too advanced.”

Remote Developer Lead Helen Lin on Effective Ways to Manage People and Teams

Helen Lee

Helen Lin works remotely in Vancouver, BC as a Lead Web Developer on the Themes team for the Shopify Online Store. The Themes team is responsible for helping our product lines to integrate features into the online store, using Javascript, Nodejs, Liquid, Ruby, and SQL. Aside from providing free themes for our merchants, her team also establishes proper standards for making features more accessible and improving web performance.

Managing People Through Alignment and Stakeholder Management

As one of the few remote technical leads at Shopify, Helen shares with us some insights into how she manages her team,I think the makings of a good lead mainly lies on your ability to understand your different reports and effectively give direction as to how a project needs to be run. If you don’t understand the long term vision of your company and don’t know how to map out what tasks need to be accomplished, then misalignment can occur leading to low team morale and poor communication.” She periodically flies down to see her team and connect and with all of her reports and stakeholders on the current project she’s working on, allowing her to build strong relationships with the people she works with. She also explained that managing people is one of the most challenging things she’s ever done but she finds it very rewarding. “ I won’t say it hasn’t come with some difficulties, because it definitely has, but learning how to empower people and communicate with people with different personalities and disciplines, has taught me the importance of staying connected and aligned.”

Self-Advocacy and Sharing Accomplishments

Some people struggle with getting aligned with their managers about their personal development and career goals; advocating for themselves is a struggle. For someone like Helen, who’s been working for a number of years now, she feels when advocating for yourself it’s important to directly ask for what it is that you want. For her, self-advocacy is when she can push past her boundaries and do things she thought impossible. “Fighting through that internal fear and challenge is far more powerful than anything I have ever experienced. Now, I take that experience and find ways to help others to advocate themselves through sharing my stories of perseverance in the face of adversity.” She also invests time helping people narrow down their aspirations and figure out what they’re passionate about. “When there is a clear vision of what you want then self-advocacy becomes easy because it's what you want to do and not what other people want you to do.”

Asking Questions and Staying Curious

For those interested in pursuing a career as a developer, she stresses how important it is to understand that asking questions and staying curious is a positive thing. “ One of the best advice I’ve gotten was that I should understand that as human beings we all make mistakes, period. Regardless of if you’re a junior or senior developer, you're bound to mess up at some time. It’s not about trying not to make mistakes, but about how you can fix the mistake and move forward from it.” Building resiliency is a great muscle to flex and not being afraid to ask questions and make mistakes are only going to make you a better developer, so don’t be afraid to speak up.

Web Developer Cathryn Griffiths on Making Career Pivots and Creating an Inclusive Space in The Technology Industry

Cathryn Griffith

Cathryn is a web developer working on the Checkout Experience team, which is responsible for making a customer’s purchase experience as smooth as possible. She’s currently acting as a champion on a project and is in charge of making sure decisions are made, deadlines get met, and work gets done. This is her first time championing a technical project, so it comes with certain challenges, but she has a strong network within her team to help her navigate this new experience.

As someone who has pivoted careers a number of times, Cathryn knows all about how difficult it can be to switch careers and find what you’re passionate about. She’s gone from pursuing a career in academia to working in the private sector as a clinical trial project manager. “After realizing I didn’t have a passion for working in health sciences, I decided to go back to school and gave programming a shot after hearing about the exciting and challenging work that a programmer does. I enrolled in a Bachelor of Computer Science at Concordia University and by my third semester, I had secured a role as a Front-End Developer so I stopped pursuing my BCompSc and started working full-time.”

Finding Her Passion and Changing Careers

Maneuvering through different industries was not an easy thing to do, but she managed by always being open to discovering what work she actually enjoyed doing. However,  making the switch to the technology industry specifically comes with its own challenges, especially as a woman pursuing a career in a male-dominated field. “When I left my first programming job to go to my second one, I hesitated a lot about moving on to the new job because I was afraid I might get stuck in a toxic work environment where my gender would be a problem. That same feeling, that reluctance, hesitation, and nervousness happened again when I thought about leaving my previous job for Shopify. Luckily, in both cases, I ended up in fantastic, supportive work environments.”

We also asked Cathryn how the technology industry can make this space more inclusive for all especially with her experience with making the switch into tech. “On a larger scale: the more diverse our industry can be, the more inclusive it can be too. We have to hire more minorities, and have a workforce with a diverse array of races, ages, genders, sexualities, and ethnicities.” Diversity in the workplace has been proven to be very beneficial to companies, and various companies have initiatives in place to promote a more inclusive workplace and have a more diverse workforce. When asked what companies can do on a day-to-day basis to promote inclusivity, she said “On a smaller scale, something I love that Shopify does is that every meeting room has a paper pyramid, that sits right on the meeting table to help guide a more inclusive discussion. Each face of the pyramid poses a question or statement to ensure people’s voices are being heard during meetings, whether online or in person. For example, ‘Overtalking is real. Go back to the person who was interrupted and let them finish.’” These pyramids are about cultivating a space where people feel comfortable speaking up and people become mindful of speaking too much. So for someone like Cathryn who is newer to the company, she feels included in the conversation and supported by her fellow coworkers.

Advice for People Looking to Switch Careers

As someone who spent years establishing herself in different career paths, she began asking herself questions like, “What are my goals?”, “What do I want to accomplish at the end of the day?” and “I’m I enjoying my work?”. Asking herself these types of questions were pivotal in helping to discover which career she was the most passionate about.

Final Thoughts

“Work hard and embrace feedback. Own your accomplishments, be proud of them, and don’t be afraid to tell others about them. Additionally, own your failures and don’t be afraid to acknowledge them — it’s only by acknowledging them that you can grow from them.”

At Shopify, we’re committed to designing a workplace that challenges and supports employees to hone their craft and make meaningful impact for entrepreneurs around the world. We know that in order to build a platform that will ‘make commerce better for everyone’, we need to have a diverse team building that product and are committed to fostering an inclusive work environment that harnesses differences and brings the best out of each and every individual.  

We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.

Continue reading

Deconstructing the Monolith: Designing Software that Maximizes Developer Productivity

Deconstructing the Monolith: Designing Software that Maximizes Developer Productivity

Shopify is one of the largest Ruby on Rails codebases in existence. It has been worked on for over a decade by more than a thousand developers. It encapsulates a lot of diverse functionality from billing merchants, managing 3rd party developer apps, updating products, handling shipping and so on. It was initially built as a monolith, meaning that all of these distinct functionalities were built into the same codebase with no boundaries between them. For many years this architecture worked for us, but eventually, we reached a point where the downsides of the monolith were outweighing the benefits. We had a choice to make about how to proceed.

Microservices surged in popularity in recent years and were touted as the end-all solution to all of the problems arising from monoliths. Yet our own collective experience told us that there is no one size fits all best solution, and microservices would bring their own set of challenges. We chose to evolve Shopify into a modular monolith, meaning that we would keep all of the code in one codebase, but ensure that boundaries were defined and respected between different components.

Each software architecture has its own set of pros and cons, and a different solution will make sense for an app depending on what phase of its growth it is in. Going from monolith to modular monolith was the next logical step for us.

Monolithic Architecture

According to Wikipedia, a monolith is a software system in which functionally distinguishable aspects are all interwoven, rather than containing architecturally separate components. What this meant for Shopify was that the code that handled calculating shipping rates lived alongside the code that handled checkouts, and there was very little stopping them from calling each other. Over time, this resulted in extremely high coupling between the code handling differing business processes.

Advantages of Monolithic Systems

Monolithic architecture is the easiest to implement. If no architecture is enforced, the result will likely be a monolith. This is especially true in Ruby on Rails, which lends itself nicely to building them due to the global availability of all code at an application level. Monolithic architecture can take an application very far since it’s easy to build and allows teams to move very quickly in the beginning to get their product in front of customers earlier. 

Maintaining the entire codebase in one place and deploying your application to a single place has many advantages. You’ll only need to maintain one repository, and be able to easily search and find all functionality in one folder. It also means only having to maintain one test and deployment pipeline, which, depending on the complexity of your application, may avoid a lot of overhead. These pipelines can be expensive to create, customize, and maintain because it takes concerted effort to ensure consistency across them all. Since all of the code is deployed in one application, the data can all live in a single shared database. Whenever a piece of data is needed, it’s a simple database query to retrieve it. 

Since monoliths are deployed to one place, only one set of infrastructure needs to be managed. Most Ruby applications come with a database, a web server, background jobs capabilities, and then perhaps other infrastructure components like Redis, Kafka, Elasticsearch and much more. Every additional set of infrastructure that is added, increases the amount of time you will have to spend with your DevOps hat on rather than your building hat. Additional infrastructure also increases the possible points of failure, decreasing your applications resiliency and security.

One of the most compelling benefits of choosing the monolithic architecture over multiple separate services is that you can call into different components directly, rather than needing to communicate over web service API’s. This means you don’t have to worry about API version management and backward compatibility, as well as potentially laggy calls.

Disadvantages of Monolithic Systems

However, if an application reaches a certain scale or the team building it reaches a certain scale, it will eventually outgrow monolithic architecture. This occurred at Shopify in 2016 and was evident by the constantly increasing challenge of building and testing new features. Specifically, a couple of things served as tripwires for us.

The application was extremely fragile with new code having unexpected repercussions. Making a seemingly innocuous change could trigger a cascade of unrelated test failures. For example, if the code that calculates our shipping rate called into the code that calculates tax rates, then making changes to how we calculate tax rates could affect the outcome of shipping rate calculations, but it might not be obvious why. This was a result of high coupling and a lack of boundaries, which also resulted in tests that were difficult to write, and very slow to run on CI. 

Developing in Shopify required a lot of context to make seemingly simple changes. When new Shopifolk onboarded and got to know the codebase, the amount of information they needed to take in before becoming effective was massive. For example, a new developer who joined the shipping team should only need to understand the implementation of the shipping business logic before they can start building. However, the reality was that they would also need to understand how orders are created, how we process payments, and much more since everything was so intertwined. That’s too much knowledge for an individual to have to hold in their head just to ship their first feature. Complex monolithic applications result in steep learning curves.

All of the issues we experienced were a direct result of a lack of boundaries between distinct functionality in our code. It was clear that we needed to decrease the coupling between different domains, but the question was how

Microservice Architecture

One solution that is very trendy in the industry is microservices. Microservices architecture is an approach to application development in which a large application is built as a suite of smaller services, deployed independently. While microservices would address the problems we experienced, they’d bring another whole suite of problems. 

We’d have to maintain multiple different test & deployment pipelines and take on infrastructural overhead for each service while not always having access to the data we need when we need it. Since each service is deployed independently, communicating between services means crossing the network, which adds latency and decreases reliability with every call. Additionally, large refactors across multiple services can be tedious, requiring changes across all dependent services and coordinating deploys.

Modular Monoliths

We wanted a solution that increased modularity without increasing the number of deployment units, allowing us to get the advantages of both monoliths and microservices without so many of the downsides.

Monolith vs Microservices by Simon Brown
Monolith vs Microservices by Simon Brown

A modular monolith is a system where all of the code powers a single application and there are strictly enforced boundaries between different domains.

Shopify’s Implementation of the Modular Monolith: Componentization

Once it was clear that we had outgrown the monolithic structure, and it was affecting developer productivity and happiness, a survey was sent out to all the developers working in our core system to identify the main pain points. We knew we had a problem, but we wanted to be data-informed when coming up with a solution, to ensure it was designed to actually solve the problem we had, not just the anecdotally reported one.

The results of that survey informed the decision to split up our codebase. In early 2017, a small but mighty team was put together to tackle this. The project was initially named “Break-Core-Up-Into-Multiple-Pieces”, and eventually evolved into “Componentization”.

Code Organization

The first issue they chose to address was code organization. At this time, our code was organized like a typical Rails application: by software concepts (models, views, controllers). The goal was to re-organize it by real-world concepts (like orders, shipping, inventory, and billing), in an attempt to make it easier to locate code, locate people who understand the code, and understand the individual pieces on their own. Each component would be structured as its own mini rails app, with the goal of eventually namespacing them as ruby modules. The hope was that this new organization would highlight areas that were unnecessarily coupled.

Reorganization By Real World Concepts Before And After Snapshots
Reorganization By Real World Concepts - Before And After

Coming up with the initial list of components involved a lot of research and input from stakeholders in each area of the company. We did this by listing every ruby class (around 6000 in total) in a massive spreadsheet and manually labeling which component it belongs in. Even though no code changed in this process, it still touched the entire codebase and was potentially very risky if done incorrectly. We achieved this move in one big bang PR built by automated scripts. Since the changes introduced were just file moves, the failures that might occur would result from our code not knowing where to find object definitions, resulting in runtime errors. Our codebase is well tested, so by running our tests locally and in CI without failures, as well as running through as much functionality as possible locally and on staging, we were able to ensure that nothing was missed. We chose to do it all in one PR so we’d only disrupt all developers as little as possible. An unfortunate downside of this change is that we lost a lot of our Git history in Github when file moves were incorrectly tracked as deletions and creations rather than renames. We can still track the origins using the git `-follow` option which follows history across file moves, however, Github doesn’t understand the move.

Isolating Dependencies

The next step was isolating dependencies, by decoupling business domains from one another. Each component defined a clean dedicated interface with domain boundaries expressed through a public API and took exclusive ownership of its associated data. While the team couldn’t achieve this for the whole Shopify codebase since it required experts from each business domain, they did define patterns and provide tools to complete the task. 

We developed a tool called Wedge in-house, which tracks the progress of each component towards its goal of isolation. It highlights any violations of domain boundaries (when another component is accessed through anything but its publicly defined API), and data coupling across boundaries. To achieve this, we wrote a tool to hook into Ruby tracepoints during CI to get a full call graph. We then sort callers and callees by component, selecting only the calls that are across component boundaries, and sending them to Wedge. Along with these calls, we send along some additional data from code analysis, like ActiveRecord associations and inheritance. Wedge then determines which of those cross-component things (calls, associations, inheritance) are ok, and which are violating. Generally:

  • Cross-component associations are always violating componentization
  • Calls are ok only to things that are explicitly public
  • Inheritance will be similar but isn’t yet fully implemented

Wedge then computes an overall score as well as lists violations per component.

Shopify's Wedge - Tracking the Progress of Each Component Towards its Goal of Isolation
Shopify's Wedge - Tracking the Progress of Each Component Goal

As a next step, we will graph score trends over time, and display meaningful diffs so people can see why and when the score changed.

Enforcing Boundaries

In the long term, we’d like to take this one step further and enforce these boundaries programmatically. This blog post by Dan Manges provides a detailed example of how one app team achieved boundary enforcement. While we are still researching the approach we want to take, the high-level plan is to have each component only load the other components that it has explicitly depended upon. This would result in runtime errors if it tried to access code in a component that it had not declared a dependency on. We could also trigger runtime errors or failing tests when components are accessed through anything other than their public API. 

We’d also like to untangle the domain dependency graph by removing accidental and circular dependencies. Achieving complete isolation is an ongoing task, but it’s one that all developers at Shopify are invested in and we are already seeing some of the expected benefits. As an example, we had a legacy tax engine that was no longer adequate for the needs of our merchants. Before the efforts described in this post, it would have been an almost impossible task to swap out the old system for a new one. However, since we had put so much effort into isolating dependencies, we were able to swap out our tax engine for a completely new tax calculation system.

In conclusion, no architecture is often the best architecture in the early days of a system. This isn’t to say don’t implement good software practices, but don’t spend weeks and months attempting to architect a complex system that you don’t yet know. Martin Fowler’s Design Stamina Hypothesis does a great job of illustrating this idea, by explaining that in the early stages of most applications you can move very quickly with little design. It’s practical to trade off design quality for time to market. Once the speed at which you can add features and functionality begins to slow down, that’s when it’s time to invest in good design. 

The best time to refactor and re-architect is as late as possible, as you are constantly learning more about your system and business domain as you build. Designing a complex system of microservices before you have domain expertise is a risky move that too many software projects fall into. According to Martin Fowler, “almost all the cases where I’ve heard of a system that was built as a microservice system from scratch, it has ended in serious trouble… you shouldn’t start a new project with microservices, even if you’re sure your application will be big enough to make it worthwhile”.

Good software architecture is a constantly evolving task and the correct solution for your app absolutely depends on what scale you’re operating at. Monoliths, modular monoliths, and Service Oriented Architecture fall along an evolutionary scale as your application increases in complexity. Each architecture will be appropriate for a different sized team/app and will be separated by periods of pain and suffering. When you do start experiencing many of the pain points highlighted in this article, that’s when you know you’ve outgrown the current solution and it’s time to move onto the next.

Thank you to Simon Brown for permission to post his Monolith vs Microservices image. For more information on Modular Monolith's please check out Simon's talk from GOTO18.

We're always on the lookout for talent and we’d love to hear from you. Please take a look at our open positions on the Engineering career page.

Continue reading

Unifying Our GraphQL Design Patterns and Best Practices with Tutorials

Unifying Our GraphQL Design Patterns and Best Practices with Tutorials

Read Time: 5 minutes

In 2015, Shopify began a journey to put mobile first. The biggest undertaking was rebuilding our native Shopify Mobile app and improving the tools and technology to accomplish this. We experimented with GraphQL to build the next generation of APIs that power our mobile apps and give our 600,000+ merchants the same seamless experience when using Shopify. There are currently hundreds Shopify developers across teams and offices contributing to our GraphQL APIs including our two public APIs: Admin and Storefront.

Continue reading

Engineering a Historic Moment: Shopify Gets Ready for Cannabis in Canada

Engineering a Historic Moment: Shopify Gets Ready for Cannabis in Canada

On October 17th, 2018, Canada ended a 95-year history of cannabis prohibition. For Shopify, the legalization of cannabis marked a new industry entering the Canadian retail market and we worked with governments and licensed sellers across the country to provide a safe, reliable and scalable platform for their business. For our engineering team, it meant significant changes to our platform to meet the strict requirements for this newly regulated industry.

The biggest change for Shopify was the requirement to store personal information in Canada. This required Canadian-specific infrastructure that we were able to develop through our recent move to the Cloud and Google Cloud Platform’s new region in Montreal. Using this platform as our foundation, we created a new instance of Shopify, in an entirely new region, to meet the needs of this industry. In our migration, we built several new Google Cloud Platform projects (all based in the Montreal region) which included key projects housing Shopify’s core infrastructure such as PCI compliant payment processing infrastructure and a regional data warehouse.

The core infrastructure, which runs on a mixture of Google Kubernetes Engine and Google Compute Engine, already existed in our other regions which meant adding another region was relatively straightforward. We used Terraform to declare and configure all parts of the underlying infrastructure, like networks and Kubernetes Engine clusters. We also took advantage of improved resiliency features in Google Cloud Platform, such as regional clusters. We structured our compute node clusters to segregate workloads, minimizing the noisy neighbour problem to ensure maximum stability and reliability. After a few months of building out this infrastructure, configuring and testing it, we had the first working version of this new regional infrastructure running test shops with a functional storefront and admin. That’s when we faced our next major challenge: scaling.

A major factor in our scaling ability were social factors—particularly, determining the behavior of cannabis consumers, an area with little to no available research. Most research focused on cannabis producers, whereas Shopify needed to figure out the behavior of cannabis consumers. We modeled a number of different traffic scenarios and provisioned enough infrastructure to ensure we could handle the peak traffic from each one. Some of the possibilities we considered included:
  • A strong initial, worldwide surge of interest on storefront pages as curiosity about a government-run online cannabis store peaks
  • Waves of traffic based on multiple days of media coverage across the world, with local timezone spikes
  • Very strong initial sales in the first minutes and hours of store openings as Canadians rush to be one of the first to legally purchase recreational cannabis
  • Possible bursts of denial of service attacks from malicious actors 

We went through multiple cycles of load testing using a mix of different storefront traffic patterns, varying the relative percentages of search, product browsing, collection browsing and checkout actions to stress the system in different ways. Each cycle included different fixes and configuration changes to improve the performance and throughput of the system until we were satisfied that we would be able to handle all possible traffic scenarios. In addition, we modeled and tested different types of bot attacks to ensure our platform defenses were effective. Finally, we conducted multiple pre-mortem discussions and built out mitigation plans to address any scenario which would cause downtime for our merchants.

Sell Cannabis In-store and Online

At the same time, we were solving how to keep personal information contained in Canada. This was extremely challenging as Shopify was built from day one with a number of storage and communications systems located outside of Canada, such as our data warehouse and network infrastructure. We examined each system for personal information to ensure that this information remains stored in Canada.

We ensured there were protections for regional storage in multiple places: inside the application, within the hosts, and at the network/infrastructure level. For our main Ruby on Rails application, we:

  • Built a library which captured network requests and verified the requested host belonged to a list of known safe endpoints.
  • Utilized strict network firewall rules and minimized interconnections to ensure that data wouldn’t accidentally traverse into other jurisdictions.
  • Deployed the containers which house the main application with the absolute minimum number of secrets necessary for the service to function in order to ensure that any service outside the jurisdiction reached in error would simply reject the request due to insufficient credentials.
  • Ensured the infrastructure used unique SSL certificates so data would not cross-pollinate between internal pieces of the system.
  • Deployed all these protections, in combination with monitoring and alerting, ensuring the teams involved were notified of potential issues. 
In the vast majority of cases, more than one of these protections applies to a particular piece of data or system, meaning there are multiple layers of protection in place to ensure that personal information does not migrate outside of Canada.

As launch day neared, we reduced the amount of change we applied to the environment to minimize risk. While the merchants were in their final testing cycles, we continued to perform load testing to ensure that the environment was optimally configured and ready. Having a successful launch day was critical for our merchants and we decided to scale the environment to handle five times the traffic and sales volume projections for launch day. Internally, we ran a series of game days (a form of fault injection where we test our assumptions about the system by degrading its dependencies under controlled conditions) for core infrastructure teams to validate that system performance and alerting was sufficient.

On launch day, merchants chose to take full advantage of the excitement and opened their stores one minute after midnight in their local time zones. That meant we’d see both retail and online launches starting at 10:31 PM EDT on October 16th (Newfoundland and Labrador) and continue through every hour until 3:01 AM EDT on October 17th (British Columbia). And at 12:01 NST, the first legal sale of cannabis in Canada was made on Shopify’s point of sale in Newfoundland followed by successful launches in Prince Edward Island, Ontario and British Columbia — all with zero downtime, excellent performance and secure storage and transmission of personal information within Canada.

Being part of launching a new retail industry and acting as a trusted partner with multiple licensed sellers while building infrastructure with regional data storage requirements, all on a strict deadline, was quite a challenge which required coordination across all Shopify departments. We learned a lot about what it takes to support regulated industries and restricted markets, knowledge which will help us support similar markets in the future, both in Canada and throughout the world. A number of the technologies and processes we developed during this project will continue to be improved and reused to support future deployments with similar requirements. Overall, it was incredibly rewarding to be part of a historic launch by contributing to and supporting the success of licensed recreational cannabis retailers throughout the country.

Intrigued? Shopify is hiring and we’d love to hear from you. Please take a look at the Production Engineer and Senior Technical Security Analyst roles available.

Continue reading

Attracting Local Talent And Building Mobile Apps: A Developer Hiring Initiative

Attracting Local Talent And Building Mobile Apps: A Developer Hiring Initiative

Shopify is a commerce platform that serves over 600,000+ merchants and employs over 3000 people across the globe. We’re always on the lookout for highly skilled individuals with diverse backgrounds to join our team, and that requires us to connect with them outside traditional recruitment channels. That’s why last year we ran the “Build Things, Show Shopify” initiative inviting developers outside of Shopify to build an app and showcase their finished product to a multidisciplinary panel of Shopify employees as well as in front of hiring managers and VCs. The outcome? Not only did we build a local developer community in Ottawa, but we added a number of potential hires to our recruiting pipeline.

Continue reading

How Shopify Uses Recommender Systems to Empower Entrepreneurs

How Shopify Uses Recommender Systems to Empower Entrepreneurs

Authors: Dóra Jámbor and Chen Karako 

There is a good chance you have come across a “recommended for you” statement somewhere in our data-driven world. This may be while shopping on Amazon, hunting for new tracks on Spotify, looking to decide what restaurant to go to on Yelp, or browsing through your Facebook feed — ranking and recommender systems are an extremely important feature of our day-to-day interactions.

This is no different at Shopify, a cloud-based, multi-channel commerce platform that powers over 600,000 businesses of all sizes in approximately 175 countries. Our customers are merchants that use our platform to design, set up, and manage their stores across multiple sales channels, including web, mobile, social media, marketplaces, brick-and-mortar locations, and pop-up shops.

Shopify builds many different features in order to empower merchants throughout their entrepreneurial lifecycle. But with the diversity of merchant needs and the variety of features that Shopify provides, it can quickly become difficult for people to filter out what’s relevant to them. We use recommender systems to suggest personalized insights, actions, tools and resources to our merchants that can help their businesses succeed. Every choice a merchant makes has consequences for their business and having the right recommendation at the right time can make a big difference.

In this post, we’ll describe how we design and implement our recommender system platform.


Collaborative Filtering (CF) is a common technique to generate user recommendations for a set of items. For Shopify, users are merchants, and items are business insights, apps, themes, blog posts, and other resources and content that merchants can interact with. CF allows us to leverage past user-item interactions to predict the relevance of each item to a given user. This is based on the assumption that users with similar past behavior will show similar preferences for items in the future.

The first step of designing our recommender system is choosing the right representation for user preferences. One way to represent preferences is with user-item interactions, derived from implicit signals like the user’s past purchases, installations, clicks, views, and so on. For example, in the Shopify App Store, we could use 1 to indicate an app installation and 0 to represent an unobserved interaction with the given app.

User-item Interaction
User-item interaction

These user-item interactions can be collected across all items, producing a user preference vector.

User Preference Vector
User preference vector

This user preference vector allows us to see the past behavior of a given user across a set of items. Our goal is now to predict the relevance of items that the user hasn’t yet interacted with, denoted by the red 0s. A simple way of achieving our goal is to treat this as a binary classification problem. That is, based on a user’s past item interactions, we want to estimate the probability that the user will find an item relevant.

User Preference (left) and Predicted Relevance (right)

User Preference (left) and Predicted Relevance (right)

We do this binary classification by learning the relationship between the item itself and all other items. We first create a training matrix of all user-item interactions by stacking users’ preference vectors. Each row in this matrix serves as an individual training example. Our goal is to reconstruct our training matrix in a way that predicts relevance for unobserved interactions.

There are a variety of machine learning methods that can achieve this task including linear models such as Sparse Linear Methods (SLIM), linear method variations (e.g., LRec), autoencoders, and matrix factorization. Despite the differences in how these models recover item relevance, they can all be used to reconstruct the original training matrix.

At Shopify, we often use linear models because of the benefits they offer in real-world applications. For the remainder of this post, we’ll focus on these techniques.

Linear methods like LRec and its variations solve this optimization problem by directly learning an item-item similarity matrix. Each column in this item-item similarity matrix corresponds to an individual item’s model coefficients.

We put these pieces together in the figure below. On the left, we have all user-item interactions, our training matrix. In the middle, we have the learned item-item similarity matrix where each column corresponds to a single item. Finally, on the right, we have the predicted relevance scores. The animation illustrates our earlier discussion of the prediction process.

User-item Interactions (left), Item-item Similarity (middle), and Predicted Relevance (right)
User-item Interactions (left), Item-item Similarity (middle), and Predicted Relevance (right)

To generate the final user recommendations, we take the items that the user has not yet interacted with and sort their predicted scores (in red). The top scored items are then the most relevant items for the user and can be shown as recommendations as seen below.

Personalized App Recommendations on the Shopify App Store
Personalized app recommendations on the Shopify App Store

Linear methods and this simple binary framework are commonly used in industry as they offer a number of desired features to serve personalized content to users. The binary aspect of the input signals and classification allows us to maintain simplicity in scaling a recommender system to new domains, while also offering flexibility with our model choice.

Scalability and Parallelizability

As shown in the figure above, we train one model per item on all user-item interactions. While the training matrix is shared across all models, the models can be trained independently from one another. This allows us to run our model training in a task-parallel manner, while also reducing the time complexity of the training. Additionally, as the number of users and items grows, this parallel treatment favors the scalability of our models.


When building recommender systems, it’s important that we can interpret a model and explain the recommendations. This is useful when developing, evaluating, and iterating on a model, but is also helpful when surfacing recommendations to users.

The item-item similarity matrix produced by the linear recommender provides a handy tool for interpretability. Each entry in this matrix corresponds to a model coefficient that reflects the learned relationship of two items. We can use this item-item similarity to derive which coefficients are responsible for a produced set of user recommendations.

Coefficients are especially helpful for recommenders that include other user features, in addition to the user-item interactions. For example, we can include merchant industry as a user feature in the model. In this case, the coefficient for a given item-user feature allows us to share with the user how their industry shaped the recommendations they see. Showing personalized explanations with recommendations is a great way of establishing trust with users.

For example, merchants’ home feeds, shown below, contain personalized insights along with explanations for why those insights are relevant to them.

Shopify Home Feed: Showing Merchants how Their Business is Doing, Along With Personalized Insights
Shopify Home Feed: Showing Merchants how Their Business is Doing, Along With Personalized Insights


Beyond explanations, user features are also useful for enriching the model with additional user-specific signals such as shop industry, location, product types, target audience and so on. These can also help us tackle cold-start problems for new users or items, where we don’t yet have much item interaction data. For example, using a user feature enriched model, a new merchant who has not yet interacted with any apps could now also benefit from personalized content in the App Store.


A recommender system must yield high-quality results to be useful. Quality can be defined in various ways depending on the problem at hand. There are several recommender metrics to reflect different notions of quality like precision, diversity, novelty, and serendipity. Precision can be used to measure the relevance of recommended items. However, if we solely optimize for precision, we might appeal to the majority of our users by simply recommending the most popular items to everyone, but would fail to capture subtleties of individual user preferences.

For example, the Shopify Services Marketplace, shown below, allows merchants to hire third-party experts to help with various aspects of their business.

Shopify Services Marketplace, Where Merchants can Hire Third-party Experts
Shopify Services Marketplace, Where Merchants can Hire Third-party Experts

To maximize the chance of fruitful collaboration, we want to match merchants with experts who can help with their unique problems. On the other hand, we also want to ensure that our recommendations are diverse and fair to avoid scenarios in which a handful of experts get an overwhelming amount of merchant requests, preventing other experts from getting exposure. This is one example where precision alone isn’t enough to evaluate the quality of our recommender system. Instead, quality metrics need to be carefully selected in order to reflect the key business metric that we hope to optimize.

While recommendations across various areas of Shopify optimize different quality metrics, they’re ultimately all built with the goal of helping our merchants get the most out of our platform. Therefore, when developing a recommender system, we have to identify the metric, or proxy for that metric that allows us to determine whether the system is aligned with this goal.


Having a simple and flexible base model reduces the effort needed for Shopify Data Science team members to extend into new domains of Shopify. Instead, we can spend more time deepening our understanding of the merchant problems we are solving, refining key model elements, and experimenting with ways to extend the capabilities of the base model.

Moreover, having a framework of binary input signals and classification allows us to easily experiment with different models that enrich our recommendations beyond the capabilities of the linear model we presented above.

We applied this approach to provide recommendations to our merchants in a variety of contexts across Shopify. When we initially launched our recommendations through A/B tests, we observed the following results:

  • Merchants receiving personalized app recommendations on the Shopify App Store had a 50% higher app install rate compared to those who didn’t receive recommendations
  • Merchants with a personalized home feed were up to 12% more likely to report that the content of their feed was useful, compared to those whose feeds were ranked by a non-personalized algorithm.
  • Merchants who received personalized matches with experts in the Expert Marketplace had a higher response rate and had overall increased collaboration between merchants and third-party experts.
  • Merchants who received personalized theme recommendations on the Shopify Theme Store, seen below, were over 10% more likely to launch their online store, compared to those receiving non-personalized or no recommendations.

Shopify Theme Store: Where Merchants can Select Themes for Their Online Store
Shopify Theme Store: Where Merchants can Select Themes for Their Online Store

This post was originally published on Medium.

This post was edited on Feb 6, 2019

We’re always working on challenging new problems on the Shopify Data team. If you’re passionate about leveraging data to help entrepreneurs, check out our open positions in Data Science and Engineering.

Continue reading

iOS Application Testing Strategies at Shopify

iOS Application Testing Strategies at Shopify

At Shopify, we use a monorepo architecture where multiple app projects coexist in one Git repository. With hundreds of commits per week, the fast pace of evolution demands a commitment to testing at all levels of an app in order to quickly identify and fix regression bugs.

This article presents the ways we test the various components of an iOS application: Models, Views, ViewModels, View Controllers, and Flows. For brevity, we ignore the details of the Continuous Integration infrastructure where these tests are run, but you can learn more from the Building a Dynamic Mobile CI System blog post.

Testing Applications, Like Building a Car

Consider the process of building a reliable car, base components like cylinders and pistons are individually tested to comply with design specifications (Model & View tests). Then these parts are assembled into an engine, which is also tested to ensure the components fit and function well together (View Controller tests). Finally, the major subsystems like the engine, transmission, and cooling systems are connected and the entire car is test-driven by a user (Flow tests).

The complexity and slowness of a test increases as we go from unit to manual tests, so it’s important to choose the right type and amount of tests for each component hierarchy. The image below shows the kind of tests we use for each type of app component; it reads bottom-up like a Model is tested with Regular Unit Tests.

Types of Tests Used for App Components
Types of Tests Used for App Components

Testing Models

A Model represents a business entity like a Customer, Order, or Cart. As the foundation of all other application constructs, it’s crucial to test that the properties and methods of a model ensure conformance with their business rules. The example below shows a unit test for the Customer model where we test the rule for a customer with multiple addresses, the billingAddress must be the first default address.

A Word on Extensions

Changing existing APIs in a large codebase is an expensive operation, so we often introduce new functionality as Extensions. For example, the function below enables two String Arrays to be merged without duplicates.

We follow a few conventions. Each test name follows a compact and descriptive format test<Function><Goal>. Test steps are about 15 lines max otherwise the test is broken down into separate cases. Overall, each test is very simple and requires minimal cognitive load to understand what it’s checking.

Testing Views

Developers aim to implement exactly what the designers intend under various circumstances and avoid introducing visual regression bugs. To achieve this, we use Snapshot Testing to record an image of a view, then subsequent tests compare that view with the recorded snapshot and fails if different.

For example, consider a UITableViewCell for Ping Pong players with the user’s name, country, and rank. What happens when the user has a very long name? Does the name wrap to a second line, truncate, or does it push the rank away? We can record our design decisions as snapshot tests so we are confident the view gracefully handles such edge cases.

UITableViewCell Snapshot Test
UITableViewCell Snapshot Test

Testing View Models

A ViewModel represents the state of a View component and decouples business models from Views—it’s the state of the UI. So, they store information like the default value of a slider or segmented control and the validation logic of a Customer creation form. The example below shows the CustomerEntryViewModel being tested to ensure its taxExempt property is false by default, and that its state validation function works correctly given an invalid phone number.

Testing View Controllers

The ViewController is the top hierarchy of component composition. It brings together multiple Views and ViewModels in one cohesive page to accomplish a business use case. So, we check whether the overall view meets the design specification and whether components are disabled or hidden based on Model state. The example below shows a Customer Details ViewController where the Recent orders section is hidden if a customer has no orders or the ‘edit’ button is disabled if the device is offline. To achieve this, we use snapshot tests as follows.

Snapshot Testing the ViewController
Snapshot Testing the ViewController

Testing Workflows

A Workflow uses multiple ViewControllers to achieve a use case. It’s the highest level of functionality from the user’s perspective. Flow tests aim to answer specific user questions like: can I login with valid credentials?, can I reset my password?, and can I checkout items in my cart?

We use UI Automation Tests powered by the XCUITest framework to simulate a user performing actions like entering text and clicking buttons. These tests are used to ensure all user-facing features behave as expected. The process for developing them is as follows.

  1. Identify the core user-facing features of the app—features without which users cannot productively use the app. For example, a user should be able to view their inventory by logging in with valid credentials, and a user should be able to add products to their shopping cart and checkout.
  2. Decompose the feature into steps and note how each step can be automated: button clicks, view controller transitions, error and confirmation alerts. This process helps to identify bottlenecks in the workflow so they can be streamlined.
  3. Write code to automate the steps, then compose these steps to automate the feature test.

The example below shows a UI Test checking that only a user with valid credentials can login to the app. The testLogin() function is the main entry point of the test. It sets up a fresh instance of the app by calling setUpForFreshInstall(), then it calls the login() function which simulates the user actions like entering the email and password then clicking the login button.

Considering Accessibility

One useful side effect of writing UI Automation Tests is that they improve the accessibility of the app, and this is very important for visually impaired users. Unlike Unit Tests, UI Tests don’t assume knowledge of the internal structure of the app, so you select an element to manipulate by specifying its accessibility label or string. These labels are read aloud when users turn on iOS accessibility features on their devices. For more information about the use of accessibility labels in UI Tests, watch this Xcode UI Testing - Live Tutorial Session video.

Manual Testing

Although we aim to automate as much flow tests as possible, the tools available aren’t mature enough to completely exclude manual testing. Issues like animation glitches and rendering bugs are only discovered through manual testing…some would even argue that so long as applications are built for users, manual user testing is indispensable. However, we are becoming increasingly dependant on UI Automation tests to replace Manual tests.


Testing at all levels of the app gives us the confidence to release applications frequently. But each test also adds a maintenance liability. So, testing each part of an app with the right amount and type of test is important. Here are some tips to guide your decision.

  • The speed of executing a test decreases as you go from Unit to Manual tests.
  • The human effort required to execute and maintain a test increases from Unit tests to Manual tests.
  • An app has more subcomponents than major components.
  • Expect to write a lot more Unit tests for subcomponents and fewer, more targeted tests as you move up to UI Automation and Manual tests...a concept known as the Test Pyramid.

Finally, remember that tests are there to ensure your app complies with business requirements, but these requirements will change over time. So, developers must consistently remove tests for features that no longer exist, modify existing tests to comply with new business rules, and add new tests to maintain code coverage.

If you'd like to continue talking about application testing strategies, please find me on Medium at @u.zziah

If are passionate about iOS development and excellent user experience, the Shopify POS team is hiring a Lead iOS Developer! Have a look at the job posting

Continue reading

The Unreasonable Effectiveness of Test Retries: An Android Monorepo Case Study

The Unreasonable Effectiveness of Test Retries: An Android Monorepo Case Study

At Shopify, we don't have a QA team, we have a QA culture which means we rely on automated testing to ensure the quality of our mobile apps. A reliable Continuous Integration (CI) system allows our developers to focus on building a high-quality system, knowing with confidence that if something is wrong, our test suite will catch it. To create this confidence, we have extensive test suites that include integration tests, unit tests, screenshot tests, instrumentation tests, and linting. But every large test suite has an enemy: flakiness.

A flaky test can exhibit both a passing and failing result with the same code and requires a resilient system that can recover from those failures. Tests can fail for different reasons that aren’t related to the test itself: network or infrastructure problems, bugs in the software that runs the tests, or even cosmic rays.

Last year, we moved our Android apps and libraries to a monorepo and increased the size of our Android team. This meant more people working in the same codebase and more tests executed when a commit merged to master (we only run the entire test suite on the master branch. For other branches only the tests related to what have changed are run). It’s only logical that the pass rate of our test suites took a hit.

Let’s assume that every test we execute is independent of each other (events like network flakiness affect all tests, but we’re not taking that into account here) and passes 99.95% of the time. We execute pipelines that each contain 100 tests. Given the probability of a test, we can estimate that the pipeline will pass 0.9995100 = 95% of the time. However, the entire test suite is made up of 20 pipelines with the same pass probability so it will pass 0.9520 = 35% of the time.

This wasn’t good and we had to improve our CI pass rate.

Developers lose trust in the test infrastructure when CI is red most of the time due to test flakiness or infrastructure issues. They’ll start assuming that every failure is a false positive caused by flakiness. Once this happens, we’ve lost the battle and gaining that developer’s trust back is difficult. So, we decided to tackle this problem in the simplest way: retrying failures.

Retries are a simple, yet powerful mechanism to increase the pass rate of our test suite. When executing tests, we believe in a fail-fast system. The earlier we get feedback, the faster we can move and that’s our end goal. Using retries may sound counterintuitive, but almost always, a slightly slower build is preferable over a user having to manually retry a build because of a flaky failure.

When retrying tests once, the chances of failing CI due to a single test would require that test to fail twice. Using the same assumptions as before, the chances of that happening are 0.05% · 0.05% = 0.000025% for each test. That translates to a 99.999975% pass rate for each test. Performing the same calculation as before, for each pipeline we would expect a pass rate of 0.99999975100 = 99.9975%, and for the entire CI suite, 0.99997520 = 99.95%. Simply by retrying failing tests, the theoretical pass rate of our full CI suite increases from 35% to 99.95%.

In each of our builds, many different systems are involved and things can go wrong while setting up the test environment. Docker can fail to load the container, bundler can fail while installing some dependencies, and so can git fetch. All of those failures can be retried. We have identified some of them as retriable failures, which means they can be retried within the same job, so we don’t need to initialize the entire test environment again.

Some other failures aren’t as easy to retry in the same job because of its side effects. Those are known as fatal failures, and we need to reload the test environment altogether. This is slower than a retriable failure, but it’s definitely faster than waiting for the developer to retry the job manually, or spend time trying to figure why a certain task failed to finally realize that the solution was retrying.

Finally, we have test failures. As we have seen, a test can be flaky. They can fail for multiple reasons, and based on our data, screenshot tests are flakier than the rest. If we detect a failure in a test, that single test is retried up to three times.

The Message Displayed When a Test Fails and It’s Retried.
The message displayed when a test fails and it’s retried.

Retries in general and test retries, in particular, aren’t ideal. They work but make CI slower and can hide reliability issues. At the end of the day, we want our developers to have a reliable CI while encouraging them to fix test flakiness if possible. For this reason, we detect all the tests that pass after a retry and notify the developers so the problem doesn’t go unnoticed. We think that a test that passes in a second attempt shouldn’t be treated like a failure, but as a warning that something can be improved. To reduce the flakiness of builds these are the tips we recommend besides retry mechanisms:

  • Don't depend on unreliable components in your builds. Try to identify the unreliable components of your system and don’t depend on them if possible. Unfortunately, most of the time this is not possible and we need those unreliable components.
  • Work on making the component more reliable. Try to understand why the component isn’t reliable enough for your use case. If that component is under your control, make changes to increase reliability.
  • Apply caching to invoke the unreliable component less often. We need to interact with external services for different reasons. A common case is to download dependencies. Instead of downloading them for every build, we can build a cache to reduce our interactions with this external service and therefore gaining in resiliency.
These tips are exactly what we did from an infrastructure point of view. When this project started, the pass rate in our Android app pipeline was 31%. After identifying and applying retry mechanisms to the sources of flakiness and adding some cache to the gradle builds we managed to increase it to almost 90%.

Pass rate plot from March to September
Pass rate plot from March to September

Something similar happened in our iOS repository. After improving our CI infrastructure, adding the previously discussed retry mechanisms and applying the tips to reduce flakiness, the pass rate grew from 67% to 97%.

It may sound counterintuitive, but thanks to retries we can move faster having slower builds.

We love to talk about mobile tooling. Feel free to reach out to us in the comments if you want to know more or share your solution to this problem.

Intrigued? Shopify is hiring and we’d love to hear from you. Please take a look at our open positions on the Engineering career page

Continue reading

Preparing Shopify for Black Friday and Cyber Monday

Preparing Shopify for Black Friday and Cyber Monday

Making commerce better for everyone is a challenge we face on a daily basis. For our Production Engineering team, it means ensuring that our 600,000+ merchants have a reliable and scalable platform to support their business needs. We need to be able to support everything our merchants throw at us—including the influx of holiday traffic during Black Friday and Cyber Monday (BFCM). All of this needs to happen without an interruption in service. We’re proud to say that the effort we put in to deploying, scaling, and launching new projects on a daily basis gives our merchants access to a platform with 99.98% uptime.

Black Friday Cyber Monday 2018 by the numbers
Black Friday Cyber Monday 2018

To put the impact of this into perspective, Black Friday and Cyber Monday is what we refer to as our World Cup. Each year, our merchants push the boundaries of our platform to handle more traffic and more sales. This year alone, merchants sold over $1.5 billion USD in sales throughout the holiday weekend.

What people may not realize is that Shopify is made up of many different internal services and interaction points with third-party providers, like payment gateways and shipping carriers. The performance and reliability of each of this dependencies can potentially affect our merchants and buyers in different ways. That’s why our Production Engineering teams preparations for BFCM run the entire gamut.

To increase the chances of success on BFCM Production Engineering run “game days” on our systems and their dependencies. Game days are a form of fault injection where we test our assumptions about the system by degrading its dependencies under controlled conditions. For example, we’ll introduce artificial latency into the code paths that interact with shipping providers to ensure that the system continues working and doing something reasonable. That could be, for instance, falling back to another third party or hard-coded defaults if a third party dependency were to become slow for any reason, or verifying that a particular service responds as expected to a problem with their main datastore.

Besides fault injection work, Production Engineering also run load testing exercises where volumes similar to what we expect during BFCM are created synthetically and sent to the different applications to ensure that the system and its components behave well under the onslaught of requests they’ll serve on BFCM.

At Shopify, we pride ourselves on continuous and fast deploys to deliver features and fixes as fast as we can; however, the rate of change on a system increases the probability of issues that can affect our users. During the ramp-up period for BFCM, we manage the normal cadence of the company by establishing both a feature freeze and a code freeze. The feature freeze starts several weeks before BFCM and means no meaningful changes to user-facing features are deployed to prevent changes on merchant’s workflows. At that point in the year, changes, even improvements can have an unacceptable learning curve for merchants that are diligently getting ready for the big event.

A few days before BFCM and during the event an actual code freeze is in effect, means that only critical fixes can be deployed and everything else must remain in stasis. The idea is to reduce the possibility of introducing bugs and unexpected system interactions that could cause the service to be compromised during the peak days of the holiday season.

Did all of our preparations work out? With BFCM in the rearview mirror, we can say, yes. This BFCM weekend was a record breaker for Shopify. We saw nearly 11,000 orders created per minute and around 100,000 requests per second being served for extended periods during the weekend. All and all, most system metrics followed a pattern of 1.8 times what they were in 2017.

The somewhat unsurprising conclusion is that running towards the risk by injecting faults, load testing, and role-playing possible disaster scenarios pays off. Also, reliability goes beyond your “own” system most complex platforms these days have to deal with third parties to provide the best service possible. We have learned to trust our partners but also understand that any system can have downtime and in the end, Shopify is responsible to our merchants and buyers.

Continue reading

Bug Bounty Year in Review 2018

Bug Bounty Year in Review 2018

With 2018 coming to a close, we thought it a good opportunity to once again reflect on our Bug Bounty program. At Shopify, our bounty program complements our security strategy and allows us to leverage a community of thousands of researchers who help secure our platform and create a better Shopify user experience. This was the fifth year we operated a bug bounty program, the third on HackerOne and our most successful to date (you can read about last year’s results here). We reduced our time to triage by days, got hackers paid quicker, worked with HackerOne to host the most innovative live hacking event to date and continued contributing disclosed reports for the bug bounty community to learn from.

Our Triage Process

In 2017, our average time to triage was four days. In 2018, we shaved that down to 10 hours, despite largely receiving the same volume of reports. This reduction was driven by our core program commitment to speed. With 14 members on the Application Security team, we're able to dedicate one team member a week to HackerOne triage.

When someone is the dedicated “triager” for the week at Shopify, that becomes their primary responsibility with other projects becoming secondary. Their job is to ensure we quickly review and respond to reports during regular business hours. However, having a dedicated triager doesn't preclude others from watching the queue and picking up a report.

When we receive reports that aren't N/A or Spam, we validate before triaging and open an issue internally since we pay $500 when reports are triaged on HackerOne. We self-assign reports on the HackerOne platform so other team members know the report is being worked on. The actual validation process we use depends on the severity of the issue:

  • Critical: We replicate the behavior and confirm the vulnerability, page the on-call team responsible and triage the report on HackerOne. This means the on-call team will be notified immediately of the bug and Shopify works to address it as soon as possible.
  • High: We replicate the behavior and ping the development team responsible. This is less intrusive than paging but still a priority. Collaboratively, we review the code for the issue to confirm it's new and triage the report on HackerOne.
  • Medium and Low: We’ll either replicate the behavior and review the code, or just review the code, to confirm the issue. Next, we review open issues and pull requests to ensure the bug isn't a known issue. If there are clear security implications, we'll open an issue internally and triage the report on HackerOne. If the security implications aren't clear, we'll err on the side of caution and discuss with the responsible team to get their input about whether we should triage the report on HackerOne.

This approach allows us to quickly act on reports and mitigate critical and high impact reports within hours. Medium and Low reports can take a little longer, especially where the security implications aren't clear. Development teams are responsible for prioritizing fixes for Medium and Low reports within their existing workloads, though we occasionally check in and help out.


Shopify x HackerOne H1-514
H1-514 in Montreal

In October, we hosted our second live hacking event and it was the first hacking event in our office in Montreal, Quebec, H1-514. We welcomed over 40 hackers to our office to test our systems. To build on our program's core principles of responsiveness, transparency and timely payouts, we wanted to do things differently than other HackerOne live hacking events. As such, we worked with HackerOne to do a few firsts for live hacking events:

  • While other events opened submissions the morning of the event, we opened submissions when the target was announced to be able to pay hackers as soon as the event started and avoid a flood of reports
  • We disclosed resolved reports to participants during the event to spark creativity instead of leaving this to the end of the event when hacking was finished
  • We used innovative bonuses to reward creative thinking and hard work from hackers testing systems that are very important to Shopify (e.g. GraphQL, race conditions, oldest bug, regression bonuses, etc.) instead of awarding more money for the number of bugs people found
  • We gave hackers shell access to our infrastructure and asked them to report any bugs they found. While none were reported at the event, the experience and feedback informed a continued Shopify infrastructure bounty program and the Kubernetes product security team's exploration of their own bounty program.

Shopify x HackerOne H1-514
H1-514 in Montreal

When we signed on to host H1-514, we weren't sure what value we'd get in return since we run an open bounty program with competitive bounties. However, the hackers didn't disappoint and we received over 50 valid vulnerability reports, a few of which were critical. Reflecting on this, the success can be attributed to a few factors:

  • We ship code all the time. Our platform is constantly evolving so there's always something new to test; it's just a matter of knowing how to incentivize the effort for hackers (You can check the Product Updates and Shopify News blogs if you want to see our latest updates).
  • There were new public disclosures affecting software we use. For example, Tavis Ormandy's disclosure of Ghostscript remote code execution in Imagemagick, which was used in a report during the event by hacker Frans Rosen.
  • Using bonuses to incentivize hackers to explore the more complex and challenging areas of the bounty program. Bonuses included GraphQL bugs, race conditions and the oldest bug, to name a few.
  • Accepting submissions early allowed us to keep hackers focused on eligible vulnerability types and avoid them spending time on bugs that wouldn't be rewarded. This helped us manage expectations throughout the two weeks, keep hackers engaged and make sure everyone was using their time effectively.
  • We increased our scope. We wanted to see what hackers could do if we added all of our properties into the scope of the bounty program and whether they'd flock to new applications looking for easier-to-find bugs. However, despite the expanded scope, we still received a good number of reports targeting mature applications from our public program.

H1-514 in Montreal. Photo courtesy of HackerOne
H1-514 in Montreal. Photo courtesy of HackerOne

Stats (as of Dec 6, 2018)

2018 was the most successful year to date for our bounty program. Not including the stats from H1-514, we saw our average bounty increase again, this time to $1,790 from $1,100 in 2017. The total amount paid to hackers was also up $90,200 compared to the previous year, to $155,750 with 60% of all resolved reports having received a bounty. We also went from one five-figure bounty awarded in 2017, to five in 2018 marked by the spikes in the following graph.

Bounty Payouts by Date
Bounty Payouts by Date

As mentioned, the team committed to quick communication, recognizing how important it is to our hackers. We pride ourselves on all of our timing metrics being among the best in the category on HackerOne. While our initial response time slipped by 5 hours to 9 hours, our triage time was reduced by over 3 days to 10 hours (it was 4 days in 2017). Both our time to bounty and resolution times also dropped, time to bounty to 30 days and resolution to 19 days, down from about a month.

Response Time by Date
Response Time by Date

Report Submitted by Date
Report Submitted by Date

In 2018 we received 1,010 reports. 58.7% were closed as not applicable compared to 63.1% in 2017. This was accompanied by an almost one percent increase in the number of resolved reports, 11.3%, up from 10.5% in 2017. The drop in not applicable and rise in informatives (reports which contain useful information but don't warrant immediate action) is likely the result of the team's commitment to only close bugs as not applicable when the issue reported is in our tables of known issues and ineligible vulnerabilities types or lacks evidence of a vulnerability.

Types of Bugs Closed
Types of Bugs Closed

We also disclosed 24 bugs on our program, one less than the previous year, but we tried to maintain our commitment to requesting disclosure for every bug resolved in our program. We continue to believe it’s extremely important that we build a resource library to enable ethical hackers to grow in our program. We strongly encourage other companies to do the same.

Despite a very successful 2018, we know there are still areas to improve upon to remain competitive. Our total number of resolved reports was down again, 113 compared to 121 despite having added new properties and functionality to our program. We resolved reports from only 62 hackers compared to 71 in 2017. Lastly, we continue to have some low severity reports remain in a triaged state well beyond our target of 1-month resolution. The implications of this are mitigated for hackers since we changed our policy earlier in the year to pay the first $500 of a bounty immediately. Since low severity reports are unlikely to receive an additional bounty, most low-severity reports are paid entirely up-front. HackerOne also made platform changes to award the hackers their reputation when we triage reports versus when we resolve them, as was previously the case.

We're planning new changes, experiments and adding new properties in 2019 so make sure to watch our program for updates.

Happy hacking!

If you're interested in helping to make commerce more secure, visit Shopify on HackerOne to start hacking or our career page to check out our open Trust and Security positions

Continue reading

How an Intern Released 3 Terabytes Worth of Storage Before BFCM

How an Intern Released 3 Terabytes Worth of Storage Before BFCM

Hi there! I’m Gurpreet and currently finishing up my second internship at Shopify. I was part of the Products team during both of my internships. The team is responsible for building and maintaining the products area of Shopify admin. As a developer, every day is another opportunity to learn something new. Although I worked on many tasks during my internship, today I will be talking about one particular problem I solved.

The Problem

As part of the Black Friday Cyber Monday (BFCM) preparations, we wanted to make sure our database was resilient enough to smoothly handle increased traffic during flash sales. After completing an analysis of our top SQL queries, we realized that the database was scanning a large number of fixed-size storage units, called innoDB pages, just to return a single record. We identified the records, historically kept for reporting purposes, that caused this excess scanning. After talking among different teams and making sure that these records were safe to delete, the team decided to write a background job to delete them.

So how did we accomplish this task which could have potentially taken our database down, resulting in downtime for our merchants?

The Background Job

I built the Rails background job using existing libraries that Shopify built to avoid overloading the database while performing different operations including deletion. A naive way to perform deletions is sending either a batch delete query or one delete query per record. It’s not easy to interrupt MySQL operations and doing the naive approach would easily overload the database with thousands of operations. The job-iteration library allows background jobs to run in iterations and it’s one of the Shopify libraries I leveraged to overcome the issue. The job runs in small chunks and can be paused between iterations to let other higher priority jobs run first or to perform certain checks. There are two parts of the job; the enumerator and the iterator. The enumerator fetches records in batches and passes one batch to the iterator at a time. The iterator then fetches the records in the given batch and deletes them. While this made sure that we weren’t deleting a large number of records in a single SQL query, we still needed to make sure we weren’t deleting the batches too fast. Deleting batches too fast results in a high replication lag and can affect the availability of the database. Thankfully, we have an existing internal throttling enumerator which I also leveraged writing the job.

After each iteration, the throttling enumerator checks if we’re starting to overload the database. If so, it automatically pauses the job until the database is back in a healthy state. We ensured our fetch queries used proper indexes and the enumerator used a proper cursor for batches to avoid timeouts. A cursor can be thought of as flagging the last record in the previous batch. This allows fetching records for the next batch by using the flagged record as the pivot. It avoids having to re-fetch previous records and only including the new ones in the current batch.

The Aftermath

We ran the background job approximately two weeks before BFCM. It was a big deal because not only did it free up three terabytes of storage and resulted in large cost savings, it made our database more resilient to flash sales.

For example, after the deletion, as seen in the chart below, our database was scanning around ~3x fewer pages in order to return a single record. Since the database was reading fewer pages to return a single record, it meant that during flash sales, it can serve an increased number of requests without getting overloaded because of unnecessary page scans. This also meant that we were making sure our merchants get the best BFCM experience with minimal technical issues during flash sales.

Database Scanning After Deletion
Database Scanning After Deletion

Truth to be told, I was very nervous watching the background job run because if anything went wrong, that meant downtime for the merchants, which is the last thing we want and man, what a horrible intern experience. At the peak, we were deleting approximately six million records a minute. The Shopify libraries I leveraged helped to make deleting over 🔥5 billion records🔥 look like a piece of cake 🎂.

🔥5 billion records🔥
5 billion Records Deleted

What I Learned

I learned so much from this project. I got vital experience with open source projects when using Shopify’s job-iteration library. I also did independent research to better understand MySQL indexes and how cursors work. For example, I didn’t know about partial indexes and how they worked. MySQL will pick a subset of prefix keys, based on the longest prefix match with predicates in the WHERE clause, to be used by the partial index to evaluate the query. Suppose we have an index on (A,B,C). A query with predicates (A,C) in the WHERE clause will only use the key A from the index, but a query with predicates (A,B) in the WHERE clause will use the keys A and B. I also learned how to use SQL EXPLAIN to analyze SQL queries. It shows exactly which indexes the database considered using, which index it ended up using, how many pages were scanned, and a lot of other useful information. Apart from improving my technical skills, working on this project made me realize the importance of collecting as much context as one can before even attempting to solve the problem. My mentor helped me with cross-team communication. Overall, context gathering allowed me to identify any possible complications ahead of time and make sure the background job ran smoothly.

Can you see yourself as one of our interns? Applications for the Summer 2019 term will be available at shopify.com/careers/interns from January 7, 2019. The deadline for applications is Monday, January 21, 2019, at 9:00 AM EST!

Continue reading

Director of Engineering, Lawrence Mandel Talks Road to Leadership, Growth, and Finding Balance.

Director of Engineering, Lawrence Mandel Talks Road to Leadership, Growth, and Finding Balance.

Lawrence Mandel is a Director of Production Engineering leading Shopify’s Developer Acceleration team and has been at Shopify for over a year. He previously worked at IBM and Mozilla where he started as a software developer before transitioning into leadership roles. Through all his work experience, he’s learned to understand the meaning of time management and to prioritize the most important things in his life, which are his family, health, and work.  

Continue reading

Start your free 14-day trial of Shopify