Stop trying to do it all alone, add Kit to your team. Learn more.
Implementing Equality in Ruby

Implementing Equality in Ruby

Ruby is one of the few programming languages that get equality right. I often play around with other languages, but keep coming back to Ruby. This is largely because Ruby’s implementation of equality is so nice.

Nonetheless, equality in Ruby isn't straightforward. There is #==, #eql?, #equal?, #===, and more. Even if you’re familiar with how to use them, implementing them can be a whole other story.

Let's walk through all forms of equality in Ruby and how to implement them.

Why Properly Implementing Equality Matters

We check whether objects are equal all the time. Sometimes we do this explicitly, sometimes implicitly. Here are some examples:

  • Do these two Employees work in the same Team? Or, in code: denis.team == someone.team.
  • Is the given DiscountCode valid for this particular Product? Or, in code: product.discount_codes.include?(given_discount_code).
  • Who are the (distinct) managers for this given group of employees? Or, in code: employees.map(&:manager).uniq.

A good implementation of equality is predictable; it aligns with our understanding of equality.

An incorrect implementation of equality, on the other hand, conflicts with what we commonly assume to be true. Here is an example of what happens with such an incorrect implementation:

The geb and geb_also objects should definitely be equal. The fact that the code says they’re not is bound to cause bugs down the line. Luckily, we can implement equality ourselves and avoid this class of bugs.

No one-size-fits-all solution exists for an equality implementation. However, there are two kinds of objects where we do have a general pattern for implementing equality: entities and value objects. These two terms come from domain-driven design (DDD), but they’re relevant even if you’re not using DDD. Let’s take a closer look.

Entities

Entities are objects that have an explicit identity attribute. Often, entities are stored in some database and have a unique id attribute corresponding to a unique id table column. The following Employee example class is such an entity:

Two entities are equal when their IDs are equal. All other attributes are ignored. After all, an employee’s name might change, but that does not change their identity. Imagine getting married, changing your name, and not getting paid anymore because HR has no clue who you are anymore!

ActiveRecord, the ORM that is part of Ruby on Rails, calls entities "models" instead, but they’re the same concept. These model objects automatically have an ID. In fact, ActiveRecord models already implement equality correctly out of the box!

Value Objects

Value objects are objects without an explicit identity. Instead, their value as a whole constitutes identity. Consider this Point class:

Two Points will be equal if their x and y values are equal. The x and y values constitute the identity of the point.

In Ruby, the basic value object types are numbers (both integers and floating-point numbers), characters, booleans, and nil. For these basic types, equality works out of the box:

Arrays of value objects are in themselves also value objects. Equality for arrays of value objects works out of the box—for example, [17, true] == [17, true]. This might seem obvious, but this isn’t true in all programming languages.

Other examples of value objects are timestamps, date ranges, time intervals, colors, 3D coordinates, and money objects. These are built from other value objects; for example, a money object consists of a fixed-decimal number and a currency code string.

Basic Equality (Double Equals)

Ruby has the == and != operators for checking whether two objects are equal or not:

Ruby’s built-in types all have a sensible implementation of ==. Some frameworks and libraries provide custom types, which will have a sensible implementation of ==, too. Here is an example with ActiveRecord:

For custom classes, the == operator returns true if and only if the two objects are the same instance. Ruby does this by checking whether the internal object IDs are equal. These internal object IDs are accessible using #__id__. Effectively, gizmo == thing is the same as gizmo.__id__ == thing.__id__.

This behavior is often not a good default, however. To illustrate this, consider the Point class from earlier:

The == operator will return true only when calling it on itself:

This default behavior is often undesirable in custom classes. After all, two points are equal if (and only if) their x and y values are equal. This behavior is undesirable for value objects (such as Point) and entities (such as the Employee class mentioned earlier).

The desired behavior for value objects and entities is as follows:

Image showing the desired behavior for value objects and entities. The first pairing for value objects checks if x and y (all attributes) are equal. The second pair for entities, checks whether the id attributes are equal. The third pair shows the default ruby check, which is whether internal object ids are equal
  • For value objects (a), we’d like to check whether all attributes are equal.
  • For entities (b), we’d like to check whether the explicit ID attributes are equal.
  • By default (c), Ruby checks whether the internal object IDs are equal.

Instances of Point are value objects. With the above in mind, a good implementation of == for Point would look as follows:

This implementation checks all attributes and the class of both objects. By checking the class, checking equality of a Point instance and something of a different class return false rather than raise an exception.

Checking equality on Point objects now works as intended:

The != operator works too:

A correct implementation of equality has three properties: reflexivity, symmetry, and transitivity.

Image with simple circles to describe the implementation of equality having three properties: reflexivity, symmetry, and transitivity, described below the image for more context
  • Reflexivity (a): An object is equal to itself: a == a
  • Symmetry (b): If a == b, then b == a
  • Transitivity (c): If a == b and b == c, then a == c

These properties embody a common understanding of what equality means. Ruby won’t check these properties for you, so you’ll have to be vigilant to ensure you don’t break these properties when implementing equality yourself.

IEEE 754 and violations of reflexivity

It seems natural that something would be equal to itself, but there is an exception. IEEE 754 defines NaN (Not a Number) as a value resulting from an undefined floating-point operation, such as dividing 0 by 0. NaN, by definition, is not equal to itself. You can see this for yourself:

This means that == in Ruby is not universally reflexive. Luckily, exceptions to reflexivity are exceedingly rare; this is the only exception I am aware of.

Basic Equality for Value Objects

The Point class is an example of a value object. The identity of a value object, and thereby equality, is based on all its attributes. That is exactly what the earlier example does:

Basic Equality for Entities

Entities are objects with an explicit identity attribute, commonly @id. Unlike value objects, an entity is equal to another entity if and only if their explicit identities are equal.

Entities are uniquely identifiable objects. Typically, any database record with an id column corresponds to an entity. Consider the following Employee entity class:

Other forms of ID are possible too. For example, books have an ISBN, and recordings have an ISRC. But if you have a library with multiple copies of the same book, then ISBN won’t uniquely identify your books anymore.

For entities, the == operator is more involved to implement than for value objects:

This code does the following:

  • The super call invokes the default implementation of equality: Object#==. On Object, the #== method returns true if and only if the two objects are the same instance. This super call, therefore, ensures that the reflexivity property always holds.
  • As with Point, the implementation Employee#== checks class. This way, an Employee instance can be checked for equality against objects of other classes, and this will always return false.
  • If @id is nil, the entity is considered not equal to any other entity. This is useful for newly-created entities which have not been persisted yet.
  • Lastly, this implementation checks whether the ID is the same as the ID of the other entity. If so, the two entities are equal.

Checking equality on entities now works as intended:



Blog post of Theseus

Implementing equality on entity objects isn’t always straightforward. An object might have an id attribute that doesn’t quite align with the object’s conceptual identity.

Take a BlogPost class, for example, with id, title, and body attributes. Imagine creating a BlogPost, then halfway through writing the body for it, scratching everything and starting over with a new title and a new body. The id of that BlogPost will still be the same, but is it still the same blog post?

If I follow a Twitter account that later gets hacked and turned into a cryptocurrency spambot, is it still the same Twitter account?

These questions don’t have a proper answer. That’s not surprising, as this is essentially the Ship of Theseus thought experiment. Luckily, in the world of computers, the generally accepted answer seems to be yes: if two entities have the same id, then the entities are equal as well.

Basic Equality with Type Coercion

Typically, an object is not equal to an object of a different class. However, this isn’t always the case. Consider integers and floating-point numbers:

Here, float_two is an instance of Float, and integer_two is an instance of Integer. They are equal: float_two == integer_two is true, despite different classes. Instances of Integer and Float are interchangeable when it comes to equality.

As a second example, consider this Path class:

This Path class provides an API for creating paths:

The Path class is a value object, and implementing #== could be done just as with other value objects:

However, the Path class is special because it represents a value that could be considered a string. The == operator will return false when checking equality with anything that isn’t a Path:

It can be beneficial for path == "/usr/bin/ruby" to be true rather than false. To make this happen, the == operator needs to be implemented differently:

This implementation of == coerces both objects to Strings, and then checks whether they are equal. Checking equality of a Path now works:

This class implements #to_str, rather than #to_s. These methods both return strings, but by convention, the to_str method is only implemented on types that are interchangeable with strings.

The Path class is such a type. By implementing Path#to_str, the implementation states that this class behaves like a String. For example, it’s now possible to pass a Path (rather than a String) to IO.open, and it will work because IO.open accepts anything that responds to #to_str.

String#== also uses the to_str method. Because of this, the == operator is reflexive:

Strict Equality

Ruby provides #equal? to check whether two objects are the same instance:

Here, we end up with two String instances with the same content. Because they are distinct instances, #equal? returns false, and because their content is the same, #== returns true.

Do not implement #equal? in your own classes. It isn’t meant to be overridden. It’ll all end in tears.

Earlier in this post, I mentioned that #== has the property of reflexivity: an object is always equal to itself. Here is a related property for #equal?:

Property: Given objects a and b. If a.equal?(b), then a == b.

Ruby won't automatically validate this property for your code. It’s up to you to ensure that this property holds when you implement the equality methods.
For example, recall the implementation of Employee#== from earlier in this article:

The call to super on the first line makes this implementation of #== reflexive. This super invokes the default implementation of #==, which delegates to #equal?. Therefore, I could have used #equal? rather than super:

I prefer using super, though this is likely a matter of taste.

Hash Equality

In Ruby, any object can be used as a key in a Hash. Strings, symbols, and numbers are commonly used as Hash keys, but instances of your own classes can function as Hash keys too—provided that you implement both #eql? and #hash.

The #eql? Method

The #eql? method behaves similarly to #==:

However, #eql?, unlike #==, does not perform type coercion:

If #== doesn’t perform type coercion, the implementations of #eql? and #== will be identical. Rather than copy-pasting, however, we’ll put the implementation in #eql?, and let #== delegate to #eql?:

I made the deliberate decision to put the implementation in #eql? and let #== delegate to it, rather than the other way around. If we were to let #eql? delegate to #==, there’s an increased risk that someone will update #== and inadvertently break the properties of #eql? (mentioned below) in the process.

For the Path value object, whose #== method does perform type coercion, the implementation of #eql? will differ from the implementation of #==:

Here, #== does not delegate to #eql?, nor the other way around.

A correct implementation of #eql? has the following two properties:

  • Property: Given objects a and b. If a.eql?(b), then a == b.
  • Property: Given objects a and b. If a.equal?(b), then a.eql?(b).

These two properties are not explicitly called out in the Ruby documentation. However, to the best of my knowledge, all implementations of #eql? and #== respect this property.

Ruby will not automatically validate that these properties hold in your code. It’s up to you to ensure that these properties aren’t violated.

The #hash Method

For an object to be usable as a key in a Hash, it needs to implement not only #eql?, but also #hash. This #hash method will return an integer, the hash code, that respects the following property:

Property: Given objects a and b. If a.eql?(b), then a.hash == b.hash.

Typically, the implementation of #hash creates an array of all attributes that constitute identity and returns the hash of that array. For example, here is Point#hash:

For Path, the implementation of #hash will look similar:

For the Employee class, which is an entity rather than a value object, the implementation of #hash will use the class and the @id:

If two objects are not equal, the hash code should ideally be different, too. This isn’t mandatory, however. It’s okay for two non-equal objects to have the same hash code. Ruby will use #eql? to tell objects with identical hash codes apart.

Avoid XOR for Calculating Hash Codes

A popular but problematic approach for implementing #hash uses XOR (the ^ operator). Such an implementation would calculate the hash codes of each individual attribute, and combine these hash codes with XOR. For example:

With such an implementation, the chance of a hash code collision, which means that multiple objects have the same hash code, is higher than with an implementation that delegates to Array#hash. Hash code collisions will degrade performance and could potentially pose a denial-of-service security risk.

A better way, though still flawed, is to multiply the components of the hash code by unique prime numbers before combining them:

Such an implementation has additional performance overhead due to the new multiplication. It also requires mental effort to ensure the implementation is and remains correct.

An even better way of implementing #hash is the one I’ve laid out before—making use of Array#hash:

An implementation that uses Array#hash is simple, performs quite well, and produces hash codes with the lowest chance of collisions. It’s the best approach to implementing #hash.

Putting it Together

With both #eql? and #hash in place, the Point, Path, and Employee objects can be used as hash keys:

Here, we use a Hash instance to keep track of a collection of Points. We can also use a Set for this, which uses a Hash under the hood, but provides a nicer API:

Objects used in Sets need to have an implementation of both #eql? and #hash, just like objects used as hash keys.

Objects that perform type coercion, such as Path, can also be used as hash keys, and thus also in sets:

We now have an implementation of equality that works for all kinds of objects.

Mutability, Nemesis of Equality

So far, the examples for value objects have assumed that these value objects are immutable. This is with good reason because mutable value objects are far harder to deal with.

To illustrate this, consider a Point instance used as a hash key:

The problem arises when changing attributes of this point:

Because the hash code is based on the attributes, and an attribute has changed, the hash code is no longer the same. As a result, collection no longer seems to contain the point. Uh oh!

There are no good ways to solve this problem except for making value objects immutable.

This isn’t a problem with entities. This is because the #eql? and #hash methods of an entity are solely based on its explicit identity—not its attributes.

So far, we’ve covered #==, #eql?, and #hash. These three methods are sufficient for a correct implementation of equality. However, we can go further to improve that sweet Ruby developer experience and implement #===.

Case Equality (Triple Equals)

The #=== operator, also called the case equality operator, isn’t really an equality operator at all. Rather, it’s better to think of it as a membership testing operator. Consider the following:

Here, Range#=== checks whether a range covers a certain element. It’s also common to use case expressions to achieve the same:

This is also where case equality gets its name. Triple-equals is called case equality, because case expressions use it.

You never need to use case. It’s possible to rewrite a case expression using if and ===. In general, case expressions tend to look cleaner. Compare:

The examples above all use Range#===, to check whether the range covers a certain number. Another commonly used implementation is Class#===, which checks whether an object is an instance of a class:

I’m rather fond of the #grep method, which uses #=== to select matching elements from an array. It can be shorter and sweeter than using #select:

Regular expressions also implement #===. You can use it to check whether a string matches a regular expression:

It helps to think of a regular expression as the (infinite) collection of all strings that can be produced by it. The set of all strings produced by /[a-z]/ includes the example string "+491573abcde". Similarly, you can think of a Class as the (infinite) collection of all its instances, and a Range as the collection of all elements in that range. This way of thinking clarifies that #=== really is a membership testing operator.

An example of a class that could implement #=== is a PathPattern class:

An example instance is PathPattern.new("/bin/*"), which matches anything directly under the /bin directory, such as /bin/ruby, but not /var/log.

The implementation of PathPattern#=== uses Ruby’s built-in File.fnmatch to check whether the pattern string matches. Here is an example of it in use:

Worth noting is that File.fnmatch calls #to_str on its arguments. This way, #=== automatically works on other string-like objects as well, such as Path instances:

The PathPattern class implements #===, and therefore PathPattern instances work with case/when, too:

Ordered Comparison

For some objects, it’s useful not only to check whether two objects are the same, but how they are ordered. Are they larger? Smaller? Consider this Score class, which models the scoring system of my university in Ghent, Belgium.

(I was a terrible student. I’m not sure if this was really how the scoring even worked — but as an example, it will do just fine.)

In any case, we benefit from having such a Score class. We can encode relevant logic there, such as determining the grade and checking whether or not a score is passing. For example, it might be useful to get the lowest and highest score out of a list:

However, as it stands right now, the expressions scores.min and scores.max will result in an error: comparison of Score with Score failed (ArgumentError). We haven’t told Ruby how to compare two Score objects. We can do so by implementing Score#&<=>:

An implementation of #<=> returns four possible values:

  • It returns 0 when the two objects are equal.
  • It returns -1 when self is less than other.
  • It returns 1 when self is greater than other.
  • It returns nil when the two objects cannot be compared.

The #<=> and #== operators are connected:

  • Property: Given objects a and b. If (a <=> b) == 0, then a == b.
  • Property: Given objects a and b. If (a <=> b) != 0, then a != b.

As before, it’s up to you to ensure that these properties hold when implementing #== and #<=>. Ruby won’t check this for you.

For simplicity, I’ve left out the implementation Score#== in the Score example above. It’d certainly be good to have that, though.

In the case of Score#<=>, we bail out if other is not a score, and otherwise, we call #<=> on the two values. We can check that this works: the expression Score.new(6) <=> Score.new(12) evaluates to -1, which is correct because a score of 6 is lower than a score of 12. (Did you know that the Belgian high school system used to have a scoring system where 1 was the highest and 10 was the lowest? Imagine the confusion!)

With Score#<=> in place, scores.max now returns the maximum score. Other methods such as #min, #minmax, and #sort work as well.

However, we can’t yet use operators like <. The expression scores[0] < scores[1], for example, will raise an undefined method error: undefined method `<' for #<Score:0x00112233 @value=6>. We can solve that by including the Comparable mixin:

By including Comparable, the Score class automatically gains the <, <=, >, and >= operators, which all call <=> internally. The expression scores[0] < scores[1] now evaluates to a boolean, as expected.

The Comparable mixin also provides other useful methods such as #between? and #clamp.

Wrapping Up

We talked about the following topics:

  • the #== operator, used for basic equality, with optional type coercion
  • #equal?, which checks whether two objects are the same instance
  • #eql? and #hash, which are used for testing whether an object is a key in a hash
  • #===, which isn’t quite an equality operator, but rather a “is kind of” or “is member of” operator
  • #<=> for ordered comparison, along with the Comparable module, which provides operators such as < and >=

You now know all you need to know about implementing equality in Ruby. For more information check out the following resources:

The Ruby documentation is a good place to find out more about equality:

I also found the following resources useful:

Denis is a Senior Software Engineer at Shopify. He has made it a habit of thanking ATMs when they give him money, thereby singlehandedly staving off the inevitable robot uprising.


If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

Lessons Learned From Running Apache Airflow at Scale

Lessons Learned From Running Apache Airflow at Scale

By Megan Parker and Sam Wheating

Apache Airflow is an orchestration platform that enables development, scheduling and monitoring of workflows. At Shopify, we’ve been running Airflow in production for over two years for a variety of workflows, including data extractions, machine learning model training, Apache Iceberg table maintenance, and DBT-powered data modeling. At the time of writing, we are currently running Airflow 2.2 on Kubernetes, using the Celery executor and MySQL 8.

System diagram showing Shopify's Airflow Architecture
Shopify’s Airflow Architecture

Shopify’s usage of Airflow has scaled dramatically over the past two years. In our largest environment, we run over 10,000 DAGs representing a large variety of workloads. This environment averages over 400 tasks running at a given moment and over 150,000 runs executed per day. As adoption increases within Shopify, the load incurred on our Airflow deployments will only increase. As a result of this rapid growth, we have encountered a few challenges, including slow file access, insufficient control over DAG (directed acyclic graph) capabilities, irregular levels of traffic, and resource contention between workloads, to name a few.

Below we’ll share some of the lessons we learned and solutions we built in order to run Airflow at scale.

1. File Access Can Be Slow When Using Cloud Storage

Fast file access is critical to the performance and integrity of an Airflow environment. A well defined strategy for file access ensures that the scheduler can process DAG files quickly and keep your jobs up-to-date.

Airflow keeps its internal representation of its workflows up-to-date by repeatedly scanning and reparsing all the files in the configured DAG directory. These files must be scanned often in order to maintain consistency between the on-disk source of truth for each workload and its in-database representation. This means the contents of the DAG directory must be consistent across all schedulers and workers in a single environment (Airflow suggests a few ways of achieving this).

At Shopify, we use Google Cloud Storage (GCS) for the storage of DAGs. Our initial deployment of Airflow utilized GCSFuse to maintain a consistent set of files across all workers and schedulers in a single Airflow environment. However, at scale this proved to be a bottleneck on performance as every file read incurred a request to GCS. The volume of reads was especially high because every pod in the environment had to mount the bucket separately.

After some experimentation we found that we could vastly improve performance across our Airflow environments by running an NFS (network file system) server within the Kubernetes cluster. We then mounted this NFS server as a read-write-many volume into the worker and scheduler pods. We wrote a custom script which synchronizes the state of this volume with  GCS, so that users only have to interact with GCS for uploading or managing DAGs. This script runs in a separate pod within the same cluster. This also allows us to conditionally sync only a subset of the DAGs from a given bucket, or even sync DAGs from multiple buckets into a single file system based on the environment’s configuration (more on this later).

Altogether this provides us with fast file access as a stable, external source of truth, while maintaining our ability to quickly add or modify DAG files within Airflow. Additionally, we can use Google Cloud Platform’s IAM (identify and access management) capabilities to control which users are able to upload files to a given environment. For example, we allow users to upload DAGs directly to the staging environment but limit production environment uploads to our continuous deployment processes.

Another factor to consider when ensuring fast file access when running Airflow at scale is your file processing performance. Airflow is highly configurable and offers several ways to tune the background file processing (such as the sort modethe parallelism, and the timeout). This allows you to optimize your environments for interactive DAG development or scheduler performance depending on the requirements.

2. Increasing Volumes Of Metadata Can Degrade Airflow Operations

In a normal-sized Airflow deployment, performance degradation due to metadata volume wouldn’t be an issue, at least within the first years of continuous operation.

However, at scale the metadata starts to accumulate pretty fast. After a while this can start to incur additional load on the database. This is noticeable in the loading times of the Web UI and even more so during Airflow upgrades, during which migrations can take hours.

After some trial and error, we settled on a metadata retention policy of 28 days, and implemented a simple DAG which uses ORM (object–relational mapping) queries within a PythonOperator to delete rows from any tables containing historical data (DagRuns, TaskInstances, Logs, TaskRetries, etc). We settled on 28 days as this gives us sufficient history for managing incidents and tracking historical job performance, while keeping the volume of data in the database at a reasonable level.

Unfortunately, this means that features of Airflow which rely on durable job history (for example, long-running backfills) aren’t supported in our environment. This wasn’t a problem for us, but it may cause issues depending on your retention period and usage of Airflow.

As an alternative approach to a custom DAG, Airflow has recently added support for a db clean command which can be used to remove old metadata. This command is available in Airflow version 2.3.

3. DAGs Can Be Hard To Associate With Users And Teams

When running Airflow in a multi-tenant setting (and especially at a large organization), it’s important to be able to trace a DAG back to an individual or team. Why? Because if a job is failing, throwing errors or interfering with other workloads, us administrators can quickly reach out to the appropriate users.

If all of the DAGs were deployed directly from one repository, we could simply use git blame to track down the job owner. However, since we allow users to deploy workloads from their own projects (and even dynamically generate jobs at deploy-time), this becomes more difficult.

In order to easily trace the origin of DAGs, we introduced a registry of Airflow namespaces, which we refer to as an Airflow environment’s manifest file.

The manifest file is a YAML file where users must register a namespace for their DAGs. In this file they will include information about the jobs’ owners and source github repository (or even source GCS bucket), as well as define some basic restrictions for their DAGs. We maintain a separate manifest per-environment and upload it to GCS alongside with the DAGs.

4. DAG Authors Have A Lot Of Power

By allowing users to directly write and upload DAGs to a shared environment, we’ve granted them a lot of power. Since Airflow is a central component of our data platform, it ties into a lot of different systems and thus jobs have wide-ranging access. While we trust our users, we still want to maintain some level of control over what they can and cannot do within a given Airflow Environment. This is especially important at scale as it becomes unfeasible for the Airflow administrators to review all jobs before they make it to production.

In order to create some basic guardrails, we’ve implemented a DAG policy which reads configuration from the previously mentioned Airflow manifest, and rejects DAGs which don’t conform to their namespace’s constraints by raising an AirflowClusterPolicyViolation.

Based on the contents of the manifest file, this policy will apply a few basic restrictions to DAG files, such as:

  • A DAG ID must be prefixed with the name of an existing namespace, for ownership.
  • Tasks in a DAG must only enqueue tasks to the specified celery queue—more on this later.
  • Tasks in a DAG can only be run in specified pools, to prevent one workload from taking over another’s capacity.
  • Any KubernetesPodOperators in this DAG must only launch pods in the specified namespaces, to prevent access to other namespace’s secrets.
  • Tasks in a DAG can only launch pods into specified sets of external kubernetes clusters

This policy can be extended to enforce other rules (for example, only allowing a limited set of operators), or even mutate tasks to conform to a certain specification (for exampke, adding a namespace-specific execution timeout to all tasks in a DAG).

Here’s a simplified example demonstrating how to create a DAG policy which reads the previously shared manifest file, and implements the first three of the controls mentioned above:

These validations provide us with sufficient traceability while also creating some basic controls which reduce DAGs abilities to interfere with each other.

5. Ensuring A Consistent Distribution Of Load Is Difficult

It’s very tempting to use an absolute interval for your DAGs schedule interval—simply set the DAG to run every timedelta(hours=1) and you can walk away, safely knowing that your DAG will run approximately every hour. However, this can lead to issues at scale.

When a user merges a large number of automatically-generated DAGs, or writes a python file which generates many DAGs at parse-time, all the DAGRuns will be created at the same time. This creates a large surge of traffic which can overload the Airflow scheduler, as well as any external services or infrastructure which the job is utilizing (for example, a Trino cluster).

After a single schedule_interval has passed, all these jobs will run again at the same time, thus leading to another surge of traffic. Ultimately, this can lead to suboptimal resource utilization and increased execution times.

While crontab-based schedules won’t cause these kinds of surges, they come with their own issues. Humans are biased towards human-readable schedules, and thus tend to create jobs which run at the top of every hour, every hour, every night at midnight, etc. Sometimes there’s a valid application-specific reason for this (for example, every night at midnight we want to extract the previous day’s data), but often we have found users just want to run their job on a regular interval. Allowing users to directly specify their own crontabs can lead to bursts of traffic which can impact SLOs and put uneven load on external systems.

As a solution to both these issues, we use a deterministically randomized schedule interval for all automatically generated DAGs (which represent the vast majority of our workflows). This is typically based on a hash of a constant seed such as the dag_id.

The below snippet provides a simple example of a function which generates deterministic, random crontabs which yield constant schedule intervals. Unfortunately, this limits the range of possible intervals since not all intervals can be expressed as a single crontab. We have not found this limited choice of schedule intervals to be limiting, and in cases when we really need to run a job every five hours, we just accept that there will be a single four hours interval each day.

Thanks to our randomized schedule implementation, we were able to smooth the load out significantly. The below image shows the number of tasks completed every 10 minutes over a twelve hour period in our single largest Airflow environment.

Bar graph showing the Tasks Executed versus time. Shows a per 10–minute Interval in our Production Airflow Environment
Tasks Executed per 10–minute Interval in our Production Airflow Environment

6. There Are Many Points of Resource Contention

There’s a lot of possible points of resource contention within Airflow, and it’s really easy to end up chasing bottlenecks through a series of experimental configuration changes. Some of these resource conflicts can be handled within Airflow, while others may require some infrastructure changes. Here’s a couple of ways which we handle resource contention within Airflow at Shopify:

Pools

One way to reduce resource contention is to use Airflow pools. Pools are used to limit the concurrency of a given set of tasks. These can be really useful for reducing disruptions caused by bursts in traffic. While pools are a useful tool to enforce task isolation, they can be a challenge to manage because only administrators have access to edit them via the Web UI.

We wrote a custom DAG which synchronizes the pools in our environment with the state specified in a Kubernetes Configmap via some simple ORM queries. This lets us manage pools alongside the rest of our Airflow deployment configuration and allows users to update pools via a reviewed Pull Request without needing elevated access. 

Priority Weight

Priority_weight allows you to assign a higher priority to a given task. Tasks with a higher priority will float to the top of the pile to be scheduled first. Although not a direct solution to resource contention, priority_weight can be useful to ensure that latency-sensitive critical tasks are run prior to lower priority tasks. However, given that the priority_weight is an arbitrary scale, it can be hard to determine the actual priority of a task without comparing it to all other tasks. We use this to ensure that our basic Airflow monitoring DAG (which emits simple metrics and powers some alerts) always runs as promptly as possible.

It’s also worthwhile to note that by default, the effective priority_weight of a task used when making scheduling decisions is the sum of its own weight and that of all its downstream tasks. What this means is that upstream tasks in large DAGs are often favored over tasks in smaller DAGs. Therefore, using priority_weight requires some knowledge of the other DAGs running in the environment.

Celery Queues and Isolated Workers

If you need your tasks to execute in separate environments (for example, dependencies on different python libraries, higher resource allowances for intensive tasks, or differing level of access), you can create additional queues which a subset of jobs submit tasks to. Separate sets of workers can then be configured to pull from separate queues. A task can be assigned to a separate queue using the queue argument in operators. To start a worker which runs tasks from a different queue, you can use the following command:

bashAirflow celery worker –queues <list of queues>

This can help ensure that sensitive or high-priority workloads have sufficient resources, as they won’t be competing with other workloads for worker capacity.

Any combination of pools, priority weights and queues can be useful in reducing resource contention. While pools allow for limiting concurrency within a single workload, a priority_weight can be used to make individual tasks run at a lower latency than others. If you need even more flexibility, worker isolation provides fine-grained control over the environment in which your tasks are executed.

It’s important to remember that not all resources can be carefully allocated in Airflow—scheduler throughput, database capacity and Kubernetes IP space are all finite resources which can’t be restricted on a workload-by-workload basis without the creation of isolated environments.

Going Forward…

There are many considerations that go into running Airflow with such high throughput, and any combination of solutions can be useful. We’ve learned a ton and we hope you’ll remember these lessons and apply some of our solutions in your own Airflow infrastructure and tooling.

To sum up our key takeaways:

  • A combination of GCS and NFS allows for both performant and easy to use file management.
  • Metadata retention policies can reduce degradation of Airflow performance.
  • A centralized metadata repository can be used to track DAG origins and ownership.
  • DAG Policies are great for enforcing standards and limitations on jobs.
  • Standardized schedule generation can reduce or eliminate bursts in traffic.
  • Airflow provides multiple mechanisms for managing resource contention.

What’s next for us? We’re currently working on applying the principles of scaling Airflow in a single environment as we explore splitting our workloads across multiple environments. This will make our platform more resilient, allow us to fine-tune each individual Airflow instance based on its workloads’ specific requirements, and reduce the reach of any one Airflow deployment.

Got questions about implementing Airflow at scale? You can reach out to either of the authors on the Apache Airflow slack community.

Megan has worked on the data platform team at Shopify for the past 9 months where she has been working on enhancing the user experience for Airflow and Trino. Megan is located in Toronto, Canada where she enjoys any outdoor activity, especially biking and hiking.

Sam is a Senior developer from Vancouver, BC who has been working on the Data Infrastructure and Engine Foundations teams at Shopify for the last 2.5 years. He is an internal advocate for open source software and a recurring contributor to the Apache Airflow project.


Interested in tackling challenging problems that make a difference? Visit our Data Science & Engineering career page to browse our open positions. You can also contribute to Apache Airflow to improve Airflow for everyone.

Continue reading

Asynchronous Communication is the Great Leveler in Engineering

Asynchronous Communication is the Great Leveler in Engineering

In March 2020—the early days of the pandemic—Shopify transitioned to become a remote first- company. We call it being Digital by Design. We are now proud to employ Shopifolk around the world.

Not only has being Digital by Design allowed our staff the flexibility to work from wherever they work best, it has also increased the amount of time that they are able to spend with their families, friends, or social circles. In the pre-remote world, many of us had to move far away from our hometowns in order to get ahead in our careers. However, recently, my own family has been negotiating with the reality of aging parents, and remote working has allowed us to move back closer to home so we are always there for them whilst still doing the jobs we love.

However, being remote isn’t without its challenges. Much of the technology industry has spent decades working in colocated office space. This has formed habits in all of us that aren’t compatible with effective remote work. We’re now on a journey of mindfully unraveling these default behaviors and replacing them with remote-focused ways of working.

If you’ve worked in an office before, you’ll be familiar with synchronous communication and how it forms strong bonds between colleagues: a brief chat in the kitchen whilst getting a coffee, a discussion at your desk that was prompted by a recent code change, or a conversation over lunch.

With the layout of physical office spaces encouraging spontaneous interactions, context could be gathered and shared through osmosis with little specific effort—it just happened.

There are many challenges that you face in engineering when working on a globally distributed team. Not everyone is online at the same time of the day, meaning that it can be harder to get immediate answers to questions. You might not even know who to ask when you can’t stick your head up and look around to see who is at their desks. You may worry about ensuring that the architectural direction that you’re about to take in the codebase is the right one when building for the long term—how can you decide when you’re sitting at home on your own?

Teams have had to shift to using a different toolbox of skills now that everyone is remote. One such skill is the shift to more asynchronous communication: an essential glue that holds a distributed workforce together. It’s inclusive of teams in different time zones, it leaves an audit trail of communication and decision making, encourages us to communicate concisely, and enables everybody the same window into the company, regardless of where they are in the world.

However, an unstructured approach can be challenging, especially when teams are working on establishing their communication norms. It helps to have a model with which to reason about how best to communicate for a given purpose and to understand what side-effects of that communication might be.

The Spectrum of Synchronousness

When working remotely, we have to adapt to a different landscape. The increased flexibility of working hours, the ability to find flow and do deep work, and the fact that our colleagues are in different places means we can’t rely on the synchronous, impromptu interactions of the office anymore. We have to navigate a continuum of communication choices between synchronous and asynchronous, choosing the right way to communicate for the right group at the right time.

It’s possible to represent different types of communication on a spectrum, as seen in the diagram below.

A diagram showing spectrum of communication types with the extremes being Synchronous Impermanent Connect and Asynchronous Permanent Disconnected
The spectrum of communication types

Let’s walk the spectrum from left to right—from synchronous to asynchronous—to understand the kinds of choices that we need to make when communicating in a remote environment.

  • Video calls and pair programming are completely synchronous: all participants need to be online at the same time.
  • Chats are written and can be read later, but due to their temporal nature have a short half-life. Usually there’s an expectation that they’ll be read or replied to fairly quickly, else they’re gone.
  • Recorded video is more asynchronous, however they’re typically used as a way of broadcasting some information or complimenting a longer document, and their relevance can fade rapidly.
  • Email is archival and permanent and is typically used for important communication. People may take many days to reply or not reply at all.
  • Written documents are used for technical designs, in-depth analysis, or cornerstones of projects. They may be read many years after they were written but need to be maintained and often represent a snapshot in time.
  • Wikis and READMEs are completely asynchronous, and if well-maintained, can last and be useful forever.

Shifting to Asynchronous

When being Digital by Design, we have to be intentionally more asynchronous. It’s a big relearning of how to work collaboratively. In offices, we could get by synchronously, but there was a catch: colleagues at home, on vacation, or in different offices would have no idea what was going on. Now we’re all in that position, we have to adapt in order to harness all of the benefits of working with a global workforce.

By treating everyone as remote, we typically write as a primary form of communication so that all employees can have access to the same information wherever they are. We replace meetings with asynchronous interactions where possible so that staff have more flexibility over their time. We record and rebroadcast town halls so that staff in other timezones can experience the feeling of watching them together. We document our decisions so that others can understand the archeology of codebases and projects. We put effort into editing and maintaining our company-wide documentation in all departments, so that all employees have the same source of truth about teams, the organization chart, and projects.

This shift is challenging, but it’s worthwhile: effective asynchronous collaboration is how engineers solve hard problems for our merchants at scale, collaborating as part of a global team. Whiteboarding sessions have been replaced with the creation of collaborative documents in tools such as Miro. In-person Town Halls have been replaced with live streamed events that are rebroadcast on different time zones with commentary and interactions taking place in Slack. The information that we all have in our heads has needed to be written, recorded, and documented. Even with all of the tools provided, it requires a total mindset shift to use them effectively.

We’re continually investing in our developer tools and build systems to enable our engineers to contribute to our codebases and ship to production any time, no matter where they are. We’re also investing in internal learning resources and courses so that new hires can autonomously level up their skills and understand how we ship software. We have regular broadcasts of show and tell and demo sessions so that we can all gather context on what our colleagues are building around the world. And most importantly, we take time to write regular project and mission updates so that everyone in the company can feel the pulse of the organization.

Asynchronous communication is the great leveler: it connects everyone together and treats everyone equally.

Permanence in Engineering

In addition to giving each employee the same window into our culture, asynchronous communication also has the benefit of producing permanent artifacts. These could be written documents, pull requests, emails, or videos. As per our diagram, the more asynchronous the communication, the more permanent the artifact. Therefore, shifting to asynchronous communication means that not only are teams able to be effective remotely, but they also produce archives and audit trails for their work.

The whole of Shopify uses a single source of truth—an internal archive of information called the Vault. Here, Shopifolk can find all of the information that they need to get their work done: information on teams, projects, the latest news and video streams, internal podcasts and blog posts. Engineers can quickly find architecture diagrams and design documents for active projects.

When writing design documents for major changes to the codebase, a team produces an archive of their decisions and actions through time. By producing written updates to projects every week, anyone in the company can capture the current context and where it has been derived from. By recording team meetings and making detailed minutes, those that were unable to attend can catch up later on-demand. A shift to asynchronous communication means a shift to implied shift to permanence of communication, which is beneficial for discovery[g][h][i], reflection and understanding.

For example, when designing new features and architecture, we collaborate asynchronously on design documents via GitHub. New designs are raised as issues in our technical designs repository, which means that all significant changes to our codebase are reviewed, ratified and archived publicly. This mirrors how global collaboration works on the open source projects we know and love. Working so visibly can be intimidating for those that haven’t done it before, so we ensure that we mentor and pair with those that are doing it for the first time.

Establishing Norms and Boundaries

Yet, multiple mediums of communication incur many choices in how to use them effectively. When you have the option to communicate via chat, email, collaborative document or GitHub issue, picking the right one can become overwhelming and frustrating. Therefore we encourage our teams to establish their preferred norms and to write them down. For example:

  • What are the response time expectations within a team for chat versus email?
  • How are online working hours clearly discoverable for each team member?
  • How is consensus reached on important decisions?
  • Is a synchronous meeting ever necessary?
  • What is the correct etiquette for “raising your hand” in video calls?
  • Where are design documents stored so they’re easily accessible in the future?

By agreeing upon the right medium to use for given situations, teams can work out what’s right for them in a way that supports flexibility, autonomy, and clarity. If you’ve never done this in your team, give it a go. You’ll be surprised how much easier it makes your day to day work.

The norms that our teams define bridge both synchronous and asynchronous expectations. At Shopify, my team members ensure that they make the most of the windows of overlap that they have each day, setting aside time to be interruptible for pair programming, impromptu chats and meetings, and collaborative design sessions. Conversely, the times of the day when teams have less overlap are equally important. Individuals are encouraged to block out time in the calendar, turn on their “do not disturb” status, and find the space and time to get into a flow state and be productive

A natural extension of the communication norms cover when writing and shipping code. Given that our global distribution of staff can potentially incur delays when it comes to reviewing, merging, and deploying, teams are encouraged to explore and define how they reach alignment and get things shipped. This can happen by raising the priority of reviewing pull requests created in other timezones first thing in the morning before getting on with your own work, through to finding additional engineers that may be external to the team but on the same timezone to lean in and offer support, review, and pairing.

Maintaining Connectivity

Once you get more comfortable with working ever more asynchronously, it can be tempting to want to make everything asynchronous: stand-ups on Slack, planning in Miro, all without needing anyone to be on a video call at all. However, if we look back at the diagram one more time, we’ll see that there’s an important third category: connectivity. Humans are social beings, and feeling connected to other, real humans—not just avatars—is critical to our wellbeing. This means that when shifting to asynchronous work we also need to ensure that we maintain that connection. Sometimes having a synchronous meeting can be a great thing, even if it’s less efficient—the ability to see other faces, and to chat, can’t be taken for granted.

We actively work to ensure that we remain connected to each other at Shopify. Pair programming is a core part of our engineering culture, and we love using Tuple to solve problems collaboratively, share context about our codebases, and provide an environment to help hundreds of new engineers onboard and gain confidence working together with us.

We also strongly advocate for plenty of time to get together and have fun. And no, I’m not talking about generic, awkward corporate fun. I’m talking about hanging out with colleagues and throwing things at them in our very own video game: Shopify Party (our internal virtual world for employees to play games or meet up). I’m talking about your team spending Friday afternoon playing board games together remotely. And most importantly, I’m talking about at least twice a year, we encourage teams to come together in spaces we’ve invested in around the world for meaningful and intentional moments for brainstorming, team building, planning, and establishing connections offline.

Asynchronous brings efficiency, and synchronous brings connectivity. We’ve got both covered at Shopify.

James Stanier is Director of Engineering at Shopify. He is also the author of Become an Effective Software Manager and Effective Remote Work. He holds a Ph.D. in computer science and runs theengineeringmanager.com.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Double Entry Transition Tables: How We Track State Changes At Shopify

Double Entry Transition Tables: How We Track State Changes At Shopify

Recently we launched Shopify Balance, a money management account and card that gives Shopify merchants quick access to their funds with no fees. After the beta launch of Shopify Balance, the Shopify Data team was brought in to answer the question: how do we reliably count the number of merchants using Balance? In particular, how do we count this historically?

While this sounds like a simple question, it’s foundationally critical to knowing if our product is a success and if merchants are actually using it. It’s also more complicated than it seems to answer.

To be considered as using Shopify Balance, a merchant has to have both an active Shopify Balance account and an active Shopify account. This means we needed to build something to track the state changes of both accounts simultaneously, and make that tracking robust and reliable over time. Enter double entry transition tables. While very much an “invest up front and save a ton of time in the long run” strategy, double entry transition tables give us the flexibility to see the individual inputs that cause a given change. It does all of this while simplifying our queries and reducing long term maintenance on our reporting.

In this post, we’ll explore how we built a data pipeline using double entry transition tables to answer our question: how many Shopify merchants are using Shopify Balance? We’ll go over how we designed something that scales as our product grows in complexity, the benefits of using double entry transition tables—from ease of use to future proofing our reporting—and some sample queries using our new table.

What Are Double Entry Transition Tables?

Double entry transition tables are essentially a data presentation format that tracks changes in attributes of entities over time. At Shopify, one of our first use cases of a double entry transition table was used to track the state of merchants using the platform, allowing us to report on how many merchants have active accounts. In comparison to a standard transition table that has from and to columns, double entry transition tables output two rows for each state change, along with a new net_change column. They can also combine many individual tracked attributes into a single output.

It took me a long time to wrap my head around this net_change column, but it essentially works like this: if you want to track the status of something over time, every time the status changes from one state to another or vice versa, there will be two entries:

  1. net_change = -1: this row is the previous state
  2. net_change = +1: this row is the new state

Double entry transition tables have many advantages including:

  • The net_change column is additive: this is the true benefit of using this type of table. This allows you to quickly get the number of entities that are in a certain state by summing up net_change while filtering for the state you care about.
  • Identifying cause of change: for situations where you care about an overall status (one that depends on several underlying statuses), you can go into the table and see which of the individual attributes caused the change.
  • Preserving all timing information: the output preserves all timing information, and even correctly orders transitions that have identical timestamps. This is helpful for situations where you need to know something like the duration of a given status.
  • Easily scaled with additional attributes: if the downstream dependencies are written correctly, you can add additional attributes to your table as the product you’re tracking grows in complexity. The bonus is that you don’t have to rewrite any existing SQL or PySpark, all thanks to the additive nature of the net_change column.

For our purpose of identifying how many merchants are using Shopify Balance, double entry transition tables allow us to track state changes for both the Shopify Balance account and the Shopify account in a single table. It also gives us a clean way to query the status of each entity over time. But how do we do this?

Building Our Double Entry Transition Pipelines

First, we need to prepare individual attribute tables to be used as inputs for our double entry transition data infrastructure. We need at least one attribute, but it can scale to any number of attributes as the product we’re tracking grows.

In our case, we created individual attribute tables for both the Shopify Balance account status and the Shopify account status. An attribute input table must have a specific set of columns:

  • a partition key that’s common across attribute, which in our case is an account_id
  • a sort key, generally a transition_at timestamp and an index
  • an attribute you want to track.

Using a standard transition table, we can convert it to an attribute with a simple PySpark job:

Note the index column. We created this index using a row number window function, ordering by the transition_id any time we have duplicate account_id and transition_at sets in our original data. While simple, it serves the role of a tiebreak should there be two transition events with identical timestamps. This ensures we always have a unique account_id, transition_at, index set in our attribute for correct ordering of events. The index plays a key role later on when we create our double entry transition table, ensuring we’re able to capture the order of our two states.

Our Shopify Balance status attribute table showing a merchant that joined and left Shopify Balance.

Now that we have our two attribute tables, it’s time to feed these into our double entry transition pipelines. This system (called build merge state transitions) takes our individual attribute tables and first generates a combined set of unique rows using a partition_key (in our case, the account_id column), and a sort_key (in our case, the transition_at and index columns). It then creates one column per attribute, and fills in the attribute columns with values from their respective tables, in the order defined by the partition_key and sort_key. Where values are missing, it fills in the table using the previous known value for that attribute. Below you can see two example attributes being merged together and filled in:

Two example attributes merged into a single output table.

This table is then run through another process that creates our net_change column and assigns a +1 value to all current rows. It also inserts a second row for each state change with a net_change value of -1. This net_change column now represents the direction of each state change as outlined earlier.

Thanks to our pipeline, setting up a double entry transition table is a very simple PySpark job:

Note in the code above we’ve specified default values. These are used to fill in the initial null values for the attributes. Now below is the output of our final double entry transition table, which we call our accounts_transition_facts table. The table captures both a merchant’s Shopify and Shopify Balance account statuses over time. Looking at the shopify_status column, we can see they went from inactive to active in 2018, while the balance_status column shows us that they went from not_on_balance to active on March 14, 2021, and subsequently from active to inactive on April 23, 2021:

A merchant that joined and left Shopify Balance in our accounts_transition_facts double entry transition table.

Using Double Entry Transition Tables

Remember how I mentioned that the net_change column is additive? This makes working with double entry transition tables incredibly easy. The ability to sum the net_change column significantly reduces the SQL or PySpark needed to get counts of states. For example, using our new account_transition_facts table, we can identify the total number of active accounts on Shopify Balance, using both the Shopify Balance status and Shopify status. All we have to do is sum our net_change column while filtering for the attribute statuses we care about:

Add in a grouping on a date column and we can see the net change in accounts over time:

We can even use the output in other PySpark jobs. Below is an example of a PySpark job consuming the output of our account_transition_facts table. In this case, we are adding the daily net change in account numbers to an aggregate daily snapshot table for Shopify Balance:

There are many ways you can achieve the same outputs using SQL or PySpark, but having a double entry transition table in place significantly simplifies the code at query time. And as mentioned earlier, if you write the code using the additive net_change column, you won’t need to rewrite any SQL or PySpark when you add more attributes to your double entry transition table.

We won’t lie, it took a lot of time and effort to build the first version of our account_transition_facts table. But thanks to our investment, we now have a reliable way to answer our initial question: how do we count the number of merchants using Balance? It’s easy with our double entry transition table! Grouping by the status we care about, simply sum net_change and viola, we have our answer.

Not only does our double entry transition table simply and elegantly answer our question, but it also easily scales with our product. Thanks to the additive nature of the net_change column, we can add additional attributes without impacting any of our existing reporting. This means this is just the beginning for our account_transition_facts table. In the coming months, we’ll be evaluating other statuses that change over time, and adding in those that make sense for Shopify Balance into our table. Next time you need to reliably count multiple states, try exploring double entry transition tables.

Justin Pauley is a Data Scientist working on Shopify Balance. Justin has a passion for solving complex problems through data modeling, and is a firm believer that clean data leads to better storytelling. In his spare time he enjoys woodworking, building custom Lego creations, and learning new technologies. Justin can be reached via LinkedIn.


Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

If you’re interested in building solutions from the ground up and would like to come work with us, please check out Shopify’s career page.

Continue reading

Shopify Invests in Research for Ruby at Scale

Shopify Invests in Research for Ruby at Scale

Shopify is continuing to invest on Ruby on Rails at scale. We’ve taken that further recently by funding high-profile academics to focus their work towards Ruby and the needs of the Ruby community. Over the past year we have given nearly half a million dollars in gifts to influential researchers that we trust to make a significant impact on the Ruby community for the long term.

Shopify engineers and researchers at a recent meetup in London

We want developments in programming languages and their implementations to be explored in Ruby, so that support for Ruby's unique properties are built in from the start. For example, Ruby's prevalent metaprogramming motivated a whole new kind of inline caching to be developed and presented as a paper at one of the top programming language conferences, and Ruby's unusually loose C extension API motivated a new kind of C interpreter to run virtualized C. These innovations wouldn't have happened if academics weren't looking at Ruby.

We want programming language research to be evaluated against the workloads that matter to companies using Ruby. We want researchers to understand the scale of our code bases, how frequently they're deployed, and the code patterns we use in them. For example, a lot of VM research over the last couple of decades has traded off a long warmup optimization period for better peak performance, but this doesn't work for companies like Shopify where we're redeploying very frequently. Researchers aren't aware of these kinds of problems unless we partner with them and guide them.

We think that working with academics like this will be self-perpetuating. With key researchers thinking and talking about Ruby, more early career researchers will consider working with Ruby and solving problems that are important to the Ruby community.

Let’s meet Shopify’s new research collaborators.

Professor Laurence Tratt

Professor Laurence Tratt describes his vision for optimizing Ruby

Professor Laurence Tratt is the Shopify and Royal Academy of Engineering Research Chair in Language Engineering at King’s College London. Jointly funded by Shopify, the Royal Academy, and King’s College, Laurie is looking at the possibility of automatically generating a just-in-time compiler from the existing Ruby interpreter through hardware meta-tracing and basic-block stitching.

Laurie has an eclectic and influential research portfolio, and extensive writing on many aspects of improving dynamic languages and programming. He has context from the Python community and the groundbreaking work towards meta-tracing in the PyPy project. Laurie also works to build the programming language implementation community for the long term by co-organising a summer school series for early career researchers, bringing them together with experienced researchers from academia and industry.

Professor Steve Blackburn

Professor Steve Blackburn is building a new model for applied garbage collection

Professor Steve Blackburn is an academic at the Australian National University and Google Research. Shopify funded his group’s work on MMTk, the memory management toolkit, a general library for garbage collection that brings together proven garbage collection algorithms with a framework for research into new ideas for garbage collection. We’re putting MMTk into Ruby so that Ruby can get the best current collectors today and future garbage collectors can be tested against Ruby.

Steve is a world-leading expert in garbage collection, and Shopify’s funding is putting Ruby’s unique requirements for memory management into his focus.

Dr Stefan Marr

Dr Stefan Marr is an expert in benchmarking dynamic language implementations

Dr Stefan Marr is a Senior Lecturer at the University of Kent in the UK and a Royal Society Industrial Fellow. With the support of Shopify, he’s examining how we can make interpreters faster and improve interpreter startup and warmup time.

Stefan has a distinguished reputation for benchmarking techniques, differential analysis between languages and implementation techniques, and dynamic language implementation. He co-invented a new method for inline caching that has been instrumental for improving the performance of Ruby’s metaprogramming in TruffleRuby.

Shopify engineers and research collaborators discuss how to work together to improve Ruby

We’ve been bringing together the researchers that we’re funding with our senior Ruby community engineers to share their knowledge of what’s already possible and what could be possible, combining our understanding of how Ruby and Rails are used at scale today and what the community needs.

These external researchers are all in addition to our own internal teams doing publishable research-level work on Ruby, with YJIT and TruffleRuby, and more efforts.

Part of Shopify’s Ruby and Rails Infrastructure Team listening to research proposals

We’ll be looking forward to sharing more about our investments in Ruby research over the coming years in blog posts and academic papers.

Chris Seaton has a PhD in optimizing Ruby and works on TruffleRuby, a highly optimizing implementation of Ruby, and research projects at Shopify.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Maestro: The Orchestration Language Powering Shopify Flow

Maestro: The Orchestration Language Powering Shopify Flow

Adagio misterioso

Shopify recently unveiled a new version of Shopify Flow. Merchants extensively use Flow’s workflow language and associated execution engine to customize Shopify, automate tedious, repetitive tasks, and focus on what matters. Flow comes with a comprehensive library of templates for common use cases, and detailed documentation to guide merchants in customizing their workflows.

For the past couple of years my team has been working on transitioning Flow from a successful Shopify Plus App into a platform designed to power the increasing automation and customization needs across Shopify. One of the main technical challenges we had to address was the excessive coupling between the Flow editor and engine. Since they shared the same data structures, the editor and engine couldn't evolve independently, and we had limited ability to tailor these data structures for their particular needs. This problem was significant because editor and engine have fundamentally very different requirements.

The Flow editor provides a merchant-facing visual workflow language. Its language must be declarative, capturing the merchant’s intent without dealing with how to execute that intent. The editor concerns itself mainly with usability, understandability, and interactive editing of workflows. The Flow engine, in turn, needs to efficiently execute workflows at scale in a fault-tolerant manner. Its language can be more imperative, but it must have good support for optimizations and have at-least-once execution semantics that ensures workflow executions recover from crashes. However, editor and engine also need to play together nicely. For example, they need to agree on the type system, which is used to find user errors and to support IDE-like features, such as code completion and inline error reporting within the visual experience.

We realized it was important to tackle this problem right away, and it was crucial to get it right while minimizing disruptions to merchants. We proceeded incrementally.

First, we designed and implemented a new domain-specific orchestration language that addressed the requirements of the Flow engine. We call this language Maestro. We then implemented a new, horizontally scalable engine to execute Maestro orchestrations. Next, we created a translation layer from original Flow workflow data structures into Maestro orchestrations. This allowed us to execute existing Flow workflows with the new engine. At last, we slowly migrated all Flow workflows to use the new engine, and by BFCM 2020 essentially all workflows were executing in the new engine.

We were then finally in a position to deal with the visual language. So we implemented a brand new visual experience, including a new language for the Flow editor. This language is more flexible and expressive than the original, so any of the existing workflows could be easily migrated. The language also can be translated into Maestro orchestrations, so it could be directly executed by the new engine. Finally, once we were satisfied with the new experience, we started migrating existing Flow workflows, and by early 2022, all Flow workflows had been migrated to use the new editor and new engine.

In the remainder of this post I want to focus on the new orchestration language, Maestro. I’ll give you an overview of its design and implementation, and then focus on how it neatly integrates with and addresses the requirements of the new version of Shopify Flow.

 

A Sample of Maestro

Allegro grazioso

Let’s take a quick tour to get a taste of what Maestro looks like and what exactly it does. Maestro isn’t a general purpose programming language, but rather an orchestration language focused solely on coordinating the sequence in which calls to functions on some host language are made, while capturing which data is passed between those function calls. For example, suppose you want to implement code that calls a remote service to fetch some relevant customers and then deletes those customers from the database. The Maestro language can’t implement the remote service call nor the database call themselves, but it can orchestrate those calls in a fault-tolerant fashion. The main benefit of using Maestro is that the state of the execution is precisely captured and can be made durable, so you can observe the progression and restart where you left in the presence of crashes.

The following Maestro code, slightly simplified for presentation, implements an orchestration similar to the example above. It first defines the shape of the data involved in the orchestration: an object type called Customer with a few attributes. It then defines three functions. Function fetch_customers takes no parameters and returns an array of Customers. Its implementation simply performs a GET HTTP request to the appropriate service. The delete_customer function, in this example, simulates the database deletion by calling the print function from the standard library. The orchestration function represents the main entry point. It uses the sequence expression to coordinate the function calls: first call fetch_customers, binding the result to the customers variable, then map over the customers calling delete_customer on each.

Maestro functions declare interfaces to encapsulate expressions: the bodies of fetch_customers and delete_customer are call expressions, and the body of orchestration is a sequence expression that composes other expressions. But at some point we must yield to the host language to implement the actual service request, database call, print, and so on. This is accomplished by a function whose body is a primitive expression, meaning it binds to the host language code registered under the declared key. For example, these are the declarations of the get and print functions from the standard library of our Ruby implementation:

We now can use the Maestro interpreter to execute the orchestration function. This is one possible simplified output from the command line:

The output contains the result of calling print twice, once for each of the customers returned by the fetch service. The interesting aspect here is that the -c flag instructed the interpreter to also dump checkpoints to the standard output.

Checkpoints are what Maestro uses to store execution state. They contain enough information to understand what has already happened in the orchestration and what wasn’t completed yet. For example, the first checkpoint contains the result of the service request that includes a JSON object with the information about customers to delete. In practice, checkpoints are sent to durable storage, such as Kafka, Redis, or MySQL. Then, if the interpreter stops for some reason, we can restart and point it to the existing checkpoints. The interpreter can recover by skipping expressions for which a checkpoint already exists. If we crash while deleting customers from the database, for example, we wouldn’t re-execute the fetch request because we already have its result.

The checkpoints mechanism allows Maestro to provide at-least-once semantics for primitive calls, exactly what’s expected of Shopify Flow workflows. In fact, the new Flow engine, at a high level, is essentially a horizontally scalable, distributed pool of workers that execute the Maestro interpreter on incoming events for orchestrations generated by Flow. Checkpoints are used for fault tolerance as well as to give merchants feedback on each execution step, current status, and so on.

Flow and Maestro Ensemble

Presto festoso

Now that we know what Maestro is capable of, let’s see how it plays together with Flow. The following workflow, for example, shows a typical Flow automation use case. It triggers when orders in a store are created and checks for certain conditions in that order, based on the presence of discount codes or the customer’s email. If the condition predicate matches successfully, it adds a tag to the order and subsequently sends an email to the store owner to alert of the discount.

Screenshot of the Flow app showing the visualization of creating a workflow based on conditions
A typical Flow automation use case

Consider a merchant using the Flow App to create and execute this workflow. There are four main activities involved 

  1. navigating the possible tasks and types to use in the workflow
  2. validating that the workflow is correct
  3. activating the workflow so it starts executing on events
  4. monitoring executions.

Catalog of Tasks and Types

The Flow Editor displays a catalog of tasks for merchants to pick from. Those are triggers, conditions, and actions provided both by Shopify and Shopify Apps via Shopify Flow Connectors. Furthermore, Flow allows merchants to navigate Shopify’s GraphQL Admin API objects in order to select relevant data for the workflow. For example, the Order created trigger in this workflow conceptually brings an Order resource that represents the order that was just created. So, when the merchant is defining a condition or passing arguments to actions, Flow assists in navigating the attributes reachable from that Order object. To do so, Flow must have a model of the GraphQL API types and understand the interface expected and provided by tasks. Flow achieves this by building on top of Maestro types and functions, respectively.

Flow models types as decorated Maestro types: the structure is defined by Maestro types, but Flow adds information, such as field and type descriptions. Most types involved in workflows come from APIs, such as the Shopify GraphQL Admin API. Hence, Flow has an automated process to consume APIs and generate the corresponding Maestro types. Additional types can be defined, for example, to model data included in the events that correspond to triggers, and model the expected interface of actions. For instance, the following types are simplified versions of the event data and Shopify objects involved in the example:

Flow then uses Maestro functions and calls to model the behavior of triggers, conditions, and actions. The following Maestro code shows function definitions for the trigger and actions involved in the workflow above.

Actions are mapped directly to Maestro functions that define the expected parameters and return types. An action used in a workflow is a call to the corresponding function. A trigger, however, is mapped to a data hydration function that takes event data, which often includes only references by IDs, and loads additional data required by the workflow. For example, the order_created function takes an OrderCreatedTrigger, which contains the order ID as an Integer, and performs API requests to load an Order object, which contains additional fields like name and discountCode. Finally, conditions are currently a special case in that they’re translated to a sequence of function calls based on the predicate defined for the condition (more on that in the next section).

Workflow Validation

Once a workflow is created, it needs validation. For that, Flow composes a Maestro function representing the whole workflow. The parameter of the workflow function is the trigger data since it’s the input for its execution. The body of the function corresponds to the transitions and configurations of tasks in the workflow. For example, the following function corresponds to the example:

The first call in the sequence corresponds to the trigger function that’s used to hydrate objects from the event data. The next three steps correspond to the logical expression configured for the condition. Each disjunction branch becomes a function call (to eq and ends_with, respectively), and the result is computed with or. A Maestro match expression is used to pattern match on the result. If it’s true, the control flow goes to the sequence expression that calls the functions corresponding to the workflow actions.

Flow now can rely on Maestro static analysis to validate the workflow function. Maestro will type check, verify that every referred variable is in scope, verify that object navigation is correct (for example, that order.customer.email is valid), and so on. Then, any error found through static analysis is mapped back to the corresponding workflow node and presented in context in the Editor. In addition to returning errors, static analysis results contain symbol tables for each expression indicating which variables are in scope and what their types are. This supports the Editor in providing code completion and other suggestions that are specific for each workflow step. The following screenshot, for example, shows how the Editor can guide users in navigating the fields present in objects available when selecting the Add order tags action.

Flow App Editor screenshot shows how the Editor can guide users in navigating the fields present in objects available when selecting the Add order tags action. The editor shows detailed information for ther.
Shopify Flow App editor

Note that transformation and validation run while a Flow workflow is being edited, either in the Flow Editor or via APIs. This operation is synchronous and, thus, must be very fast since merchants are waiting for the results. This architecture is similar to how modern IDEs send source code to a language service that parses the code into a lower level representation and returns potential errors and additional static analysis results.

Workflow Activation

Once a workflow is ready, it needs to be activated to start executing. The process is initially similar to validation in that Flow generates the corresponding Maestro function. However, there are a few additional steps. First, Maestro performs a static usage analysis: for each call to a primitive function it computes which attributes of the returned type are used by subsequent steps. For example, the call to shopify::admin::order_created returns a tuple (Shop, Order) but not all attributes of those types are used. In particular, order.customer.name isn’t used by this workflow. It wouldn’t only be inefficient to hydrate that value, in the presence of recursive definitions (such as Order has a Customer who has Orders), it would be impossible to determine where to stop digging into the type graph. The result of usage analysis is then passed at runtime to the host function implementation. The runtime can use it to tailor how it computes the values it returns, for instance, by optimizing the queries to the Admin GraphQL API.

Second, Maestro performs a compilation step. The idea is to apply optimizations, removing anything unnecessary for the runtime execution of the function, such as type definitions and auxiliary functions that aren’t invoked by the workflow function. The result is a simplified, small, and efficient Maestro function. The compiled function is then packaged together with the result of usage analysis and becomes an orchestration. Finally, the orchestration is serialized and deployed to the Flow Engine that observes events and runs the Maestro interpreter on the orchestration.

Monitoring Executions

As the Flow Engine executes orchestrations, the Maestro interpreter emits checkpoints. As we discussed before, checkpoints are used by the engine when restarting the interpreter to ensure at-least-once semantics for actions. Additionally, checkpoints are sent back to Flow to feed the Activity page, which lists workflow executions. Since checkpoints have detailed information about the output of every primitive function call, they can be used to map back to the originating workflow step and offer insight into the behavior of executions.

A screenshot of the Flow run log from the Activity Page. It displays the results of the workflow execution at each step
Shopify Flow Run Log from the Activity page

For instance, the image above shows a Run Log for a specific execution of the example, which can be accessed from the Activity page. Note that Flow highlights the branch of the workflow that executed and which branch of the condition disjunction actually evaluated to true at runtime. All this information comes directly from interpreting checkpoints and mapping back to the workflow.

Outro: Future Work

Largo maestoso

In this post I introduced Maestro, a domain-specific orchestration language we developed to power Shopify Flow. I gave a sample of what Maestro looks like and how it neatly integrates with Flow, supporting features of both the Flow Editor as well as the Flow Engine. Maestro has been powering Flow for a while, but we are planning more, such as:

  • Improving the expressiveness of the Flow workflow language, making better use of all the capabilities Maestro offers. For example, allowing the definition of variables to bind the result of actions for subsequent use, support for iteration, pattern matching, and so on.
  • Implementing additional optimizations on deployment, such as merging Flow workflows as a single orchestration to avoid redundant hydration calls for the same event.
  • Using the Maestro interpreter to support previewing and testing of Flow workflows, employing checkpoints to show results and verify assertions.

If you are interested in working with Flow and Maestro or building systems from the ground up to solve real-world problems. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by design.

Continue reading

React Native Skia—For Us, For You, and For Fun

React Native Skia—For Us, For You, and For Fun

Right now, you are likely reading this content on a Skia surface. It powers Chrome, Android, Flutter, and others. Skia is a cross-platform drawing library that provides a set of drawing primitives that you can trust to run anywhere: iOS, Android, macOS, Windows, Linux, the browser, and now React Native.

Our goal with this project is twofold. First we want to provide React Native, which is notorious for its limited graphical capabilities, with a set of powerful 2D drawing primitives that are consistent across iOS, Android, and the Web.  Second is to bridge the gap between graphic designers and React Native: by providing the same UI capabilities as a tool like Figma. Everyone can now speak the same language.

Skia logo image. The background is back and Skia is displayed in cursive rainbow font in the middle of the screen
React Native Skia logo

To bring the Skia library to React Native, we needed to rely on the new React Native architecture, JavaScript Interface (JSI). This new API enables direct communication between JavaScript and native modules using C++ instead of asynchronous messages between the two worlds. JSI allows us to expose the Skia API directly in the following way:

We are making this API virtually 100% compatible with the Flutter API allowing us to do two things:

  1. Leverage the completeness and conciseness of their drawing API
  2. Eventually provide react-native-web support for Skia using CanvasKit, the Skia WebAssembly build used by Flutter for its web apps.

React is all about declarative UIs, so we are also providing a declarative API built on top of the imperative one. The example above can also be written as:

This API allows us to provide an impressive level of composability to express complex drawings, and it allows us to perform declarative optimizations. We leverage the React Reconciler to perform the work of diffing the internal representation states, and we pass the differences through to the Skia engine.

React Native Skia offers a wide range of APIs such as advanced image filters, shaders, SVG, path operations, vertices, and text layouts. The demo below showcases a couple of drawing primitives previously unavailable in the React Native ecosystem. Each button contains a drop and inner shadow, the progress bar is rendered with an angular gradient,  and the bottom sheet uses a backdrop filter to blur the content below it.

Below is an example of mesh gradients done using React Native Skia

Reanimated 2 (a project also supported by Shopify) brought to life the vision of React Native developers writing animations directly in JavaScript code by running it on a dedicated thread. Animations in React native Skia work the same way. Below is an example of animation in Skia:

Example of the Breathe code animated

If your drawing animation depends on an outside view, like a React Native gesture handler, for instance, we also provide a direct connector to Reanimated 2.

With React Native Skia, we expect to be addressing a big pain point of the React Native community. And it is safe to say that we are only getting started. We are working on powerful new features which we cannot wait to share with you in the upcoming months. We also cannot wait to see what you build with it. What are you waiting for!? npm install @shopify/react-native-skia.

Christian Falch has been involved with React Native since 2018, both through open source projects and his Fram X consultancy. He has focused on low-level React Native coding integrating native functionality with JavaScript and has extensive experience writing C++ based native modules.

William Candillon is the host of the “Can it be done in React Native?” YouTube series, where he explores advanced user-experiences and animations in the perspective of React Native development. While working on this series, William partnered with Christian to build the next-generation of React Native UIs using Skia.

Colin Gray is a Principal Developer of Mobile working on Shopify’s Point of Sale application. He has been writing mobile applications since 2010, in Objective-C, RubyMotion, Swift, Kotlin, and now React Native. He focuses on stability, performance, and making witty rejoinders in engineering meetings. Reach out on LinkedIn to discuss mobile opportunities at Shopify!


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Data Is An Art, Not Just A Science—And Storytelling Is The Key

Data Is An Art, Not Just A Science—And Storytelling Is The Key

People often equate data science with statistics, but it’s so much more than that. When data science first emerged as a craft, it was a combination of three different skill sets: science, mathematics, and art. But over time, we’ve drifted. We’ve come to prioritize the scientific side of our skillset and have lost sight of the creative part.

A venn diagram with three circles: Math, Art, and Science. In the middle is Data Science
A Venn diagram of the skills that make up the craft of data science

One of the most neglected, yet arguably most important, skills from the artistic side of data science is communication. Communication is key to everything we do as data scientists. Without it, our businesses won’t be able to understand our work, let alone act on it.

Being a good data storyteller is key to being a good data scientist. Storytelling captures your stakeholders’ attention, builds trust with them, and invites them to fully engage with your work. Many people are intimidated by numbers. By framing a narrative for them, you create a shared foundation they can work from. That’s the compelling promise of data storytelling.

Data science is a balancing act—math and science have their role to play, but so do art and communication. Storytelling can be the binding force that unites them all. In this article, I’ll explore how to tell an effective data story and illustrate with examples from our practice at Shopify. Let’s dive in.

What Is Data Storytelling?

When you Google data storytelling, you get definitions like: “Data storytelling is the ability to effectively communicate insights from a dataset using narratives and visualizations.” And while this isn’t untrue, it feels anemic. There’s a common misconception that data storytelling is all about charts, when really, that’s just the tip of the iceberg.

Even if you design the most perfect visualization in the world—or run a report, or create a dashboard—your stakeholders likely won’t know what to do with the information. All of the burden of uncovering the story and understanding the data falls back onto them.

At its core, data storytelling is about taking the step beyond the simple relaying of data points. It’s about trying to make sense of the world and leveraging storytelling to present insights to stakeholders in a way they can understand and act on. As data scientists, we can inform and influence through data storytelling by creating personal touch points between our audience and our analysis. As with any good story, you need the following key elements:

  1. The main character: Every story needs a hero. The central figure or “main character” in a data story is the business problem. You need to make sure to clearly identify the problem, summarize what you explored when considering the problem, and provide any reframing of the problem necessary to get deeper insight.
  2. The setting: Set the stage for your story with context. What background information is key to understanding the problem? You're not just telling the story; you're providing direction for the interpretation, ideally in as unbiased a way as possible. Remember that creating a data story doesn’t mean shoe-horning data into a preset narrative—as data scientists, it’s our job to analyze the data and uncover the unique narrative it presents.
  3. The narrator: To guide your audience effectively, you need to speak to them in a way they understand and resonate with. Ideally, you should communicate your data story in the language of the receiver. For example, if you’re communicating to a non-technical audience, try to avoid using jargon they won’t be familiar with. If you have to use technical terms or acronyms, be sure to define them so you’re all on the same page.
  4. The plot: Don’t leave your audience hanging—what happens next? The most compelling stories guide the reader to a response and data can direct the action by providing suggestions for next steps. By doing this, you position yourself as an authentic partner, helping your stakeholders figure out different approaches to solving the problem.

Here’s how this might look in practice on a sample data story:

 

Main Character Setting Narrator Plot
Identify the business question you're trying to solve. What background information is key to understanding the problem. Ensure you're communicating in a way that your audience will understand. Use data to direct the action by providing next steps.
Ex. Why aren't merchants using their data to guide their business decisions? Ex. How are they using existing analytic products and what might be preventing use? Ex. Our audience are busy execs who prefer short bulleted lists in a Slack message. Ex. Data shows merchants spend too much time going back and forth between their Analytics and Admin page. We recommend surfacing analytics right within their workflow.

 

With all that in mind, how do you go about telling effective data stories? Let me show you.

1. Invest In The Practice Of Storytelling

In order to tell effective data stories, you need to invest in the right support structures. First of all, that means laying the groundwork with a strong data foundation. The right foundation ensures you have easy access to data that is clean and conformed, so you can move quickly and confidently. At Shopify, our data foundations are key to everything we do—it not only supports effective data storytelling, but also enables us to move purposefully during unprecedented moments.

For instance, we’ve seen the impact data storytelling can have while navigating the pandemic. In the early days of COVID-19, we depended on data storytelling to give us a clear lens into what was happening, how our merchants were coping, and how we could make decisions based on what we were seeing. This is a story that has continued to develop and one that we still monitor to this day.

Since then, our data storytelling approach has continued to evolve internally. The success of our data storytelling during the pandemic was the catalyst for us to start institutionalizing data storytelling through a dedicated working group at Shopify. This is a group for our data scientists, led by data scientists—so they fully own this part of our craft maturity.

Formalizing this support network has been key to advancing our data storytelling craft. Data scientists can drop in or schedule a review of a project in process. This group also provides feedback and informed guidance on how to improve the story that the analysis is trying to tell so communications back to stakeholders is most impactful. The goal is to push our data scientists to take their practice to the next level—by providing context, explaining what angles they already explored, offering ways to reframe the problem, and sharing potential next steps.

Taking these steps to invest in the practice of data storytelling ensures that when our audience receives our data communications, they’re equipped with accurate data and useful guidance to help them choose the best course of action. By investing in the practice of data storytelling, you too can ensure you’re producing the highest quality work for your stakeholders—establishing you as a trusted partner.

2. Identify Storytelling Tools And Borrow Techniques From The Best

Having the right support systems in place is key to making sure you’re telling the right stories—but how you tell those stories is just as important. One of our primary duties as data scientists is decision support. This is where the art and communication side of the practice comes in. It's not just a one-and-done, "I built a dashboard, someone else can attend to that story now." You’re committed to transmitting the story to your audience. The question then becomes, how can you communicate it as effectively as possible, both to technical and non-technical partners?

At Shopify, we’ve been inspired by and have adopted design studio Duarte’s Slidedocs approach. Slidedocs is a way of using presentation software like PowerPoint to create visual reports that are meant to be read, not presented. Unlike a chart or a dashboard, what the Slidedoc gives you is a well-framed narrative. Akin to a “policy brief” (like in government), you can pack a dense amount of information and visuals into an easily digestible format that your stakeholders can read at their leisure. Storytelling points baked into our Slidedocs include:

  • The data question we’re trying to answer
  • A description of our findings
  • A graph or visualization of the data 
  • Recommendations based on our findings
  • A link to the in-depth report
  • How to contact the storyteller
A sample slidedoc example from Shopify Data. It highlights information for a single question by describing findings and recommendations. It also links out to the in-depth report
An example of how to use Slidedocs for data storytelling

Preparing a Slidedoc is a creative exercise—there’s no one correct way to present the data, it’s about understanding your audience and shaping a story that speaks to them. What it allows us to do is guide our stakeholders as they explore the data and come to understand what it’s communicating. This helps them form personal touchpoints with the data, allowing them to make a better decision at the end.

While the Slidedocs format is a useful method for presenting dense information in a digestible way, it’s not the only option. For more inspiration, you can learn a lot from teams who excel at effective communication, such as marketing, PR, and UX. Spend time with these teams to identify their methods of communication and how they scaffold stories to be consumed. The important thing is to find tools that allow you to present information in a way that’s action-oriented and tailored for the audience you’re speaking to.

3. Turn Storytelling Into An Experience

The most effective way to help your audience feel invested in your data story is to let them be a part of it. Introducing interactivity allows your audience to explore different facets of the story on demand, in a sense, co-creating the story with you. If you supply data visualizations, consider ways that you can allow your audience to filter them, drill into certain details, or otherwise customize them to tell bigger, smaller, or different stories. Showing, not telling, is a powerful storytelling technique.

A unique way we’ve done this at Shopify is through a product we created for our merchants that lets them explore their own data. Last fall, we launched the BFCM 2021 Notebook—a data storytelling experience for our merchants with a comprehensive look at their store performance over Black Friday and Cyber Monday (BFCM).

While we have existing features for our merchants that show, through reports and contextual analytics, how their business is performing, we wanted to take it to the next level by giving them more agency and a personal connection to their own data. That said, we understand it can be overwhelming for our merchants (or anyone!) to have access to a massive set of data, but not know how to explore it. People might not know where to start or feel scared that they’ll do it wrong.

Example BFCM 2021 Notebook. The notebook shows a graph of the sales over time during BFCM weekend 2021
Shopify’s BFCM Notebook

What the BFCM Notebook provided was a scaffold to support merchants’ data exploration. It’s an interactive visual companion that enables merchants to dive into their performance data (e.g. total sales, top-performing products, buyer locations) during their busiest sale season. Starting with total sales, merchants could drill into their data to understand their results based on products, days of the week, or location. If they wanted to go even deeper, they could click over the visualizations to see the queries that powered them—enabling them to start thinking about writing queries of their own.

Turning data storytelling into an experience has given our merchants the confidence to explore their own data, which empowers them to take ownership of it. When you’re creating a data story, consider: Are there opportunities to let the end user engage with the story in interactive ways?

Happily Ever After

Despite its name, data science isn’t just a science; it’s an art too. Data storytelling unites math, science, art, and communication to help you weave compelling narratives that help your stakeholders comprehend, reflect on, and make the best decisions about your data. By investing in your storytelling practice, using creative storytelling techniques, and including interactivity, you can build trust with your stakeholders and increase their fluency with data. The creative side of data science isn’t an afterthought—it’s absolutely vital to a successful practice.

Wendy Foster is the Director of Engineering & Data Science for Core Optimize at Shopify. Wendy and her team are focused on exploring how to better support user workflows through product understanding, and building experiences that help merchants understand and grow their business.


Are you passionate about data discovery and eager to learn more, we’re always hiring! Visit our Data Science and Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by design.

Continue reading

Building a Business System Integration and Automation Platform at Shopify

Building a Business System Integration and Automation Platform at Shopify

Companies organize and automate their internal processes with a multitude of business systems. Since companies function as a whole, these systems need to be able to talk to one another. At Shopify, we took advantage of Ruby, Rails, and our scale with these technologies to build a business system integration solution.

The Modularization of Business Systems

In step with software design’s progression from monolithic to modular architecture, business systems have proliferated over the past 20 years, becoming smaller and more focused. Software hasn’t only targeted the different business domains like sales, marketing, support, finance, legal, and human resources, but the niches within or across these domains, like tax, travel, training, documentation, procurement, and shipment tracking. Targeted applications can provide the best experience by enabling rapid development within a small, well defined space.

The Gap

The transition from monolithic to modular architecture doesn’t remove the need for interaction between modules. Maintaining well-defined, versioned interfaces and integrating with other modules is one of the biggest costs of modularization. In the business systems space, however, it doesn’t always make sense for vendors to take responsibility for integration, or do it in the same way.

Business systems are built on different tech stacks with different levels of competition and different customer requirements. This landscape leads to business systems with asymmetric interfaces (from SOAP to FTP to GraphQL) and integration capabilities (from complete integration platforms to nothing). Businesses are left with a gap between their systems and no clear, easy way to fill it.

Organic Integration

Connecting these systems on an as needed basis leads to a hacky hodgepodge of:

  • ad hoc code (often running on individual’s laptops)
  • integration platforms like Zapier
  • users downloading and uploading CSVs
  • third party integration add ons from app stores
  • out of the box integrations
  • custom integrations built on capable business systems.

Frequently data won’t be going from the source system directly to the target system, but has multiple layovers in whatever systems it could integrate with. The only determining factors are the skillsets and creativity of the people involved in building the integration.

When a company is small this can work, but as companies scale and the number of integrations grow it becomes unmanageable: data flows are convoluted, raising security questions, and making business critical automation fragile. Just like with monolithic architecture it can become too terrifying and complex to change anything, paralyzing the business systems and preventing them from adapting and scaling to support the company.

Integration Platform as a Service

The solution, as validated by the existence of numerous Integration Platform as a Service (IPaaS) solutions like Mulesoft, Dell Boomi, and Zapier, is yet another piece of software that’s responsible for integrating business systems. The consistency provided by using one application for all integration can solve the issues of visibility, fragility, reliability, and scalability.

Mulesoft

At Shopify, we ran into this problem, created a small team of business system integration developers and put them to work building on Mulesoft. This was an improvement, but, because Shopify is a software company, it wasn’t perfect.

Isolation from Shopify Development

Shopify employs thousands of developers. We have infrastructure, training, and security teams. We maintain a multitude of packages and have tons of Slack channels for getting help, discussing ideas, and learning about best practices. Shopify is intentionally narrow in the technologies it uses (Ruby, React, and Go) to benefit from this scale.

Mulesoft is a proprietary platform leveraging XML configuration for the Java virtual machine. This isn’t part of Shopify’s tech stack, so we missed out on many of the advantages of developing at Shopify.

Issues with Integrating Internal Applications

Mulesoft’s cloud runtime takes care of infrastructure for its users, a huge advantage of using the platform. However, Shopify has a number of internal services, like shipment tracking, as well as infrastructure, like Kafka, that for security reasons can only be used from within Shopify’s cloud. This meant that we would need to build infrastructure skills on our team to host Mulesoft on our own cloud.

Although using Mulesoft initially seemed to lower the costs of connecting business systems, due to our unique situation, it had more drawbacks than developing on Shopify’s tech stack.

Building on Shopify’s Stack

Unless performance is paramount, in which case we use Go, Ruby is Shopify’s choice for backend development. Generally Shopify uses the Rails framework, so if we’re going to start building business system integrations on Shopify’s tech stack, Ruby on Rails is our choice. The logic for choosing Ruby on Rails within the context of development at Shopify is straightforward, but how do we use it for business system integration?

The Design Priorities

When the platform is complete, we want to build reliable integrations quickly. To turn that idea into a design, we need to look at the technical aspects of business system integration that differentiate it from the standard application development Rails is designed around.

Minimal

Generally applications are structured around a domain and get to determine the requirements, the data they will and won’t accept. An integration, however, isn’t the source of truth for anything. Any validation we introduce in an integration will be, at best, a duplication of logic in the target application. At worst our logic will create erroneous errors.

I did this the other day with a Sorbet Struct. I was using it to organize data before posting it. Unfortunately a field was required in the struct that wasn’t required in the target system. This resulted in records failing in transit when the target system would have accepted them.

Transparent

Many business systems are highly configurable. Changes in their configuration can lead to changes in their APIs, affecting integrations.

Airtable, for example, uses the column names as the JSON keys in their API, so changing a column name in the user interface can break an integration. We need to provide visibility into exactly what integrations are doing to help system admins avoid creating errors and quickly resolve them when they arise.

Flexible

Business systems are diverse, created at different times by different developers using different technologies and design patterns. For integration work this⁠—most importantly⁠—leads to a wide variety of interfaces like FTP, REST, SOAP, JSON, XML, and GraphQL. If we want a centralized, standardized place to build integrations it needs to support whatever requirements are thrown at it.

Secure

Integrations deal with sensitive information, personally identifiable information (PII), compensation, and anything else that needs to move between business systems. We need to make sure that we aren’t exposing this data.

Reusable

Small, point to point integrations are the most reliable and maintainable. This design has the potential to create a lot of duplicate code and infrastructure. If we want to build integrations quickly we need to reuse as much as possible.

Implementation

Those are some nice high-level design priorities. How did we implement them?

Documentation

From the beginning of the project, documentation has been a priority. We document

  • decisions that we’re making, so they’re understood and challenged in the future as needs change
  • the integrations living on our platform
  • the clients we’ve implemented for connecting to different systems and how to use them
  • how to build on the platform as a whole.

Initially we were using GitHub’s built-in wiki, but being able to version control our documentation and commit updates alongside the code made it easier to trace changes and ensure documentation was being kept up to date. Fortunately Shopify’s infrastructure makes it very easy to add a static site to a git repository.

Design priorities covered: transparency, reusability

Language Features

Ruby is a mature, feature-rich language. Beyond being Turing complete, over the years it’s added a plethora of features to make programming simpler and more concise. It also has an extensive package ecosystem thanks to Ruby’s wide usage, long life, and generous community. In addition to reusing our code, we’re able to leverage other developer’s and organization’s code. Many business systems have great, well-maintained gems, so integrating with them is as simple as adding the gem and credentials.

Design priorities covered: reusability

Rails Engines

We reused Shopify Core’s architecture, designing our application as a modular monolith made up of Rails Engines. Initially the application didn’t take advantage of Rails Engines and simply used namespaces within the app directory. It quickly became apparent that this model made tracking down an individual integration’s code difficult. You have to go through every one of the app directories, controllers, helpers, and more to see if an integration’s namespace was present.

After a lot of research and a few conversations with my Shopify engineering mentor, I began to understand Rails Engines. Rails engines are a great fit for our platform because integrations have relatively obvious boundaries, so it’s easy and advantageous to modularize them.

This design enabled us to reuse the same infrastructure for all our integrations. It also enabled us to share code across integrations by creating a common Rails Engine, without the overhead of packaging it up into rubygems or duplicating it. This reduces both development and maintenance costs.

In addition, this architecture benefitted transparency by keeping all of the code in one place and modularizing it. It’s easy to know what integrations exist and what code belongs to them.

Design priorities covered: reusability, transparent

Eliminating Data Storage

Our business system integration platform won’t be the source of truth for any business data. The business data comes from other business systems and passes through our application.

If we start storing data in our application it can become stale, out of sync with the source of truth. We could end up sending stale data to other systems and triggering the wrong processes. Tracking this all down requires digging through databases, logs, and timestamps in multiple systems, some without good visibility.

Data storage adds complexity, hurts transparency, and introduces security and compliance concerns.

Design priorities covered: transparent, minimal, secure

Actions

Business system integration consists almost entirely of business logic. In Rails, there are multiple places this could live, but they generally involve abstractions designed around building standalone applications, not integrations. Using one of these abstractions would add complexity and obfuscate the logic.

Actions were floating around Shopify as a potential home for business logic. They have the same structure as Active Jobs, one public method, perform, and don’t reference any other Actions. The Action concept provides consistency, making all integration logic easy to find. It also provides transparency by putting all business logic in one place, so it’s only necessary to look at one Action to understand a data flow.

One of the side effects of Actions is code duplication. This was a trade-off we accepted. Given that integrations should be acting independently, we would prefer to duplicate some code than tightly couple integrations.

Design priorities covered: transparent, minimal

Embracing Hashes

Dataflows are the purpose of our application. In every integration we are dealing with at least two API abstractions of complex systems. Introducing our own abstractions on top of these abstractions can quickly compound complexity. If we want the application to be transparent, it needs to be obvious what data is flowing through it and how the data is being modified.

Most of the data we’re working with is JSON. In Ruby, JSON is represented as a hash, so working with hashes directly often provides the best transparency with the least room for introducing errors.

I know, I know. We all hate to see strings in our code, but hear me out. You receive a JSON payload. You need to transform it and send out another JSON payload with different keys. You could map the original payload to an object, map that object to another object, and map the final object back to JSON. If you want to track that transformation, though, you need to track it through three transformations. On the other hand, you could use a hash and a transform function and have the mapping clearly displayed.

Using hashes leads to more transparency than abstracting them away, but it also can lead to typos and therefore errors, so it’s important to be careful. If you’re using a string key multiple times, turn it into a constant.

Design priorities covered: transparent, minimal

Low-level Mocking

At Shopify, we generally use Mocha for mocking, but for our use case we default to WebMock. WebMock mocks at the request level, so you see the URL, including query parameters, headers, and request body explicitly in tests. This makes it easy to work directly with business systems API documentation because this is the level it’s documented at, and it allows us to understand exactly what our integrations are doing.

There are some cases, though, where we use Mocha, for example with SOAP. Reading a giant XML text string doesn’t provide useful visibility into what data is being sent. WebMock tests also become complex when many requests are involved in the integration. We’re working on improving the testing experience for complex integrations with common factories and prebuilt WebMocks.

Design priorities covered: transparent

Shopify

Perhaps most importantly, we’ve been able to tap into development at Shopify by leveraging our:

  • infrastructure, so all we have to do to stand up an application or add a component is run dev runtime
  • training team to help onboard our developers
  • developer pipeline for hiring
  • observability through established logging, metrics and tracing setups
  • internal shipment tracking service
  • security team standards and best practices

The list could go on forever.

Design priorities covered: reusability, security

It’s been a year since work on our Rails integration platform began. Now, we have 18 integrations running, have migrated all our Mulesoft apps to the new platform, have doubled the number of developers from one to two and have other teams building integrations on the platform. The current setup enables us to build simple integrations, the majority of our use case, quickly and securely with minimal maintenance. We’re continuing to work on ways to minimize and simplify the development process, while supporting increased complexity, without harming transparency. We’re currently focused on improving test mock management and the onboarding process and, of course, building new integrations.

Will is a Senior Developer on the Solutions Engineering Team. He likes building systems that free people to focus on creative, iterative, connective work by taking advantage of computers' scalability and consistency.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Möbius: Shopify’s Unified Edge

Möbius: Shopify’s Unified Edge

While working on improvements to the Shopify platform in terms of how we handle traffic, we identified that through time each change became more and more challenging. When deploying a feature, if we wanted the whole platform to benefit from it, we couldn’t simply build it “one way fits all”—we already had more than six different ways for the traffic to reach us. To list some, we had a traffic path for:

  • “general population” (GenPop) Shopify Core: the monolith serving shops that’s already using an edge provider
  • GenPop Shopify applications and services
  • GenPop Shopify applications and services that’s using an edge provider
  • Personally identifiable information (PII) restricted Shopify Core: where we need to make sure that the traffic and related data are kept in a specific area of the world
  • PII restricted Shopify applications and services
  • publicly-accessible APIs that are required for Mutual Transport Layer Security (mTLS) authentication.
LOTR Meme stating: "How many ways are there for traffic to reach Shopify?"
LOTR Meme

We had to choose which traffic path could use those new features or build the same feature in more than six different ways. Moreover, many different traffic paths means that observability gets blurred. When we receive requests and something doesn’t work properly, figuring out why and how to fix it requires more time. It also takes longer to onboard new team members to all of those possibilities and how to distinguish between them.

LOTR meme image of Frodo holding the ring stating: One edge to reach them all, one edge to secure them; One edge to bring them all and in the clusters bind them.
One edge to reach them all, one edge to secure them; One edge to bring them all and in the clusters bind them.

This isn’t the way to build a highway to new features and improvements. I’d like to tell you why and how we built the one edge to front our services and systems.

One Does Not Simply Know What “The Edge” Stands For

LOTR meme image of Aragorn stating: One does not simply know what "The Edge" stands for.
One does not simply know what "The Edge" stands for.

The most straightforward definition of the edge, or network edge, is the point at which an enterprise-owned network connects to a third-party network. With cloud computing, lines are slightly blurred as we use third parties to provide us with servers and networks (even more when using a provider to front your network, like we do at Shopify). But in both those cases, as long as they’re used and controlled by Shopify, they’re considered part of our network.

The edge of Shopify is where requests from outside our network are made to reach our network.

The Fellowship of the Edge

Unifying our edge became our next objective and two projects were born to make this possible: Möbius, which as the name taken from the “Möbius strip” suggests, was to be the one edge of Shopify and Shopify Front End (SFE), the routing layer that receives traffic from Möbius and dispatches it to where it needs to go.

A flow diagram showing Möbius’s traffic path that takes requests from the internet to the routing layer and then sends traffic to the application’s clusters for traffic to be served. Purple entities are on the traffic path for PII restricted traffic, while the beige ones are for the GenPop traffic.
Möbius’s traffic path takes requests from the internet to the routing layer and then sends traffic to the application’s clusters for traffic to be served. Purple entities are on the traffic path for PII restricted traffic, while the beige ones are for the GenPop traffic.

About a year before starting Möbius, we already had a small number of applications handled through our edge, but we saw limitations in terms of how to properly automate such an approach at scale, while the gains to the platform justified the monetary costs to reach those gains. We designed SFE and Möbius together, leading to a better separation of concerns between the edge and the routing layers.

The Shopify Front End

SFE is designed to provide a unified routing layer behind Möbius. Deployed in many different regions, routing clusters can receive any kind of web traffic from Möbius, whether for Shopify Core or Applications. Those clusters are mainly nginx deployments with custom Lua code to handle the routing according to a number of criteria, including but not limited to the IP address a client connected to and the domain that was used to reach Shopify. For the PII restricted requirements, parallel deployments of the same routing clusters code are deployed in the relevant regions.

To handle traffic for applications and services, SFE works by using a centralized API receiving requests from Kubernetes controllers deployed in every cluster using such applications and services. This allows linking the domain names declared by an application to the clusters where the application is deployed. We also use this to provide active/active (when two instances of a given service can receive requests at the same time) or active/passive (when only a single instance of a given service can receive requests) load balancing.

Providing load balancing at the routing layer instead of DNS allows for near instantaneous traffic changes instead of depending on the Time to Live as described in my previous post. It avoids those decisions being made on the client side and thus provides us with better command and control over the traffic.

Möbius

Möbius’s core concerns are simple: we grab the traffic from outside of Shopify and make sure it makes its way inside of Shopify in a stable, secure, and performant manner. Outside of Shopify is any client connecting to Shopify from outside a Shopify cluster. Inside of Shopify is, as far as Möbius is concerned, the routing cluster with the lowest latency to the receiving edge’s point-of-presence (PoP).

Möbius is responsible for TLS and TCP termination with the clients, and doing that termination as close as possible to the client. It brings faster requests and better DDoS protection, plus it allows us to filter malicious requests before the traffic even reaches our clusters. This is something that was already done for our GenPop Shopify Core traffic, but Möbius now standardizes. On top of handling the certificates for the shops, we added an automated path to handle certificates for applications domains.

A flow diagram showing the configuration of the edge with Möbius and SFE. Domains updates are intercepted to update the edge provider’s domains and certificates store, making sure that we’re able to terminate TCP and TLS for those domains and let the request follow its path
Configuration of the edge with Möbius and SFE. Domains updates are intercepted to update the edge provider’s domains and certificates store, making sure that we’re able to terminate TCP and TLS for those domains and let the request follow its path

SFE already needs to be aware of the domains that the applications respond to, so instead of building the same logic a second time to configure the edge, we piggybacked on the work the SFE controller was already doing. We added handlers in the centralized API to configure those domains at the edge, through API requests to our vendor, and indicate we’re expecting to receive traffic on those, and to forward requests to SFE. Our API handler takes care of each and any DNS challenge to validate that we own the domain in order for the traffic to start flowing, but also obtains a valid certificate.

Prior to Möbius, if an application owner wanted to take advantage of the edge, they had to configure their domain manually at the edge (validating ownership, obtaining a certificate, setting up the routing), but Möbius provides full automation of that setup, allowing application owners to simply configure their ingress and DNS and ripe the benefits of the edge right away.

Finally, it’s never easy to have many systems migrate to use a new one. We aimed to make that change as easy as possible for application owners. With automation deploying all that was required, the last required step was a simple DNS change for applications domains, from targeting a direct-to-cluster record to targeting Möbius. We wanted to keep that change manual to make sure that application owners own the process and make sure that nothing gets broken.

A screenshot of the dashboard for the shopify-debug.com application. It displays hostnames configured to serve an application and the status of it's edge
Example dashboard for the shopify-debug.com application (accessible publicly and used for debugging connectivity issues with merchants). On the dashboard, we can find a link to its edge logs, see that the domains of the application are properly configured at the edge to receive traffic, and provide a TLS certificate. A test link also allows to simulate a connection to the platform using that domain so the response can be verified manually.

To make sure all is fine for an application before (and after!) migration, we also added observability in the form of easy:

  • access to the logs for a given application at the edge
  • identification of which domains an application will have configured at the edge,
  • understanding of what is the status of those domains.

This allows owners of applications and services to immediately identify if one of their domains isn’t configured or behaving as expected.

Our Precious Edge

A drawing of Gollum's face (from Lord of the Rings) staring at a Mobius strip like it's the ring

On top of all the direct benefits that Möbius provides right away, it allows us to build the future of Shopify’s edge. Different teams are already working on improvements to the way we do caching at the edge, for instance, or on ways to use other edge features that we’re not already taking advantage of. We also have ongoing projects to handle cluster-to-cluster communications by avoiding the traffic from going through the edge and coming back to our clusters by taking advantage of SFE.

Using new edge features and standardizing internal communications is possible because we unified the edge. There are exceptions where we need to avoid cross-dependency for applications and services on which either Möbius or SFE depend to function. If we were to onboard them to use Möbius and SFE, whenever an issue would happen, we would be in a crash-loop situation: Möbius/SFE requires that application to work, but that application requires Möbius/SFE to work.

It’s now way easier to explain to new Shopifolk how traffic reaches us and what happens between a client and Shopify. There’s no need for as many conditionals in those explanations, nor as many whiteboards… but we might need more of those to explain all that we do as we grow the capabilities on our now-unified edge!

Raphaël Beamonte holds a Ph.D. in Computer Engineering in systems performance analysis and tracing, and sometimes gives lectures to future engineers, at Polytechnique Montréal, about Distributed Systems and Cloud Computing.


If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

Code Ranges: A Deeper Look at Ruby Strings

Code Ranges: A Deeper Look at Ruby Strings

Contributing to any of the Ruby implementations can be a daunting task. A lot of internal functionality has evolved over the years or been ported from one implementation to another, and much of it is undocumented. This post is an informal look at what makes encoding-aware strings in Ruby functional and performant. I hope it'll help you get started digging into Ruby on your own or provide some additional insight into all the wonderful things the Ruby VM does for you.

Ruby has an incredibly flexible, if not unusual, string representation. Ruby strings are generally mutable, although the core library has both immutable and mutable variants of many operations. There’s also a mechanism for freezing strings that makes String objects immutable on a per-object or per-file basis. If a string literal is frozen, the VM will use an interned version of the string. Additionally, strings in Ruby are encoding-aware, and Ruby ships with 100+ encodings that can be applied to any string, which is in sharp contrast to other languages that use one universal encoding for all its strings or prevent the construction of invalid strings.

Depending on the context, different encodings are applied when creating a string without an explicit encoding. By default, the three primary ones used are UTF-8, US-ASCII, and ASCII-8BIT (aliased as BINARY). The encoding associated with a string can be changed with or without validation. It is possible to create a string with an underlying byte sequence that is invalid in the associated encoding.

The Ruby approach to strings allows the language to adapt to many legacy applications and esoteric platforms. The cost of this flexibility is the runtime overhead necessary to consider encodings in nearly all string operations. When two strings are appended, their encodings must be checked to see if they're compatible. For some operations, it's critical to know whether the string has valid data for its attached encoding. For other operations, it's necessary to know where the character or grapheme boundaries are.

Depending on the encoding, some operations are more efficient than others. If a string contains only valid ASCII characters, each character is one byte wide. Knowing each character is only a byte allows operations like String#[], String#chr, and String#downcase to be very efficient. Some encodings are fixed width—each "character" is exactly N bytes wide. (The term "character" is vague when it comes to Unicode. Ruby strings (as of Ruby 3.1) have methods to iterate over bytes, characters, code points, and grapheme clusters. Rather than get bogged down in the minutiae of each, I'll focus on the output from String#each_char and use the term "character" throughout.) Many operations with fixed-width encodings can be efficiently implemented as character offsets are trivial to calculate. In UTF-8, the default internal string encoding in Ruby (and many other languages), characters are variable width, requiring 1 - 4 bytes each. That generally complicates operations because it's not possible to determine character offsets or even the total number of characters in the string without scanning all of the bytes in the string. However, UTF-8 is backwards-compatible with ASCII. If a UTF-8 string consists of only ASCII characters, each character will be one byte wide, and if the runtime knows that it can optimize operations on such strings the same as if the string had the simpler ASCII encoding.

Code Ranges

In general, the only way to tell if a string consists of valid characters for its associated encoding is to do a full scan of all the bytes. This is an O(n) process, and while not the least efficient operation in the world, it is something we want to avoid. Languages that don't allow invalid strings only need to do the validation once, at string creation time. Languages that ahead-of-time (AOT) compile can validate string literals during compilation. Languages that only have immutable strings can guarantee that once a string is validated, it can never become invalid. Ruby has none of those properties, so its solution to reducing unnecessary string scans is to cache information about each string in a field known as a code range.

There are four code range values:

  • ENC_CODERANGE_UNKNOWN
  • ENC_CODERANGE_7BIT
  • ENC_CODERANGE_VALID
  • ENC_CODERANGE_BROKEN

The code range occupies an odd place in the runtime. As a place for the runtime to record profile information, it's an implementation detail. There is no way to request the code range directly from a string. However, since the code range records information about validity, it also impacts how some operations perform. Consequently, a few String methods allow you to derive the string's code range, allowing you to adapt your application accordingly.

The mappings are:

Code range
Ruby code equivalent
ENC_CODERANGE_UNKNOWN
No representation*
ENC_CODERANGE_7BIT
str.ascii_only?
ENC_CODERANGE_VALID
str.valid_encoding? && !str.ascii_only?
ENC_CODERANGE_BROKEN
!str.valid_encoding?

Table 1:  Mapping of internal code range values to public Ruby methods.

* – Code ranges are lazily calculated in most cases. However, when requesting information about a property that a code range encompasses, the code range is calculated on demand. As such, you may pass strings around that have an ENC_CODERANGE_UNKNOWN code range, but asking information about its validity or other methods that require the code range, such as a string's character length, will calculate and cache the code range before returning a value to the caller.

Given its odd standing as somewhat implementation detail, somewhat not, every major Ruby implementation associates a code range with a string. If you ever work on a Ruby implementation's internals or a native extension involving String objects, you'll almost certainly run into working with and potentially managing the code range value.

Semantics

In MRI, the code range value is stored as an int value in the object header with bitmask flags representing the values. Each of the values is mutually exclusive from one another. This is important to note because logically, every string with an ASCII-compatible encoding and consists of only ASCII characters is a valid string. However, such a string will never have a code range value of ENC_CODERANGE_VALID. You should use the ENC_CODERANGE(obj) macro to extract the code range value and then compare it against one of the defined code range constants, treating the code range constants essentially the same as an enum (e.g., if (cr == ENC_CODERANGE_7BIT) { ... }).

If you try to use the code range values as bitmasks directly, you'll have very confusing and difficult to debug results. Due to the way the masks are defined, if a string is annotated as being both ENC_CODERANGE_7BIT and ENC_CODERANGE_VALID it will appear to be ENC_CODERANGE_BROKEN. Conversely, if you try to branch on a combined mask like if (cr & (ENC_CODERANGE_7BIT | ENC_CODERANGE_VALID)) { ... }, that will include ENC_CODERANGE_BROKEN strings. This is because the four valid values are only represented by two bits in the object header. The compact representation makes efficient use of the limited space in the object header but can be misleading to anyone used to working with bitmasks to match and set attributes.

To help illustrate the point a bit better, I've ported some of the relevant C code to Ruby (see Listing 1):

Listing 1: MRI's original C code range representation recreated in Ruby.

JRuby has a very similar implementation to MRI, storing the code range value as an int compactly within the object header, occupying only two bits. In TruffleRuby, the code range is represented as an enum and stored as an int in the object's shape. The enum representation takes up additional space but prevents the class of bugs from misapplication of bitmasks.

String Operations and Code Range Changes

The object's code range is a function of both its sequence of bytes and the encoding associated with the object to interpret those bytes. Consequently, when either the bytes change or the encoding changes, the code range value has the potential to be invalidated. When such an operation occurs, the safest thing to do is to perform a complete code range scan of the resulting string. To the best of our ability, however, we want to avoid recalculating the code range when it is not necessary to do so.

MRI avoids unnecessary code range scans via two primary mechanisms. The first is to simply scan for the code range lazily by changing the string's code range value to ENC_CODERANGE_UNKNOWN. When an operation is performed that needs to know the real code range, MRI calculates it on demand and updates the cached code range with the new result. If the code range is never needed, it's never calculated. (MRI will calculate the code range eagerly when doing so is cheap. In particular, when lexing a source file, MRI already needs to examine every byte in a string and be aware of the string's encoding, so taking the extra step to discover and record the code range value is rather cheap.)

The second way MRI avoids code range scans is to reason about the code range values of any strings being operated on and how an operation might result in a new code range. For example, when working with strings with an ENC_CODERANGE_7BIT code range value, most operations can preserve the code range value since all ASCII characters stay within the 0x00 - 0x7f range. Whether you take a substring, change the casing of characters, or strip off whitespace, the resulting string is guaranteed to also have the ENC_CODERANGE_7BIT value, so performing a full code range scan would be wasteful. The code in Listing 1 demonstrates some operations on a string with an ENC_CODERANGE_7BIT code range and how the resulting string always has the same code range.

Listing 2: Changing the case of a string with an ENC_CODERANGE_7BIT code range will always result in a string that also has an ENC_CODERANGE_7BIT code range.

Sometimes the code range value on its own is insufficient for a particular optimization, in which case MRI will consider additional context. For example, MRI tracks whether a string is "single-byte optimizable." A string is single-byte optimizable if its code range is ENC_CODERANGE_7BIT or if the associated encoding uses characters that are only one-byte wide, such as is the case with the ASCII-8BIT/BINARY encoding used for I/O. If a string is single-byte optimizable, we know that String#reverse must retain the same code range because each byte corresponds to a single character, so reversing the bytes can't change their meaning.

Unfortunately, the code range is not always easily derivable, particularly when the string's code range is ENC_CODERANGE_VALID or ENC_CODERANGE_BROKEN, in which case a full code range scan may prove to be necessary. Operations performed on a string with an ENC_CODERANGE_VALID code range might result in an ENC_CODERANGE_7BIT string if the source string's encoding is ASCII-compatible; otherwise, it would result in a string with an ENC_CODERANGE_VALID encoding. (We've deliberately set aside the case of String#setbyte which could cause a string to have an ENC_CODERANGE_BROKEN code range value. Generally, string operations in Ruby are well-defined and won't result in a broken string.) In Listing 2, you can see some examples of operations performed against a string with an ENC_CODERANGE_VALID code range resulting in strings with either an ENC_CODERANGE_7BIT code range or an ENC_CODERANGE_VALID coderange.

Listing 3: Changing the case of a string with an ENC_CODERANGE_VALID code range might result in a string with a different code range.

Since the source string may have an ENC_CODERANGE_UNKNOWN value and the operation may not need the resolved code range, such as String#reverse called on a string with the ASCII-8BIT/BINARY encoding, it's possible to generate a resulting string that also has an ENC_CODERANGE_UNKNOWN code range. That is to say, it's quite possible to have a string that is ASCII-only but which has an unknown code range that, when operated on, still results in a string that may need to have a full code range scan performed later on. Unfortunately, this is just the trade-off between lazily computing code ranges and deriving the code range without resorting to a full byte scan of the string. To the end user, there is no difference because the code range value will be computed and be accurate by the time it is needed. However, if you're working on a native extension, a Ruby runtime's internals, or are just profiling your Ruby application, you should be aware of how a code range can be set or deferred.

TruffleRuby and Code Range Derivations

As a slight digression, I'd like to take a minute to talk about code ranges and their derivations in TruffleRuby. Unlike other Ruby implementations, such as MRI and JRuby, TruffleRuby eagerly computes code range values so that strings never have an ENC_CODERANGE_UNKNOWN code range value. The trade-off that TruffleRuby makes is that it may calculate code range values that are never needed, but string operations are simplified by never needing to calculate a code range on-demand. Moreover, TruffleRuby can derive the code range of an operation's result string without needing to perform a full byte scan in more situations than MRI or JRuby can.

While eagerly calculating the code range may seem wasteful, it amortizes very well over the lifetime of a program due to TruffleRuby's extensive reuse of string data. TruffleRuby uses ropes as its string representation, a tree-based data structure where the leaves look like a traditional C-style string, while interior nodes represent string operations linking other ropes together. (If you go looking for references to "rope" in TruffleRuby, you might be surprised to see they're mostly gone. TruffleRuby still very much uses ropes, but the TruffleRuby implementation of ropes was promoted to a top-level library in the Truffle family of language implementations, which TruffleRuby has adopted. If you use any other language that ships with the GraalVM distribution, you're also using what used to be TruffleRuby's ropes.) A Ruby string points to a rope, and a rope holds the critical string data.

For instance, on a string concatenation operation, rather than allocate a new buffer and copy data into it, with ropes we create a "concat rope" with each of the strings being concatenated as its children (see Fig.1). The string is then updated to point at the new concat rope. While that concat rope does not contain any byte data (delegating that to its children), it does store a code range value, which is easy to derive because each child rope is guaranteed to have both a code range value and an associated encoding object.

Figure 1: A sample rope for the result of “Hello “ + “François”

Moreover, rope metadata are immutable, so getting a rope's code range value will never incur more overhead than a field read. TruffleRuby takes advantage of that property to use ropes as guards in inline caches for its JIT compiler. Additionally, TruffleRuby can specialize string operations based on the code ranges for any argument strings. Since most Ruby programs never deal with ENC_CODERANGE_BROKEN strings, TruffleRuby's JIT will eliminate any code paths that deal with that code range. If a broken string does appear at runtime, the JIT will deoptimize and handle the operation on a slow path, preserving Ruby's full semantics. Likewise, while Ruby supports 100+ encodings out of the box, the TruffleRuby JIT will optimize a Ruby application for the small number of encodings it uses.

A String By Any Other Name

Often string performance discussions are centered around web template rendering or text processing. While important use cases, strings are also used extensively within the Ruby runtime. Every symbol or regular expression has an associated string, and they're consulted for various operations. The real fun comes with Ruby's metaprogramming facilities: strings can be used to access instance variables, look up methods, send messages to objects, evaluate code snippets, and more. Improvements (or degradations) in string performance can have large, cascading effects.

Backing up a step, I don't want to oversell the importance of code ranges for fast metaprogramming. They are an ingredient in a somewhat involved recipe. The code range can be used to quickly disqualify strings known not to match, such as those with the ENC_CODERANGE_BROKEN code range value. In the past, the code range was used to fail fast when particular identifiers were only allowed to be ASCII-only. While not currently implemented in MRI, such a check could be used to dismiss strings with the ENC_CODERANGE_VALID code range when all identifiers are known to be ENC_CODERANGE_7BIT, and vice versa. However, once a string passes the code range check, there's still the matter of seeing if it matches an identifier (instance variable, method, constant, etc.). With TruffleRuby, that check can be satisfied quickly because its immutable ropes are interned and can be compared by reference. In MRI and JRuby, the equality check may involve a linear pass over the string data as the string is interned. Even that process gets murky depending on whether you're working with a dynamically generated string or a frozen string literal. If you're interested in a deep dive on the difficulties and solutions to making metaprogramming fast in Ruby, Chris Seaton has published a paper about the topic and I've presented a talk about it at RubyKaigi.

Conclusion

More so than many other contemporary languages, Ruby exposes functionality that is difficult to optimize but which grants developers a great deal of expressivity. Code ranges are a way for the VM to avoid repeated work and optimize operations on a per-string basis, guiding away from slow paths when that functionality isn't needed. Historically, that benefit has been most keenly observed when running in the interpreter. When integrated with a JIT with deoptimization capabilities, such as TruffleRuby, code ranges can help eliminate generated code for the types of strings used by your application and the VM internally.

Knowing what code ranges are and what they're used for can help you debug issues, both for performance and correctness. At the end of the day, a code range is a cache, and like all caches, it may contain the wrong value. While such instances within the Ruby VM are rare, they're not unheard of. More commonly, a native extension manipulating strings may fail to update a string's code range properly. Hopefully, with a firm understanding of code ranges, you’ll find Ruby's handling of strings less daunting.

Kevin is a Staff Developer on the Ruby & Rails Infrastructure team where he works on TruffleRuby. When he’s not working on Ruby internals, he enjoys collecting browser tabs, playing drums, and hanging out with his family.


If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

Leveraging Shopify’s API to Build the Latest Marketplace Kit

Leveraging Shopify’s API to Build the Latest Marketplace Kit

In February, we released the second version of Marketplace Kit: a collection of boilerplate app code and documentation allowing third-party developers to integrate with Shopify, build selected commerce features and launch a world-class marketplace in any channel.

Previously, we used the node app generated by the Shopify command-line interface (CLI) as a foundation. However, this approach came with two drawbacks: If changes are made to the Shopify CLI, it would affect our code and documentation. Also, we had limited control over best practices, which forced us to use the CLI's node app dependencies.

Since then, we've decoupled code from the Shopify CLI and separated the Marketplace Kit sample app into two separate apps: a full-stack admin app and a buyer-facing client app. For these, we chose dependencies that were widely used, such as Express and NextJS, to appeal to the largest number of partners possible. Open-sourced versions of the apps are publicly available for anyone to try out.

Shopify’s APIs mentioned or shown in this article:

Here are a few ways we leveraged Shopify’s APIs to create the merchant-facing Marketplace admin app for version 2.0.

Before We Get Started

Here’s a brief overview of app development at Shopify. The most popular server-side language used within the Shopify CLI, to ease the development of apps is Node JS, a server-side JavaScript runtime. That’s why we used it for the Marketplace Kit’s sample admin app. With Node JS, we use a web framework library called Express JS, chosen for reasons such as ease of use and popularity word-wide.

On the client-side of the admin and buyer apps, we use the main JavaScript frontend library at Shopify: React JS. In the buyer-side app, we chose Next JS, a framework for React JS. This mainly provides structure to the application, as well as built-in features like server-side rendering and typescript support. When sharing data between frontend and backend apps, we use GraphQL and their helper libraries for ease of integration, Apollo Client and Apollo Server.

It’s also helpful to have familiarity with some key web development concepts, such as JSON Web Token (JWT) authentication, and Vanilla JavaScript, a compiled programming language best known for being used as a scripting language for web pages.

JWT Authentication with App Bridge

Let’s start with how we chose to handle authentication in our apps to assess a working example of using an internal library to ease development within Shopify. App Bridge is a standalone library offering React component wrappers for some app actions. It provides an out-of-the box solution for embedding your app inside of the Shopify admin, Shopify POS, and Shopify Mobile. Since we're using React for our embedded channel admin app, we leveraged additional App Bridge imports to handle authentication. Here is a client-side example from the channels admin app:

The app object, which is an instance of useAppBridge, is used to pass contextual information about the embedded app. We chose to wrap the authenticatedFetch function usage inside of a function with custom auth redirecting. Notice that the authenticatedFetch import does many things under the hood. Notably, it adds two HTTP headers: Authorization with a JWT session token created on demand and X-Requested-With with XMLHttpRequest (basically narrows down the request type and improves security).

This is the server-side snippet that handles the session token. It resides in our main server file, where we define our spec-compliant GraphQL server, using an Express app as middleware. Within the configuration of our ApolloServer’s context property, you'll see how we handle the auth header:

Notice how we leverage Shopify’s Node API to decode the session token and then to load the session data, providing us with the store’s access token. Fantastic!

Quick tip: To add more stores, you can switch out the store value in .env and run the Shopify CLI's shopify app serve command!

Serving REST & GraphQL With Express

In our server-side code, we use the apollo-server-express package instead of simply using apollo-server:

The setup for a GraphQL server using the express-specific package is quite similar to how we would do it with the barebones default package. The difference is that we apply the Apollo Server instance as middleware to an Express HTTP instance with graphQLServer.applyMiddleware({ app }) (or whatever you named your instance).

If you look at the entire file, you'll see that the webhooks and routes for the Express application are added after starting the GraphQL server. The advantage of using the apollo-server-express package over apollo-server is being able to serve REST and GraphQL at the same time using Express. Serving GraphQL within Express allows us to use Node middleware for common problems like rate-limiting, security and authentication. The trade-off is using a little bit more boilerplate, but since apollo-server is a wrapper to the Express specific one, there’s no noticeable performance difference.

Check out the Apollo team’s blog post Using Express with GraphQL to read more.

Custom Client Wrappers

Here’s an example of custom API clients for data fetching from Shopify’s Node API, applying GraphQL and REST:

This allows for easier control of our request configuration, like adding custom User-Agent headers with a unique header title for the project, including its npm package version.

Although Shopify generally encourages using GraphQL, sometimes it makes sense to use REST. In the admin app, we used it for one call: getting the products listings count. There was no need to create a specific query when an HTTP-GET request contains all the information required—using GraphQL would not offer any advantage. It’s also a good example of using REST in the application, ensuring developers who use the admin app as a starting point see an example that takes advantage of both ways to fetch data depending on what’s best for the situation.

Want to Challenge Yourself?

For full instructions on getting started with Marketplace Kit, check out our official documentation. To give you an idea, here are screenshots of the embedded admin app and the buyer app, in that order, upon completion of the tutorials in the docs:

More Stories on App Development and GraphQL:

For articles aimed at partners, check out our Shopify’s Partners blog where we cover more content related to app development at Shopify.

Kenji Duggan is a Backend Developer Intern at Shopify, working on the Strategic Partners team under Marketplace Foundation. Recently, when he’s not learning something new as a web developer, he is probably working out or watching anime


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

The Magic of Merlin: Shopify's New Machine Learning Platform

The Magic of Merlin: Shopify's New Machine Learning Platform

Shopify's machine learning platform team builds the infrastructure, tools and abstracted layers to help data scientists streamline, accelerate and simplify their machine learning workflows. There are many different kinds of machine learning use cases at Shopify, internal and external. Internal use cases are being developed and used in specialized domains like fraud detection and revenue predictions. External use cases are merchant and buyer facing, and include projects such as product categorization and recommendation systems.

At Shopify we build for the long term, and last year we decided to redesign our machine learning platform. We need a machine learning platform that can handle different (often conflicting) requirements, inputs, data types, dependencies and integrations. The platform should be flexible enough to support the different aspects of building machine learning solutions in production, and enable our data scientists to use the best tools for the job.

In this post, we walk through how we built Merlin, our magical new machine learning platform. We dive into the architecture, working with the platform, and a product use case.

The Magic of Merlin

Our new machine learning platform is based on an open source stack and technologies. Using open source tooling end-to-end was important to us because we wanted to both draw from and contribute to the most up-to-date technologies and their communities as well as provide the agility in evolving the platform to our users’ needs.

Merlin’s objective is to enable Shopify's teams to train, test, deploy, serve and monitor machine learning models efficiently and quickly. In other words, Merlin enables:

  1. Scalability: robust infrastructure that can scale up our machine learning workflows
  2. Fast Iterations: tools that reduce friction and increase productivity for our data scientists and machine learning engineers by minimizes the gap between prototyping and production
  3. Flexibility: users can use any libraries or packages they need for their models

For the first iteration of Merlin, we focused on enabling training and batch inference on the platform.

Merlin Architecture

A high level diagram of Merlin’s architecture

Merlin gives our users the tools to run their machine learning workflows. Typically, large scale data modeling and processing at Shopify happens in other parts of our data platform, using tools such as Spark. The data and features are then saved to our data lake or Pano, our feature store. Merlin uses these features and datasets as inputs to the machine learning tasks it runs, such as preprocessing, training, and batch inference.

With Merlin, each use case runs in a dedicated environment that can be defined by its tasks, dependencies and required resources — we call these environments Merlin Workspaces. These dedicated environments also enable distributed computing and scalability for the machine learning tasks that run on them. Behind the scenes, Merlin Workspaces are actually Ray clusters that we deploy on our Kubernetes cluster, and are designed to be short lived for batch jobs, as processing only happens for a certain amount of time.

We built the Merlin API as a consolidated service to allow the creation of Merlin Workspaces on demand. Our users can then use their Merlin Workspace from Jupyter Notebooks to prototype their work, or orchestrate it through Airflow or Oozie.

Merlin’s architecture, and Merlin Workspaces in particular, are enabled by one of our core components—Ray.

What Is Ray?

Ray is an open source framework that provides a simple, universal API for building distributed systems and tools to parallelize machine learning workflows. Ray is a large ecosystem of applications, libraries and tools dedicated to machine learning such as distributed scikit-learn, XGBoost, TensorFlow, PyTorch, etc.

When using Ray, you get a cluster that enables you to distribute your computation across multiple CPUs and machines. In the following example, we train a model using Ray:

We start by importing the Ray package. We call ray.init() to start a new Ray runtime that can run either on a laptop/machine or connect to an existing Ray cluster locally or remotely. This enables us to seamlessly take the same code that runs locally, and run it on a distributed cluster. When working with a remote Ray cluster, we can use the Ray Client API to connect to it and distribute the work.

In the example above, we use the integration between Ray and XGBoost to train a new model and distribute the training across a Ray cluster by defining the number of Ray actors for the job and different resources each Ray actor will use (CPUs, GPUs, etc.).

For more information, details and examples for Ray usage and integrations, check out the Ray documentation.

Ray In Merlin

At Shopify, machine learning development is usually done using Python. We chose to use Ray for Merlin's distributed workflows because it enables us to write end-to-end machine learning workflows with Python, integrate it with the machine learning libraries we use at Shopify and easily distribute and scale them with little to no code changes. In Merlin, each machine learning project comes with the Ray library as part of its dependencies, and uses it for distributed preprocessing, training and prediction.

Ray makes it easy for data scientists and machine learning engineers to move from prototype to production. Our users start by prototyping on their local machines or in a Jupyter Notebook. Even at this stage, their work can be distributed on a remote Ray cluster, allowing them to run the code at scale from an early stage of development.

Ray is a fast evolving open source project. It has short release cycles and the Ray team is continuously adding and working on new features. In Merlin, we adopted capabilities and features such as:

  • Ray Train: a library for distributed deep learning which we use for training our TensorFlow and PyTorch models
  • Ray Tune: a library for experiment execution and hyperparameter tuning
  • Ray Kubernetes Operator: a component for managing deployments of Ray on Kubernetes and autoscale Ray clusters

Building On Merlin

A diagram of the user’s development journey in Merlin

A user’s first interaction with Merlin usually happens when they start a new machine learning project. Let’s walk through a user’s development journey:

  1. Creating a new project: The user starts by creating a Merlin Project where they can place their code and specify the requirements and packages they need for development
  2. Prototyping: Next, the user will create a Merlin Workspace, the sandbox where they use Jupyter notebooks to prototype on a distributed and scalable environment
  3. Moving to Production: When the user is done prototyping, they can productionize their project by updating their Merlin Project with the updated code and any additional requirements
  4. Automating: Once the Merlin Project is updated, the user can orchestrate and schedule their workflow to run regularly in production
  5. Iterating: When needed, the user can iterate on their project by spinning up another Merlin Workspace and prototyping with different models, features, parameters, etc.

Let's dive a little deeper into these steps.

Merlin Projects

The first step of each machine learning use case on our platform is creating a dedicated Merlin Project. Users can create Merlin Projects for machine learning tasks like training a model or performing batch predictions. Each project can be customized to fit the needs of the project by specifying the system-level packages or Python libraries required for development. From a technical perspective, a Merlin Project is a Docker container with a dedicated virtual environment (e.g. Conda, pyenv, etc.), which isolates code and dependencies. As the project requirements change, the user can update and change their Merlin Project to fit their new needs. Our users can leverage a simple-to-use command line interface that allows them to create, define and use their Merlin Project.

Below is an example of a Merlin Project file hierarchy:

The config.yml file allows users to specify the different dependencies and machine learning libraries that they need for their use case. All the code relevant to a specific use case is stored in the src folder.

Once users push their Merlin Project code to their branch, our CI/CD pipelines build a custom Docker image.

Merlin Workspaces

Once the Merlin Project is ready, our data scientists can use the centralized Merlin API to create dedicated Merlin Workspaces in prototype and production environments. The interface abstracts away all of the infrastructure-related logic (e.g. deployment of Ray clusters on Kubernetes, creation of ingress, service accounts) so they can focus on the core of the job.

A high level architecture diagram of Merlin Workspaces

Merlin Workspaces also allow users to define the resources required for running their project. While some use cases need GPUs, others might need more memory and additional CPUs or more machines to run on. The Docker image that was created for a Merlin Project will be used to spin up the Ray cluster in a dedicated Kubernetes namespace for a Merlin Workspace. The user can configure all of this through the Merlin API, which gives them either a default environment or allows them to select the specific resource types (GPUs, memory, machine types, etc.) that their job requires.

Here’s an example of a payload that we send the Merlin API in order to create a Merlin Workspace:

Using this payload will result in a new Merlin Workspace which will spin up a new Ray cluster with the specific pre-built Docker image of one of our models at Shopify—our product categorization model, which we’ll dive into more later on. This cluster will use 20 Ray workers, each one with 10 CPUs, 30GB of memory and 1 nvidia-tesla-t4 GPU. The cluster will be able to scale up to 30 workers.

After the job is complete, the Merlin Workspace can be shut down, either manually or automatically, and return the resources back to the Kubernetes cluster.

Prototyping From Jupyter Notebooks

Once our users have their Merlin Workspace up and running, they can start prototyping and experimenting with their code from Shopify’s centrally hosted JupyterHub environment. This environment allows them to spin up a new machine learning notebook using their Merlin Project's Docker image, which includes all their code and dependencies that will be available in their notebook.

An example of how our users can create a Merlin Jupyter Notebook

From the notebook, the user can access the Ray Client API to connect remotely to their Merlin Workspaces. They can then run their remote Ray Tasks and Ray Actors to parallelize and distribute the computation work on the Ray cluster underlying the Merlin Workspace.

This method of working with Merlin minimizes the gap between prototyping and production by providing our users with the full capabilities of Merlin and Ray right from the beginning.

Moving to Production

Once the user is done prototyping, they can push their code to their Merlin Project. This will kick off our CI/CD pipelines and create a new version of the project's Docker image.

Merlin was built to be fully integrated with the tools and systems that we already have in place to process data at Shopify. Once the Merlin Project's production Docker image is ready, the user can build the orchestration around their machine learning flows using declarative YAML (yet another markup language) templates or by configuring a DAG (directed Acyclic Graph) in our Airflow environment. The jobs can be scheduled to run periodically, call the production Merlin API to spin up Merlin Workspaces and run Merlin jobs on them.

A simple example of an Airflow DAG running a training job on Merlin

The DAG in the image above demonstrates a training flow, where we create a Merlin Workspace, train our model on it and—when it’s done—delete the workspace and return the resources back to the Kubernetes cluster.

We also integrated Merlin with our monitoring and observability tools. Each Merlin Workspace gets its own dedicated Datadog dashboard which allows users to monitor their Merlin job. It also helps them understand more about the computation load of their job and the resources it requires. On top of this, each Merlin job sends its logs to Splunk so that our users can also debug their job based on the errors or stacktrace.

At this point, our user's journey is done! They created their Merlin Project, prototyped their use case on a Merlin Workspace and scheduled their Merlin jobs using one of the orchestrators we have at Shopify (e.g Airflow). Later on, when the data scientist needs to update their model or machine learning flow, they can go back to their Merlin Project to start the development cycle again from the prototype phase.

Now that we explained Merlin's architecture and our user journey, let's dive into how we onboarded a real-world algorithm to Merlin—Shopify’s product categorization model.

Onboarding Shopify’s Product Categorization Model to Merlin

A high level diagram of the machine learning workflow for the Product Categorization model

Recently we rebuilt our product categorization model to ensure we understand what our merchants are selling, so we can build the best products that help power their sales. This is a complex use case that requires several workflows for its training and batch prediction. Onboarding this use case to Merlin early on enabled us to validate our new platform, as it requires large scale computation and includes complex machine learning logic and flows. The training and batch prediction workflows were migrated to Merlin and converted using Ray.

Migrating the training code

To onboard the product categorization model training stage to Merlin, we integrated its Tensorflow training code with Ray Train, for distributing training across a Ray cluster. With Ray Train, changing the code to support the distributed training was easy and required few code changes - the original logic stayed the same, and the core changes are described in the example below.

The following is an example of how we integrated Ray Train with our Tensorflow training code for this use-case:

The TensorFlow logic for the training step stays the same, but is separated out into its own function. The primary change is adding Ray logic to the main function. Ray Train allows us to specify the job configuration, with details such as number of workers, backend type, and GPU usage.

Migrating inference

The inference step in the product categorization model is a multi-step process. We migrated each step separately, using the following method. We used Ray ActorPool to distribute each step of batch inference across a Ray cluster. Ray ActorPool is similar to Python's `multiprocessing.Pool` and allows scheduling Ray tasks over a fixed pool of actors. Using Ray ActorPool is straightforward and allows easy configuration for parallelizing the computation.

Here’s an example of how we integrated Ray ActorPool with our existing inference code to perform distributed batch predictions:

We first need to create our Predictor class (a Ray Actor), which includes the logic for loading the product categorization model, and performing predictions on product datasets. In the main function, we use the size of the cluster (ray.available_resources()["CPU"]) to create all the Actors that will run in the ActorPool. We then send all of our dataset partitions to the ActorPool for prediction.

While this method works for us at the moment, we plan to migrate to using Ray Dataset Pipelines which provides a more robust way to distribute the load of the data and perform batch inference across the cluster with less dependence on the number of data partitions or their size.

What's next for Merlin

As Merlin and its infrastructure mature, we plan to continue growing and evolving to better support the needs of our users. Our aspiration is to create a centralized platform that streamlines our machine learning workflows in order to enable our data scientists to innovate and focus on their craft.

Our next milestones include:

  • Migration: Intention to migrate all of Shopify’s machine learning use cases and workflows to Merlin and adding a low code framework to onboard new use cases
  • Online inference: Support real time serving of machine learning models at scale
  • Model lifecycle management: Add model registry and experiment tracking
  • Monitoring: Support monitoring for machine learning

While Merlin is still a new platform at Shopify, it’s already empowering us with the scalability, fast iteration and flexibility that we had in mind when designing it. We're excited to keep building the platform and onboarding new data scientists, so Merlin can help enable the millions of businesses powered by Shopify.

Isaac Vidas is a tech lead on the ML Platform team, focusing on designing and building Merlin, Shopify’s machine learning platform. Connect with Isaac on LinkedIn.


If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

Best-in-Class Developer Experience with Vite and Hydrogen

Best-in-Class Developer Experience with Vite and Hydrogen

Hydrogen is a framework that combines React and Vite for creating custom storefronts on Shopify. It maximizes performance for end-users and provides a best-in-class developer experience for you and your team. Since it focuses on evergreen browsers, Hydrogen can leverage modern capabilities, best practices, and the latest tooling in web development to bring the future of ecommerce closer.

Creating a framework requires a lot of choices for frontend tooling. One major part of it is the bundler. Traditionally, developers had no native way to organize their code in JavaScript modules. Therefore, to minimize the amount of code and waterfall requests in the browser, new frontend tools like Webpack started to appear, powering projects such as Next.js and many more.

Bundling code became the de facto practice for the last decade, especially when using view libraries like React or Vue. Whereas these tools successfully solved the problem, they quickly became hard to understand and maintain due to the increasing complexity of the modern web. On top of that, the development process started to slow down because bundling and compiling are inherently slow: the more files in a project, the more work the tool needs to do. Repeat this process for every change made in a project during active development, and one can quickly notice how the developer experience (DX) tanks. 

Diagram showing bundle-based dev server. Modules are bundled and compiled to be server ready
Bundle-based dev server image from Vite.js docs

Thanks to the introduction of ES Modules (a native mechanism to author JavaScript modules) and its support in browsers, some new players like Snowpack and Parcel appeared and started shaping up the modern web development landscape.

Image showing use of native ES Modules to minimize the amount of bundling required during development
Native ESM-based dev server from Vite.js docs

This new generation of web tooling aims to improve the DX of building apps. Whereas Webpack needs a complex configuration, even for simple things due to its high flexibility, these new tools provide very sensible but configurable default values. Furthermore, they leverage native ES Modules to minimize the amount of bundling required during development. In particular, they tend to bundle and cache only third-party dependencies to keep network connections low (the number of files downloaded by the browser). Some dependencies may have dozens or hundreds of files, but they don't need to be updated often. On the other hand, it provides user code unbundled to the browser, thus speeding up the refresh rates when making changes.

Enter Vite. With its evergreen and modern philosophy, we believe Vite aligns perfectly with Hydrogen. Featuring a lightning-fast development server with hot module replacement, a rich plugin ecosystem, and clever default configurations that make it work out of the box for most apps, Vite was among the top options to power Hydrogen's development engine.

Why Vite?

Vite is French for "quick", and the Hydrogen team can confirm: it's really fast. From the installation and setup to its hot reloading, things that used to be a DX pain are (mostly) gone. It’s also highly configurable and simple to use.

Partially, this is thanks to the two magnificent tools that power it: ESBuild, a Golang-based, lightning-fast compiler for JavaScript, and Rollup, a flexible and intelligible bundler. However, Vite is much more than the sum of these parts.

Ease of Use

In Vite, the main entry point is a simple index.html file, making it a first-class citizen instead of an after-thought asset. Everything else flows from here by using stylesheets and scripts tags. It crawls and analyzes all of the imported assets and transforms them accordingly.

Thanks to its default values, most flavors of CSS and JavaScript, including JSX, TypeScript (TS), PostCSS, work out of the box.

Let me reiterate this: it just works™. No painful configuration is needed to get those new CSS prefixes or the latest TS type checking working. It even lets you import WebAssembly or SVG files from JavaScript just like that. Also, since Vite's main target is modern browsers, it’s prepared to optimize the code and styles by using the latest supported features by default.

We value the simplicity Vite brings in Hydrogen and share it with our users. It all sums up to saving a lot of time configuring your tooling compared to other alternatives.

A Proven Plugin System

Rollup has been around for a much longer time than Vite. It does one thing and does it very well: bundling. The key here is that Vite can tell it what to bundle.

Furthermore, Rollup has a truly rich plugin ecosystem that is fully compatible with Vite. With this, Vite provides hooks during development and building phases that enable advanced use cases, such as transforming specific syntax like Vue files. There are many plugins out there that use these hooks for anything you can imagine: Markdown pages with JSX, SSR-ready icons, automatic image minification, and more.

In Hydrogen, we found these Vite hooks are easier to understand and use than those in Webpack, and it allows us to write more maintainable code.

Speed

A common task that tends to slow down web development is compiling JavaScript flavors and new features to older and widely supported code. Babel, a compiler written in JavaScript, has been the king in this area for a long time.

However, new tools like ESBuild started to appear recently with a very particular characteristic: they use a machine-compiled language to transform JavaScript instead of using JavaScript itself. In addition, and perhaps more importantly, they also apply sophisticated algorithms to avoid repeating AST parsing and parallelize work, thus establishing a new baseline for speed.

Apart from using ESBuild, Vite applies many other optimizations and tricks to speed up development. For instance, it pre-bundles some third-party dependencies and caches them in the filesystem to enable faster startups.

All in all, we can say Vite is one of the fastest alternatives out there when it comes to local development, and this is something we also want our users to benefit from in Hydrogen.

ESM and HMR

Along with Snowpack and Parcel, Vite is one of the first tools to embrace ECMAScript Modules (ESM) and inject JavaScript into the browser using script tags with type=module.

This, paired with hot-module replacement (HMR), means that changes to files on the local filesystem are updated instantly in the browser.

Vite is also building for the future of the web and the NPM ecosystem. While most third-party libraries are still on CommonJS (CJS) style modules (native in Node.js), the new standard is ESM. Vite performs an exhaustive import analysis of dependencies and transforms CJS modules into ESM automatically, thus letting you import code always in a modern fashion. And this is not something to take lightly. CJS and ESM interoperability is one of the biggest headaches web developers have faced in recent years.

As app developers ourselves in Hydrogen, it is such a relief we can focus on coding without wasting time on this issue. Someday most packages will, hopefully, follow the ESM standard. Until that day, Vite has us covered.

Server-Side Rendering

Server-side rendering (SSR) is a critical piece to modern frameworks like Hydrogen and is another place where Vite shines. It extends Rollup hooks to provide SSR information, thus enabling many advanced use cases.

For example, it is possible to transform the same imported file in different ways depending on the running environment (browser or server). This is key to supporting some advanced features we need in Hydrogen, such as React Server Components, which was only available in Webpack to this date.

Vite can also load front-end code in the server by converting dependencies to a Node-compatible runtime and modules to CJS. Think of simply importing a React application in Node. It greatly eases the way SSR works and is something Hydrogen leverages to remove extra dependencies and simplify code.

Community

Last but not least, Vite has a large and vibrant community around it.

Many projects in addition to Hydrogen are relying on and contributing to Vite, such as Vitest, SvelteKit, Astro, Storybook, and many more.

And it's not just about the projects, but also the people behind them who are incredibly active and always willing to help in Vite's Discord channel. From Vite's creator, @youyuxi, to many other contributors and maintainers such as @patak_dev, @alecdotbiz, or @antfu7.

Hydrogen is also a proud sponsor of Vite. We want to support the project to ensure it stays up to date with the latest DX improvements to make web developers’ life easier.

How Hydrogen uses Vite

Our goal when building Hydrogen on top of Vite was to keep things as “close to the metal” as possible and not reinvent the wheel. CLI tools can rely on Vite commands internally, and most of the required configuration is abstracted away.

Creating a Vite-powered Hydrogen storefront is as easy as adding the @shopify/hydrogen/plugin plugin to your vite.config.js:

Behind the scenes, we are invoking 4 different plugins:

  • hydrogen-config: This is responsible for altering the default Vite config values for Hydrogen projects. It helps ensure bundling for both Node.js and Worker runtimes work flawlessly, and that third-party packages are processed properly.
  • react-server-dom-vite: It adds support for React Server Components (RSC). We extracted this plugin from Hydrogen core and made it available in the React repository.
  • hydrogen-middleware: This plugin is used to hook into Vite’s dev server configuration and inject custom behavior. It allows us to respond to SSR and RSC requests while leaving the asset requests to Vite’s default web server.
  • @vite/plugin-react: This is an official Vite plugin that adds some goodies for React development such as fast refresh in the browser.

Just with this, Hydrogen is able to support server components, streaming requests, clever caching, and more. By combining this with all the features Shopify already provides, you can unlock unparalleled performance and best-in-class DX for your storefront.

Choosing the Right Tool

There are still many advanced use cases where Webpack is a good fit since it is very mature and flexible. Many projects and teams, such as React’s, rely heavily on it for their day-to-day development.

However, Vite makes building modern apps a delightful experience and empowers framework authors with many tools to make development easier. Storefront developers can enjoy a best-in-class DX while building new features at a faster pace. We chose Vite for Hydrogen and are happy with that decision so far.

Fran works as a Staff Software Engineer on the Hydrogen team at Shopify. Located in Tokyo, he's a web enthusiast and an active open source contributor who enjoys all things tech and all things coconut. Connect with Fran on Twitter and GitHub.


If you’re passionate about solving complex problems at scale, and you’re eager to learn more, we're hiring! Reach out to us or apply on our careers page.

Continue reading

10 Books Shopify’s Tech Talent Think You Should Read

10 Books Shopify’s Tech Talent Think You Should Read

How we think, absorb information, and maximize time—these are the topics Shopify developers and engineers are reading up on.

We have a book bar of the company’s favorite reads and make sure any employee who wants a copy of any title can get one. So we thought we’d flip the script and ask 10 of our technical minds to tell us the books they think everyone in tech should read this year.

Many of their choices were timeless, suggesting a clear desire to level up beyond hard skills. There are a couple deep dives into software design and computing systems, but many of the titles on this reading list are guides for reframing personal habits and patterns: taking notes, receiving feedback, sharing knowledge, and staying focused amid myriad distractions.

The Talent Code by Daniel Coyle

(Bantam Books)

I received my copy of The Talent Code shortly before uprooting my life to attend a front-end bootcamp. The school sent a copy to every student about to start their nine-week program. Coyle’s thesis is “Greatness isn’t born. It’s grown.” He highlights areas that allow us to become great at almost anything: deep practice, passion, and master coaching. The book made me rethink whether I’m destined to be bad at some things. One example for me was softball, but a more pressing use case was my upcoming immersion in coding. Coyle’s lessons helped me thrive during my course’s long hours, but I haven’t applied the same lessons to softball, yet.

Carys Mills, Staff Front End Developer

The 5 Elements of Effective Thinking by Edward B. Burger and Michael Starbird

(Princeton University Press)

I’ve always followed the adage of “work smarter, not harder,” but in knowledge work, how do we “think smarter, not harder”? The 5 Elements of Effective Thinking presents an answer, packaged in a framework that’s applicable in work and life more broadly. The book is short and pithy. I keep it near my desk. The elements of the book include how to understand a topic, how to think about failure, how to generate good questions, and how to follow those questions. I won’t spoil the fifth element for you, you’ll have to read about it yourself!

Ash Furrow, Senior Staff Developer


Thanks for the Feedback: The Science and Art of Receiving Feedback Well by Sheila Heen and Douglas Stone

(Viking Adult)

As developers, we give and receive feedback all the time—every code review, tech review, and, of course, feedback on our foundational and soft skills too. There’s a lot of focus on how to do a good code review—how to give feedback, but there’s also an art of receiving feedback. Sheila Heen and Douglas Stone’s Thanks for the Feedback: The Science and Art of Receiving Feedback Well does an excellent job of laying out the different layers involved in receiving feedback and the different kinds there are. Being able to identify the kind of feedback I’m getting (beyond "constructive")—appreciation or encouragement, coaching or evaluative—has helped me leverage even poorly delivered feedback to positively impact my personal and professional growth.

Swati Swoboda, Development Manager, Shipping

How to Take Smart Notes by Sönke Ahrens

(Self-published)

Occasionally there are books that will totally flip how you think about doing something. How to Take Smart Notes is one of those. The title is about notes, but the book is about taking a totally different approach to learning and digesting information. Even if you choose not to follow the exact note taking technique it describes, the real value is in teaching you how to think about your own methods of absorbing and integrating new information. It’s completely changed the approach I take to studying nonfiction books.

Rose Wiegley, Staff Software Engineer

Extreme Ownership: How U.S. Navy SEALs Lead and Win by Jocko Willink and Leif Babin

(Echelon Front)

The book that I'd recommend people read, if they haven't read it before, is actually a book we recommend internally at Shopify: Extreme Ownership by Jocko Willink and Leif Babin. Don't let the fact that it's about the Navy SEALs put you off. There are so many generally applicable lessons that are critical as our company continues to grow at a rapid pace. Success in a large organization—especially one that is globally distributed—is about decentralized leadership from teams and individuals: we all have the autonomy and permission to go forth and build amazing things for our merchants, so we should do just that whilst setting great examples for others to follow.

James Stanier, Director of Engineering, Core

The Elements of Computing Systems: Building a Modern Computer from First Principles by Noam Nisan and Shimon Schocken

(The MIT Press)

Curious how tiny hardware chips become the computers we work on? I highly recommend The Elements of Computing Systems for any software developer wanting a more well-rounded understanding of a computer’s abstraction layers—not just at the level you’re most comfortable with, but even deeper. This workbook guides you through building your own computer from the ground up: hardware chip specifications, assembly language, programming language, and operating system. The authors did a great job of including the right amount of knowledge to not overwhelm readers. This book has given me a stronger foundation in computing systems while working at Shopify. Don’t like technical books? The authors also have lectures on Coursera available for free.

Maple Ong, Senior Developer

A Philosophy of Software Design by John Ousterhout

(Yaknyam Press)

A Philosophy of Software Design tackles a complicated topic: how to manage complexity while building systems. And, surprisingly, it’s an easy read. One of Stanford computer science professor John Ousterhout’s insights I strongly agree with is that working code isn’t enough. Understanding the difference between tactical vs strategic coding helps you level up—specifically, recognizing when a system is more complex than it needs to be is a crucial yet underrated skill. I also like how Ousterhout likens software to playing a team sport, and when he explains why our work as developers isn’t only writing code that works, but also creating code and systems that allow others to work easily. Read with an open mind. A Philosophy of Software Design offers a different perspective from most books on the subject.

Stella Miranda, Senior Developer

Living Documentation by Cyrille Martraire

(Addison-Wesley Professional)

Living Documentation isn’t so much about writing good documentation than about transmitting knowledge, which is the real purpose of documentation. In the tech world where the code is the source of truth, we often rely on direct interactions when sharing context, but this is a fragile process as knowledge can be diluted from one person to another and even lost when people leave a company. On the other side of the spectrum lies traditional documentation. It’s more perennial but requires significant effort to keep relevant, and that’s the main reason why documentation is the most overlooked task in the tech world. Living Documentation is an attempt at bridging the gap between these two extremes by applying development tools and methods to documentation in an incremental way, ensuring knowledge transmission in a 100-year company.

Frédéric Bonnet, Staff Developer

Uncanny Valley by Anna Wiener

(MCD Books)

Sometimes you need to read something that’s both resonant and entertaining in addition to job or specific skill-focused books. In the memoir Uncanny Valley, Anna Wiener vividly describes her journey from working as a publishing assistant in New York to arriving in the Bay Area and befriending CEOs of tech unicorns. At a time when tech is one of the biggest and most influential industries in the world, her sharp observations and personal reflections force those of us working in the sector to look at ourselves with a critical eye.

Andrew Lo, Staff Front End Developer

Deep Work: Rules For Focused Success in a Distracted World by Cal Newport

(Grand Central Publishing)

I've found that the most impactful way to tackle hard problems is to first get into a flow state. Having the freedom to work uninterrupted for long blocks of time has often been the differentiator in discovering creative solutions. Once you've experienced it, it's tough going back to working any other way. Most of the activities we do as knowledge workers benefit from this level of attention and focus. And if you've never tried working in long, focused time blocks, Deep Work should convince you to give it a shot. A word of warning though: make sure you have a bottle of water and some snacks handy. It's easy to completely lose track of time and skip meals. Don't do that!

Krishna Satya, Development Manager

For more book recommendations, check out this Twitter thread from last year’s National Book Lovers Day.


If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

How We Built the Add to Favorite Animation in Shop

How We Built the Add to Favorite Animation in Shop

I just want you to feel it

Jay Prince, from the song Feel It

I use the word feeling a lot when working on animations and gestures. For example, animations or gestures sometimes feel right or wrong. I think about that word a lot because our experiences using software are based on an intuitive understanding of the real world. When you throw something in real life, it influences how you expect something on screen to behave after you drag and release it.

By putting work, love, and care into UI details and designs, we help shape the experience and feeling users have when using an app. All the technical details and work is in service of the user's experiences and feelings. The user may not consciously notice the subtle animations we create, but if we do our job well, the tiniest gesture will feel good to them.

The team working on Shop, our digital shopping assistant, recently released a feature that allows buyers to favorite products and shops. By pressing a heart button on a product, buyers can save those products for later. When they do, the product image drops into the heart icon (containing a list of favorite products) in the navigation tab at the bottom.

In this post, I’ll show you how I approached implementing the Add to Favorite animation in Shopify’s Shop app. Specifically, we can look at the animation of the product image thumbnail appearing, then moving into the favorites tab bar icon:

Together, we'll learn:

  • How to sequence animations.
  • How to animate multiple properties at the same time.
  • What interpolation is.

Getting Started

When I start working on an animation from a video provided by a designer, I like to slow it down so I can see what's happening more clearly:

If a slowed video isn’t provided, you can record the animation using Monosnap or Quicktime. This also allows you to slowly scrub through the video. Fortunately, we also have this great motion spec to work with as well:

As you can see, the motion spec defines the sequence of animations. Based on the spec, we can determine:

  • which properties are animating
  • what values to animate to
  • how long each animation will take
  • the easing curve of the animation
  • the overall order of the animations

Planning the Sequence

Firstly, we should recognize that there are two elements being animated:

  • the product thumbnail
  • the favorites tab bar icon

The product thumbnail is being animated first, then the Favorites tab bar icon is being animated second. Let's break it down step by step:

1. Product thumbnail fades in from 0% to 100% opacity. At the same time, it scales from 0 to 1.2.
2. Product thumbnail scales from 1.2 to 1
(A 50 ms pause where nothing happens)
3. Product thumbnail moves down, then disappears instantly at the end of this step.
4. The Favorite tab bar icon moves down. At the same time, it changes color from white to purple.
5. The Favorite tab bar icon moves up. At the same time, it changes color from purple to white.
6. The Favorite tab bar icon moves down.
7. The Favorite tab bar icon moves up to its original position.

 

Each of the above steps is an animation that has a duration and easing curve, as specified in the motion spec provided by the motion designer. Our motion specs define these easing curves that define how a property changes over time:

Coding the Animation Sequence

Let's write code! The Shop app is a React Native application and we use the Reanimated library to implement animations.

For this animation sequence, there are multiple properties being animated at times. However, these animations happen together, driven by the same timings and curves. Therefore we can use only one shared value for the whole sequence. That shared progress value can drive animations for each step by moving from values 1 to 2 to 3 etc.

So the progress value tells us which step of the animation we are in, and we can set the animated properties accordingly. As you can see, this sequence of steps match the steps we wrote down above, along with each step's duration and easing curves, including a delay at step 3:

We can now start mapping the progress value to the animated properties!

Product Thumbnail Styles

First let's start with the product thumbnail fading in:

What does interpolate mean?

Interpolating maps a value from an input range to an output range. For example, if the input range is [0, 1] and the output range is [0, 10], then as the input increases from 0 to 1, the output increases correspondingly from 0 to 10. In this case, we're mapping the progress value from [0, 1] to [0, 1] (so no change in value).

In the first step of the animation, the progress value changes from 0 to 1 and we want the opacity to go from 0 to 1 during that time so that it fades in. “Clamping” means that when the input value is greater than 1, the output value stays at 1 (it restricts the output to the maximum and minimum of the output range). So the thumbnail will fade in during step 1, then stay at full opacity for the next steps because of the clamping.

However, we also want the thumbnail to disappear instantly at step 3. In this case, we don't use interpolate because we don't want it to animate a fade-out. Instead, we want an instant disappearance:

Now the item is fading in, but it also has to grow in scale and then shrink back a bit:

This interpolation is saying that from step 0 to 1, we want scale to go from 0 to 1.2. From step 1 to 2, we want the scale to go from 1.2 to 1. After step 2, it stays at 1 (clamping).

Let's do the final property, translating it vertically:

So we're moving from position -60 to -34 (half way behind the tab bar) between steps 2 and 3. After step 3, the opacity becomes 0 and it disappears! Let's test the above code:

Nice, it fades in while scaling up, then scales back down, then slides down halfway under the tab bar, and then disappears.

Tab Bar Icon Styles

Now we just need to write the Favorite tab bar icon styles!

First, let's handle the heart becoming filled (turning purple), then unfilled (turning white). I did this by positioning the filled heart icon over the unfilled one, then fading in the filled one over the unfilled one. Therefore, we can use a simple opacity animation where we move from 0 to 1 and back to 0 over steps 3, 4 and 5:

For the heart bouncing up and down we have:

From steps 3 to 7, this makes the icon move up and down, creating a bouncing effect. Let's see how it looks!

Nice, we now see the tab bar icon react to having a product move into it.

Match Cut

By using a single shared value, we ensured that the heart icon moves down immediately when the thumbnail disappears, creating a match cut. A “match cut” is a cinematic technique where the movement of an item immediately cuts to the movement of another item during a scene transition. The movement that the users’ eye expects as the product thumbnail moves down cuts to a matching downward movement of the heart icon. This creates an association of the item and the Favorites section in the user's mind.

In another approach, I tried using setTimeout to start the tab bar icon animation after the thumbnail one. I found that when the JS thread was busy, this would delay the second animation, which ruined the match cut transition! It felt wrong when seeing it with that delay. Therefore, I did not use this approach. Using withDelay from Reanimated would have avoided this issue by keeping the timer on the UI thread.

When I started learning React Native, the animation code was intimidating. I hope this post helps make implementing animations in React Native more fun and approachable. When done right, they can make user interactions feel great!

You can see this animation by favoriting a product in the Shop app!

Special thanks to Amber Xu for designing these animations, providing me with great specs and videos to implement them, and answering my many questions.

Andrew Lo is a Staff Front End Developer on the Shop's Design Systems team. He works remotely from Toronto, Canada. 


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

A Data Scientist’s Guide To Measuring Product Success

A Data Scientist’s Guide To Measuring Product Success

If you’re a data scientist on a product team, much of your work involves getting a product ready for release. You may conduct exploratory data analyses to understand your product’s market, or build the data models and pipelines needed to power a new product feature, or design a machine learning model to unlock new product functionality. But your work doesn’t end once a product goes live. After a product is released, it’s your job to help identify if your product is a success.

Continue reading

Using Terraform to Manage Infrastructure

Using Terraform to Manage Infrastructure

Large applications are often a mix of code your team has written and third-party applications your team needs to manage. These third-party applications could be things like AWS or Docker. In my team’s case, it’s Twilio TaskRouter.

The configuration of these services may not change as often as your app code does, but when it does, the process is fraught with the potential for errors. This is because there is no way to write tests for the changes or easily roll them back–things we depend on as developers when shipping our application code.

Using Terraform improves your infrastructure management by allowing users to implement engineering best practices in what would otherwise be a GUI with no accountability, tests, or revision history.

On the Conversations team, we recently implemented Terraform to manage a piece of our infrastructure to great success. Let’s take a deeper look at why we did it, and how.

My team builds Shopify’s contact center. When a merchant or partner interacts with an agent, they are likely going through a tool we’ve built. Our app suite contains applications we’ve built in-house and third-party tools. One of these tools is Twilio TaskRouter.

TaskRouter is a multi-channel skill-based task routing API. It handles creating tasks (voice, chat, etc.) and routing them to the most appropriate agent, based on a set of routing rules and agent skills that we configure.

As our business grows and becomes more complex, we often need to make changes to how merchants are routed to the appropriate agent.

Someone needs to go into our Twilio console and use the graphic user interface (GUI) to update the configuration. This process is fairly straightforward and works well for getting off the ground quickly. However, the complexity quickly becomes too high for one person to understand it in its entirety.

In addition, the GUI doesn’t provide a clear history of changes or a way to roll them back.

As developers, we are used to viewing a commit history, reading PR descriptions and tests to understand why changes happened, and rolling back changes that are not working as expected. When working with Twilio TaskRouter, we had none of these.

Using Terraform to Configure Infrastructure

Terraform is an open source tool for configuring infrastructure as code.

It is a state machine for infrastructure that gives teams all the benefits of engineering best practices listed above to infrastructure that was previously only manageable via a GUI.

Terraform requires three things to work:

  1. A reliable API. We need a reliable API for Terraform to work. When using Terraform, we will stop using the GUI and rely on Terraform to make our changes for us via the API. Anything you can’t change with the API, you won’t be able to manage with Terraform.
  2. A Go client library. Terraform is written in Go and requires a client library for the API you’re targeting written in Go. The client library makes HTTP(S) calls to your target app.
  3. A Terraform provider. The core Terraform software uses a provider to interact with the target API. Providers are written in Go using the Terraform Plugin SDK.

With these three pieces, you can manage just about any application with Terraform!

Image from: https://learn.hashicorp.com/img/terraform/providers/core-plugins-api.png<

A Terraform provider adds a set of resources Terraform can manage. Providers are not part of Terraform’s code. They are created separately to manage a specific application. Twilio did not have a provider when we started this project, so we made our own.

Since launching this project, Twilio has developed its own Terraform provider, which can be found here.

At its core, a provider enables Terraform to perform CRUD operations on a set of resources. Armed with a provider, Terraform can manage the state of the application.

Creating a Provider

Note: If you are interested in setting up Terraform for a service that already has a provider, you can skip to the next section.

Here is the basic structure of a Terraform provider:

This folder structure contains your Go dependencies, a Makefile for running commands, an example file for local development, and a directory called twilio. This is where our provider lives.

A provider must contain a resource file for every type of resource you want to manage. Each resource file contains a set of CRUD instructions for Terraform to follow–you’re basically telling Terraform how to manage this resource.

Here is the function defining what an activity resource is in our provider:

Note: Go is a strongly typed language, so the syntax might look unusual if you’re not familiar with it. Luckily you do not need to be a Go expert to write your own provider!

This file defines what Terraform needs to do to create, read, update and destroy activities in Task Router. Each of these operations is defined by a function in the same file.

The file also defines an Importer function, a special type of function that allows Terraform to import existing infrastructure. This is very handy if you already have infrastructure running and want to start using Terraform to manage it.

Finally, the function defines a schema–these are the parameters provided by the API for performing CRUD operations. In the case of Task Router activities, the parameters are friendly_name, available, and workspace_sid.

To round out the example, let’s look at the create function we wrote:

Note: Most of this code is boilerplate Terraform provider code which you can find in their docs.

The function is passed context, a schema resource, and an empty interface.

We instantiate the Twilio API client and find our workspace (Task Router activities all exist under a single workspace).

Then we format our parameters (defined in our Schema in the resourceTwilioActivity function) and pass them into the create method provided to us by our API client library.

Because this function creates a new resource, we set the id (setID) to the sid of the result of our API call. In Twilio, a sid is a unique identifier for a resource. Now Terraform is aware of the newly created resource and it’s unique identifier, which means it can make changes to the resource.

Using Terraform

Once you have created your provider or are managing an app that already has a provider, you’re ready to start using Terraform.

Terraform uses a DSL for managing resources. The good news is that this DSL is more straightforward than the Go code that powers the provider.

The DSL is simple enough that with some instruction, non-developers should be able to make changes to your infrastructure safely–but more on that later.

Here is the code for defining a new Task Router activity:

Yup, that’s it!

We create a block declaring the resource type and what we want to call it. In that block, we pass the variables defined in the Schema block of our resourceTwilioActivity, and any resources that it depends on. In this case, activities need to exist within a workspace. So we pass in the workspace resource in the depends_on array. Terraform knows it needs this resource to exist or to create it before attempting to create the activity.

Now that you have defined your resource, you’re ready to start seeing the benefits of Terraform.

Terraform has a few commands, but plan and apply are most common. Plan will print out a text-based representation of the changes you’re about to make:

Terraform makes visualizing the changes to your infrastructure very easy. At this planning step you may uncover unintended changes - if there was already an offline activity the plan step would show you an update instead of a create. At this step, all you need to do is change your resource block’s name,and run terraform plan again.

When you are satisfied with your changes, run terraform apply to make the changes to your infrastructure. Now Terraform will know about the newly created resource, and its generated id, allowing you to manage it exclusively through Terraform moving forward.

To get the full benefit of Terraform (PRs, reviews, etc.), we use an additional tool called Atlantis to manage our GitHub integration.

This allows people to make pull requests with changes to resource files, and have Atlantis add a comment to the PR with the output of terraform plan. Once the review process is done, we comment atlantis apply -p terraform to make the change. Then the PR is merged.

We have come a long way from managing our infrastructure with a GUI in a web app! We have a Terraform provider communicating via a Go API client to manage our infrastructure as code. With Atlantis plugged into our team’s GitHub, we now have many of the best practices we rely on when writing software–reviewable PRs that are easy to understand and roll back if necessary, with a clear history that can be scanned with a git blame.

How was Terraform Received by Other Teams?

The most rewarding part of this project was how it was received by other teams. Instead of business and support teams making requests and waiting for developers to change Twilio workflows, Terraform empowered them to do it themselves. In fact, some people’s first PRs were changes to our Terraform infrastructure!

Along with freeing up developer time and making the business teams more independent, Terraform provides visibility to infrastructure changes over time. Terraform shows the impact of changes, and the ease of searching GitHub for previous changes makes it easy to understand the history of changes our teams have made.

Building great tools will often require maintaining third-party infrastructure. In my team’s case, this means managing Twilio TaskRouter to route tasks to support agents properly.

As the needs of your team grow, the way you configure your infrastructure will likely change as well. Tracking these changes and being confident in making them is very important but can be difficult.

Terraform makes these changes more predictable and empowers developers and non-developers alike to use software engineering best practices when making these changes.

Jeremy Cobb is a developer at Shopify. He is passionate about solving problems with code and improving his serve on the tennis court.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Creating a React Library for Consistent Data Visualization

Creating a React Library for Consistent Data Visualization

At Shopify, we tell a lot of stories through data visualization. This is the driving force behind business decisions—not only for our merchants, but also for teams within Shopify.

With more than 10,000 Shopify employees, though, it is only natural that different teams started using different tools to display data, which is great—after all, creative minds create diverse solutions, right? The problem is that it led to a lot of inconsistencies, like these two line charts that used to live in the Shopify admin—the page you see after logging in to Shopify, where you can set up your store, configure your settings, and manage your business—for example:

Let’s play Spot the Difference: line widths, dashed line styles, legend styles, background grids, one has labels on the X Axis, the other doesn’t... This isn’t just a “visual styles” problem since they use different libraries, one was accessible to screen readers and the other wasn’t; one was printable the other not.

To solve this problem, the Insights team has been working on creating a React data visualization library—Polaris Viz—that other teams can rely on to quickly implement data visualization without having to solve the same problems over and over again.

But first things first, if you haven’t yet, I recommend you start by reading my co-worker Miru Alves’ amazing blog post where she describes how we used Delta-E and Contrast Ratio calculations to create a color matrix with a collection of colors we can choose from to safely use without violating any accessibility rules.

This post is going to focus on the process of implementing the light and dark themes in the library, as well as allowing library consumers to create their own themes, since not all Shopify brands like Shop, Handshake, or Oberlo use the same visual identity.

Where Did the Inconsistencies Come From?

When we started tackling this issue, the first thing we noticed was that even in places that were already using only Polaris Viz, we had visual inconsistencies. This is because our original components API looked like this:

As you can see, changing the appearance of a chart involved many different options spread in different props, and you either had to create a wrapping component that has all the correct values or pass the props over and over again to each instance. OK, this explains a lot.

Ideally, all charts in the admin should use either the default dark or light themes that the UX team created, so we should make it easy for developers to choose light or dark without all this copyin’ && pasta.

Implementing Themes

To cover the use cases of teams that used the default dark or light themes, we removed all the visual style props and introduced a new theme prop to all chart components:

  • The theme prop accepts the name of a theme defined in a record of Themes.
  • The Theme type contains all visual configurations like colors, line styles, spacing, and if bars should be rounded or not.

These changes allow consumers to have all the good styles by default—styles that match our visual identity, take accessibility into consideration, and have no accidental discrepancies—and they just have to pass in theme=’Light’ if they want to use the Light instead of the Dark theme

This change should cover the majority of use cases, but we still need to support other visual identities. Putting back all those style props would lead to the same problems for whoever wasn’t using the default styles. So how could we make it easy to specify a different visual identity?

Introducing the PolarisVizProvider

We needed a way to allow consumers to define what their own visual identity looks like in a centralized manner so all charts across their applications would just use the correct styles. So instead of having the chart components consume the themes record from a const directly, we introduced a context provider that stores the themes:

By having the provider accept a themes prop, we allow consumers to overwrite the Default and Light themes or add their own. This implementation could cause some problems though: what happens if a user overwrites the Default theme, but doesn’t provide all the properties that are necessary to render a chart. For example what if they forget to pass the tooltip background color?

To solve this, we first implemented a createTheme function:

createTheme allows you to pass in a partial theme and obtain a complete theme. All the properties that are missing in the partial theme will just use the library’s default values.

Next, we implemented a createThemes function. It guarantees that even if properties are overwritten, the theme record will always contain the Default and Light themes:

With both of these in place, we just needed to update the PolarisVizProvider implementation:

Overwriting the Default Theme

From a consumer perspective, this means that you could wrap your application with a PolarisVizProvider, define your Default theme, and all charts will automagically inherit the correct styles. For example:

All charts inside of <App/> will have a blue background by default:

It hurts my eyes, but IT WORKS!

Creating Multiple Themes

You can also define multiple extra themes in the PolarisVizProvider. Each top level key in this object is used as a theme name that you can pass to individual charts later on. For example:

The first chart uses a theme named AngryRed and the second HappyGreen

We did have to repeat the definition of the single series color twice though—seriesColors.single = [‘black’]—it would be even more annoying if we had multiple properties shared by both themes and only wanted to overwrite some. We can make this easier by changing the implementation of the createTheme function to accept an optional baseTheme, instead of always using the default from the library:

With those changes in place, as a consumer I can just import createTheme from the library and use AngryRed as the base theme when creating HappyGreen:

Making Colors Change According to the Data Set

Another important feature we had in the library and didn’t want to lose was to change the series colors according to the data.

In this example, we’re applying a green gradient to the first chart to highlight the highest values as having more ordered items—more sales—is a good thing! In the second chart though, we’re applying a red gradient to highlight the highest values, since having more people return what they ordered isn’t such a good thing.

It would be super cumbersome to create extra themes any time we wanted a specific data series to use a different color, so we changed our DataSeries type to accept an optional colour that can overwrite the series colour coming from the theme:

So for the example above, we could have something like:

Next Steps

Polaris Viz will be open source soon! If you want to get access to the beta version of the library, help us test, or suggest features that might be useful for you, reach out to us at polaris-viz-feedback@shopify.com

Krystal is a Staff Developer on the Visualization Experiences team. When she’s not obsessing over colors, shapes and animation she’s usually hosting karaoke & billiards nights with friends or avoiding being attacked by her cat, Pluma.

Continue reading

Test Budget: Time Constrained CI Feedback

Test Budget: Time Constrained CI Feedback

At Shopify we run more than 170,000 tests in our core monolith. Naturally, we're constantly exploring ways to make this faster, and the Test Infrastructure team analyzed the feasibility of introducing a test budget: a fixed amount of time for tests to run. The goal is to speed up the continuous integration (CI) test running phase by accepting more risk. To achieve that goal we used prioritization to reorder the test execution plan in order to increase the probability of a fast failure. Our analysis provided insights into the effectiveness of executing prioritized tests under a time constraint. The single most important finding was that we were able to find failures after we had run only 70% of the test-selection suite.

The Challenge

Shopify’s codebase relies on CI to avoid regressions before releasing new features. As the code submission rate grows along with the development team size, so does the size of the test pool and the time between code check-ins and test result feedback. As seen in the figure below developers will occasionally get late CI feedback while other times the CI builds complete in under 10 minutes. This non-normal cadence of receiving CI feedback leads to more frequent context switches.

The feedback time varies

Various techniques exist to speed up CI such as running tests in parallel or reducing the number of tests to run with test selection. Balancing the cost of running tests against the value of running them is a fundamental topic in test selection. Furthermore, if we think of the value as a variable then we can make the following observations for executing tests:

  • No amount of tests can give us complete confidence that no production issue will occur.
  • The risk of production issues is lower if we run all the tests.
  • As complexity of the system increases, the value of testing any individual component decreases.
  • Not all tests increase our confidence level the same way.

The Approach

It’s important to note first the difference between the test selection and test prioritization. Test selection selects all tests that correspond to the given changes using a call graph deterministically. On the other hand, test prioritization orders the test with the goal of discovering failures fast. Also, that sorted set won’t always be the same for the same change since the prioritization techniques use historical data.

The system we built produces a prioritized set of tests on top of test selection and constrains the execution of those tests using a predetermined time budget. Having established that there’s a limited time to execute the tests, the next step is to determine what’s the best time to stop executing tests and enforce it.

The time constraint or budget, and where the name Test Budget comes from, is the predetermined time we terminate test execution while considering that we must find as many failures as possible during that period of time.

System Overview

The guiding principle we used to build the Test Budget was: we can't be sure there will be no bugs in production that affect the users after running our test suite in any configuration.

To identify the most valuable tests to run within an established time budget, the following steps must be performed:

  1. identify prioritization criteria and compute the respective prioritized sets of tests
  2. compute the metrics for all criteria and analyze the results to determine the best criteria
  3. further analyze the data to pick a time constraint for running the tests

The image below gives a structural overview of the test prioritization system we built. First, we are computing the prioritized sets of tests using historical test results for every prioritization criterion (for example the criterion failure rate will have it’s own prioritized set of tests). Then, given some commit and the test-selection set that corresponds to that commit, we’re executing the prioritized tests as a CI build. These prioritized tests are a subset of the test selection test suite.

Test Prioritization System

First, the system obtains the test result data needed by the prioritization techniques. The data is ingested into a Rails app that’s responsible for the processing and persistence. It exposes the test results through a HTTP API and a GUI. For persistence, we chose to use Redis, not only because of the unstructured nature of our data, but also because of the Redis Sorted Sets data structure that enables us to query for ordered sets of tests in O(logn) time, where n is the number of elements in the set.

The goal of the next step is to select a subset of tests given the changes of the committed code. We created a pipeline that’s being triggered for a percentage of the builds that contain failures. We execute this pipeline with a specific prioritization each time and calculate metrics based on it.

Modeling Risk

During the CI phase, the risk of not finding a fault can be thought of as a numbers game. How certain are we that the application will be released successfully if we have tested all the flows? What if we test the same flows 1000 times? We leaned on test prioritization to order the tests in such a way that early faults are found as soon as possible, which encouraged the application of heuristics as the prioritization criteria. This section explores how to measure the risk of not detecting faults using the time budget and if we don’t skip a test randomly but after using the best heuristics.

Prioritization Criteria

We built six test prioritization criteria that produced a rating for every test in the codebase:

  • failure_rate: how frequently a test fails based on historical data.
  • avg_duration: how fast a test executes. Executing faster tests allows us to execute more tests in a short amount of time.
  • churn: a file that’s changing too much could be more brittle.
  • coverage: how much of the source code is executed when running a test.
  • complexity: based on the lines of code per file.
  • default: this is the random order set.

Evaluation Criteria

After we get the prioritized tests, we need to evaluate the results of executing the test suite following the prioritized order. We chose two groups of metrics to evaluate the criteria:

  1. The first includes the Time to First Failure (TTFF) which acts as a tripwire since if the time to first failure is 10 minutes then we can’t enforce a lower time constraint than 10 minutes.
  2. The second group of metrics includes the Average Percentage of Faults Detected (APFD) and the Convergence Index. We needed to start thinking of the test execution timing problem using a risk scale, which would open the way for us to run fewer tests by tweaking how much risk we will accept.

The APFD is a measure of how early a particular test suite execution detects failures. APFD is calculated using the following formula:

APFD Formula

The equation tells us that in order to calculate the APFD we will take the difference between 1 and the sum of the positions of the tests that expose each failure. In the equation above:

  • n is the number of test cases in the test suite
  • m is the total number of failures in the test suite
  • Fi is the position in the prioritized order set of the first test that exposes the fault i.

The APFD values range from 0 to 1, where higher APFD values imply a better prioritization.

For example, for the test suites (produced by different prioritization algorithms) T1 and T2 that each have a total number of tests (n) = 100 and total number of faults (m) = 4, we get the following matrix:

T1

 T2

F1

1

4

F2

10

20

F3

30

60

F4

60

61

 

And we calculate their APFD values:

APFD Values
APFD Values

The first prioritization has a better APFD rating (0.7525 versus 0.6425).

The Convergence Index tells us when to stop testing within a time constrained environment because a high convergence indicates we’re running fewer tests and finding a big percentage of failures.

Convergence Index = Percentage of faults detectedPercentage of tests executed
Convergence Index

The formula to calculate the Convergence Index is the percentage of faults detected divided by the percentage of tests executed.

Data Analysis

For each build, we created and instrumented a prioritized pipeline to produce artifacts for building the prioritization sets and emit test results to Kafka topics.

The prioritization pipeline in Buildkite

We ran the prioritized pipeline multiple times to apply statistical analysis to our results. Finally, we used Python Notebooks to combine all the measurements and easily visualize the percentiles. For APFD and TTFF we decided to use boxplot to visualize possible outliers and skewness of the data.

When Do We Find the First Failing Test?

We used the TTFF metric to quantify how fast we could know that the CI will eventually fail. Finding the failure within a time window is critical because the goal is to enforce that window and stop the test execution when the time window ends.

TTFF

In the figure above we present the statistical distributions for the prioritization criteria using boxplots. The median time to find a failure is less than five minutes for all the criteria. Complexity, churn, and avg_duration have the worst third quartile results with a maximum of 16 minutes. On the other hand, default and failure_rate gave more promising results with a median of less than three minutes.

Which Prioritization Criteria Have the Best Failure Detection Rates?

We used the APFD metric to compare the prioritization criteria. A higher APFD value indicates a better failure detection rate.

APFD scores

The figure above presents the boxplots of APFD values for all the prioritization criteria. We notice that there isn’t a significant difference between the churn and complexity prioritization criteria. Both of these have median values close to zero which make them very inappropriate for prioritizing the tests. We also see that the failure_rate has the best detection rate that’s marginally better than the random (default) one.

Which Prioritization Criteria Has the Quickest Convergence Time?

The increase of test failures detected decreases as we execute more tests. This is what we visualized with the convergence index data and using a step chart. In all the convergence graphs the step is 10% of the test suite executed.

Mean convergence index

The above figure indicates that while all the criteria find a percentage of faults after running only 50% of the test suite for the mean, the default and failure_rate prioritization criteria stand out.

For the mean case, executing 50% of the test suite finds 50% of the failures using the default prioritization and 60% using the failure_rate. The failure_rate criterion is able to detect 80% of the failures after running only 60% of the test suite.

How Much Can We Shrink the Test Suite Given a Time Constraint?

The p20 and p5 visualizations of the convergence quantify how reliably we could detect faults within the time budget. We use the p20 and p5 visualizations because a higher value of convergence is better. The time budget is an upper bound. The CI system executes the tests up to that time bound.

Convergence index p20

For example, after looking at the p20 (80% of builds) plot (the above figure), we need to execute 60% of the test-selection tests (the test-selection suite is 40% of the whole test suite on median) to detect an acceptable amount of failures. Then, the time budget is the time it takes to execute 60% of the selected tests.

Convergence index p5

Looking at the plot of the 5th percentile (95% of the builds) plot (see the figure above), we notice that we could execute 70% of the already test-selection reduced test suite to detect 50% of the failures.

The Future of Test Budget Prioritization

By looking at our convergence and TTFF and if we want to emphasize the discovery of a faulty commit, that is the first failure, we can see that we could execute less than 70% of the test-selection suite.

The results of the data analysis suggest several alternatives for future work. First, deep learning models could utilize the time budget as a constraint while they are building the prioritized sets. Prioritizing tests using a feedback mechanism could be the next prioritization to explore, where tests that never run could be automatically deleted from the codebase, or failures that result in problems during production testing could be given a higher priority.

Finally, one possible potential for a Test Budget prioritization system could be outside the scope of the Continuous Integration environment: the development environment. Another way of looking at the ordered sets is that the first tests are more impactful or more susceptible to failures. Then we could use such data to inform developers during the development phase that parts of the codebase are more likely to have failing tests in CI. A message such as “this part of the codebase is covered by a high priority test which breaks in 1% of the builds” would give feedback to developers immediately while they’re writing the code. It would shift testing to the left by giving code suggestions during development, and eventually reduce the costs and time of executing tests in the CI environment.


If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

Adding the V8 CPU Profiler to v8go

Adding the V8 CPU Profiler to v8go

V8 is Google’s open source high-performance JavaScript and WebAssembly engine written in C++. v8go is a library written in Go and C++ allowing users to execute JavaScript from Go using V8 isolates. Using Cgo bindings allows us to run JavaScript in Go at native performance.

The v8go library, developed by Roger Chapman, aims to provide an idiomatic way for Go developers to interface with V8. As it turns out, this can be tricky. For the past few months, I’ve been contributing to v8go to expose functionality in V8. In particular, I’ve been adding support to expose the V8 CPU Profiler

From the start, I wanted this new API to be:

  • easy for the library's Go users to reason about
  • easy to extend for other profiler functionality eventually
  • aligned closely with the V8 API
  • as performant as possible.

The point about performance is especially interesting. I theorized that my first iteration of the implementation was less performant than a proposed alternative. Without benchmarking them, I proceeded to rewrite. That second implementation was merged, and I moved on with my life. So when I was like "Hey! I should write a post about the PR and benchmark the results" only to actually see the benchmarks and reconsider everything.

If you’re interested in API development, Go/Cgo/C++ performance or the importance of good benchmarks, this is a story for you.

Backing Up to the Starting Line: What Was My Goal?

The goal of adding the V8 CPU Profiler to v8go was so users of the library could measure the performance of any JavaScript being executed in a given V8 context. Besides providing insight on the code being executed, the profiler returns information about the JavaScript engine itself including garbage collection cycles, compilation and recompilation, and code optimization. While virtual machines and the like can run web applications incredibly fast, code should still be performant, and it helps to have data to understand when it's not. 

If we have access to a CPU profiler, we can ask it to start profiling before we start executing any code. The profiler samples the CPU stack frames at a preconfigured interval until it's told to stop. Sufficient sampling helps show the hot code paths whether that be in the source code or in the JavaScript engine. Once the profiler has stopped, a CPU profile is returned. The profile comes in the form of a top-down call tree composed of nodes. To walk the tree, you get the root node and then follow its children all the way down.

Here’s an example of some JavaScript code we can profile:

Using v8go, we start by creating the V8 isolate, context, and CPU profiler. Before running the above code, the profiler is told to start profiling:

After the code has finished running, the profiling is stopped and the CPU profile returned. A simplified profile in a top-down view for this code looks like:

Each of these lines corresponds to a node in the profile tree. Each node comes with plenty of details including:

  • name of the function (empty for anonymous functions)
  • id of the script where the function is located
  • name of the script where the function originates
  • number of the line where the function originates
  • number of the column where the function originates
  • whether the script where the function originates is flagged as being shared cross-origin
  • count of samples where the function was currently executing
  • child nodes of this node
  • parent node of this node
  • and more found in the v8-profiler.h file.

For the purposes of v8go, we don’t need to have opinions about how the profile should be formatted, printed, or used since this can vary. Some may even turn the profile into a flame graph. It’s more important to focus on the developer experience of trying to generate a profile in a performant and idiomatic way.

Evolving the API Implementation

Given the focus on performance and an idiomatic-to-Go API, the PR went through a few different iterations. These iterations can be categorized into two distinct rounds: the first where the profile was lazily loaded and the second where the profile was eagerly loaded. Let’s start with lazy loading.

Round 1: Lazy Loading

The initial approach I took aligned v8go with V8's API as closely as possible. This meant introducing a Go struct for each V8 class we needed and their respective functions (that is, CPUProfiler, CPUProfile, and CPUProfileNode).

This is the Go code that causes the profiler to stop profiling and return a pointer to the CPU profile:

This is the corresponding C++ code that translates the request in Go to V8's C++:

With access to the profile in Go, we can now get the top-down root node:

The root node exercises this C++ code to access the profiler pointer and its corresponding GetTopDownRoot() method:

With the top-down root node, we can now traverse the tree. Each call to get a child, for instance, is its own Cgo call as shown here:

The Cgo call exercises this C++ code to access the profile node pointer and its corresponding GetChild() method:

The main differentiator of this approach is that to get any information about the profile and its nodes, we have to make a separate Cgo call. For a very large tree, this makes at least kN more Cgo calls where k is the number of properties queried, and N is the number of nodes. The value for k will only increase as we expose more properties on each node.

How Go and C Talk to Each Other

At this point, I should explain more clearly how v8go works. v8go uses Cgo to bridge the gap between Go and V8's C code. Cgo allows Go programs to interoperate with C libraries: calls can be made from Go to C and vice versa.

Upon some research about Cgo's performance, you'll find Sean Allen’s Gophercon 2018 talk where he made the following recommendation:

“Batch your CGO calls. You should know this going into it, since it can fundamentally affect your design. Additionally once you cross the boundary, try to do as much on the other side as you can. So for go => “C” do as much as you can in a single “C” call. Similarly for “C” => go do as much as you can in a single go call. Even more so since the overhead is much higher.”

Similarly, you’ll find Dave Cheney’s excellent “cgo is not go” that explains the implications of using cgo: 

“C doesn’t know anything about Go’s calling convention or growable stacks, so a call down to C code must record all the details of the goroutine stack, switch to the C stack, and run C code which has no knowledge of how it was invoked, or the larger Go runtime in charge of the program.

The take-away is that the transition between the C and Go world is non trivial, and it will never be free from overhead.”

When we talk about “overhead,” the actual cost can vary by machine but some benchmarks another contributor v8go (Dylan Thacker-Smith) ran show an overhead of about 54 nanoseconds per operation (ns/op) for Go to C calls and 149 ns/op for C to Go calls:

Given this information, the concern for the lazy loading is justified: when a user needs to traverse the tree, they’ll make many more Cgo calls, incurring the overhead cost each time. After reviewing the PR, Dylan made the suggestion of: building the entire profile graph in C code and then passing a single pointer back to Go so Go could rebuild the same graph using Go data structures loaded with all the information that can then be passed to the user. This dramatically reduces the number of Cgo calls. This brings us to round #2.

Round 2: Eager Loading

To build out a profile for visualization, users will need access to most if not all of the nodes of the profile. We also know that for performance, I want to limit the number of C calls that have to be made in order to do so. So, we move the heavy-lifting of getting the entire call graph inside of our C++ function StopProfiling so that the pointer we return to the Go code is to the call graph fully loaded with all the nodes and their properties. Our go CPUProfile and CPUProfileNode objects will match V8’s API in that they have the same getters, but now, internally, they just return the values from the structs private fields instead of reaching back to the C++ code.

This is what the StopProfiling function in C++ does now: once the profiler returns the profile, the function can traverse the graph starting at the root node and build out the C data structures so that a single pointer to the profile can be returned to the Go code that can traverse the graph to build corresponding Go data structures.

The corresponding function in Go, StopProfiling, uses Cgo to call the above C function (CPUProfilerStopProfiling) to get the pointer to our C struct CPUProfile. By traversing the tree, we can build the Go data structures so the CPU profile is completely accessible from the Go side:

With this eager loading, the rest of the Go calls to get profile and node data is as simple as returning the values from the private fields on the struct.

Round 3 (Maybe?): Lazy or Eager Loading

There’s the potential for a variation where both of the above implementations are options. This means allowing users to decide where they want to lazily or eagerly load everything on the profile. It’s another reason why, in the final implementation of the PR, the getters were kept instead of just making all of the Node and Profile fields public. With the getters and private fields, we can change what’s happening under the hood based on how the user wants the profile to load.

Speed is Everything, So Which One's Faster?

Comparing lazy and eager loading required a test that executed some JavaScript program with a decently sized tree so we could exercise a number of Cgo calls on many nodes. We would measure if there was a performance gain by building the tree eagerly in C and returning that complete call graph as a pointer back to Go.

For quite a while, I ran benchmarks using the JavaScript code from earlier. From those tests, I found that:

  1. When lazy loading the tree, the average duration to build it is ~20 microseconds.
  2. When eagerly loading the tree, the average duration to build it is ~25 microseconds.

It's safe to say these results were unexpected. As it turns out, the theorized behavior of the eager approach wasn’t more optimal than lazy loading, in fact, it was the opposite. It relied on more Cgo calls for this tree size. 

However, because these results were unexpected, I decided to try a much larger tree using the Hydrogen starter template. From testing this, I found that:

  1. When lazy loading the tree, the average duration to build it is ~90 microseconds.
  2. When eagerly loading the tree, the average duration to build it is ~60 microseconds.

These results aligned better with our understanding of the performance implications of making numerous Cgo calls. It seems that, for a tiny tree, the cost of traversing it three times (twice to eagerly load information and once to print it) doesn’t cost less than the single walk to print it that includes numerous Cgo calls. The true cost only shows itself on a much larger tree where the benefit of the upfront graph traversal cost greatly benefits the eventual walkthrough of a large tree to be printed. If I hadn’t tried a different sized input, I would never have seen that the value of eager loading eventually shows itself. If I drew the respective approaches of growth lines on a graph, it would look something like:

Simple graph with time to build profile on the y axis and size of javascript on x axis. 2 lines indicating eager and lazy are plotted on the graph with lazy being higher

Looking Back at the Finish line

As a long time Go developer, there’s plenty of things I take for granted about memory management and performance. Working on the v8go library has forced me to learn about Cgo and C++ in such a way that I can understand where the performance bottlenecks might be, how to experiment around them, and how to find ways to optimize for them. Specifically contributing the functionality of CPU profiling to the library reminded me that:

  1. I should benchmark code when performance is critical rather than just going with my (or another’s) gut. It absolutely takes time to flesh out a sufficient alternative code path to do fair benchmarking, but chances are there are discoveries made along the way. 
  2. Designing a benchmark matters. If the variables in the benchmark aren’t reflective of the average use case, then the benchmarks are unlikely to be useful and may even be confusing.

Thank you to Cat Cai, Oliver Fuerst, and Dylan Thacker-Smith for reviewing, clarifying, and generally just correcting me when I'm wrong.

About the Author:

Genevieve is a Staff Developer at Shopify, currently working on Oxygen.


If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

RubyConf 2021: The Talks You Might Have Missed

RubyConf 2021: The Talks You Might Have Missed

Shopify loves Ruby and opportunities to get together with other engineers who love Ruby to learn, share, and build relationships. In November, Rubyists from Shopify’s Ruby and Rails infrastructure teams gathered in Denver at RubyConf 2021 to immerse themselves in all things Ruby with a community of their peers. If you weren’t there or want to revisit the content, we’ve compiled a list of the talks from our engineers. 

A History of Compiling Ruby by Chris Seaton

Love Ruby compilers? Chris does.

“Why is it worth looking at Ruby compilers? Why is it worth looking at compilers at all? Well, I think compilers are fascinating. I’ve been working on them for a couple of decades. I think one of the great things about compilers, you can talk to anyone who’s a developer about compilers, because we all use compilers. Everyone’s got an opinion on how the languages should be designed. You can have conversations with anyone at every level about compilers, and compilers are just really fun. They may seem like a deeply technical topic, but they’re conceptually fairly simple. They take a file as input, they do something internally, and they produce a file as output.”

In this talk, Chris dives into the history of Ruby compilers, the similarities and differences, and what we can learn from them.

Learn more about Chris’ work on TruffleRuby: https://shopify.engineering/understanding-programs-using-graphs

Some Assembly Required by Aaron Patterson 

In typical Aaron style, this talk is filled with puns and humor while being educational and thought-provoking. Aaron shares why he wrote a JIT compiler for Ruby. Why did he write a JIT compiler? 

To see if he could.

“I wanted to see if I could build this thing. For me, programming is a really creative and fun endeavor. I love to program. And many times I’ll just write a project just to see if I can do it. And this is one of those cases. So, I think maybe people are asking, ‘does this thing actually work?’” 

Watch Aaron’s talk to find out if it does work and learn how to build a JIT compiler in pure Ruby. 

Learn more about TenderJIT on GitHub

Building a New JIT Compiler Inside CRuby by Maxime Chevalier Boisvert

In this talk, Maxime talks about YJIT, an open-source project led by a small team of developers at Shopify to incrementally build a new JIT compiler inside CRuby. She discusses the key advantages of YJIT, the approach the team is taking to implement YJIT, and early performance results.

“The objective is to produce speedups on real-world software. For us, real-world software means large web workloads, such as Ruby on Rails. The benefits of our approach is we’re highly compatible with all existing Ruby code and we’re able to support all of the latest Ruby features.”

Check out YJIT in Ruby 3.1!

Learn more about YJIT:

Gradual Typing in Ruby–A Three Year Retrospective by Ufuk Kayserilioglu and Alexandre Terrasa 

Ufuk and Alexandre share a retrospective of adopting Sorbet at Shopify, why you don’t have to go full-in on types out of the gate, and why gradual typing might be a great middle-ground for your team. They also share lessons learned from a business and technical perspective. 

“You shouldn’t be getting in the way of people doing work. If you want adoption to happen, you need to ramp up gently. We’re doing gradual type adoption. And because this is gradual-type adoption, it’s totally okay to start slow, to start at the lowest strictness levels, and to gradually turn it up as people are more comfortable and as you are more comfortable using the tools.”

Check out the following posts from Ufuk and Alexandre to learn more about static typing for Ruby and adopting Sorbet at scale at Shopify.

Building Native Extensions. This Could Take A While... by Mike Dalessio 

At RubyKaigi 2021, Mike did a deep dive into the techniques and toolchain used to build and ship native C extensions for Ruby. In his latest talk at RubyConf 2021, Mike expands upon the conversation to explore why Nokogiri evolved to use more complex techniques for compilation and installation over the years and touches upon human trust and security. 

“Nokogiri is web-scale now. Since January (2021), precompiled versions of Nokogiri have been downloaded 60 million times. It’s a really big number. If you do back of the envelope power calculations, assuming some things about your core, 2.75 megawatts over 10 months have been saved.”

Mike has provided companion material to the talk on GitHub.

Parsing Ruby by Kevin Newton

Kevin digs into the topic of Ruby parsers with a thorough deep dive into the technical details and tradeoffs of different tools and implementations. While parsing is a technically challenging topic, Kevin delivers a talk that speaks to junior and senior developers, so there’s something for everyone! 

“Parser generators are complicated technologies that use shift and reduce operations to build up syntax trees. Parser generators are difficult to maintain across implementations of languages. They’re not the most intuitive of technologies and it’s difficult to maintain upstream compatibility. It’s a good thing that Ruby is going to slow down on syntax and feature development because it’s going to give an opportunity for all the other Ruby implementations to catch up.”

Problem Solving Through Pair Programming by Emily Harber

We love pair programming at Shopify. In this talk, Emily explores why pair programming is a helpful tool for getting team members up to speed and writing high-quality code, allowing your team to move faster and build for the long term. Emily also provides actionable advice to get started to have more productive pairing sessions.

“Pair programming is something that should be utilized at all levels and not exclusively as a part of your onboarding or mentorship processes. Some of the biggest benefits of pairing carry through all stages of your career and through all phases of development work. Pairing is an extremely high fidelity way to build and share context with your colleagues and to keep your code under constant review and to combine the strengths of multiple developers on a single piece of a shared goal.”

 

Achieving Fast Method Metaprogramming: Lessons from MemoWise by Jemma Issroff

In this talk, Jemma and Jacob share the journey of developing MemoWise, Ruby’s most performant memoization gem. The presentation digs into benchmarking, unexpected object allocations, performance problems common to Ruby metaprogramming, and their experimentation to develop techniques to overcome these concerns.

“So we were really critically concerned with optimizing our performance as much as possible. And like any good scientist, we followed the scientific method to ensure this happens. So four steps: Observation, hypothesis, experiment, and analysis. Benchmarks are one of the best ways to measure performance and to an experiment that we can use over and over again to tell us exactly how performant our code is or isn’t.” 

Programming with Something by Tom Stuart

In this talk, Tom explores how to store executable code as data in Ruby and write different kinds of programs that process it. He also tries to make “fasterer” and “fastererer” words, but we’ll allow it because he shares a lot of great content.

“A simple idea like the SECD machine is the starting point for a journey of iterative improvement that lets us eventually build a language that’s efficient, expressive, and fast.”

If you are interested in exploring the code shown in Tom’s talk, it’s available on GitHub.

The Audacious Array by Ariel Caplan

Do you love Arrays? In this talk, Ariel explores the “powerful secrets” of Ruby arrays by using…cats! Join Ariel on a journey through his game, CatWalk, which he uses to discuss the basics of arrays, adding and removing elements, creating randomness, interpretation, arrays as sets, and more. 

“When we program, many of the problems that we solve fall into the same few categories. We often need to create constructs like a randomizer, a 2D representation of data like a map, some kind of search mechanism, or data structures like stacks and queues. We might need to take some data and use it to create some kind of report, And sometimes we even need to do operations that are similar to those we do on a mathematical set. It turns out, to do all of these things, and a whole lot more, all we need is a pair of square brackets. All we need is one of Ruby’s audacious arrays.” 

If you want to explore the code for Ariel’s “nonsensical” game, CatWalk, check it out on GitHub

Ruby Archaeology by Nick Schwaderer

In this talk, Nick “digs” into Ruby archeology to run old code and explore Ruby history and interesting gems from the past and shares insights into what works and what’s changed from these experiments.  

“So why should you become a Ruby archeologist? There are hundreds of millions, if not billions, of lines of valid code, open source for free, on the internet that you can access today. In the Ruby community today, sometimes it feels like we’re converging.”

Keeping Developers Happy With a Fast CI by Christian Bruckmayer

As a member of Shopify’s test infrastructure team, Christian ensures that the continuous integration (CI) systems are scalable, robust, and usable. In this talk, Christian shares techniques such as monitoring, test selection, timeouts, and the 80/20 rule to speed up test suites. 

“The reason we have a dedicated team is just the scale of Shopify. So the Rails core monolith has approximately 2.8 million lines of code, over a thousand engineers work on it, and in terms of testing we have 210,000 Ruby tests. If you execute them it would take around 40 hours. We run around 1,000 builds per day, which means we run around 100 million test runs per day. So that’s a lot.”

Read more about keeping development teams happy with fast CI on the blog.

Note: The first 1:40 of Christian’s talk has minor audio issues, but don’t bail on the talk because the audio clears up quickly, and it’s worth it!

Parallel Testing With Ractors–Putting CPU's to Work by Vinicius Stock

Vini talks about using Ractors to parallelize test execution, builds a test framework built on Ractors, compares current solutions, and discusses the advantages and limitations.

“Fundamentally, tests are just pieces of code that we want to organize and execute. It doesn’t matter if in Minitest they are test methods and in RSpec they are Ruby blocks, they’re just blocks of code that we want to run in an organized manner. It then becomes a matter of how fast we can do it in order to reduce the feedback loop for our developers. Then we start getting into strategies for parallelizing the execution of tests.”

Optimizing Ruby's Memory Layout by Peter Zhu & Matt Valentine-House

Peter and Matt discuss how their variable width allocation project can move system heap memory into Ruby heap memory, reducing system heap allocations, and providing finer control of the memory layout to optimize for performance.

“We’re confident about the stability of variable width allocation. Variable width allocation passes all tests on CI on Shopify’s Rails monolith, and we ran it for a small portion of production traffic of a Shopify service for a week, where it served over 500 million requests.”

Bonus: Meet Shopify's Ruby and Rails Infrastructure Team (AMA)

There were a LOT of engineers from the Ruby and Rails teams at Shopify at RubyConf 2021. Attendees had the opportunity to sit with them at a meet and greet session to ask questions about projects, working at Shopify, “Why Ruby?”, and more.

Jennie Lundrigan is a Senior Engineering Writer at Shopify. When she's not writing nerd words, she's probably saying hi to your dog.


We want your feedback! Take our reader survey and tell us what you're interested in reading about this year.

If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

How to Get an Engineering Internship at Shopify: A Complete Guide

How to Get an Engineering Internship at Shopify: A Complete Guide

An important component of being an engineer is getting hands-on experience in the real world, which internships can provide. This is especially true for engineering internships, which are critical for helping students develop real-life skills they can’t learn in the classroom. Sadly the internship market has suffered heavily since the pandemic with internship opportunities dropping by 52%, according to Glassdoor.

The silver lining is that many companies are now transitioning to incorporate virtual internships like we did. Whether you are a student, recent graduate, career switcher, bootcamp graduate, or another type of candidate, our virtual engineering internships are designed to kickstart your career and impact how entrepreneurs around the world do business.

What is it like to be a Shopify intern? During our latest intern satisfaction survey, 98% of respondents said they would recommend the program to friends. There are many opportunities to build your career, make an impact, and gain real-world experience at Shopify. But don’t just take our word for it! Keep reading to see what our interns had to say about their experience and learn how you can apply.

How to Get an Internship at Shopify

If you’re looking to jumpstart your personal growth and start an internship that can help lay the foundation for a successful career, we can help. Interning at Shopify allows you to work on real projects, solve hard problems, and gain practical feedback along the way. We provide the tools you need to succeed and trust you to take ownership and make great decisions. Here’s the steps to getting started.

Step 1: Review Available Opportunities

At Shopify, our engineering internships vary in length from three to eight months, with disciplines such as front-end and back-end development, infrastructure engineering, data engineering, mobile development, and more. Currently we run three intern application cycles a year. Applicants for the Fall 2022 cohort will be able to apply in May of 2022. Join our Shopify Early Talent Community, and we'll notify you. We also list available internships on our Early Careers page, these include a variety of three, four, and eight-month paid programs.

Step 2: Apply Online

Getting a Shopify engineering internship starts with an online application. We'll ask you for your resume, cover letter, contact information, education status, LinkedIn profile, and personal website. You will also be asked to complete an Intern Challenge to demonstrate your interest in the internship topic. This is a great place to show off your love for engineering. Perhaps you have your site using Ruby on Rails. We’d love to hear about it!

Step 3: Get Ready for the Skills Challenge

Depending on your specialization, you may be asked to submit a personal project like a GitHub link so that the recruiter can test your skills. Challenges differ by category, but you might be asked to design a Shopify store or to use a coding language like Python or Ruby on Rails to solve a problem. We want to see that you care about the subject, so be specific and put effort into your challenges to make your skills stand out.

Step 4: Prepare for the Interview Process

Shopify's interview process is divided into two phases. Our first stage allows us to get to know you better. Our conversation is called the Life Story, and it's a two-sided conversation that presents both your professional and personal experiences so far. Our second stage is used to assess your technical skills. A challenge will be presented to you, and you will be asked to propose a technical solution.

Top Skills for Engineering Interns

In a series of recent Twitter discussions from August and January, we asked about the most important skills for an engineering intern. More than 100 hiring managers, engineering professionals, and thought leaders responded. Here’s a summary of the skills they look for, along with how our very own interns have learned and applied them.

A visual representation of interns acting out the top skills: collaboration, lifelong learning, curiosity, GitHub experience, remote work experience, communication, interviewing, and accountability
Top skills for engineering interns

Collaboration

When you are working with a team, as most interns do, you need to be able to work together smoothly and effectively. Collaboration can encompass several characteristics, including communication, group brainstorming, emotional intelligence, and more. According to one follower on Twitter: “tech is a small part of software engineering, the valuable part is working well in teams.”

Our interns collaborate with talented people around the world. Emily Liu, a former intern and upcoming UX designer at Shopify, said the core team was spread out across five countries. The time differences didn’t hinder them from collaborating together to achieve a common goal. “Teamwork makes the dream work,” says Emily.

Lifelong Learning

Being a constant learner is one of Shopify's values and is considered a measure of success. As one Twitter follower pointed out, this is especially important in engineering since you should "always be willing to learn, to adapt, and to accept help" and that “even the most senior staff developer can learn something from an intern.”

This is echoed by former intern Andrea Herscovich, who says “if you are looking to intern at a company which values impact over everything, lifelong learning, and entrepreneurship, apply to Shopify!”

Curiosity

Without curiosity, an intern might become stagnant and not stay on top of the latest tools and technologies. Lack of curiosity can hinder an intern's career in engineering, where technological developments are rapid. In a response given by a hiring manager, curiosity is one of the key things he looks for among engineering interns, but says it's hard to find.

Andrea Herscovich also says she was encouraged to "be curious." This curiosity allowed her to build her own path for the internship. A particularly memorable project involved contributing to Polaris, Shopify's open-source design system, says Andrea. When working on adding a feature to a component in Polaris, Andrea learned how to develop for a more general audience.   

GitHub Experience

GitHub is an essential tool for collaborating with other developers in most engineering environments. As one Twitter user says: “I don't care if you got an A+ or C- in compilers; I'm going to look at your GitHub (or other public work) to see if you've been applying what you learned.” At Shopify, GitHub plays an important role in collaboration.

Using GitHub, former Shopify intern Kelly Ma says that her mentor provided a list of challenges instead of clearly-defined work for her to solve. During this time, Kelly had a chance to ask questions and learn more about the work of her team. As a result, she interacted with Shopifolk outside of her team and forged new relationships.

Remote Work Experience

A growing number of engineers are now working remotely. The trend is likely to continue well into the future due to COVID-19. As an intern, you will have the opportunity to gain experience working remotely, which can prepare you for the growing virtual workforce. Perhaps you're wondering if a remote internship can deliver the same experience as an in-person internship?

One former Shopify intern, Alex Montague, was anxious about how a remote internship would work. After completing the program, he told us, "working from home was pretty typical for a normal day at work" and that the tools he used made remote work easy, and he was "just as productive, if not more so, than if I was in the office." Alex is now a front-end developer on our App Developer Experience team, which provides insights and tools to help partners and merchants build and maintain apps.

Communication 

Today, communication is one of the most important skills engineers can have—and one that they sometimes lack. As one Twitter follower puts it: "nothing in CS you learn will be more important than how to communicate with humans." Fortunately, as an intern, you get the chance to improve on these skills even before you enter the workforce.

Meeting over Google Hangouts, pair programming on Tuple, brainstorming together on Figma, communicating through Slack, and discussing on GitHub are just a few ways that Shopify interns communicate, says Alex Montague. Interns can take advantage of these opportunities to develop core communication skills such as visual communication, written communication, and nonverbal communication.

Interviewing

“Practice interviewing. This is a skill,” one Twitter follower advises. Interviewing well is key to a successful internship search, and it can set you apart from other candidates. At Shopify, the interview process is divided into two different phases. We begin with a Life Story to learn more about you, what motivates you, and how we can help you grow. Our later rounds delve into your technical skills.

As part of his preparation for his Life Story internship interview, Elio Hasrouni noted all the crucial events in his life that have shaped who he is today, starting from his childhood. Among other things, he mentioned his first job, his first coding experience, and what led him into Software Engineering. Elio is now a full-time developer within our Retail and Applications division, which helps power our omnichannel commerce.

Accountability

Accountability involves taking responsibility for actions, decisions, and failures. For an engineering intern, accountability might mean accepting responsibility for your mistakes (and you’ll make plenty of them) and figuring out how to improve. Acknowledging your mistakes helps you demonstrate self-awareness that enables you to identify the problem, address it, and avoid repeating it.

How do you keep yourself accountable? Kelly Ma credits stretch goals, which are targets that are designed to be difficult to achieve, as a way to remain accountable. Other ways she keeps accountable include exploring new technological frontiers and taking on new challenges. One way that Shopify challenged her to be accountable was by asking her to own a project goal for an entire cycle (six weeks). This process included bringing stakeholders together via ad hoc meetings, updating GitHub issues to convey the state of the goal, and learning how to find the right context.

Tips to Help You Succeed

As you might expect, great internship opportunities like this are highly competitive. Applicants must stand out in order to increase their chances of being selected. In addition to the core skills discussed above, there are a few other things that can make you stand out from the crowd.

Practice Our Sample Intern Challenges

It is likely that we will ask you to participate in an Intern Challenge to showcase your skills and help us better understand your knowledge. To help you prepare, you can practice some of our current and previous intern challenges below.

Showcase Your Past Projects

You don’t need prior experience to apply as a Shopify intern, but if you compile all your previous projects relevant to the position you're applying for, your profile will certainly stand out. Your portfolio is the perfect place to show us what you can do. 

Research the Company

It's a good idea to familiarize yourself with Shopify before applying. Take the time to learn about our product, values, mission, vision, and find a connection with them. In order to achieve success, our goals and values should align with your own. 

Additional Resources

Want to learn more about Shopify's Engineering intern program? Check out these posts:

Want to learn more about the projects our interns work on? Check out these posts:

About the Author:

Nathan Quarrie is a Digital Marketing Lead at Shopify based in Toronto, Ontario. Before joining Shopify, he worked in the ed-tech industry where he developed content on topics such as Software Engineering, UX & UI, Cloud Computing, Data Analysis, and Web Development. His content and articles have been published by more than 30 universities including Columbia, Berkeley, Northwestern, and University of Toronto.


If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

Changing a polymorphic_type in Rails

Changing a polymorphic_type in Rails

In this post I'm going to share how my teammates and I redefined the way we store one of the polymorphic associations in the Shopify codebase. I am part of the newly formed Payment Flexibility team. We work on features that empower merchants to better manage their payments and receivables on Shopify.

Code at Shopify is organized in components. As a new team, we decided to take ownership over some existing code and to move it under the component we’re responsible for (payment flexibility). This resulted in moving classes (including models) from one module to another, meaning their namespace had to change. While thinking about how we were going to move certain classes under different modules, we realized we may benefit from changing the way Rails persists a polymorphic association to a database. Our team had not yet entirely agreed on the naming of the modules and classes. We wanted to facilitate name changes during the future build phase of the project.

We decided to stop storing class names as a polymorphic type for certain records. By default, Rails stores class names as polymorphic types. We decided to instead use an arbitrary string. This article is a step by step representation of how we solved this challenge. I say representation because the classes and data used for this article are not taken from the Shopify codebase. They’re a practical example of the initial situation and the solution we applied.

I’m going to start with a short and simple reminder of what polymorphism is, then move on to a description of the problem, and finish with a detailed explanation of the solution we chose.

What is Polymorphism?

Polymorphism means that something has many forms (from the Greek “polys” for many and “morphē” for form).

Polymorphic relationship in Rails refers to a type of Active Record association. This concept is used to attach a model to another model that can be of a different type by only having to define one association.

For the purpose of this post, I’ll take the example of a Vehicle that has_one :key and the Key belongs_to :vehicle.

A Vehicle can be a Car or a Boat.

You can see here that Vehicle has many forms. The relationship between Key and Vehicle is polymorphic.

The foreign key stored on the child object (the Key record in our example) points to a single object (Vehicle) that can have different forms (Car or Boat). The form of the parent object is stored on the child object under the polymorphic_type column. The value of the polymorphic_type is equal to the class name of the parent object, "Car" or "Boat" in our example.

The code block below shows how a polymorphic association is stored in Rails.

The Issue

As I said initially, our vehicle classes had to move under another module, a change in module results in a different namespace. For this example I’ll pretend I want to change how our code is organized and put Car under the Garage module.

I go ahead and move the Car and Boat models under the new module Garage:

I’m now running into the following:

The vehicle_type column now contains "Garage::Car", which means we’ll have vehicle_type: "Car" and vehicle_type: "Garage::Car" both stored in our database.

Having these two different vehicle_type values means the Key records with vehicle_type: "Car" won’t be returned when calling a_vehicle.key. The Active Record association has to be aware of all the possible values for vehicle_type in order to find the associated record:

Both these vehicle_type values should point towards the updated model Garage::Car for our polymorphic ActiveRecord association to continue to work. The association is broken in both directions. Calling #vehicle on a Key record that has vehicle_type: "Car" won’t return the associated record:

The Idea

Once we realized changing a namespace was going to introduce complexity and a set of tasks (see next paragraph), one of my teammates said to me, “Let's stop storing class names in the database altogether. By going from a class name to an arbitrary string we could decrease the coupling between our codebase and our database. This means we could more easily change class names and namespaces if we need to in the future.” For our example, instead of storing "Garage::Car" or "Garage::Boat" why don't we just store "car" or "boat"?

To go forward with a module and classes name change, without modifying the way Active Record stores a polymorphic association, we would have to add the ability to read from several polymorphic types when setting the ActiveRecord association. We also would have had to update existing records for them to point to the new namespace. If we go back to our example, records with vehicle_type: "Garage::Car" should point towards the new Garage::Car model until we could perform a backfill of the column with the updated model class name.

In Practice: Going From Storing a Class Name to an Arbitrary String

Rails has a way to override the writing of a polymorphic_type value. It’s done by redefining the polymorhic_name method. The code below is taken from the Rails gem source code:

Let's redefine the source code above for our Garage::Car example:

When creating a Key record we now have the following:

Now we have both "Car" the class name and "car" the arbitrary string stored as vehicle_type. Having two possible values for vehicle_type brings another problem. In a polymorphic association, the target (associated record) is looked up using the single value returned in .polymorphic_name, and this is where the limitation lies. The association is only able to look for one vehicle_type value. vehicle_type is stored as the value returned by polymorphic_name when the record was created.

An example of this limitation:

Look closely at the SQL expression, and you’ll see that we’re only looking for keys with a vehicle_type = "car" (the arbitrary string). The association won’t find the Key for vehicles created before we started our code change (keys where vehicle_type = "Car"). We have to redefine our association scope so it can look for keys with vehicle_type of "Car" or "car":

Our association now becomes the following SQL expression:

The association is now looking up keys with either "car" or "Car" as vehicle_type.

Now that we can read from both the class name and new arbitrary string as a vehicle_type for our association, we can go ahead and clean up our database to only have arbitrary strings stored as vehicle_type. At Shopify, we use MaintenanceTasks. You could run a migration or a script as the one below to update your records.

Once the clean up is complete, we only have arbitrary strings stored as vehicle_type. We can go ahead and remove the .unscope on the Garage::Car and Garage::Boat association.

But Wait, All This for What?

The main benefit from this patch is that we reduced the coupling between our codebase and our database.

Not storing class names as polymorphic types means you can move your classes, rename your modules and classes, without having to touch your existing database records. All you have to do is update the class names used as keys and values in the three CLASS_MAPPING hashes. The value stored in the database will remain the same unless you change the arbitrary strings these classes and class names resolve to.

Our solution adds complexity. It’s probably not worth it for most use cases. For us it was a good trade off since we knew the naming of our modules and classes could change in the near future.

The solution I explained isn’t the one we initially adopted. We initially went an even more complex route. This post is the solution we wish we had found when we started looking into the idea of changing how a polymorphic association is stored. After a bit of research and experimentation, I came to this simplified version and thought it was worth sharing.

Diego is a software engineer on the Payment Flexibility Team. Living in the Canadian Rockies.


We want your feedback! Take our reader survey and tell us what you're interested in reading about this year.

If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

Continue reading

We Want Your Feedback for the Shopify Engineering Blog

We Want Your Feedback for the Shopify Engineering Blog

Update (March 11, 2022): The reader survey is now closed. Thanks to all who provided feedback. Keep in touch with us on Twitter at @ShopifyEng.

 

Hello Shopify Engineering readers,

We’re conducting a survey so we can get a better sense of the stories you’re interested in reading. We want to learn more about you, your likes and dislikes, so we can create the best content possible, from deeply technical guides to pieces on developer culture.

The survey will take five minutes to complete. Your responses will be used for the purpose of improving our content and tailoring newsletters to better reflect your interests.*

Thank you for your feedback—and for reading!

Sincerely,

Anita Clarke
Senior Managing Editor

*Your responses will be analyzed in aggregate and used for research purposes; some aggregated data may be shared externally. Your data will be treated in accordance with Shopifys privacy policy, which can be found here.

Continue reading

Shopify's Playbook for Scaling Machine Learning

Shopify's Playbook for Scaling Machine Learning

Five years ago, my team and I launched the first machine learning product at Shopify. We were determined to build an algorithm-powered product that solved a merchant problem and made a real impact on the business. Figuring out where to start was the hardest part. There was (and still is!) a lot of noise out there on best practices. 

Fast forward to today, and machine learning is threaded into many aspects of Shopify. How did we get to this point? Through our experience building our first few models, we carved out a pragmatic step-by-step guide that has enabled us to successfully scale machine learning across our organization.

Our playbook is tech-independent and can be applied in any domain, no matter what point you’re at in your machine learning journey—which is why we’re sharing it with you today.

Starting From Zero

The first few problems your team chooses to solve with machine learning have a disproportionate impact on the success and growth of your machine learning portfolio. But knowing which problem to start with can be difficult.

1. Identify A Problem Worth Solving

You want to pick a problem that your users care about. This will ensure your users use your solution day in and day out and enable you to gather feedback quickly. Having a deep understanding of your problem domain will help you achieve this. Nothing surpasses the ability to grasp your business goals and user needs. This context will guide your team on what your product is trying to achieve and what data-driven solutions will have a real impact on these priorities.

One way Shopify achieves this is by embedding our data scientists into our various product and commercial lines. This ensures that they have their finger on the pulse and are partners in decision making. With this domain context, we were able to identify a worthy first problem—order fraud detection. Sales are the main goal of our merchants, so we knew this problem impacted every merchant. We also knew the existing solution was a pain point for them (more on this later).

Screenshot of the Fraud Analysis screen in the Shopify admin showing the details for an order that's ranked Low
Order fraud detection in the Shopify admin.

2. Ensure You Have Enough Data

Good, accessible data is half the battle for building a successful model. Many organizations collect data, but you need to have a high degree of confidence in it, otherwise, you have to start collecting it anew. And it needs to be accessible. Is your data loaded into a medium that your data scientists can use? Or do you have to call someone in operations to move data from an S3 bucket? 

In our case, we had access to 10 years of transaction data that could help us understand the best inputs and outputs for detecting fraudulent orders. We have an in-house data platform and our data scientists have easy access to data through tools like Trino (formerly Presto). But the technology doesn’t matter, all that matters is that whatever problem you choose, you have trustworthy and accessible data to help you understand the problem you’re trying to solve.

3. Identify Your Model’s Downstream Dependencies

Keep in mind that any problem you pick won’t be an abstract, isolated problem—there are going to be things downstream that are impacted by it. Understanding your user’s workflow is important as it should influence the conditions of your target.

For example, in order fraud, we know that fulfillment is a downstream dependency. A merchant won’t want to fulfill an order if it’s a high risk of fraud. With this dependency in mind, we realized that we needed to detect fraud before an order is fulfilled: after would leave our prediction useless.

4. Understand Any Existing Solutions

If the problem you’re trying to solve has an existing solution, dig into the code and data, talk to the domain experts and fully understand how that solution is performing. If you’re going to add machine learning, you need to identify what you’re trying to improve. By understanding the existing solution, you’ll be able to identify benchmarks for your new solution, or decide if adding machine learning is even needed.

When we dug into the existing rule-based solution for detecting order fraud, we uncovered that it had a high false positive rate. For example, if the billing and shipping address on an order differed, our system would flag that order. Every time an order was flagged our merchants had to investigate and approve it, which ate up precious time they could be spending focused on growing their business. We also noticed that the high false positive rate was causing our merchants to cancel good orders. Lowering the false positive became a tangible benchmark for our new solution.

5. Optimize For Product Outcomes

Remember, this is not an exercise in data science—your product is going to be used by real people. While it’s tempting to optimize for scores such as accuracy, precision and recall, if those scores don’t improve your user experience, is your model actually solving your problem?

A venn diagram showing two circles with Product Outcome and User Trust, the overlap is real world success
In order for a model to have real world success, you need to optimize for product outcome and user trust, not just model scores.

For us, helping merchants be successful (i.e. make valid sales) was our guiding principle, which influenced how we optimize our models and where we put our thresholds. If we optimized our model to ensure zero fraud, then our algorithm would simply flag every order. While our merchants would sell nothing, we would achieve our result of zero fraud. Obviously this isn’t an ideal experience for our merchants. So, for our model, we optimized for helping merchants get the highest number of valid sales.

While you might not pick the same problem, or have the same technology, by focusing on these steps you’ll be able to identify where to add machine learning in a way that drives impact. For more tips from Shopify on building a machine learning model, checkout this blog.

Zero to One

So you’ve built a model, but now you’re wondering how to bring it to production? Understanding that the strength of everything that data science builds is on the foundation of a strong platform will help you find your answer. In order to bring your models to production in a way that can scale, you simply need to begin investing in good data engineering practices.

1. Create Well-Defined Pipelines

In order to confidently bring your model to production, you need to build well-defined pipelines for all stages of predictive modeling. For your training pipeline, you don’t want to waste time trying to keep track of your data and asking, “Did I replace the nulls with zeros? Did my colleagues do the same?” If you don’t trust your training, you’ll never get to the point where you feel comfortable putting your model into production. In our case, we created a clean pipeline by clearly labeling our input data, transformations and the features that go into our model.

You’ll want to do the same with your verification and testing pipeline. Building a pipeline that captures rich metadata around which model or dataset was used in your training will enable you to reproduce metrics and trace bugs. With these good data engineering practices in place, you’ll remove burdensome work and be able to establish model trust with your stakeholders.

The model lifecycle: Model Building, Model Evaluation, Productionize Model, Testing, Deployment, Monitoring & Observability
Model lifecycle

2. Decide How to Deploy Your Model

There are a lot of opinions on this, but the answer really depends on the problem and product context. Regardless of which decision you make, there are two key things to consider:

  • What volume will your model experience? Is your model going to run for every user? Or only a select group? Planning for volume means you’ll make better choices. In our case, we knew that our deployment choice had to be able to deal with varying order volumes, from normal traffic days to peak sales moments like Black Friday. That consideration influenced us to deploy the model on Shopify’s core tech stack—Ruby on Rails—because those services are used to high-volume and have resources dedicated to keeping them up and running.
  • What is the commitment between the user and the product? Understand what the user expects or what the product needs because these will have implications on what you can build. For example, our checkout is the heartbeat of our platform and our merchants expect it to be fast. In order to detect fraud as soon as an order is made, our system would have to do a real-time evaluation. If we built an amazing model, but it slowed down our checkout, we would solve one problem, but cause another. You want to limit any unnecessary product or user strain.

By focusing on these steps, we were able to quickly move our order fraud detection model into production and demonstrate if it actually worked—and it did! Our model beat the baseline, which is all we could have asked for. What we shipped was a very simple logistic regression model, but that simplicity allowed us to ship quickly and show impact. Today, the product runs on millions of orders a day and scales with the volume of Shopify. 

Our first model became the stepping stone that enabled us to implement more models. Once your team has one successful solution in production, you now have an example that will evangelize machine learning within your organization. Now it’s time to scale.

One to One Hundred

Now that you have your first model in production, how do you go from one model to multiple models? Whether that’s in the same product or bringing machine learning to other existing products? You have to think about how you can speed up and scale your model building workflows.

1. Build Trust In Your Models

While deploying your first model you focused on beginning to build good engineering practices. As you look to bring models to new products, you need to solidify those practices and build trust in your models. After we shipped our order fraud detection model, we implemented the following key processes into our model lifecycle to ensure our models are trustworthy, and would remain trustworthy:

  • Input and output reconciliation: Ensure the data sets that you use during training match the definition and the measurements of what you see at the time of inference. You’ll also want to reconcile the outcomes of the model to make sure that for the same data you’re predicting the same thing. It seems basic, but we’ve found a lot of bugs this way.
  • Production backtesting: Run your model in shadow for a cohort of users, as if it’s going to power a real user experience. Running backtests for our order fraud detection model allowed us to observe our model predictions, and helped us learn the intricacies of how what we’d built functioned with real world data. It also gave us a deployment mechanism for comparing models.
  • Monitoring: Conditions that once made a feature true, may change over time. As your models become more complex, keeping on top of these changes becomes difficult. For example, early on in Shopify’s history, mobile transactions were highly correlated with fraud. However, we passed a tipping point in ecommerce where mobile orders became the primary way of shopping, making our correlation no longer true. You have to make sure that as the world changes, as features change, or distributions change, there's either systems or humans in place to monitor these

2. Encode Best Practices In Your Platform

Now that you’ve solidified some best practices, how do you scale that as your team grows? When your team is small and you only have 10 data scientists, it’s relatively straightforward to communicate standards. You may have a Slack channel or a Google Doc. But as both your machine learning portfolio and team grow, you need something more unifying. Something that scales with you. 

A good platform isn’t just a technology enabler—it’s also a tool you can use to encode culture and best practices. That’s what we did at Shopify. For example, as we outlined above, backtesting is an important part of our training pipeline. We’ve encoded that into our platform by ensuring that if a model isn’t backtested before it goes into production, our platform will fail that model.

While encoding best practices will help you scale, it’s important that you don’t abstract too early. We took the best practices we developed while deploying our order fraud protection model, and a few other models implemented in other products, and we conducted trial and error. Only after taking a few years to see what worked, did we encode these practices into our platform.

3. Automate Things!

If on top of building the foundations, our team had to monitor, version, and deploy our models every single day, we’d still be tinkering with our very first model. Ask yourself, “How can I scale far beyond the hours I invest?” and begin thinking in terms of model operations—scheduling runs, automatic checks, model versioning, and, one day, automatic model deployment. In our case, we took the time to build all of this into our infrastructure. It all runs on a schedule every day, every week, for every merchant. Of course, we still have humans in the loop to dig into any anomalies that are flagged. By automating the more operational aspects of machine learning, you’ll free up your data scientists’ time, empowering them to focus on building more awesome models.

Shopify's Order Fraud Pipeline that goes from Python to PMML via Apache Airflow and then to Rails
Shopify’s automated order fraud detection pipeline. Models are built in Python, then PMML (predictive modeling markup language) serializes the models to become language independent, enabling us to deploy in our production system which runs on Ruby. Everything runs on a scheduler with Apache Airflow.

These last three steps have enabled us to deploy and retrain models fast. While we started with one model that sought to detect order fraud, we were able to apply our learnings to build various other models for products like Shopify Capital, product categorization, the Shopify Help Center search, and hundreds more products. If you’re looking to go from one to one hundred, follow these steps, wash, rinse and repeat and you’ll have no problem scaling.

This Is a Full-Stack Problem

Now you have a playbook to scale machine learning in your organization. And that’s where you want to end up—in a place where you’re delivering more value for the business. But even with all of these steps, you won’t truly be able to scale unless your data scientists and data engineers work together. Building data products is a full-stack problem. Regardless of your organization structure, data scientists and data engineers are instrumental to the success of your machine learning portfolio. As my last piece of wisdom, ensure your data scientists and data engineers work in alignment, off of the same road map, and towards the same goal. Here’s to making products smarter for our users!

Solmaz is the Head of Commerce Intelligence and VP of Data Science and Engineering at Shopify. In her role, Solmaz oversees Data, Engineering, and Product teams responsible for leveraging data and artificial intelligence to reduce the complexities of commerce for millions of businesses worldwide.


If you’re passionate about solving complex problems at scale, and you’re eager to learn more, we're hiring! Reach out to us or apply on our careers page.

Continue reading

Hydrogen & Tailwind: The Perfect Match for Building Beautiful Storefronts

Hydrogen & Tailwind: The Perfect Match for Building Beautiful Storefronts

Let’s get this out of the way: I really, really like Tailwind. It's my preferred way to style websites, and it enables developers to build beautiful storefronts quickly with Hydrogen, our React-based framework for building custom storefronts. If you’re not familiar with Hydrogen and want to give it a quick spin, visit https://hydrogen.new.

To add Tailwind to a new Hydrogen app, you don’t have to do anything. It’s the default option. It’s literally there the moment you run npx create-hydrogen-app@latest. We bundled Tailwind with the Hydrogen starter template because we think it’s a really powerful and customizable set of tools to get building quickly.

So what’s the best way to use Tailwind in your project? Let’s start with componentization. I consider it one of the most effective ways to work with Tailwind.

Componentization with Tailwind

The first thing you’ll notice about Tailwind is that you use a bunch of CSS classes (often called “utility classes”) to build your website. That’s it—you don’t need to write CSS inside a dedicated CSS file if you don’t want to.

To decipher the code you see above:

  • text-center is the equivalent of setting “text-align: center;”
  • mb-16 indicates that there should be a good amount of margin at the bottom of the div
  • font-extrabold is to assign a font-weight that’s heavier than bold, but not as heavy as black
  • text-5xl is a way to say make this text pretty large
  • md:text-7xl indicates that, at the medium breakpoint, the text should be even larger. (Yes, you read that correctly: you can define responsive styles using class names instead of needing to write `@media` rules in a stylesheet! You can’t do that with regular inline styles.)

The abundance of CSS classes catches people off guard the first time they see a Tailwind website. I was one of these people, too.

One important thing to consider is that most websites are built with components these days. If you’re building a new website, it’s probably componentized on the server (think WordPress files or Rails partials) or componentized on the client (think React or Vue).

Hydrogen is built with React. This means you can use Tailwind classes within each component, and then reuse those components throughout your Hydrogen storefront without having to copy and paste a bunch of CSS classes.

The above example is from Hydrogen’s starter template. It represents a navigation that should be hidden at small breakpoints but displayed at larger breakpoints (hidden lg:block). It outputs an unordered list which displays its items in a centered way using flexbox (flex items-center justify-center). When the navigation links are hovered, their opacity changes to 80% (hover:opacity-80).

Here’s what the navigation looks like at a larger breakpoint:

A screenshot of the Hydrogen Starter Template homepage. The navigation is centered at the top of the screen and separated from the content by a gradient blue bar.
Hydrogen starter template homepage

You can check out the /src/components folder to see a bunch of examples of using Tailwind classes in different components in the Hydrogen starter template.

You might be asking yourself, “What’s the difference between building React components with Tailwind and building React components with something like Bootstrap or my own custom CSS framework?”

At the end of the day, you’re still building a component-based system, just like you would in Bootstrap or a custom framework. The difference is that the classes you apply to your components in a Bootstrap world have names that are tightly coupled to the function of each component.

This makes for a more brittle system. You can imagine that if I have a custom framework where I’ve designed for a product card that contains a product title, image,and description:

Screenshot of a Product Card of a brown nike shoe. The title is above the photo and a description is below it.
Product card

Now, let’s pretend that I really like this design. I have some blog posts on my landing page, and I want to use this same card layout for those too. I also want to show an author avatar between my title and my image on those blog posts.

Unfortunately, my class names are tightly-coupled to the product component. My options are:

  • Just re-use my product component and grimace every time I see it being used for the wrong thing
  • Rename my product class names to be more generic, like “card”
  • Duplicate all the class definitions to a new set of classes prefixed with blog-card

I’m not faced with this same dilemma when I’m using Tailwind, since I’m using utility classes that aren’t bound to the semantic meaning of their original use: product-*. I’m free to copy and paste my Tailwind and HTML markup to a new component called <BlogCard> without having to update CSS classes or jump to a stylesheet. I can also easily extract a subset of inner markup to a dedicated component that is shared between <BlogCard> and <ProductCard> without having to deal with renaming BEM-style product-card__title classes.

What About the Learning Curve?

Another question you might have: “Why do I effectively have to learn a new language in order to be productive in Tailwind?”

It’s a fair question. The learning curve for Tailwind can be steep, especially for folks who haven’t touched CSS before. In order to be effective, you still need to have at least some knowledge of how CSS works—when to use margin, when to use padding, and how to leverage flexbox and CSS grid for layouts.

Thankfully, Tailwind’s docs are amazing. They have autocomplete search, logical grouping of CSS topics, and lots of examples. Whenever you’re using Tailwind, you’ll likely have their docs open in another browser tab. Also, Tailwind’s VSCode extension is a must-have. It makes working with Tailwind a brilliant experience in the editor because CSS classes are autocompleted along with their style representations, and you get inline swatch previews for properties like background color.

In my experience, the best way to learn Tailwind is to use it in a real project. This forces you to learn the design patterns and memorize commonly-used Tailwind classes. After working on a project for a couple hours and building up muscle memory, I found myself being way more productive using the framework than I ever was writing custom CSS.

What’s the Deal with All of These Classes?

So you’re off and running with Hydrogen and Tailwind, but maybe one thing is rubbing you the wrong way: why are there so many CSS classes? Isn’t this just like writing inline styles?

Thankfully, no, it’s not like writing inline styles. One huge benefit of Tailwind is enforced consistency and constraints. As a developer who isn’t super great at design, I know that if I’m given a blank canvas with no constraints, it’s likely that I’ll create something that is very meh. Hey, I’m trying to get better! But if I have too many options, or put another way, not enough constraints, my design leads to inconsistent choices. This manifests itself as wonky spacing between elements, subpar typography decisions, and a wild gradient of colors that mimics the result of a toddler getting unsupervised access to their parent’s makeup bag.

Tailwind offers spacing and color stops that enforce a consistent visual look:

As a developer who struggles with analysis paralysis, Tailwind’s constraints are a breath of fresh air. This is how my brain works:

  • Need a little padding? Use p-1.
  • A little more padding? OK, use p-2.
  • Gosh, just a little bit more? Ahh, p-4 should do the trick.

I don’t need to think about pixels, ems, rems, or percentages. And I don’t need to double check that my other hundred components adhere to the same convention since Tailwind enforces it for me. Hydrogen’s developer experience is rooted in this philosophy as well: we don’t want developers to have to think about the nitty-gritty boilerplate, so we provide it for them.

This doesn’t mean you’re absolutely constrained to the stops Tailwind has defined! You can override Tailwind’s design system to define your own values. You can also write arbitrary values as Tailwind classes.

Composability

Tailwind is built in a way that it can be composed into a set of components that fit your design system. These design systems are portable.

Since Tailwind leverages utility classes, this means you can copy examples from really smart developers and designers on the Internet and paste them into your website as a starting point. This is really tough to do if you’re not using Tailwind or another utility CSS framework. Without Tailwind, you’d need to:

  • copy one or more CSS files
  • place it in whatever structure you’ve defined for your website’s CSS files
  • paste the HTML into your website
  • update the CSS classes everywhere to conform to your website’s style convention.

You can get a head start by purchasing Tailwind UI, which is a product by Tailwind Labs, the creators of Tailwind. They offer an e-commerce kit with a bunch of really useful components for building custom storefronts. You can also check out other cool Tailwind component collections like Tailwind Starter Kit, HyperUI, and daisyUI.

Because of Tailwind’s composability, copy and paste is actually a feature of Tailwind! The copy paste features of Tailwind means you can browse something like TailwindUI, copy something that strikes your fancy, and paste it into your storefront to customize without any other changes or manual CSS file updates.

Working with a Team

Maybe you work as a solo developer, but working with other developers is fun, too. You should try it! When you work on a team, everybody who edits the codebase needs to be familiar with how things are supposed to be done. Otherwise, it’s easy for a codebase to get out of hand with lots of inconsistencies between each developer’s individual choices.

Tailwind is gold for working with teams. Everyone has access to Tailwind’s docs (I’ve mentioned they’re great, by the way). Once team members get accustomed to Tailwind’s classes, they can look at any component and instantly know how the component is styled at each breakpoint. They don’t need to jump between stylesheets and component markup. They don’t need to spend a few minutes figuring out how the Sass partials work together or style mixins function. In order to be productive, they just read and write CSS classes! This is great news not only for teams but also for open-source projects.

There are so many unique choices we make as individuals that don’t necessarily contribute to a team project in a good way. One example of this is ordering CSS properties in a typical CSS file. Another example of this is naming things. Oh, this actually brings up a great point…

Not Having to Name Things is By Far the Best Part About Using Tailwind, Period

If there’s one thing you take away from this post, let it be this: I’ve spent so many hours of my life as a developer trying to decide what to name things. When I use Tailwind, I don’t have to use that time naming things. Instead, I go for a walk outside. I spend time with my family. I keep writing the screenplay I’ve been putting off for so long.

It’s a hard thing to understand unless you’ve spent some time using Tailwind, not naming things. Plus, when you’re working with other people, you don’t have to quibble over naming conventions in PRs or accrue technical debt when a component’s scope changes slightly and its class names no longer make sense. Granted, you’ll still have to name some things—like components—in your codebase. However, Tailwind’s utility classes grant you the mental freedom from having to assign semantic class names that represent a chunk of styles.

Hydrogen and Tailwind: A Perfect Match

I think you’ll enjoy using Tailwind inside Hydrogen. I didn’t even find an adequate place to mention the fact that Tailwind allows you to use dark mode out of the box! Or that the Tailwind team built a complementary JavaScript library called HeadlessUI that helps you create accessible interactive experiences with any CSS styles, not just Tailwind.

If you finished reading this post, and you still don’t like Tailwind—that’s fine! I don’t think I’ll convince you with this single blog post. But I’d encourage you to give it a shot within the context of a Hydrogen storefront, because I think Tailwind and Hydrogen make for a good combination. Tailwind’s utility classes lend themselves to encapsulation inside Hydrogen’s commerce components. Developers get the best of both worlds with ready-made starter components along with composable styles. Tailwind lets you focus on what is important: building out a Hydrogen storefront and selling products to your customers.

Josh Larson is a Senior Staff Developer at Shopify working on the Hydrogen team. He works remotely from Des Moines, Iowa. Outside of work, he enjoys spending time with his wife, son, and dogs.

Learn More About Hydrogen


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

React Server Components Best Practices You Can Use with Hydrogen

React Server Components Best Practices You Can Use with Hydrogen

When my team and I started experimenting with React Server Components (RSC) while building Hydrogen, our React-based framework for building custom storefronts, I was incredibly excited. Not only for the impact this would have on Hydrogen, and the future of ecommerce experience (goodbye large bundle sizes, hello improved buying experiences!), but also for the selfish reason that many of us developers have when encountering new tech: this is going to be fun.

And, indeed, it was… but it was also pretty challenging. RSC is a paradigm shift and, personally, it took some getting used to. I started out building way too many client components and very few server components. My client components were larger than they needed to be and contained logic in them that really had no business existing on the client. Eventually, after months of trial and error and refactoring, it eventually clicked for me. I found it (dare I say it?) easy to build server and client components!

In this post, I’m going to dive into the patterns and best practices for RSC that both myself and my team learned while building Hydrogen. My goal is to increase your understanding of how to approach writing components in an RSC application and cut down your trial-and-error time. Let’s go!

Default to Shared Components

When you need to build a component from scratch in a RSC application, start out with a shared component. Shared components’ entire functionality can execute in both server and client contexts without any issues. They’re a natural middle ground between client and server components and a great starting point for development.

Starting in the middle helps you ask the right questions that lead you to build the right type of component. You’ll have to ask yourself: “Can this bit of code run only on the client?” and, similarly, Should this bit of code execute on the client?” The next section identifies some of the questions that you should ask.

In our experience, the worst approach you can take in a RSC application is to default to always building client components. While this will get you up and running quickly, your application ends up with a larger than necessary bundle size, containing too many client components that are better suited as server components.

Pivot to a Client Component in Rare Cases

The majority of the components in your RSC application should be server components, so you’ll need to analyze the use case carefully when determining if a client component is even necessary.

In our experience, there are very specific use cases in which a shared component should be pivoted to a client component. Generally, it’s not necessary to convert the entire component into a client component, only the logic necessary for the client needs to be extracted out into a client component. These use cases include

  • incorporating client side interactivity
  • using useState or useReducer
  • using lifecycle rendering logic (for example, useEffect)
  • making use of a third-party library that doesn’t support RSC
  • using browser APIs that aren’t supported on the server.

An important note on this: don’t just blindly convert your whole shared component into a client component. Rather, intentionally extract just the specific functionality you need into a client component. This helps keep your client component and bundle size as small as possible. I’ll show you some examples at the end of this post.

Pivot to a Server Component as Often as Possible

If the component doesn’t include any of the client component use cases, then it should be pivoted to a server component if it’s one of the following use cases:

  • The component includes code that shouldn’t be exposed on the client, like proprietary business logic and secrets.
  • The component won’t be used by a client component.
  • The code never executes on the client (to the best of your knowledge).
  • The code needs to access the filesystem or databases (which aren’t available on the client).
  • The code fetches data from the storefront API (in Hydrogen-specific cases).

If the component is used by a client component, dig into the use cases and implementation. It’s likely you could pass the component through to the client component as a child instead of having the client component import it and use it directly. This eliminates the need to convert your component into a client component, since client components can use server components when they’re passed into them as children.

Explore Some Examples

These are a lot of things to keep in mind, so let’s try out some examples with the Hydrogen starter template.

Newsletter Sign-up

Our first example is a component that allows buyers to sign up to my online store’s newsletter. It appears in the footer on every page, and it looks like this:

Screenshot of the footer Newsletter signup. It has a text box for email and an Sign Me Up button
Newsletter sign-up component

We’ll start with a shared component called NewsletterSignup.jsx:

In this component, we have two pieces of client interactivity (input field and submit button) that indicates that this component, as currently written, can’t be a shared component.

Instead of fully converting this into a client component, we’re going to extract just the client functionality into a separate NewsletterSignupForm.client.jsx component:

And then update the NewsletterSignup component to use this client component:

It would be tempting to stop here and keep the NewsletterSignup component as a shared component. However, I know for a fact that I want this component to only be used in the footer of my online store, and my footer component is a server component. There’s no need for this to be a shared component and be part of the client bundle, so we can safely change this to a server component by simply renaming it to NewsletterSignup.server.jsx.

And that’s it! You can take a look at the final Newsletter sign-up product on Stackblitz.

Product FAQs

For the next example, let’s add a product FAQ section to product pages. The content here is static and will be the same for each product in my online store. The interaction from the buyer can expand or collapse the content. It looks like this:

Screenshot of a collapsable Product FAQ content. The question has a toggle to hide the answers
Product FAQ content

Let’s start with a shared ProductFAQs.jsx component:

Next, we’ll add it to our product page. The ProductDetails.client component is used for the main content of this page, so it’s tempting to turn the ProductFAQs into a client component so that the ProductDetails component can use it directly. However, we can avoid this by passing the ProductFAQs through to the product/[handle].server.jsx page:

And then update the ProductDetails component to use the children:

Next, we want to add the client interactivity to the ProductFAQs component. Again, it would be tempting to convert the ProductFAQ component from a shared component into a client component, but that isn't necessary. The interactivity is only for expanding and collapsing the FAQ content—the content itself is hardcoded and doesn’t need to be part of the client bundle. What we’ll do instead is extract the client interactivity into an exclusively client component, Accordion.client.jsx:

We’ll update the ProductFAQs component to use the Accordion:

At this point, there’s no reason for the ProductFAQs component to remain a shared component. All the client interactivity is extracted out and, similar to the NewsletterSignup component, I know this component will never be used by a client component. All that’s left now is to:

  • rename the file from ProductFAQs.jsx to ProductFAQs.server.jsx
  • update the import statement in product/[handle].server.jsx
  • add some nice styling to it via Tailwind.

You can view the final Product FAQ code on Stackblitz.

React Server Components are a paradigm shift, and writing a component for an RSC application can take some getting used to. Keep the following in mind while you’re building:

  • Start out with a shared component.
  • Extract functionality into a client component in specific cases.
  • Pivot to a server component if the code never needs to or never should execute on the client.

Happy coding!

Cathryn is a Staff Front End Developer on Shopify’s Checkout team and a founding member of Hydrogen. She works remotely in Montreal, Canada. When not coding, she’s usually playing with her dog, crafting, or reading.

Learn More About Hydrogen


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Rapid Development with Hydrogen: Building a Product Page

Rapid Development with Hydrogen: Building a Product Page

Last year we released Hydrogen, our React-based framework for building custom storefronts. Hydrogen allows developers to build fast, dynamic commerce experiences by leveraging streaming server-side rendering, React Server Components, and caching APIs. Hydrogen is currently in developer preview and I'm excited to show you how you can rapidly build out a simple product page by leaning on Hydrogen's components.

Sample Snowdevil Product Display Page showing an image of a snowboard, the name, price, variant picker, and Add to cart button
We’ll be using Hydrogen to build a product display page.

Previously, constructing a custom storefront required developers to manually manipulate data and create custom components for each page. Hydrogen accelerates this process by offering Shopify-specific commerce components, hooks, and utilities that allows developers to focus on building unique storefront experiences.

Getting Started

To get started, generate a starter app by heading to hydrogen.new.

Most of the files you’ll work with are located in the /src directory. This directory contains a set of boilerplate components and pages, the main app component (App.server.jsx), and the client/server entry points. For an in-depth overview, see the getting started guide.

The starter app is connected to a demo store which contains a few products. You can find the details in /shopify.config.js.

Hydrogen connects the Storefront API (with the access token already configured) and allows us to fetch data from descending server components using the useShopQuery hook.

Hydrogen's starter app comes with a convenient out-of-the-box product page. To get hands-on with the components and understand the moving parts, we'll put this page aside and build our own.

Our goal is to set up a Product Context that allows us to render product details with ease:

From there, we'll create a variant picker and allow the customer to add the product to cart.

Creating a Product Page

In your app, create a new page /src/pages/featured.server.jsx with the following code:

Hydrogen's file based routing kicks in, and a new route is created at /featured.  This code will be rendered on the server because the filename ends in .server.jsx. Our page is wrapped in the starter app's Layout component that has a cart ready to go.

Sample Snowdevil Product Display Page that's missing the image of a snowboard, name, price, variant picker, and Add to cart button
An empty product page wrapped in a Layout component.

We'll be using a ProductProvider component (alias: Product) to set up a context with product details. This allows descendents to use components like Product.title (rendering the product title) and hooks like useProductOptions (keeping track of the selected variant). Our product page requires client-side state, so we'll create a /src/components/FeaturedProductDetails.client.jsx client component to house product details.

You'll notice ProductProvider requires a Product object to be passed through as a prop. Instead of manually writing a query to fetch all of the required fields, many Hydrogen components (including this one) have GraphQL fragments that you can use instead. It's important to note that this fragment includes variables that you’ll need to provide values for when performing the query. Joining these pieces together, the /src/pages/featured.server.jsx file looks like this:

And the /src/components/FeaturedProductDetails.client.jsx file looks like this:

The featured product page now shows a product title wrapped in the starter app layout.

Sample Snowdevil Product Display Page showing the title but still missing the an image of a snowboard, price, variant picker, and Add to cart button
The product title appears on the product page.

With the context in place, it's a breeze to build out the rest of the page—try adding a product description, price, or custom component like handle (hint: you'll need to import Hydrogen's useProduct hook!):

Customizing Components

Hydrogen components are customized using passthrough and render props. Using passthrough props, you pass attributes as props to the component that passes them through to the rendered HTML tag:

<Product.Price className="font-bold" aria-live="polite" />

And the output is:

<div class="font-bold" aria-live="polite">$749.95</div>

Using render props, you pass a function that returns JSX as a child to the Hydrogen component:

<Product.Price>
  {({ amount, currencySymbol }) => `Fancy price: ${currencySymbol}${amount}`}
</Product.Price>

And the output is:

<div>Fancy price: $749.95</div>

Adding a Variant Picker

Merchants add variants to products that come in more than one option, such as color. To accommodate these, we'll need a variant picker dropdown. We can use the useProduct hook to retrieve a list of variants and get/set for the selected variant. A simple dropdown is created by mapping the variants. When a selection is made, setSelectedVariant updates the state:

To set the initial variant, pass the product provider an initialVariantId property. This is a good place to use the flattenConnection utility that transforms a connection object from the Storefront API into a flat array of nodes:

A close-up of the Variant Picker showing three options a user can select
The variant picker.

Adding the Add to Cart Button

The Product.SelectedVariant.AddToCartButton component knows which variant is selected and takes care of adding the item to cart. We can expand on this further by toggling a disabled state and changing the text based on a variant's availability:

Finishing Touches

With a functioning product page in place, we now add a variant image (you guessed it, there’s a component for that too) and tidy up the styling:

The final code is found on StackBlitz.

Sample Snowdevil Product Display Page showing an image of a snowboard, the name, price, variant picker, and Add to cart button
The final product

Hydrogen Enables Rapid Development

Taking advantage of these components, hooks, utilities, and fragments allows you to skip many of the repetitive and mundane parts of building a custom storefront, speeding up the development process.

For more code examples, be sure to examine the starter template which provides a full purchase journey out-of-the-box. You can find these files in the /src directory of your project.

This post just scratched the surface, check out all of the components and take Hydrogen for a spin on your next project!

Scott’s a Developer Advocate at Shopify, located on the east coast of Australia and formerly a Shopify app developer and developer bootcamp coach. When he's not tinkering with code, you'll find him checking the surf or hanging out with his family.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

How We Fixed the Dependency Confusion Vulnerability in Over 600 Ruby Applications

How We Fixed the Dependency Confusion Vulnerability in Over 600 Ruby Applications

Shopify has grown significantly over the years, and our success makes us an attractive target for malicious actors. We take the safety of our merchants seriously, so we have a good reason to continuously improve the security at Shopify. 

I’ll share how the Ruby Conventions team, which focuses on creating conventions to make Ruby services sustainable, used an iterative approach to solve complex problems at scale while responding to shifting circumstances. In particular, how we solved the dependency confusion vulnerability in over 600 Ruby applications, developed tooling that allows us to do large-scale migration with ease, and made the Ruby community a bit safer.

Understanding the Dependency Confusion Problem

Shopify runs a bug bounty program where we pay people to find vulnerabilities on our platform and learn what we have to improve on. One such report showed that we were vulnerable to a dependency confusion vulnerability that could give an attacker access to our local, continuous integration/continuous deployment (CI/CD), and production environments.

The vulnerability leverages the ambiguity of a package source to install malicious dependencies. If an external package is created with a higher version number under the same name as an internal Shopify package, the external dependency is resolved instead of the internal dependency. 

In Ruby, developers use Bundler to manage their dependencies and make their environments reproducible. Bundler resolves dependencies so that you use the correct versions and sources for each gem. The Bundler team fixed the issue by introducing a new Gemfile.lock file format that’s created by a fresh install or an update. The new format assigns each gem to an explicit source:

However, at that time, the new format required you to upgrade. That meant Bundler updated all dependencies in the lockfile that would require vetting each update and testing the application for regressions in behavior. 

Identifying the Impact

We didn’t know how many applications were susceptible to the dependency confusion vulnerability that made it hard to assess the impact of the problem. Our first step was to disambiguate the situation, so we could understand the problem better. 

Disambiguating unknowns doesn’t need to be fancy, and it’s better to have some insight than none. In our case, we defined a cron job in our CI system to get the Bundler version information from all repositories into our data lake. It turned out that around 600 Ruby applications were susceptible to the dependency confusion vulnerability.

Having that data also allowed us to create a metric of outstanding migrations and measure progress towards solving our problem. It’s also a great way of detaching the solution from the goal, which is less constraining.

Changing Assumptions Through Experimentation

As developers, our solution has to take quite a few constraints into account. When developing software iteratively, we try to change some of those constraints and reevaluate our solution quickly. Making those changes as soon as possible surfaces unknowns increasing the likelihood of a successful project.

In our case having over 600 repositories to migrate meant that manually migrating every application would be too time-consuming. Requiring teams to do it themselves would be tedious and error-prone because the Gemfile.lock file couldn’t be automatically updated while keeping the current gem versions. In that case, developers would need to modify the lockfile to revert the versions updates back to prevent regressions from being introduced.

If we were able to update a Gemfile.lock to the new format without updating dependencies, it would enable us to automate rolling this upgrade out to all Ruby applications in Shopify. We would only rely on the application owners to deploy the changes.

We experimented with building a Bundler plugin (a gem that extends Bundler’s functionality) to automate the upgrade. It updated the Gemfile.lock file to the new format without updating dependencies. The plugin boiled down to:

  1. Initializing the specification for a given Gemfile.lock file that contains information about the gems such as the name, the version, and remote.
  2. Updating the Gemfile.lock file to the new lockfile format that updates all gems in the process. We minimize updates by only permitting patch version updates.
  3. Replacing the versions in the updated Gemfile.lock file with the gem versions from the old Gemfile.lock file.

This approach wasn’t a perfect solution, but it worked well enough to run Bundler migrations. It allowed us to proceed to the next problem area of migrating large numbers of applications.

Running Migrations at Scale

One of the biggest challenges in running large-scale migrations is handling edge cases. Rather than exploring how migrations can go wrong beforehand, it’s more effective to migrate a handful of applications and discover the actual problems. The other benefit is that we can identify and migrate the subset of applications with issues that have known solutions while resolving the edge cases at the same time. This approach allows us to constantly deliver on our goals and put ourselves in a better spot each day.

Our Bundler plugin migrated the lockfile without dependency updates, and then we could start migrating applications. We started out running the plugin on a handful of applications that weren’t merchant-facing. This went smoothly, and we decided to run it on a larger batch for non-critical repositories. However, we noticed issues arising from inconsistent build setups, Ruby versions, and other configurations in the larger batches of migrations.

Some of our tooling didn’t support the latest Bundler version, and we had to work with our deployment, CI, and local environments teams to update them. Our collaborations were particularly fruitful when we:

  • investigated the issue first
  • tried to solve it
  • shared the context with the team. 

Most people want to help and making it easy for them benefits everyone.

Some of our Docker images are built with Heroku’s Ruby buildpack that didn’t support the required Bundler version. This situation rendered a percentage of applications unable to migrate. To solve this issue, we worked with the Heroku Buildpack team to adopt the latest Bundler version. They released a new version with the bundler update, making it broadly available in the Ruby community.

Another critical element was raising awareness with project owners and setting a deadline to deprecate the old Bundler version. Being upfront with owners and communicating the impact of the change allowed teams to prioritize and work with us to update their projects.

The Bundler migration plugin was run locally, but scalability issues arose. It became too complicated to manage different Ruby versions, parallelize them, and address failures. Instead of wasting time on building a solution that would have solved all eventualities at the start, we used the migration plugin to its breaking point, investigated the problem areas, and implemented improvements. 

As a response to our scaling issues, we built a command-line interface (CLI) tool on top of our CI system to set up the right environment for a repository, run commands on it, and open a pull request (PR) based on the changes made. Having an environment per repository worked great because we didn’t run into misconfiguration problems anymore. Using our CI system also allowed us to parallelize the execution, which in turn, sped the process up. Furthermore,  migration failures were easier to recover and track.

Preventing Future Problems

Part of iteratively solving a problem means focusing on current problems rather than future concerns. However, it doesn’t mean ignoring future concerns altogether. It’s important to distinguish between critical concerns and ones that can be figured out later on.

One example was preventing a Gemfile.lock file from regressing to its previous format that would make us vulnerable. We were aware of the possibility of regressions, but we also knew that we could build tooling to solve this issue. Instead of investing time in tackling the problem upfront, we decided to wait and start working on it once we migrated most applications. This approach also allowed us to gauge the magnitude of the problem rather than wasting resources working out hypotheticals.

We encountered a handful of regressions during our migration and were a bit concerned. We investigated each manually to see if there were bigger problems present. Since we didn't find anything suggesting deeper problems, we carried on and continued monitoring knowing that if we ran into more regressions, we had more information to change course and face the new reality.

We investigated the lockfile regression problem and shared what we learned with the Bundler team. They enhanced the tool to prevent these cases from occuring in the future. We didn’t need to implement special tooling to prevent regressions (it saved us a lot of work and time). We only had to make sure that all applications were using the correct Bundler version.

Most of our applications were migrated to the Bundler version that didn’t prevent regressions because we staggered the migration to make continuous progress. Since we battle-tested our migration tooling and resolved most configuration issues, it allowed us to migrate all of our applications to the latest Bundler version in less than a day.

Rather than waiting for the perfect solution, making iterative changes improved our tooling to the point where we made changes that used to be hard, easy. This de-risked the deployment.

To prevent the installation of malicious gems, we made changes to our local environment tooling to ensure it always defaults to the recommended Bundler version. This ensures that an individual developer machine isn’t susceptible to running malicious code from the dependency confusion vulnerability. We also started failing CI whenever it encountered an out-of-date Bundler version, ensuring that any code change that could introduce the dependency confusion vulnerability wouldn’t be merged. Since most of our other automated processes require CI to execute, we rely on CI to catch vulnerable Bundler versions.

Sharing What We've Done with the Community

We love open source at Shopify, and we like giving back to the community. When contributing, it is quite valuable to share the purpose as well as the solution. It leads to insightful conversations that result in a better solution. Often, contributions aren’t solely PRs. Providing context on investigative work, bringing problems to someone's attention, or testing another contributor’s prototypes are just as valuable.

Our plugin worked pretty well for us, so we created a proposal in Bundler to fix the issue for the Ruby community. These changes would allow Bundler to update the Gemfile.lock file without upgrading gems in the process. Our proposal didn’t make it in, but led to a conversation resulting in an alternative approach that was shipped in Bundler 2.2.21. We helped test their approach on our applications to ensure that we caught as many edge cases as possible to help minimize the potential burden on the community. 

We also ran into issues where developers using an insecure version of Bundler could accidentally revert to the old lockfile format. The problem was that the latest Bundler version (at the time) still resolved the old Gemfile.lock file on `bundle install`, which made it very simple to regress to the old format. We created a prototype to prevent that from happening that sparked another conversation with the maintainers of Bundler and brought the issue to their attention. They released version 2.2.22 of Bundler that prevents regressions and makes everybody in the community more secure.

We set out to fix the dependency confusion vulnerability in every Ruby project at Shopify and succeeded. This wouldn’t have been possible if we hadn’t followed an iterative approach that allowed us to make steady progress while taking shifting circumstances into account. We developed tooling that allows us to do large-scale migration, which has come in handy for other uses. We also aggregated Bundler version data on our Ruby projects to track adoption and make future decision-making easier. Lastly, we have worked closely with the Bundler team to improve the base functionality while leveraging Shopify’s scale to find edge cases, fix bugs, improve Bundler, and make it better for everyone in the Ruby community.

Frederik is a production engineer at Shopify and part of the Ruby & Rails infrastructure team. He contributed to massively scaling Shopify’s CI/CD system and making Ruby services more secure across Shopify and the Ruby community.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Cloud, Load, and Modular Code: What 2022 Looks Like for Shopify

Cloud, Load, and Modular Code: What 2022 Looks Like for Shopify

You may have heard that 2021 was Shopify’s biggest Black Friday Cyber Monday (BFCM) ever. This four-day period was monumental for both Shopify’s merchants and our engineering teams.

Last year’s numbers capture a moment in time but can also help us predict what’s to come in the year ahead. On our cloud in 2021, our peak BFCM traffic surpassed 32 million app server requests per minute (RPM). In the same time period our load balancers peaked at more than 34 million RPM. To put that in perspective, this means that the equivalent of Texas’s total population hit our load balancers in a given minute. One flash sale—a short-lived sale that exceeds our checkout per minute threshold—even generated enough load to use over 20% of our total computing capacity at its peak.

During BFCM 2021, we also:

  • sent nearly 145 million emails
  • averaged 30 TB per minute of egress network traffic
  • handled 42 billion API calls and delivered 13 billion webhooks
  • wrote 3.18 PB and read 15 PB of data from our storefront caching infrastructure
  • performed over 11 Million queries per second and delivered 11 terabytes per second read I/O with our MySQL database fleet

The year ahead poses even bigger challenges for our engineers, data scientists, user experience designers, and product managers. More BFCM sales are happening on mobile devices. More people are shopping on social media. Commerce is happening across a growing array of platforms and buyers expect a fast and consistent experience. If the metaverse becomes a reality, there will be commerce opportunities within that world that need to be explored. What does a flash sale look like in the metaverse and how does that play out?

Infographic of Shopify's BFCM 2021 technical stats
Shopify's technical stats from BFCM 2021

If the data and trends above tell us anything, it's that there’s no getting around the fact that flash sales, huge floods of web traffic, and many different buying environments are a big part of the future of commerce. The questions for me are: What are the enduring challenges for the engineering teams working to enable this incredible growth in the next five to ten years? How do we build scalable products and infrastructure so millions of merchants can go from zero to IPO—and beyond? Engineering at Shopify is about solving challenges and building resilient systems so merchants can focus on their business instead of technology. 

Here are a few things we’re planning on doing in 2022 to work quickly in a world that’s growing rapidly, becoming more global, and at the same time moving closer to where merchants do business and where buyers are shopping.

We are building more modular code. Shopify is famously one of the world’s largest Rails monolith codebases. We’ve been actively changing the architecture of the monolith to a majestic, modular monolith for several years. And more recently, we’ve been changing our architectural patterns as we deconstruct parts of the monolith for better developer productivity

As an example, we split out our storefront rendering process from the modular monolith repo to make sure merchants (and their customers) get the fastest online shopping experience possible. When we were done with the split and some code refactoring work, the results were four times faster cache fill rates and five times faster page render times. Also, pulling the storefront renderer out means it can now be deployed in geographies around the planet without having to deploy our full Rails monolith. The closer we can render the storefront to the buyer, the fewer round-trips between the store and the browser need to be made, again improving overall storefront performance. In 2022, we’re going to continue exploring majestic monoliths. We see that engineers working on repos that directly improve merchant performance, like storefront rendering, iterate and deploy quickly. This model also allows us to put our developer experience first and provide a simpler setup with tighter coupling with our debugging and resiliency tools. 

We are leveraging new cloud development platforms to work more efficiently on a global scale. This year, we’ll spend a lot of time making sure developers can create impact fast—in minutes not hours. We’re moving the majority of our developers into our cloud development environment, called Spin. Devs can spin up (pun intended) a full development environment in seconds as opposed to minutes. You can even have multiple environments for experimentation to share work-in-progress with teammates. (We plan to share more about Spin in the future.)

Another big part of this year will be about building on this cloud development platform foundation to make our developer workflow faster and even smoother. We also moved all of our engineering to working on Apple M1 Macbook Pro laptops and these powerful devices, combined with Spin, are already making developers much more productive. Spin creates opportunities for us to build much improved IDE and browser extensions for enhanced productivity and delight, and an exciting opportunity for us to explore new ways to solve developer problems at scale that just weren’t possible in our previous local development environment paradigm. 

We are making load testing a more natural part of the development process. To prepare for BFCM 2021, we began load testing in July and ran the highest load test in Shopify’s history: a load balancer peak of 50.7 million RPM. But, flash sales that spike in minutes are not as predictable in their load requirements as a seasonal growth pattern like BFCM. To help prepare our infrastructure and products to handle larger and spikier scale, we’re continuing to improve our load testing. These load tests, built in-house, help our teams understand how products handle the larger platform-wide surge scenarios. Our load testing helps test product sales regardless if they are exclusively online, in-person using our retail POS products, or a combination of both. Automating and combining load tests as part of our product development processes is absolutely critical to avoid performance issues as we scale alongside our merchants.

These are a few ways we’re making it as easy as possible for developers to do the best work of their lives. We want to have the right tools so we can be creative about commerce—not “How do I set up my environment?” or “How does my code get built?” Engineers want to work at scale, ship impactful changes on a regular cadence, and work with a great team.

Speaking of great teams, a team of engineers from Shopify and Github built YJIT, a new Just-in-time (JIT) compiler that merged with Ruby 3.1. It’s 31% faster than interpreted CRuby and 26% faster than MJIT, reaching near-peak performance after a single iteration of any benchmark. It’s having a huge impact on the Ruby community inside and outside of Shopify and accelerating lots of production code execution times.

What isn’t changing in 2022: We remain opinionated about our tech stack. We’re all in on Rails and doubling down on React Native for mobile. We are going to continue to make big bets on our infrastructure, on building delightful developer environments, and making sure that we’re building for the success of all of our merchants. BFCM 2022? Bring it on.

Allan Leinwand is Chief Technology Officer at Shopify leading the engineering and data teams. Allan was previously SVP of Engineering at Slack and CTO at ServiceNow. He co-founded and held senior leadership positions at multiple companies, has authored books, and ventured to the dark side as a venture capital investor for seven years. He’s passionate about helping Shopify be the best commerce platform for everyone!


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Search at Shopify—Range in Data and Engineering is the Future

Search at Shopify—Range in Data and Engineering is the Future

One thing I’ve always appreciated about Shopify is the emphasis on range: the ability to navigate across expertise. Range isn’t just a book we love at Shopify, it’s built into our entire outlook. If you’re a developer at Shopify, you could start your career building data science infrastructure, but decide a few years later to pivot to Ruby internals.

The emphasis on range inspires me. In my coding journey, I’ve loved ranging. I started building AppleBasic programs in 4th grade. Years later my high school friends would try to one-up each other, obsessed with the math behind 3D games.

What does any of this have to do with search?

While most would see search and discovery as some kind of deep specialty: it actually requires an intense amount of range. Many search teams focus too much on specialists—in the words of my former colleague Charlie Hull, teams always wanted to hire “magical search unicorns” that often don’t exist. Instead, they tended to have siloed data and engineers working on search.

I’ve taken these painful experiences to heart when helping build Shopify’s search team. I want to share why range is a core team principle that separates us from the herd and sets us up for long-term success. (And of course, why you should join, even if you’re not a magical search unicorn!).

Lack of Range: Dysfunction between Data and Engineering 

In reality, nobody on our search team is an “engineer” or “data scientist”. Instead they have the range to be both at the same time. In fact, most of the team has a wide range when it comes to past jobs or hobbies: from linguists to physicists! After all, good decisions require fitting both data science and engineering skills into one brain.  

Why? Because of the trade-offs.

Pure data scientists or engineers waste time making poor decisions because they lack full context. They won’t see the other competency’s constraints. That’s why generalizing beyond our expertise is a major part of how Shopifolk work on every project. And that’s precisely why we’ve brought this value to the search domain.

Consider life in the data silo: without engineering context, data could easily chase the bleeding edge machine learning research without considering how to deliver to production. They develop a new model, decide shipping to production isn’t their job and instead give the new model to engineers to translate. 

In the engineer silo, they don’t have the context needed to make the important tradeoffs. Can they know where to tweak the model to remove bloat that doesn’t hurt relevance? Can pure engineers make the dozens of minute-by-minute decisions they need to optimize relevance, performance, and stability? Without the data context in their brain, they’ll fail, leading to suboptimal solutions!

Great engineering is about making the best decision given the constraints. So when an engineer lacks one crucial piece of know-how (data and relevance), they won’t arrive at the optimal solution between relevance, performance, stability, and other product factors. They’ll blindly implement the model, unsure where to tweak, leading to disastrous results in one of these dimensions.

That leads me to the other end of the trade-off spectrum: the data team creates a reasonable solution, but the infrastructure won’t bend. Unfortunately the engineers, specifically skilled in performance and reliability, might not see the full search quality spectrum of relevance, experience, and performance. Their incentives focus on answering whether search satisfies a service-level agreement? Does it keep me from being woken up at 3AM when I’m on call? With only those constraints, why would an engineer care to build a complicated looking search relevance model that only runs the risk of creating more complexity and instability?

Coordination between two groups—each with only half of the skills needed to make decisions—creates dysfunction. It adds needless time to production deployment and creates politics. 

Silos like these only lead to the dark side.

The solution? RANGE

Range: The Solution to Dysfunction between Data and Engineering

At Shopify, we have one team with members from both competencies. We draw very few lines between “data” and “engineering” work. Instead we have “search” work.

Engineers on our team must grow data science skills—they learn to build and run experiments. They think scientifically and evaluate the quality of a model. Data scientists find themselves pushed to become good engineers. They must build high quality, performant, and testable code. When they build a model, it’s not just a random idea in a notebook, it’s on them to get it to production and create a maintainable system.

Why does this matter? Because search, like all software development, requires making dozens of deeply intricate tradeoffs between correctness, scalability, performance, and maintainability. Good decisions require fitting both data science and engineering skills in one brain. An elegant solution to a problem is the simplest one that satisfies all of the constraints. If you can only fit half the constraints in your head, you’ll fail to see the best solution that makes search smart, fast, and scalable.

A close partnership between data and engineering organizations makes this possible. Management on both sides has experience and commitment to close collaboration and partnership. At the level of individual contributors, we don’t think of ourselves as two teams. We’re one team, with individuals that report to a few different leads. We organize, plan, and execute together. We don’t carve out territorial fiefdoms.

Data and Engineering Range is the Future

When you look at the problems of tomorrow, they’ll increasingly be less about point-and-click interactivity. They’ll frequently include some “smart” user interaction. The user wants to:

  • talk to the system 
  • start with a curated set of possibilities tailored to them and fine tune them with their preferences 
  • be given options or taken on a journey that doesn’t filter out obvious paths they won’t care about.

This isn’t just the cool stuff people add on to an existing application: it’s increasingly the core part of what’s being built. 

I see search and discovery at Shopify as just the beginning. The more personalized or conversational products we build, like those listed above, the more engineers must have the range to push into data (and vice versa). The future isn’t specialization within data science and engineering—it’s having the range to move between both.

Doug Turnbull is a Sr. Staff Engineer at Shopify working on search and discovery. Doug wrote Relevant Search and contributed to AI Powered Search. Doug also blogs heavily at Shopify and his personal site. Currently Doug’s passion includes incubating search and discovery skills at Shopify, planning technical initiatives in search and discovery, and collaborating with peers to make commerce better for everyone through search!


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

That Old Certificate Expired and Started an Outage. This is What Happened Next

That Old Certificate Expired and Started an Outage. This is What Happened Next

In distributed systems, there’s plenty of occasions for things to go wrong.  This is why resiliency and redundancy are important. But no matter the systems you put in place, no matter whether you did or didn’t touch your deployments, issues might arise. It makes it critical to acknowledge the near misses: the situations where something could have gone wrong and the situations where something did, but it could have been worse. When was the last time it happened to you? For us, at Shopify, it was on September 30th, 2021, when the expiration of Let’s Encrypt’s (old) root certificate almost led to a global outage of our platform.

In April 2021, Let’s Encrypt announced that the former root certificate was expiring. As we use Let’s Encrypt as our public certificate provider since we became a sponsor in 2016, we made sure that Shopify’s edge infrastructure was up to date with the different requirements, so we wouldn’t stop serving traffic to all of (y)our beloved shops. As always, Let’s Encrypt did their due diligence with communications and by providing a cross-signing of their new root certificate by the old one. What this means is that while clients didn’t trust the new root certificate yet, because that new root certificate was signed by the old one, they trusted the old one and would transfer their trust to the new one. Also, the period of time between the announcement and the expiration was sufficient for any Let’s Encrypt-emitted certificates, which expire after three months, to be signed by the new cross-signed root certificate and considered valid using any of the old or new root certificates. We didn’t expect anything bad to happen on September 30th, 2021, when the root certificate was set to expire at 10:00 a.m. Eastern Standard Time.

At 10:15 am that same day, our monitors started complaining about certificate errors, but not at Shopify’s edge—that is, between the public and Shopify—but between our services. As a member of the traffic team of Shopify that handles a number of matters related to bringing traffic safely and reliably into Shopify’s platform (including the provisioning and handling of certificates), I joined the incident response to try and help figure out what was happening. Our first response was to lock the deployments of the Shopify monolith (using spy, our chatops) while some of us connected to containers to try and figure out what was happening in there. In the meantime, we started looking at the deployments that were happening when this issue started manifesting. It didn’t make any sense as those changes didn’t have anything to do with the way services interconnected, nor with certificates. This is when the Let’s Encrypt root certificate expiry started clicking in our minds. An incident showing certificate validity errors happening right after the expiry date couldn’t be a coincidence, but we couldn’t reproduce the error in our browsers or even using curl. Using openssl, we could, however, observe the certificate expiry for the old root certificate:

The error was related to the client being used for those connections. And we saw those errors appearing in multiple services across Shopify using different configurations and libraries. For a number of those services, the errors were bubbling up from the internally-built library allowing services to check people’s authentication to Shopify. While Faraday is the library we generally use for HTTP connections, our internal library has dependencies on rack-oauth2 and openid_connect. Looking at the dependency chains for both applications, we saw the following:

Both rack-oauth2 (directy) and openid_connect (indirectly) depend on httpclient, which, according to the GitHub repository of the library, “gives something like the functionality of libwww-perl (LWP) in Ruby.”

From other service errors, we identified that the google-api-client also was in error. Using the same process, we pinpointed the same library as a dependency:

And so we took a closer look at httpclient and...


Code snippet from httpclient/nahi

Uh-oh, that doesn’t look good. httpclient is heavily used, whether it’s directly or through indirect exposures of the dependency chain. Like web browsers, httpclient embeds a version of the root certificates. The main difference is that in this case, the version of the root certificate store in the library is six years old (!!) while reference root certificate stores are generally updated every few months. So even with Let’s Encrypt due diligence, a stale client store that doesn’t trust the new root certificate directly or the old one, as it expired, was sufficient to cause internal issues. 

Our emergency fix was simple. We forked the Git repository, created a branch that overrode cacert.pem with the most recent root certificate bundle and started using that branch in our deployments to make things work. After confirming the fix was working as expected and deploying it in our canaries, the problem was solved for the monolith. Then automation helped create pull requests for all our affected repositories.

The choice of overriding cacert.pem with a more recent one is a temporary fix. However, it was the one, following a solve-fast approach, we knew would work automatically for all our deployments without making any other changes. To support this fix and make sure a similar issue does not happen soon, we put in place systems to keep track of changes in the root certificates, and automatically update them in our fork when needed. A better long-term approach could be to use the system root certificates store for instance, which can commence after a review of our system root certificate stores, across all of our runtime environments.

We wondered why it took about 15 minutes for us to start seeing the effects of that certificate expiry. The answer is actually in the trigger as we identified that we started seeing the issue on the Shopify monolith when a deployment happened. HTTP has a process of permanent connections, also called HTTP keep-alive, that keeps a connection alive as long as it’s being used, and only closes it when it hasn’t been used after a short period of time. Also, TLS validation, the check of the validity of certificates, is only performed while initializing the connection, but the trust is maintained for the duration of that connection. Given the traffic on Shopify, our deployments kept alive the connections to other systems, and the only reason those connections were broken was because of Kubernetes pods being recreated to deploy the new version, leading to a new HTTP connection and the failure of TLS validation—hence the 15 minutes discrepancy.

Beside our Ruby applications having (indirect!) dependencies on httpclient, a few other of our systems were affected by the same problem. Particularly, services powered by data were left hanging as the application providing them with data was affected by the disruption. For instance, product recommendations weren’t shown during that time, marketing campaigns ended up being throttled temporarily, and, more visible to our merchants’ customers, order confirmations were delayed for a short period because the risk analysis couldn’t be performed.

Of the Shopify monolith, however, only the canaries—that is, the server to which we roll changes first to test their effect in production before rolling them to the rest of the fleet—were affected by the issue. Our incident response initial action of locking deployments also stops any deployment process in its current state. This simple action allowed us to avoid cycling Kubernetes pods for the monolith and keep the current version running, protecting us from a global outage of Shopify, leading to September 30th, 2021 being that one time an outage could have been way worse.

Raphaël is a Staff Production Engineer and the tech lead of the Traffic team at Shopify, taking care of the interfaces between Shopify and the outside world and providing reliable and scalable systems for configuring the edge of the neverending growing applications. He holds a Ph.D. in Computer Engineering in systems performance analysis and tracing, and sometimes gives lectures to future engineers, at Polytechnique Montréal, about Distributed Systems and Cloud Computing.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

Nerd Out on 10 of Our Favorite Posts From 2021

Nerd Out on 10 of Our Favorite Posts From 2021

Shopify engineers not only work on challenging and impactful projects, but they also take the time to share their craft expertise at events and on the blog. As we close out the year, here are ten of our favorite posts of 2021, a curated selection spanning technologies and engineering disciplines, as well as a few from previous years that many of you still love, read, and share. 

While pulling this list together, the song My Favorite Things kept repeating in my mind. And with that, I apologize in advance for my cringeworthy poetic interpretation. It isn’t easy to rhyme with CRuby.

Building apps with and without Rails libraries
GraphQL how-tos not one but a series
Upgrading MySQL makes your heart sing
These are a few of your favorite things

Building a JIT compiler for CRuby
Making apps faster with caching is groovy
Hydrogen-powered storefronts give you wings
These are a few of your favorite things

1. How to Build a Web App With and Without Rails Libraries

An illustrated factory producing Ruby Gems

Maple Ong, Senior Developer, wrote our most-read post of 2021, an in-depth tutorial that asks you to forget that Rails exists for a minute (whaaaat?!) and takes you on a journey to build a web application only using the standard Ruby libraries.

“With just a little bit of familiarity with the Rails framework, you’re able to build a web application—that’s the magic of Rails. Building your own Ruby web application from scratch, however, isn’t only educational—I’d argue that it’s also a rite of passage in a Ruby and Rails developer’s career!”

Want to learn more about Maple? Follow her on Twitter.

2. Building Blocks of High Performance Hydrogen-powered Storefronts

A structure built with different shapes and colors

Shopify Unite created a lot of buzz around Hydrogen, a React-based framework for building custom and creative storefronts. Ilya Grigorik, Principal Engineer at Shopify, takes us behind the scenes and shares how Hydrogen is built and optimized to power personalized, contextual, and dynamic commerce.

Fun fact: Hydrogen is in developer preview in case you want to take it out for a spin.

3. YJIT: Building a New JIT Compiler for CRuby

A cartoon Ruby Gem levelling up, jumping to higher pipes

Shopify engineers worked on many cool and impactful projects in 2021. Exhibit A: YJIT. Created by a small team led by Staff Engineer Maxime Chevalier-Boisvert, YJIT is a new JIT (just-in-time) compiler built inside CRuby that has now been merged upstream and is part of the Ruby 3.1.0 release. In this post, Maxime writes about the importance of the project to Shopify and Ruby developers worldwide.

For extra credit and additional reading, check out Noah Gibbs’ YJIT posts:

4. A Five-Step Guide for Conducting Exploratory Data Analysis

A person working on graphs at their desk
 

Cody Mazza-Anthony, Shopify Data Scientist, explains how to use exploratory data analysis (EDA) for answering important business questions and walks through five tips for performing an effective EDA. Here are the key takeaways:

  • Missing values can plague your data
  • Provide a basic description of your features and categorize them
  • Understand your data by visualizing its distribution
  • Your features have relationships! Make note of them
  • Outliers can dampen your fun only if you don’t know about them

5. Upgrading MySQL at Shopify

A cartoon server working out at the gym

Yi Qing Sim, Senior Production Engineer, shares how the Database Platform team performed the most recent MySQL upgrade at Shopify. She also discusses the roadblocks they encountered during rollback testing, the internal tooling built to aid in upgrading and scaling our fleet in general, and guidelines for approaching upgrades going forward.

6. Apache Beam for Search: Getting Started by Hacking Time

A image of a wristwatch

You might know Doug Turnbull, Senior Staff Engineer, for writing the book “Relevant Search”, contributing to “AI-Powered Search”, and creating relevance tooling for Solr and Elasticsearch like Splainer, Quepid, and the Elasticsearch Learning to Rank plugin. We also know him as the author of this excellent post about using Apache Beam for search at Shopify.

7. Understanding GraphQL for Beginners

An illustration of three hamburgers, varying in sizes and toppings

OK, I might be cheating here because this is a three-part series, but it’s been a top read in 2021, prompting our team to refer to these tutorials as the “Everybody Loves Raymond” series. In this hands-on tutorial, Raymond Chung, Technical Educator on the Dev Degree team, teaches you all about GraphQL and digs into the difference between REST and GraphQL.

8. Keeping Developers Happy with a Fast CI

A car races, blurred by the speed

Christian Bruckmayer, Senior Production Engineer, is part of the Test Infrastructure team responsible for ensuring Shopify’s CI systems are scalable, robust, and usable. In this post, Christian shares how the team reduced the p95 of Shopify’s core monolith from 45 minutes to 18, allowing developers to spend less time waiting and ship faster.

9. Rate Limiting GraphQL APIs by Calculating Query Complexity

Illustrated cereal boxes with various GraphQL references

Guilherme Vieira, Senior Developer on the API Patterns team, explores Shopify’s rate-limiting system for the GraphQL Admin API and how it addresses some limitations of methods commonly used in REST APIs. He also describes how we calculate query costs that adapt to the data that clients need while providing a more predictable load on servers.

10. Building an App Clip with React Native

A cartoon phone genie with the Shop App opened on the screen emitting from a lamp

Sebastian Ekström, Senior Developer on the Shop team, recounts what it was like being the first to build an App Clip in React Native that would be surfaced to millions of users each day. 

“We approached this project with a lot of unknowns—the technology was new and new to us. We were trying to build an App Clip with React Native, which isn’t typical! Our approach (to fail fast and iterate) worked well. Having a developer with native iOS development was very helpful because App Clips—even ones written in React Native—involve a lot of Apple’s tooling.”

Bonus: Older Posts You Still Love

The following posts are still very popular on the blog, so it felt wrong to leave them off the list just because they are a couple of years old. Considering the number of people who have read Building a Data Table Component in React, I’m assuming there are thousands of data table components built with React out there, thanks to this post.

  1. Building a Data Table Component in React
  2. Under Deconstruction: The State of Shopify’s Monolith
  3. How to Write Fast Code in Ruby on Rails

Jennie Lundrigan is a Senior Engineering Writer at Shopify. When she's not writing nerd words, she's probably saying hi to your dog.


If you’re passionate about solving complex problems at scale, and you’re eager to learn more, we're always hiring! Reach out to us or apply on our careers page.

Continue reading

Shopify’s Unique Data Science Hierarchy Of Needs

Shopify’s Unique Data Science Hierarchy Of Needs

You’ve probably seen the “Data Science Hierarchy of Needs” (pictured below) before. Inspired by Maslow, it shows the tooling you would use at different levels in data science—from logging and user-generated content at the bottom, to AI and deep learning at the very top.

While hierarchies like this one can serve as helpful guides, at Shopify, we don’t think it always captures the whole picture. For one, it emphasizes particular tools over finding the best solution to a given problem. Plus, it can have a tendency of prioritizing more “advanced” solutions, when a simple one would do. 

Data Science hierarchy of needs showing pyramid from top to bottom in ascending order of importance: AI, Learn, Aggregate/Label, Explore/Transform, Move/Store, and Collect
The Data Science Hierarchy of Needs

That’s why we’ve chosen to take a different approach. We’ve created our own Data Science Hierarchy of Needs to reflect the various ways we as a data team create impact, not only for Shopify, but also for our merchants and their customers. In our version, each level of the hierarchy represents a different way we deliver value—not better or worse, just different. 

Our philosophy is much more tool-agnostic, and it emphasizes trying simple solutions before jumping to more advanced ones. This enables us to make an impact faster, then iterate with more complex solutions, if necessary. We see the pinnacle of data science not as machine learning or AI, but in the impact that we’re able to have, no matter the technology we use. Above all, we focus on helping Shopify and our merchants make great decisions, no matter how we get there.  

Below, we’ll walk you through our Data Science Hierarchy of Needs and show you how our tool-agnostic philosophy was the key to navigating the unprecedented COVID-19 pandemic for Shopify and our merchants. 

Tackling The Pandemic With Data

During the pandemic, we depended on our data to give us a clear lens into what was happening, how our merchants were coping, and what we could do to support them. Our COVID-19 impact analysis—a project we launched to understand the impact of the pandemic and support our merchants—is a great example of how our Data Science Hierarchy of Needs works. 

For context—at Shopify, data scientists are embedded in different business units and product teams. When the pandemic hit, we were able to quickly launch a task force with data science representatives from each area of the business. The role of these data scientists was to surface important insights about the effects of the pandemic on our merchants and help us make timely, data-informed decisions to support them.

At every step of the way, we relied on our Data Science Hierarchy of Needs to support our efforts. With the foundations we had built, we were able to quickly ship insights to all of Shopify that were used to inform decisions on how we could best help our merchants navigate these challenging times. Let’s break it down. 

Shopify Data Science hierarchy of needs pyramid showing from top to bottom in increasing size:  Influence, prescribe, predict/infer, describe, collect and model
Shopify’s Data Science Hierarchy of Needs

1. Collecting And Modeling Data To Create A Strong Foundation

The base of our hierarchy is all about building a strong foundation that we can use to support our efforts as we move up the pyramid. After all, we can’t build advanced machine learning models or provide insightful and impactful analysis if we don’t have the data accessible in a clean and conformed manner.  

Activities At The Collect & Model Level

  • Data generation
  • Data platform
  • Acquisition
  • Pipeline build
  • Data modeling
  • Data cleansing

At Shopify, we follow the Dimensional Modeling methodology developed by Ralph Kimball—a specific set of rules around organizing data in our data warehouse—to ensure all of our data is structured consistently. Since our team is familiar with how things are structured in the foundation, it’s easy for them to interact with the data and start using the tools at higher levels in the pyramid to analyze it.It’s important to note that even though these foundational practices, by necessity, precede activities at the higher level, they’re not “less than”—they are critical to everything we do as data scientists. Having this groundwork in place was absolutely critical to the success of our COVID-19 impact analysis. We weren’t scrambling to find data—it was already clean, structured, and ready to go. Knowing that we had put in the effort to collect data the right way also gave us the security that we could trust the insights that came out during our analysis. 

2. Describing The Data To Gain A Baseline Understanding Of The Business

This next level of the hierarchy is about leveraging the data we’ve collected to describe what we observe happening within Shopify. With a strong foundation in place, we’re able to report metrics and answer questions about our business. For instance, for every product we release, we create associated dashboards to help understand how well the product is meeting merchants’ needs. 

At this phase, we’re able to start asking key questions about our data. These might be things like: What was the adoption of product X over the last three months? How many products do merchants add in their first week on the platform? How many buyers viewed our merchants’ storefronts? The answers to these questions offer us insight into particular business processes, which can help illuminate the steps we should take next—or, they might establish the building blocks for more complex analysis (as outlined in steps three and four). For instance, if we see that the adoption of product X was a success, we might ask, Why? What can we learn from it? What elements of the product launch can we repeat for next time?

Activities At The Describe Level

During our COVID-19 impact analysis, we were interested in discovering how the pandemic was affecting Shopify and our merchants’ businesses: What does COVID-19 mean for our merchants’ sales? Are they being affected in a positive or negative way, and why? This allowed us to establish a baseline understanding of the situation. While for some projects it might have been possible to stop the analysis here, we needed to go deeper—to be able to predict what might happen next and take the right actions to support our merchants. 

3. Predicting And Inferring The Answers To Deeper Questions With More Advanced Analytical Techniques

At this level, the problems start to become more complex. With a strong foundation and clear ability to describe our business, we can start to look forward and offer predictions or inferences as to what we think may happen in the future. We also have the opportunity to start applying more specialized skills to seek out the answers to our questions. 

Activities At The Predict / Infer Level 

These questions might include things like: What do we think sales will be like in the future? What do we think caused the adoption of a particular product? Once we have the answers, we can start to explain why certain things are happening—giving us a much clearer picture of our business. We’re also able to start making predictions about what is likely to happen next.

Circling back to our COVID-19 impact analysis, we investigated what was happening globally and conducted statistical analysis to predict how different regions we serve might be affected. An example of the kinds of questions we asked includes: Based on what we see happening to our merchants in Italy as they enter lockdown, what can we predict will happen in the U.S. if they were to do the same? Once we had a good idea of what we thought might happen, we were able to move on to the next level of the pyramid and decide what we wanted to do about it. 

4. Using Insights To Prescribe Action

At this level, we’re able to take everything from the underlying levels of the hierarchy to start forming opinions about what we should do as a business based on the information we’ve gathered. Within Shopify, this means offering concrete recommendations internally, as well as providing guidance to our merchants. 

Activities At The Prescribe Level

When it came to our COVID-19 impact analysis, our research at the lower levels helped provide the insights to pivot our product roadmap and ship products that we knew could support our merchants. For example:

  • We observed an increase of businesses coming online due to lockdowns, so we offered an extended 90-day free trial to all new merchants
  • Knowing the impact lockdowns would have on businesses financially, we expanded Shopify Capital (our funding program for merchants), then only available in the U.S., to Canada and the UK
  • With the increase of online shopping and delays in delivery, we expanded our shipping options, adding local delivery and the option to buy online, pick up in-store
  • Observing the trend of consumers looking to support local businesses, we made gift cards available for all Shopify plans and added a new feature to our shopping app, Shop, that made it easier to discover and buy from local merchants

By understanding what was happening in the world and the commerce industry, and how that was impacting our merchants and our business, we were able to take action and create a positive impact—which is what we’ll delve into in our next and final section. 

5. Influencing The Direction Of Your Business 

This level of the hierarchy is the culmination of the work below and represents all we should strive to achieve in our data science practice. With a strong foundation and a deep understanding of our challenges, we’ve been able to put forward recommendations—and now, as the organization puts our ideas into practice, we start to make an impact.

Activities At The Influence Level

  • Analytics
  • Machine learning
  • Artificial intelligence
  • Deep dives
  • Whatever it takes! 

It’s critical to remember that the most valuable insights don’t necessarily have to come from using the most advanced tools. Any insight can be impactful if it helps us inform a decision, changes the way we view something, or (in our case) helps our merchants.

Our COVID-19 impact analysis didn’t actually involve any artificial intelligence or machine learning, but it nevertheless had wide-reaching positive effects. It helped us support our merchants through a challenging time and ensured that Shopify also continued to thrive. In fact, in 2020, our merchants made a total of $119.6 billion, an increase of 96% over 2019. Our work at all the prior levels ensured that we could make an impact when it mattered most. 

Delivering Value At Every Level

In practice, positive influence can occur as a result of output at any level of the hierarchy—not just the very top. The highest level represents something that we should keep in mind as we deliver anything, whether it be a model, tool, data product, report analysis, or something else entirely. The lower levels of the hierarchy enable deeper levels of inquiry, but this doesn’t make them any less valuable on their own. 

Using our Data Science Hierarchy of Needs as a guide, we were able to successfully complete our COVID-19 impact analysis. We used the insights we observed and put them into action to support our merchants at the moment they needed them most, and guided Shopify’s overarching business and product strategies through an unprecedented time. 

No matter what level in the hierarchy we’re working at, we ensure we’re always asking ourselves about the impact of our work and how it is enabling positive change for Shopify and our merchants. Our Data Science Hierarchy of Needs isn’t a rigid progression—it’s a mindset.

Phillip Rossi is the Head of Expansion Intelligence at Shopify. He leads the teams responsible for using data to inform decision making for Shopify, our merchants, and our partners at scale.


If you’re passionate about solving complex problems at scale, and you’re eager to learn more, we're always hiring! Reach out to us or apply on our careers page.

Continue reading

Building a Real-time Buyer Signal Data Pipeline for Shopify Inbox

Building a Real-time Buyer Signal Data Pipeline for Shopify Inbox

By Ashay Pathak and Selina Li

Tens of thousands of merchants use Shopify Inbox as a single business chat app for all online customer interactions and staff communications. Over four million conversations were exchanged on Shopify Inbox in 2020, and 70 percent of Shopify Inbox conversations are with customers making a purchasing decision. This prompted the Shopify Data team to ask ourselves, “How can we help merchants identify and convert those conversations to sales?” 

We built a real-time buyer signal data pipeline to surface relevant customer context—including active cart activities and order completion information—to merchants while they’re chatting with their customers. With these real-time, high-intent customer signals, merchants know where the buyers are in their shopping journey—from browsing products on online stores to placing orders. Merchants can ask more direct questions, better answer customer inquiries, and prioritize conversations that are more likely to convert. 

Animation showing cart events displaying a prompts in the merchant's chat window.
Animation of cart event

We’ll share how we designed our pipeline, along with how we uncovered insights on merchant behaviors through A/B testing. We’ll also discuss how we address the common problems of streaming solutions, tackle complex use cases by leveraging various Apache Beam functions and measure success using an experiment.

Overview

Buyers can message merchants from many different channels like Online Store Chat, Facebook Messenger, and Apple Business Chat. Shopify Inbox allows merchants to manage customer conversations from different messaging channels within a single business chat app.  While it’s a great tool for managing customer conversations, we wanted to go one step further by helping merchants optimize sales opportunities on existing conversations and prioritize conversations as they grow.

The majority of Shopify Inbox conversations are with customers making a purchasing decision. We need to identify signals that represent buyers’ purchase intent and surface it at the right time during a conversation. How we achieve this is by building a real-time Apache Beam pipeline that surfaces high-intent buyer signals in Shopify Inbox.

When a buyer has an active conversation with a merchant, we currently share two buyer signals with the merchant: 

  1. Cart action event: Provides information on buyers’ actions on the cart, product details, and the current status of the cart. 
  2. Order completion event: Provides information on the recent purchase a buyer has made, including an order number URL that enables merchants to view order details in the Shopify admin (where merchants login to manage their business).

These signals are shared in the form of conversation events (as shown in the below image). Conversation events are the means for communicating context or buyer behavior to merchants that are relevant during the time of the conversation. They’re inserted in chronological order within the message flow of the conversation without adding extensive cognitive loads to merchants.

An image of a Shopify Inbox chat window on a mobile phone showing conversation events from the cart and order completion event
Example of conversation events—cart and order completion event in Shopify Inbox

In general, the cart and order completion events are aggregated and shared based on the following characteristics:

  • Pre-conversation events: Events that happen up to 14 days before a conversation is initiated.
  • Post-conversation events: Events that happen after a conversation is initiated. The conversation has a life cycle of seven days, and we maintain events in state until the conversation expires. 

Architecture

To deliver quality information to merchants on time, there are two main requirements our system needs to fulfill: low latency and high reliability. We do so by leveraging three key technologies:

  • Apache Kafka 
  • Apache Beam 
  • Google Cloud Dataflow
A system diagram showing the flow from Apache Kafka to Apache Beam to Google Cloud Dataflow
Diagram of system architecture

Message Queues with Apache Kafka

For this pipeline we use two different forms of Kafka events: Monorail and Change Data Capture.

Monorail

Monorail is an abstraction layer developed internally at Shopify that adds structure to the raw Kafka events before producing it to Kafka. Also with the structure there’s support for versioning, meaning that if the schema produces upstream changes, then it gets produced to the updated version while the Kafka topic remains the same. Having version control is useful in our case as it helps to ensure data integrity.

Change Data Capture(CDC) 

CDC uses binlogs and Debezium to create a stream of events from changed data in MySQL databases and supports large record delivery. Some of the inputs to our pipeline aren’t streams by nature, so CDC allows us to read such data by converting it to a stream of events.

Real-time Streaming Processing with Apache Beam 

Apache Beam is a unified batch and stream processing system. Instead of using a batch system to aggregate months of old data and a separate streaming system to process the live user traffic, Apache Beam keeps these workflows together in one system. For our specific use case where the nature of events is transactional, it’s important for the system to be robust and handle all behaviors in a way that the results are always accurate. To make this possible, Apache Beam provides support with a variety of features like windowing, timers, and stateful processing.

Google Cloud Dataflow for Deploying Pipeline

We choose to use Google Dataflow as a runner for our Apache Beam model. Using a managed service helps us concentrate on the logical composition of our data processing job without worrying too much about physical orchestration of parallel processing. 

High Level System Design

Diagram of real time buyer signal system design
Diagram of real time buyer signal system design

The pipeline ingests data from CDC and Monorail while the sink only writes to a Monorail topic. We use Monorail as the standardized communication tool between the data pipeline and dependent service. The downstream consumer processes Monorail events that are produced from our model, structuring those events and sending them to merchants in Shopify Inbox.

The real-time buyer signals pipeline includes the following two main components:

  • Events Filtering Jobs: The cart and checkout data are transactional and include snapshots on every buyer interactions on cart and checkout. Even during non-peak hours, there are tens of thousands of events we read from the cart and checkout source every second. To reduce the workloads of the pipeline and optimize resources, this job only keeps mission-critical events (that is, only relevant transactional events of Shopify Inbox users).
  • Customer Events Aggregation Job: This job hosts the heavy lifting logic for our data pipeline. It maintains the latest snapshot of a buyer’s activities in an online store, including the most recent conversations, completed orders, and latest actions with carts. To make sure this information is accessible at any point of time, we rely on stateful processing with Timers and Global Window in Apache Beam. The event-emitting rule is triggered when a buyer starts a conversation.

The customer events aggregation job is the core of our real-time pipeline, so let’s dive into the design of this job.

Customer Events Aggregation Job

A system diagram of Customer Events Aggregation Job
Diagram of Customer Events Aggregation Job

As shown on the diagram above, we ingest three different input collections, including filtered conversation, checkout and cart events in the customer events aggregation job. All the input elements are keyed by the unique identifier of a buyer on an online store using the CoGroupByKey operator _shopify_y (see our policy on what information Shopify collects from visitors’ device). This allows us to group all input elements into a single Tuple collection for easier downstream processing. To ensure we have access to historical information, we leverage the state in Apache Beam that stores values by per-key-and-window to access last seen events. As states expire when a window ends, we maintain the key over a Global Window that’s unbonded and contains a single window to allow access to states at any time. We maintain three separate states for each customer event stream: conversation, checkout, and cart state. Upon arrival of new events, a processing time trigger is used to emit the current data of a window as a pane. Next, we process last seen events from state and new events from pane through defined logics in PTransform.

In this stage of the system, upon receiving new events from a buyer, we try to answer the following questions:

1. Does this buyer have an active conversation with the merchant?

This question determines whether our pipeline should emit any output or should just process the cart/checkout events and then store them to its corresponding state. The business logic of our pipeline is to emit events only when the buyer has started a conversation with the merchant through Shopify Inbox.

2. Do these events occur before or after a conversation is started?

This question relates to how we aggregate the incoming events. We aggregate events based on the two characteristics we mentioned above:

  • Pre-conversation events:We show transactional data on buyers’ activities that occur after a conversation is initiated. Using the same scenario mentioned above, we show a cart addition event and an order completion event to the merchant.
  • Post-conversation events: We show transactional data on buyers’ activities that occur after a conversation is initiated. Using the same scenario mentioned above, we show a cart addition event and an order completion event to the merchant.
Examples of pre-conversation event(left) versus post-conversation event(right)
Examples of pre-conversation event(left) versus post-conversation event(right)

3. What is the latest interaction of a buyer on an online store?

This question reflects the key design principle of our pipeline—the information we share with merchants should be up-to-date and always relevant to a conversation. Due to the nature of how streaming data arrives at a pipeline and the interconnected process between cart and checkout, it introduces the main problems we need to resolve in our system.

There are a few challenges we faced when designing the pipeline and its solutions.

Interdependency of Cart and Checkout

Cart to checkout is a closely connected process in a buyer’s shopping journey. For example, when a buyer places an order and returns to the online store, the cart should be empty. The primary goal of this job is to mock this process in the system to ensure correct reflection on cart checkout status at any time. The challenge is that cart and checkout events are from different Monorail sources but they have dependencies on each other. By using a single PTransform function, it allows us to access all mutable states and create dynamic logic based on that. An example of this is that we clear the cart state when receiving a checkout event of the same user token.

Handling Out-of-Order Events

As the information we share in the event is accumulative (for example, total cart value) sharing the buyer signal events in the correct sequence is critical to a positive merchant experience. The output event order should be based on the chronological order of the occurrence of buyers’ interaction with the cart and chat. For example, removal of an item should always come after an item addition. However, one of the common problems with streaming data is we can’t guarantee events across data sources are read and processed in order. On top of that, the action on the cart isn’t explicitly stated in the source. Therefore, we rely on comparing quantities change between transactional events to extract the cart action. 

This problem can be solved by leveraging stateful processing in Apache Beam. A state is a buffer that stores values by per-key-and-window. It’s mutable and evolved with time and new incoming elements. With the use of state, it allows us to access previous buyer activity snapshots and identify any out-of-order events by comparing the event timestamp between new events and events from the state. This ensures no outdated information is shared with merchants. 

Garbage Collection 

To ensure we’re not overloading data to states, we use Timer to manually clean up the expired or irrelevant per-key-and-window values in states. The timer is set to use the event_time domain to manage the state lifecycle per-key-and-window. We use this to accommodate the extendable lifespan of a cart. 

Sharing Buyer Context Across Conversations

Conversations and cart cookies have different life spans. The problem we had was that the characteristics of events can be evolved across time. For example, a post-conversation cart event can be shared as a pre-conversation event upon the expiration of a conversation. To address this, we introduced a dynamic tag in states to indicate whether the events have been shared in a conversation. Whenever the timer for the conversation state executes, it will reset this tag in the cart and checkout state. 

Testing Our Pipeline

Through this real-time system, we expect the conversation experience to be better for our merchants by providing them these intelligent insights about the buyer journey. We carried out an experiment and measured the impact on our KPI’s to validate the hypothesis. The experiment had a conventional A/B test setup where we divided the total audience (merchants using Shopify Inbox) into two equal groups: control and treatment. Merchants in the control group continued to have the old behavior, while the treatment group merchant saw the real-time buyer signal events in their Shopify Inbox client app. We tracked the merchant experience using the following metrics: 

  • Response Rate: Percent of buyer conversations that got merchant replies. We observed a significant increase of two percent points.
    • Response Time: Time between first buyer message and first merchant response. While we saw the response rate significantly increase, we observed no significant change in response time, signifying that merchants are showing intent to reply quicker.  
    • Conversion Rate: Percent of buyer conversations that got attributed to a sale. We observed a significant increase of 0.7 percent points.

    Our experiments showed that with these new buyer signals being shown to merchants in real-time, they’re able to better answer customer queries because they know where the buyers are in their shopping journey. Even better, they’re able to prioritize the conversations by responding to the buyer who is already in the checkout process, helping buyers to convert quicker. Overall, we observed a positive impact on all the above metrics. 

    Key Takeaways of Building Real-time Buyer Signals Pipeline 

    Building a real-time buyer signal data pipeline to surface relevant customer context was a challenging process, but one that makes a real impact on our merchants. To quickly summarize the key takeaways: 

    • Apache Beam is a useful system for transactional use cases like cart as it provides useful functionalities such as state management and timers. 
    • Handling out of order events is very important for such use cases and to do that a robust state management is required. 
    • Controlled experiments are an effective approach to measure the true impact of major feature changes and derive valuable insights on users' behaviors.

    Ashay Pathak is a Data Scientist working on Shopify’s Messaging team. He is currently working on building intelligence in conversations & improving chat experience for merchants. Previously he worked for an intelligent product which delivered proactive marketing recommendations to merchants using ML. Connect with Ashay on Linkedln to chat.

    Selina Li is a Data Scientist on the Messaging team. She is currently working to build intelligence in conversation and improve merchant experiences in chat. Previously, she was with the Self Help team where she contributed to deliver better search experiences for users in Shopify Help Center and Concierge. Check out her last blog post on Building Smarter Search Products: 3 Steps for Evaluating Search Algorithms. If you would like to connect with Selina, reach out on Linkedin.


    Interested in tackling challenging problems that make a difference? Visit our Data Science & Engineering career page to browse our open positions.

    Continue reading

    Scaling Shopify's BFCM Live Map: An Apache Flink Redesign

    Scaling Shopify's BFCM Live Map: An Apache Flink Redesign

    By Berkay Antmen, Chris Wu, and Dave Sugden

    In 2017, various teams at Shopify came together to build an external-facing live-streamed visualization of all the sales made by Shopify merchants during the Black Friday and Cyber Monday (BFCM) weekend. We call it the Shopify BFCM live map.

    Shopify’s BFCM live map is a visual signal of the shift in consumer spending towards independent businesses and our way to celebrate the power of entrepreneurship. Over the years, it’s become a tradition for different teams within Shopify to iterate on the live map to see how we can better tell this story. Because of our efforts, people all over the world can watch our merchant sales in real-time, online, broadcast on television, and even in Times Square.

    This year, the Shopify Data Platform Engineering team played a significant role in the latest iteration of the BFCM live map. Firstly, we sought to explore what new insights we could introduce and display on the live map. Secondly, and most importantly, we needed to figure out a way to scale the live map. Last year we had more than 1 million merchants. That number has grown to over 1.7 million. With just weeks left until BFCM, we were tasked with not only figuring out how to address the system’s scalability issues but also challenging ourselves to do so in a way that would help us create patterns we could repeat elsewhere in Shopify.

    We’ll dive into how our team, along with many others, revamped the data infrastructure powering our BFCM live map using Apache Flink. In a matter of weeks, we created a solution that displayed richer insights and processed a higher volume of data at a higher uptime—all with no manual interventions.

    Last Year’s Model Had Met Its Limit

    Last year’s live map drew a variety of transaction data and metadata types from our merchants. The live map looked amazing and did the job, but now with more than 1.7 million merchants on our platform, we weren’t confident that the backend architecture supporting it would be able to handle the volume predicted for 2021.

    With just weeks until BFCM, Shopify execs challenged us to “see if we know our systems” by adding new metrics and scaling the live map.

    In this ask, the Shopify Data Platform Engineering team saw an opportunity. We have an internal consulting team that arose organically to assist Shopify teams in leveraging our data stack. Lately, they'd been helping teams adopt stateful stream processing technologies. Streaming is still a developing practice at Shopify, but we knew we could tap this team to help us use this technology to scale the BFCM live map. With this in mind, we met with the Marketing, Revenue, UX, Product, and Engineering teams, all of whom were equally invested in this project, to discuss what we could accomplish in advance of BFCM.

    Deconstructing Last Year’s Model

    We started by taking stock of the system powering the 2020 live map. The frontend was built with React and a custom 3D visualization library. The backend was a home-grown, bespoke stateful streaming service we call Cricket, built in Go. Cricket processes messages from relevant Kafka topics and broadcasts metrics to the frontend via Redis.

    Image showing the 2020 BFCM live map system diagram.
    2020 BFCM live map system diagram

    Our biggest concern was that this year Cricket could be overloaded with the volume coming from the checkout Kafka topic. To give you an idea of what that volume looked like, at the peak we saw roughly 50,000 messages per second during the 2021 BFCM weekend. On top of volume concerns, our Kafka topic contains more than just the subset of events that we need, and those events contain fields we didn’t intend to use.

    Image showing a snapshot of a Nov 27, 2020 live map including a globe view, sales per minute at $1,541,390, orders per minute at 15,875, and carbon offset at 254,183 Tonnes.
    Shopify’s 2020 Black Friday Cyber Monday Live Map

    Another challenge we faced was that the connection between Cricket and the frontend had a number of weaknesses. The original authors were aware of these, but there were trade-offs they’d made to get things ready in time. We were using Redis to queue up messages and broadcast our metrics to browsers, which was inefficient and relatively complex. The metrics displayed on our live map have more relaxed requirements around ordering than, say, chat applications where message order matters. Instead, our live map metrics:

    • Can tolerate some data loss: If you take a look at the image above of last year’s live map, you’ll see arc visuals that represent where an order is made and shipping to. These data visualizations are already sampled because we’re not displaying every single order on the browser (it would be too many!). So it’s okay if we lose some of the arc visuals because we’re unable to draw all arcs on the screen anyways.
    • Only require the latest value: While Cricket relays near real-time updates, we’re only interested in displaying the latest statistics for our metrics. Last year those metrics included sales per minute, orders per minute, and our carbon offset. Queuing up and publishing the entire broadcasted history for these metrics would be excessive.

    This year, on top of the metrics listed above, we sought to add in:

    • Product trends: Calculated as the top 500 categories of products with the greatest change in sale volume over the last six hours.
    • Unique shoppers: Calculated as unique buyers per shop, aggregated over time.

    Now in our load tests, we observed that Redis would quickly become a bottleneck due to the increase in the number of published messages and subscribers or connections. This would cause the browser long polling to sometimes hang for too long, causing the live map arc visuals to momentarily disappear until getting a response. We needed to address this because we forecasted that this year there would be more data to process. After talking to the teams who built last year’s model, and evaluating what existed, we developed a plan and started building our solution.

    The 2021 Solution

    At a minimum, we knew that we had to deliver a live map that scaled at least as well as last year’s, so we were hesitant to go about changing too much without time to rigorously test it all. In a way this complicated things because while we might have preferred to build from scratch, we had to iterate upon the existing system.

    2021 BFCM live map system diagram
    2021 BFCM live map system diagram

    In our load tests, with 1 million checkout events per second at peak, the Flink pipeline was able to operate well under high volume. We decided to put Flink on the critical path to filter out irrelevant checkout events and resolve the biggest issue—that of Cricket failing to scale. By doing this, Cricket was able to process one percent of the event volume to compute the existing metrics, while relying on Flink for the rest.

    Due to our high availability requirements for the Flink jobs, we used a mixture of cross-region sharding and cross-region active-active deployment. Deduplications were handled in Cricket. For the existing metrics, Cricket continued to be the source of computation and for the new metrics, computed by Flink, Cricket acted as a relay layer.

    For our new product trends metric, we leveraged our product categorization algorithm. We emitted 500 product categories with sales quantity changes, every five minutes. For a given product, the sales quantity percentage change was computed based on the following formula:

    change = SUM(prior 1hr sales quantity) / MEAN(prior 6hr sales quantity) - 1

    For all product trends job at a high level:

    So How Did It Do?

    Pulling computation out of Cricket into Flink proved to be the right move. Those jobs ran with 100 percent uptime throughout BFCM without backpressure and required no manual intervention. To mitigate risk, we also implemented the new metrics as batch jobs on our time-tested Spark infrastructure. While these jobs ran well, we ended up not relying on them because Flink met our expectations.

    Here’s a look at what we shipped:

    Shopify’s 2021 Black Friday Cyber Monday Live Map with new data points including unique shoppers and product trends
    Shopify’s 2021 Black Friday Cyber Monday Live Map with new data points including unique shoppers and product trends

    In the end, user feedback was positive, and we processed significantly more checkout events, as well as produced new metrics.

    However, not everything went as smoothly as planned. The method that we used to fetch messages from Redis and serve them to the end users caused high CPU loads on our machines. This scalability issue was compounded by Cricket producing metrics at a faster rate and our new product trends metric clogging Redis with its large memory footprint.

    A small sample of users noticed a visual error: some of the arc visuals would initiate, then blip out of existence. With the help of our Production Engineering team, we dropped some of the unnecessary Redis state and quickly unclogged it within two hours.

    Despite the hiccup, the negative user impact was minimal. Flink met our high expectations, and we took notes on how to improve the live map infrastructure for the next year.

    Planning For Next Year

    With another successful BFCM through, the internal library we built for Flink enabled our teams to assemble sophisticated pipelines for the live map in a matter of weeks, proving that we can run mission-critical applications on this technology.

    Beyond BFCM, what we’ve built can be used to improve other Shopify analytic visualizations and products. These products are currently powered by batch processing and the data isn’t always as fresh as we’d like. We can’t wait to use streaming technology to power more products that help our merchants stay data-informed.

    As for the next BFCM, we’re planning to simplify the system powering the live map. And, because we had such a great experience with it, we’re looking to use Flink to handle all of the complexity.

    This new system will enable us to:

    • no longer have to maintain our own stateful stream processor
    • remove the bottleneck in our system
    • only have to consider back pressure at a single point (versus having to handle back pressure in our streaming jobs, in Cricket, and between Cricket and Web).

    We are exploring a few different solutions, but the following is a promising one:

    Image showing potential future BFCM live map system diagram. Add data sources via events to Flink all metrics and snapshot metrics to the database. Poll from the browser to web and read metrics from the web to the database
    Potential future BFCM live map system diagram

    The above design is relatively simple and satisfies both our scalability and complexity requirements. All of the metrics would be produced by Flink jobs and periodically snapshotted in a database or key-value store. The Web tier would then periodically synchronize its in-memory cache and serve the polling requests from the browsers.

    Overall, we’re pleased with what we accomplished and excited that we have such a head start on next year’s design. Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Want to help us scale and make commerce better for everyone? Join our team.

    Berkay Antmen leads the Streaming Capabilities team under Data Platform Engineering. He’s interested in computational mathematics and distributed systems. His current Shopify mission is to make large-scale near real-time processing easy. Follow Berkay on Twitter.

    Chris Wu is a Product Lead who works on the Data Platform team. He focuses on making great tools to work with data. In his spare time he can be found buying really nice notebooks but never actually writing in them.

    Dave Sudgen is a Staff Data Developer who works on the Customer Success team, enabling Shopifolk to onboard to streaming technology.


    Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

    Continue reading

    Upgrading MySQL at Shopify

    Upgrading MySQL at Shopify

    In early September 2021, we retired our last Shopify database virtual machine (VM) that was running Percona Server 5.7.21, marking the complete cutover to 5.7.32. In this post, I’ll share how the Database Platform team performed the most recent MySQL upgrade at Shopify. I’ll talk about some of the roadblocks we encountered during rollback testing, the internal tooling that we built out to aid upgrading and scaling our fleet in general, and our guidelines for approaching upgrades going forward, which we hope will be useful for the rest of the community.

    Why Upgrade and Why Now?

    We were particularly interested in upgrading due to the replication improvements that would preserve replication parallelism in a multi-tier replication hierarchy via transaction writesets. However, in a general sense, upgrading our version of MySQL was on our minds for a while and the reasons have become more important over time as we’ve grown:

    • We’ve transferred more load to our replicas over time, and without replication improvements, high load could cause replication lag and a poor merchant and buyer experience.
    • Due to our increasing global footprint, to maintain efficiency, our replication topology can be up to four “hops” deep, which increases the importance of our replication performance.
    • Without replication improvements, in times of high load such as Black Friday/Cyber Monday (BFCM) and flash sales, there’s a greater likelihood of replication lag that in turn heightens the risk to merchants’ data availability in the event of a writer failure.
    • It’s industry best practice to stay current with all software dependencies to receive security and stability patches.
    • We expect to eventually upgrade to MySQL 8.0. Building the upgrade tooling required for this minor upgrade helps us prepare for that.

    To the last point, one thing we definitely wanted to achieve as a part of this upgrade was—to put it in the words of my colleague Akshay—“Make MySQL upgrades at Shopify a checklist of tasks going forward, as opposed to a full-fledged project.” Ideally, by the end of the project, we have documentation with steps for how to perform an upgrade that can be followed by anyone on the Database Platform team that takes on the order of weeks, rather than months, to complete.

    Database Infrastructure at Shopify

    Core

    Shopify's Core database infrastructure is horizontally sharded by shop, spread across hundreds of shards, each consisting of a writer and five or more replicas. These shards are run on Google Compute Engine Virtual Machines (VM) and run the Percona Server fork of MySQL. Our backup system makes use of Google Cloud’s persistent disk snapshots. While we’re running the upstream versions of Percona Server, we maintain an internal fork and build pipeline that allows us to patch it as necessary.

    Mason

    Without automation, there’s a non-trivial amount of toil involved in just the day-to-day operation of our VM fleet due to its sheer size. VMs can go down for many reasons, including failed GCP live migrations, zone outages, or just run-of-the-mill VM failures. Mason was developed to respond to VMs going down by spinning up a VM to replace it—a task far more suited to a robot rather than a human, especially in the middle of the night.

    Mason was developed as a self-healing service for our VM-based databases that was borne out of a Shopify Hack Days project in late 2019.

    Healing Isn’t All That’s Needed

    Shopify’s query workload can differ vastly from shard to shard, which necessitates maintenance of vastly different configurations. Our minimal configuration is six instances: three instances in Google Cloud’s us-east1 region and three instances in us-central1. However, each shard’s configuration can differ in other ways:

    • There may be additional replicas to accommodate higher read workloads or to provide replicas in other locations globally.
    • The VMs for the replicas may have a different number of cores or memory to accommodate differing workloads.

    With all of this in mind, you can probably imagine how it would be desirable to have automation built around maintaining these differences—without it, a good chunk of the manual toil involved in on-call tasks would be simply provisioning VMs, which isn’t an enviable set of responsibilities.

    Using Mason to Upgrade MySQL

    Upgrades at our scale are extremely high effort as the current count of our VM fleet numbers in the thousands. We decided that building additional functionality onto Mason would be the way forward to automate our MySQL upgrade, and called it the Declarative Database Topologies project. Where Mason was previously used as a solely reactive tool that only maintained a hardcoded default configuration, we envisioned its next iteration as a proactive toolone that allows us to define a per-shard topology and do the provisioning work that reconciles its current state to a desired state. Doing this would allow us to automate provisioning of upgraded VMs, thus removing much of the toil involved in upgrading a large fleet, and automate scale-up provisioning for events such as BFCM or other high-traffic occurrences.

    The Project Plan

    We had approximately eight months before BFCM preparations would begin to achieve the following:

    • pick a new version of MySQL.
    • benchmark and test the new version for any regressions or bugs
    • perform rollback testing and create a rollback plan to so we can safely downgrade if necessary
    • finally, perform the actual upgrade.

    At the same time, we also needed to evolve Mason to:

    • increase its stability
    • move from a global hardcoded configuration to a dynamic per-shard configuration
    • have it respond to scale-ups when the configuration changed
    • have it care about Chef configuration, too
    • … do all of that safely.

    One of the first things we had to do was pick a version of Percona Server. We wanted to maximize the gains that we would get from an upgrade while minimizing our risk. This led us to choose the highest minor version of Percona Server 5.7, which was 5.7.32 at the start of the project. By doing so, we benefited from the bug and security fixes made since we last upgraded; in the words of one of our directors, “incidents that never happened” because we upgraded. At the same time, we avoided some of the larger risks associated with major version upgrades.

    Once we had settled on a version, we made changes in Chef to have it handle an in-place upgrade. Essentially, we created a new Chef role with the existing provisioning code but with the new version specified for the MySQL server version variable and modified the code so that the following happens:

    1. Restore a backup taken from a 5.7.21 VM on an VM with 5.7.32 installed.
    2. Allow the VM and MySQL server process to start up normally. 
    3. Check the contents of the mysql_upgrade_info file in the data directory. If the version differs from that of the MySQL server version installed, run mysql_upgrade (via a wrapper script that’s necessary to account for unexpected behaviour of the mysql_upgrade script that exits with the return code 2, instead of the typical return code of 0, when an upgrade wasn’t required).
    4. Perform the necessary replication configuration and proceed with the rest of the MySQL server startup.

    After this work was completed, all we had to do to provision an upgraded version was to specify that the new VM be built with the new Chef role.

    Preparing for the Upgrade

    Performing the upgrade is the easy part, operationally. You can spin up an instance with a backup from the old version, let mysql_upgrade do its thing, have it join the existing replication topology, optionally take backups from this instance with the newer version, populate the rest of the topology, and then perform a takeover. Making sure the newer version performs the way we expect and can be safely rolled back to the old version, however, is the tricky part.

    During our benchmarking tests, we didn’t find anything anomalous, performance-wise. However, when testing the downgrade from 5.7.32 back to 5.7.21, we found that the MySQL server wouldn’t properly start up. This is what we saw when tailing the error logs:

    When we allowed the calculation of transient stats at startup to run to completion, it took over a day due to a lengthy table analyze process on some of our shards—not great if we needed to roll back more urgently than that.

    A cursory look at the Percona Server source code revealed that the table_name column in the innodb_index_stats and innodb_table_stats changed from VARCHAR(64) in 5.7.21 to VARCHAR(199) in 5.7.32. We patched mysql_system_tables_fix.sql in our internal Percona Server fork so that the column lengths were set back to a value that 5.7.21 expected, and re-tested the rollback. This time, we didn’t see the errors about the column lengths, however we still saw the analyze table process causing full table rebuilds, again leading to an unacceptable startup time, and it became clear to us that we had merely addressed a symptom of the problem by fixing these column lengths.

    At this point, while investigating our options, it occurred to us that one of the reasons why this analyze table process might be happening is because we run ALTER TABLE commands as a part of the MySQL server start: we run a startup script that sets the AUTO_INCREMENT value on tables to set a minimum value (this is due to the auto_increment counter not being persisted across restarts, a long-standing bug which is addressed in MySQL 8.0).

    Investigating the Bug

    Once we had our hypothesis, we started to test it. This culminated in a group debugging session where a few members of our team found that the following steps reproduced the bug that resulted in the full table rebuild:

    1. On 5.7.32: A backup previously taken from 5.7.21 is restored.
    2. On 5.7.32: An ALTER TABLE is run on a table that should just be an instantaneous metadata change, for example, ALTER TABLE t AUTO_INCREMENT=n. The table is changed instantaneously, as expected.
    3. On 5.7.32: A backup is taken.
    4. On 5.7.21: The backup taken from 5.7.32 in the previous step is restored.
    5. On 5.7.21: The MySQL server is started up, and mysql_upgrade performs the in-place downgrade.
    6. On 5.7.21: A similar ALTER TABLE statement to step 1 is performed. A full rebuild of the table is performed, unexpectedly and unnecessarily.

    Stepping through the above steps with the GNU Debugger (GDB), we found the place in the MySQL server source code where it’s incorrectly concluded that indexes have changed in a way that required a table rebuild (from Percona Server 5.7.21 in the has_index_def_changed function in sql/sql_table.cc):

    We saw, while inspecting in GDB, that the flags for the old version of the table (table_key->flags above) don’t match that of the new version of the table (new_key->flags above), despite the fact that only a metadata change was applied:

    Digging deeper, we found past attempts to fix this bug. In the 5.7.23 release notes, there’s the following:

    “For attempts to increase the length of a VARCHAR column of an InnoDB table using ALTER TABLE with the INPLACE algorithm, the attempt failed if the column was indexed.If an index size exceeded the InnoDB limit of 767 bytes for COMPACT or REDUNDANT row format, CREATE TABLE and ALTER TABLE did not report an error (in strict SQL mode) or a warning (in nonstrict mode). (Bug #26848813)”

    A fix was merged for the bug, however we saw that there was a second attempt to fix this behaviour. In the 5.7.27 release notes, we see:

    “For InnoDB tables that contained an index on a VARCHAR column and were created prior to MySQL 5.7.23, some simple ALTER TABLE statements that should have been done in place were performed with a table rebuild after an upgrade to MySQL 5.7.23 or higher. (Bug #29375764, Bug #94383)”

    A fix was merged for this bug as well, but it didn’t fully address the issue of some ALTER TABLE statements that should be simple metadata changes instead leading to a full table rebuild.

    My colleague Akshay filed a bug against this, however the included patch wasn’t ultimately accepted by the MySQL team. In order to safely upgrade past this bug, we still needed MySQL to behave in a reasonable way on downgrade, and we ended up patching Percona Server in our internal fork. We tested our patched version successfully in our final rollback tests, unblocking our upgrade.

    What are “Packed Keys” Anyway?

    The PACK_KEYS feature of the MyISAM storage engine allows keys to be compressed, thereby making indexes much smaller and improving performance. This feature isn’t supported by the InnoDB storage engine as its index layout and expectations are completely different. In MyISAM, when indexed VARCHAR columns are expanded past eight bytes, thus converting from unpacked keys to packed keys, it (rightfully) triggers an index rebuild.

    However, we can see that in the first attempt to fix the bug in 5.7.23, that the same type of change triggers the same behaviour in InnoDB, even though packed keys aren’t supported. To remedy this, from 5.7.23 onwards, the HA_PACK_KEY and HA_BINARY_PACK_KEY flags weren’t set if the storage engine didn’t support it.

    That, however, meant that if a table was created prior to 5.7.23, these flags are unexpectedly set even on storage engines that didn’t support it. So upon upgrade to 5.7.23 or higher, any metadata-only ALTER TABLE commands executed on an InnoDB table incorrectly conclude that a full index rebuild is necessary. This brings us to the second attempt to fix the issue in which the flags were removed entirely if the storage engine didn’t support it. Unfortunately that second bug fix didn’t account for the case where the flags might have changed, but the difference should be ignored when evaluating whether the indexes need to be rebuilt in earlier versions, and that’s what we addressed in our proposed patch. In our patch, during downgrade, if the old version of the table (from 5.7.32) didn’t specify the flag, but the new version of the table (in 5.7.21) does, then we bypass the index rebuild.

    Meanwhile, in the Mason Project… 

    While all of this rollback testing work was in progress, another part of the team was hard at work shipping new features in Mason to let it handle the upgrades. These were some of the requirements we had that guided the project work:

    • The creation of a “priority” lane—self-healing should always take precedence over a scale-up related provisioning request.
    • We needed to throttle the scale-up provisioning queue to limit how much work was done simultaneously.
    • Feature flags were required to limit the number of shards to release the scale-up feature to, so that we could control which shards were provisioned and release the new features carefully.
    • A dry-run mode for scale-up provisioning was necessary to allow us to test these features without making changes to the production systems immediately.

    Underlying all of this was an abundance of caution in shipping the new features. Because of our large fleet size, we didn’t want to risk provisioning a lot of VMs we didn’t need or VMs in the incorrect configuration that would cost us either way in terms of GCP resource usage or engineering time spent in decommissioning resources.

    In the initial stages of the project, stabilizing the service was important since it played a critical role in maintaining our MySQL topology. Over time, it had turned into a critical component of our infrastructure that significantly improved our on-call quality of life. Some of the early tasks that needed to be done were simply making it a first-class citizen among the services that we owned. We stabilized the staging environment it was deployed into, created and improved existing monitoring, and started using it to emit metrics to Datadog indicating when the topology was underprovisioned (in cases where Mason failed to do its job).

    Another challenge was that Mason itself talks to many disparate components in our infrastructure: the GCP API, Chef, the Kubernetes API, ZooKeeper, Orchestrator, as well as the database VMs themselves. It was often a challenge to anticipate failure scenarios—often, the failure experienced was completely new and wouldn’t have been caught in existing tests. This is still an ongoing challenge, and one that we hope to address through improved integration testing.

    Later on, as we onboarded new people to the project and started introducing more features, it also became obvious that the application was quite brittle in its current state; adding new features became more and more difficult due to the existing complexity, especially when they were being worked on concurrently. It brought to the forefront the importance of breaking down streams of work that have the potential to become hard blockers, and highlighted how much a well-designed codebase can decrease the chances of this happening.

    We faced many challenges, but ultimately shipped the project on time. Now that the project is complete, we’re dedicating time to improving the codebase so it’s more maintainable and developer-friendly.

    The Upgrade Itself

    Throughout the process of rollback testing, we had already been running 5.7.32 for a few months on several shards reserved for canary testing. A few of those shards are load tested on a regular basis, so we were reasonably confident that this, along with our own benchmarking tests, made it ready for our production workload.

    Next, we created a rollback plan in case the new version was unstable in production for unforeseen reasons. One of the early suggestions for risk mitigation was to maintain a 5.7.21 VM per-shard and continue to take backups from them. However, that would have been operationally complex and also would have necessitated the creation of more tooling and monitoring to make sure that we always have 5.7.21 VMs running for each shard (rather toilsome when the number of shards reaches the hundreds in a fleet). Ultimately, we decided against this plan, especially considering the fact that we were confident that we could roll back to our patched build of Percona Server, if we had to.

    Our intention was to do everything we could to de-risk the upgrade by performing extensive rollback testing, but ultimately we preferred to fix forward whenever possible. That is, the option to rollback was expected to be taken only as a last resort.

    We started provisioning new VMs with 5.7.32 in earnest on August 25th using Mason, after our tooling and rollback plan were in place. We decided to stagger the upgrades by creating several batches of shards. This allowed the upgraded shards to “bake” and not endanger the entire fleet in the event of an unforeseen circumstance. We also didn’t want to provision all the new VMs at once due to the amount of resource churn (at the petabyte-scale) and pressure it would put on Google Cloud.

    On September 7th, the final shards were completed, marking the end of the upgrade project.

    What Did We Take Away from This Upgrade? 

    This upgrade project highlighted the importance of rollback testing. Without the extensive testing that we performed, we would have never known that there was a critical bug blocking a potential rollback. Even though needing to rebuild the fleet with the old version to downgrade would have been toilsome and undesirable, patching 5.7.21 gave us the confidence to proceed with the upgrade, knowing that we had the option to safely downgrade if it became necessary.

    Also Mason, the tooling that we relied on, became more important over time. In the past, Mason was considered a lower-tier application, and simply turning it off was a band-aid solution to when it was behaving in unexpected ways. Fixing it wasn’t often a priority when bugs were encountered. However, as time has gone by, we’ve recognized how large of a role it plays in toil-mitigation and maintaining healthy on-call expectations, especially as the size of our fleet has grown. We have invested more time and resources into it by improving test coverage and refactoring key parts of the codebase to reduce complexity and improve readability. We also have future plans to improve the local development environments and streamline its deployment pipeline.

    Finally, investing in the documentation and easy repeatability of upgrades has been a big win for Shopify and for our team. When we first started planning for this upgrade, finding out how upgrades were done in the past was a bit of a scavenger hunt and required a lot of institutional knowledge. By developing guidelines and documentation, we paved the way for future upgrades to be done faster, more safely, and more efficiently. Rather than an intense and manual context-gathering process every time that pays no future dividends, we can now treat a MySQL upgrade as simply a series of guidelines to follow using our existing tooling.

    Next up: MySQL 8!

    Yi Qing Sim is a Senior Production Engineer and brings nearly a decade of software development and site reliability engineering experience to the Database Backend team, where she primarily works on Shopify’s core database infrastructure.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Remote Rendering: Shopify’s Take on Extensible UI

    Remote Rendering: Shopify’s Take on Extensible UI

    Shopify is one of the world's largest e-commerce platforms. With millions of merchants worldwide, we support an increasingly diverse set of use cases, and we wouldn't be successful at it without our developer community. Developers build apps that add immense value to Shopify and its merchants, and solve problems such as marketing automation, sales channel integrations, and product sourcing.

    In this post, we will take a deep dive into the latest generation of our technology that allows developers to extend Shopify’s UI. With this technology, developers can better integrate with the Shopify platform and offer native experiences and rich interactions that fit into users' natural workflow on the platform.

    A GIF showing how a 3rd party extension inserting a page highlighting an upsell purchase before the user completes purchase is completed in the Shopify checkout
    3rd party extension adding a post-purchase page directly into the Shopify checkout

    To put the technical challenges into context, it's important to understand our main objectives and requirements:

    • The user experience of 3rd party extensions must be consistent with Shopify's native content in terms of look & feel, performance, and accessibility features.
    • Developers should be able to extend Shopify using standard technologies they are already familiar with.
    • Shopify needs to run extensions in a secure and reliable manner, and prevent them from negatively impacting the platform (naively or maliciously).
    • Extensions should offer the same delightful experience across all supported platforms (web, iOS, Android).

    With these requirements in mind, it's time to peel the onion.

    Remote Rendering

    At the heart of our solution is a technique we call remote rendering. With remote rendering, we separate the code that defines the UI from the code that renders it, and have the two communicate via message passing. This technique fits our use case very well because extensions (code that defines UI) are typically 3rd party code that needs to run in a restricted sandbox environment, while the host (code that renders UI) is part of the main application.

    A diagram showing that Extensions define the UI and run in a sandbox and the Host renders the UI and is part of the main application. Extensions and Host communicate via messages between them.
    Separating extensions (3rd party code) from host (1st party code)

    Communication between an extension and a host is done via a MessageChannel. Using message passing for all communication means that hosts and extensions are completely agnostic of each other’s implementation and can be implemented using different languages. In fact, at Shopify, we have implemented hosts in JavaScript, Kotlin, and Swift to provide cross-platform support.

    The remote-ui Library

    Remote rendering gives us the flexibility we need, but it also introduces non-trivial technical challenges such as defining an efficient message-passing protocol, implementing function calls using message passing (aka remote procedure call), and applying UI updates in a performant way. These challenges (and more) are tackled by remote-ui, an open-source library developed at Shopify.

    Let's take a closer look at some of the fundamental building blocks that remote-ui offers and how these building blocks fit together.

    RPC

    At the lower level, the @remote-ui/rpc package provides a powerful remote procedure call (RPC) abstraction. The key feature of this RPC layer is the ability for functions to be passed (and called) across a postMessage interface, supporting the common need for passing event callbacks.

    Two code snippets displayed side by side showing remote procedure calls using endpoint.expose and endpoint.call
    Making remote procedure calls using endpoint.call (script1.js) and endpoint.expose (script2.js)

    @remote-ui/rpc introduces the concept of an endpoint for exposing functions and calling them remotely. Under the hood, the library uses Promise and Proxy objects to abstract away the details of the underlying message-passing protocol.

    It's also worth mentioning that remote-ui’s RPC has very smart automatic memory management. This feature is especially useful when rendering UI, since properties (such as event handlers) can be automatically retained and released as UI component mount and unmount. 

    Remote Root

    After RPC, the next fundamental building block is the RemoteRoot which provides a familiar DOM-like API for defining and manipulating a UI component tree. Under the hood, RemoteRoot uses RPC to serialize UI updates as JSON messages and send them to the host.

    Two code snippets showing appending a child to a `RemoteRoot` object and getting converted to a JSON message
    UI is defined with a DOM-like API and gets converted to a JSON message

    For more details on the implementation of RemoteRoot, see the documentation and source code of the @remote-ui/core package.

    Remote Receiver

    The "opposite side" of a RemoteRoot is a RemoteReceiver. It receives UI updates (JSON messages sent from a remote root) and reconstructs the remote component tree locally. The remote component tree can then be rendered using native components.

    Code snippets showing RemoteRoot and RemoteReceiver working together

    Basic example setting up a RemoteRoot and RemoteReceiver to work together (host.jsx and extension.js)

    With RemoteRoot and RemoteReceiver we are very close to having an implementation of the remote rendering pattern. Extensions can define the UI as a remote tree, and that tree gets reconstructed on the host. The only missing thing is for the host to traverse the tree and render it using native UI components.

    DOM Receiver

    remote-ui provides a number of packages that make it easy to convert a remote component tree to a native component tree. For example, a DomReceiver can be initialized with minimal configuration and render a remote root into the DOM. It abstracts away the underlying details of traversing the tree, converting remote components to DOM elements, and attaching event handlers.

    In the snippet above, we create a receiver that will render the remote tree inside a DOM element with the id container. The receiver will convert Button and LineBreak remote components to button and br DOM elements, respectively. It will also automatically convert any prop starting with on into an event listener.

    For more details, check out this complete standalone example in the remote-ui repo.

    Integration with React

    The DomReceiver provides a convenient way for a host to map between remote components and their native implementations, but it’s not a great fit for our use case at Shopify. Our frontend application is built using React, so we need a receiver that manipulates React components (instead of manipulating DOM elements directly).

    Luckily, the @remote-ui/react package has everything we need: a receiver (that receives UI updates from the remote root), a controller (that maps remote components to their native implementations), and the RemoteRenderer React component to hook them up.

    There's nothing special about the component implementations passed to the controller; they are just regular React components:

    However, there's a part of the code that is worth taking a closer look at:

    // Run 3rd party script in a sandbox environment
    // with the receiver as a communication channel ...

    Sandboxing

    When we introduced the concept of remote rendering, our high-level diagram included only two boxes, extension and host. In practice, the diagram is slightly more complex.

    An image showing the Sandbox as a box surrounding the Extension and a box representing the Host. The two communicate via messages
    The sandbox is an additional layer of indirection between the host and the extension

    The sandbox, an additional layer of indirection between the host and the extension, provides platform developers with more control. The sandbox code runs in an isolated environment (such as a web worker) and loads extensions in a safe and secure manner. In addition to that, by keeping all boilerplate code as part of the sandbox, extension developers get a simpler interface to implement.

    Let's look at a simple sandbox implementation that allows us to run 3rd party code and acts as “the glue” between 3rd party extensions and our host.

    The sandbox allows a host to load extension code from an external URL. When the extension is loaded, it will register itself as a callback function. After the extension finishes loading, the host can render it (that is, call the registered callback).

    Arguments passed to the render function (from the host) provide it with everything it needs. remoteChannel is used for communicating UI updates with the host, and api is an arbitrary object containing any native functionality that the host wants to make available to the extension.

    Let's see how a host can use this sandbox:

    In the code snippet above, the host makes a setTitle function available for the extension to use. Here is what the corresponding extension script might look like:

    Notice that 3rd party extension code isn't aware of any underlying aspects of RPC. It only needs to know that the api (that the host will pass) contains a setTitle function.

    Implementing a Production Sandbox

    The implementation above can give you a good sense of our architecture. For the sake of simplicity, we omitted details such as error handling and support for registering multiple extension callbacks.

    In addition to that, our production sandbox restricts the JavaScript environment where untrusted code runs. Some globals (such as importScripts) are made unavailable and others are replaced with safer versions (such as fetch, which is restricted to specific domains). Also, the sandbox script itself is loaded from a separate domain so that the browser provides extra security constraints.

    Finally, to have cross-platform support, we implemented our sandbox on three different platforms using web workers (web), web views (Android), and JsCore (iOS).

    What’s Next?

    The technology we presented in this blog post is relatively new and is currently used to power two types of extensions, product subscriptions and post-purchase, in two different platform areas.

    We are truly excited about the potential we’re unlocking, and we also know that there's a lot of work ahead of us. Our plans include improving the experience of 3rd party developers, supporting new UI patterns as they come up, and making more areas of the platform extensibile.

    If you are interested in learning more, you might want to check out the remote-ui comprehensive example and this recent React Summit talk.

    Special thanks to Chris Sauve, Elana Kopelevich, James Woo, and Trish Ta for their contribution to this blog post.

    Joey Freund is a manager on the core extensibility team, focusing on building tools that let Shopify developers extend our platform to make it a perfect fit for every merchant.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Building an App Clip with React Native

    Building an App Clip with React Native

    When the App Clip was introduced in iOS 14, we immediately realized that it was something that could be a big opportunity for the Shop app. Due to the App Clip design, being a lightweight version of an app that you can download on the fly, we wanted to investigate what it could mean for us. Being able to instantly show users the power of the Shop app, without having to download it from the App Store and go through the onboarding was something we thought could have a huge growth potential.

    One of the key features, and restrictions, for an App Clip is the size limitation. To make things even more interesting, we wanted to build it in React Native. Something that, to our knowledge, has never been done at this scale before.

    Being the first to build an App Clip in React Native that was going to be surfaced to millions of users each day proved to be a challenging task.

    What’s an App Clip?

    App Clips are a miniature version of an app that’s meant to be lightweight and downloadable “on the go.” To provide a smooth download experience, the App Clip can’t exceed 10MB in size. For comparison, the iOS Shop app is 51MB.

    An App Clip can’t be downloaded from the App Store—it can only be “invoked”. An invocation means that a user performs an action that opens the App Clip on their phone: scanning a QR code or an NFC tag, clicking a link in the Messages app, or tapping a Smart App Banner on a webpage. After the invocation is made, iOS displays a prompt asking the user to open the App Clip, meanwhile the binary of the App Clip is downloaded in the background that causes it to launch instantly. The invocation URL is passed on to the App Clip which enables you to provide a contextual experience for the user.

    What Are We Trying to Solve?

    The Shop app helps users to track all of their packages in one place with ease. When the buyer installs the app that order is automatically imported and the buyer is kept up to date about its status without having to ask the seller.

    However, we noticed a big drop-off of users in the funnel between the “Thank you” page and opening the app. Despite the Shop app having a 4.8 star rating, the few added steps of going through an App Store meant some buyers chose not to complete the process. The App Clip would solve all of this.

    When the user landed on the “Thank you” page on their computer and invoked the App Clip by scanning a QR code, or for mobile checkouts by simply tapping the Open button, they would instantly see their order tracked. No App Store, no onboarding, just straight into the order details with the option to receive push notifications for the whole package journey.

    Why React Native?

    React Native apps aren’t famous for being small in size, so we knew building an App Clip that was below 10MB in size would pose some interesting challenges. However, being one of the most popular apps on the app stores, and champions of React Native, we really wanted to see if it was possible.

    Since the Shop app is built in React Native, all our developers could contribute to the App Clip—not just Swift developers—and we would potentially be able to maintain code sharing and feature parity with the AppClip as we do across Android and iOS.

    In short, it was an interesting challenge that aligned with our technology choices and our values about building reusable systems designed for the long-term.

    Building a Proof of Concept–Failing Fast

    Since the App Clip was a very new piece of technology, there was a huge list of unknowns. We weren’t sure if it was going to be possible to build it with React Native and go below the 10MB limit. So we decided to set up a technical plan where if we failed, we would fail fast.

    The plan looked something like this:

    1. Build a “Hello World” App Clip in React Native and determine its size
    2. Build a very scrappy, not even functional, version of the actual App Clip, containing all the code and dependencies we estimated we would need and determine its size
    3. Clean up the code, make everything work

    We also wanted to fail fast product wise. App Clips is a brand new technology that few people have been exposed to. We weren’t sure if our App Clip would benefit our users, so our goal was to get an App Clip out for testing, and get it out fast. If it would prove to be successful we would go back and re-iterate.

    Hello World

    When we started building the App Clip, there were a lot of unknowns. So to determine if this was even possible, we started off by creating a “Hello World” App Clip using just React Native’s <View /> and <Text /> components.

    The “Hello World” App Clip weighed in at a staggering 28MB. How could a barebone App Clip be this big in size? We investigated and realized that the App Clip was including all the native dependencies that the Shop app used, even though it only needed a subset of the React Native ones. We realized that we had to explicitly define exactly which native dependencies the App Clip needed in the Podfile:

    Defining dependencies was done by looking through React Natives node_modules/react-native/scripts/react_native_pods to determine the bare minimum native dependencies that React Native needed. After determining the list, we calculated the App Clip size. The result was 4.3MB. This was good news, but we still didn’t know if adding all the features we wanted would make us go beyond the 10MB limit.

    Building a Scrappy Version

    Building an App Clip with React Native is almost identical to building a React Native app with one big difference. We need to explicitly define the App Clip dependencies in the Podfile. Auto linking wouldn’t work in this case since it would scan all the installed packages for the ones compatible with auto linking and add them, we needed to cherry-pick pods only used by the App Clip.

    The process was pretty straightforward; add a dependency in a React component, and if it had a native dependency, we’d added it to the “Shop App Clip” target in the Podfile. But the consequence of this would be quite substantial later on.

    So the baseline size was 4.3MB, now it was time to start adding the functionality we needed. Since we were still exploring the design in this phase, we didn’t know exactly what the end result would be (other than displaying information about the user's order), but we could make some assumptions. For one, we wanted to share as much code with the app as possible. The Shop app has a very robust UI library that we wanted to leverage as well as a lot of business logic that handles user and order creation. Secondly we knew that we needed basic functionality like:

    • Network calls to our GraphQL service
    • Error reporting
    • Push notifications

    Since we only wanted to determine the build size, and in the spirit of failing fast, we implemented these features without them even working. The code was added, as well as the dependencies, but the App Clip wasn’t functional at all.

    We calculated the App Clip size once again, and the result was 6.5MB. Even though it was a scrappy implementation to say the least, and there were still quite a few unknowns regarding the functionality, we knew that building it in React Native was theoretically possible and something we wanted to pursue.

    Building the App Clip

    We knew that building our App Clip with React Native was possible, our proof of concept was 6.5MB, giving us some leeway for unknowns. And with a React Native App Clip there sure were a lot of unknowns. Will sharing code between the app and the App Clip affect its size or cause any other issues? What do we do if the App Clip requires a dependency that pushes us over the 10MB limit?

    Technology Drives Design

    Given the very rigid constraints, we decided that unlike most projects where the design leads the technology, we would approach this from the opposite direction. While developing the App Clip, the technology would drive the design. If something caused us to go over, or close to, the 10MB limit we would go back to the drawing board and find alternative solutions.

    Code Sharing Between Shop App and App Clip

    With the App Clip, we wanted to give the user a quick overview of their order and the ability to receive shipping updates through push notifications. We were heavily inspired by the order view in Shop app, and the final App Clip design was a reorganized version of that.

    A screenshot showing the App Clip order page on the left and the Shop App order page on the right. Order details are more front and center in the App Clip verison.
    App Clip versus Shop App

    The Shop app is structured to share as much code as possible, and we wanted to incorporate that in the App Clip. Sharing code between the two makes sense, especially when the App Clip had similar functionality as the order view in the app.

    Our first exploration was to see if it was viable to share all the code in the order view between the app and the App Clip, and modify the layout with props passed from the App Clip.

    A flow diagram showing that App Clip and Shop App share all the code for the <OrderView /> component and therefor share <ProductRow /> and <OrderHeader /> as a result.
    App Clip and Shop App share all the code in the <OrderView /> component

    We quickly realized this wasn’t viable. For one, it would add too much complexity to the order view, but mainly, any change to the order view would affect the App Clip. If a developer adds a feature to the order view, with a big dependency, the 10MB App Clip limit could be at risk.

    For a small development team, it might have been a valid approach, but at our scale we couldn’t. The added responsibility that every developer would have for the App Clip’s size limit while doing changes to the app’s main order view would be against our values around autonomy.

    We then considered building our own version of the order view in the App Clip, but sharing its sub components. This could be a viable compromise where all the logic heavy code would live in the <OrderView /> but the simple presentational components could still be shared.

    A flow diagram showing that App Clip and Shop App share subcomponents from the &lt;OrderView /&gt;:  &lt;ProductRow /&gt; and &lt;OrderHeader /&gt;.
    App Clip and Shop App share subcomponents of the <OrderView /> component

    The first component we wanted to import to the App Clip was <ProductRow />, its job is to display the product title, price and image:

    An image showing <ProductRow />, its job is to display the product title, variant, price and image
    <ProductRow /> displaying product title, price and image

    The code for this component looks like this (simplified):

    But when we imported this component into the App Clip it crashed. After some digging, we realized that our <Image /> component uses a library called react-native-fast-image. It’s a library, built with native Swift code, we use to be able to display large lists of images in a very performant way. And as mentioned previously, to keep the App Clip size down we need to explicitly define all its native dependencies in the Podfile. We hadn’t defined the native dependency for react-native-fast-image and therefore it crashed. The fix was easy though, adding the dependency enabled us to use the <ProductRow /> component:

    However, our proof of concept App Clip weighed in at 6.5MB meaning we only had 3.5MB to spare. So we knew we only wanted to add the absolute necessary dependencies, and since the App Clip would only display a handful of images we didn’t deem this library an absolute necessity.

    With this in mind, we briefly went through all the components we wanted to share with the order view, maybe this was just a one time thing we could create a workaround for? We discovered that the majority of the sub components of the <OrderView /> somewhere down the line had a native dependency. Upon analyzing how they would affect the App Clip size, we discovered that they would push the App Clip far north of 10MB with one single dependency weighing in at a staggering 2.5MB.

    Standing at a Crossroad

    We now realized sharing components between the order view in the app and the App Clip was not possible, was that true for all code? At this stage we were standing at a crossroad. Do we want to duplicate everything? Some things? Nothing?

    To answer this question we decided to base the decision on the following principles:

    • The App Clip is an experiment: we didn’t know if it would be successful or not, so we want to validate this idea as fast as possible.
    • Minimal impact on other developers: We were a small team working on the App Clip, we don’t want to add any responsibility to the rest of the developers working on the Shop app.
    • Easy to delete: Due to the many unknowns for the success of the experiment, we wanted to double down on writing code that was easy to delete.

    With this in mind, we decided that the similarities between the order view in the app and the App Clip are purely coincidental. This change of mindset helped us move forwards very quickly.

    Build Phase

    Building the App Clip was very similar to building any other React Native app, the only real difference was that we constantly needed to keep track of its size. Since checking the size of the App Clip was very time consuming, around 25 minutes each time on our local machines, we decided to only do this when any new dependencies were added as well as doing some ad-hoc checks from time to time.

    All the components for the App Clip were created from scratch with the exception of the usage of our shared components and functions within the Shop app. Inside our shared/ directory there are a lot of powerful foundational tools we wanted to use in the App Clip; <Box /> and <Text /> and a few others that we rely on heavily to structure our UI in the Shop app with the help of our Restyle library. We also wanted to reuse the shared hooks for setting up push notifications, creating a user, etc. As mentioned earlier, sharing code between the app and the App Clip could potentially cause issues. If a developer decides to add a new native dependency to the <Box /> or <Text /> they would, often unknowingly, affect the App Clip as well. 

    However, we deemed these shared components mature enough to not have any large changes made to them. To catch any new dependencies being added to these shared components, we wrote a CI script to detect and notify the pull request author of this.

    The script did three things:

    1. Go through the Podfile and create a list of all the native dependencies.
    2. Traverse through all imports the App Clip made and create a list of the ones that have native dependencies.
    3. Finally, compare the two lists. If they don’t match, the CI job fails with instructions on how to proceed.

    A few times we stumbled upon some issues with dependencies, either our shared one or external ones, adding some weight to the App Clip. This could be a third-party library for animations, async storage, or getting information about the device. With our “technology drives design” principle in mind, we often removed the dependencies for non-critical features, as with the animation library.

    We now felt more confident on how to think while building an App Clip and we moved fast, continuously creating and merging pull requests.

    Support Invocation URLs in the App

    The app always has precedence over the App Clip. Meaning if you invoke the App Clip by scanning a QR code, but you already have the app installed, the app opens and not the App Clip. We had to build support for invocations in the app as well, so even if the user has the app installed scanning the QR code would automatically import the order.

    React Native enables us to do this through the Linking module:

    The module allowed us to fetch the invocation URL inside the app and create the order for already existing app users. With this, we now supported importing an order by scanning a QR code both in the App Clip and the app.

    Smooth Transition to the App

    The last feature we wanted to implement was a smooth transition to the app. If the user decides to upgrade from the App Clip to the full app experience, we wanted to provide a simpler onboarding experience and also magically have their order ready for them in the app. Apple provides a very nice solution to this with shared data containers which both the App Clip and the app have access to.

    Now we can store user data in the App Clip that the app has access to, providing an optimal onboarding experience if the user decides to upgrade.

    Testing the App Clip

    Throughout the development and launch of the App Clip, testing was difficult. Apple provides a great way to mock an invocation of the App Clip by hard coding the invocation URL in Xcode, but there was no way to test the full end-to-end flow of scanning the QR code, invocating the App Clip, and downloading the app. This wasn’t possible either on our local machines or TestFlight. To verify that the flow would work as expected we decided to release a first version of the App Clip extremely early. With the help of beta flags we made sure the App Clip could only be invoked by the team. This early release had no functionality, it only verified that the App Clip received the invocation URL and passed along the proper data to the app for a great onboarding experience. Once this flow was working, and we could trust that our local mockups worked the same as in production, testing the App Clip got a lot easier.

    After extensive testing, we felt ready to release the App Clip. The release process was very similar since the App Clip is bundled into the app, the only thing needed was to provide copy and image assets in App Store Connect for the invocation modal.

    Screenshot of App Store Connect screen for uploading copy and image assets.
    App Store Connect

    We approached this project with a lot of unknowns—the technology was new and new to us. We were trying to build an App Clip with React Native, which isn’t typical! Our approach (to fail fast and iterate) worked well. Having a developer with native iOS development was very helpful because App Clips—even ones written in React Native—involve a lot of Apple’s tooling.

    One challenge we didn’t anticipate was how difficult it would be to share code. It turned out that sharing code introduced too much complexity into the main application, and we didn’t want to impact the development process for the entire Shop team. So we copied code where it made sense.

    Our final App Clip size was 9.1MB, just shy of the 10MB limit. Having such a hard constraint was a fun challenge. We managed to build most of what we initially had in mind, and there are further optimizations we can still make.

    Sebastian Ekström is a Senior Developer based in Stockholm who has been with Shopify since 2018. He’s currently working in the Shop Retention team.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Five Tips for Growing Your Engineering Career

    Five Tips for Growing Your Engineering Career

    The beginning stages of a career in engineering can be daunting. You’re trying to make the most of the opportunity at your new job and learning as much as you can, and as a result, it can be hard to find time and energy to focus on growth. Here are five practical tips that can help you grow as you navigate your engineering career.

    Continue reading

    Using Propensity Score Matching to Uncover Shopify Capital’s Effect on Business Growth

    Using Propensity Score Matching to Uncover Shopify Capital’s Effect on Business Growth

    By Breno Freitas and Nevena Francetic

    Five years ago, we introduced Shopify Capital, our data-powered product that enables merchants to access funding from right within the Shopify platform. We built it using a version of a recurrent neural network (RNN)—analyzing more than 70 million data points across the Shopify platform to understand trends in merchants’ growth potential and offer cash advances that make sense for their businesses. To date, we’ve provided our merchants with over $2.7 billion in funding.

    But how much of an impact was Shopify Capital having, really? Our executives wanted to know—and as a Data team, we were invested in this question too. We were interested in validating our hypothesis that our product was having a measurable, positive impact on our merchants.

    We’ve already delved into the impact of the program in another blog post, Digging Through the Data: Shopify Capital's Effect on Business Growth, but today, we want to share how we got our results. In this post, we’re going behind the scenes to show you how we investigated whether Shopify Capital does what we intended it to do: help our merchants grow.

    The Research Question

    What’s the impact on future cumulative gross merchandise value (for example, sales) of a shop after they take Shopify Capital for the first time?

    To test whether Shopify merchants who accepted Capital were more successful than those who didn’t, we needed to compare their results against an alternative future (the counterfactual) in which merchants who desired Capital didn’t receive it. In other words, an A/B test.

    Unfortunately, in order to conduct a proper A/B test, we would need to randomly and automatically reject half of the merchants who expressed interest in Capital for some period of time in order to collect data for proper analysis. While this makes for good data collection, it would be a terrible experience for our users and undermine our mission to help merchants grow, which we were unwilling to do.

    With Shopify Capital only being active in the US in 2019, an alternative solution would be to use Canadian merchants who didn’t yet have access to Shopify Capital (Capital launched in Canada and the UK in Spring 2020) as our “alternate reality.” We needed to seek out Canadian shops who would have used Shopify Capital if given the opportunity, but weren’t able to because it wasn’t yet available in their market.

    We can do this comparison through a method called “propensity score matching” (PSM).

    Matchmaker, Matchmaker, Make Me a Match

    In the 1980s, researchers Rosenbaum and Rubin proposed PSM as a method to reduce bias in the estimation of treatment effects with observational data sets. This is a method that has become increasingly popular in medical trials and in social studies, particularly in cases where it isn’t possible to complete a proper random trial. A propensity score is defined as the likelihood of a unit being assigned to the treatment group. In this case: What are the chances of a merchant accepting Shopify Capital if it were offered to them?

    It works like this: After propensity scores are estimated, the participants are matched with similar counterparts on the other set, as depicted below.

    Depiction of matching performed on two sets of samples based on their propensity scores.
    Depiction of matching performed on two sets of samples based on their propensity scores.

    We’re looking for a score of similarity for taking treatment and only analyzing samples in both sets that are close enough (get a match) and respecting any other constraints imposed by the selected matching methodology. This means we could even be dropping samples from treatment when matching, if the scores fall outside of the parameters we’ve set.

    Once matched, we’ll be able to determine the difference in gross merchandise value (GMV), that is, sales, between the control and treatment groups in the six months after they take Shopify Capital for the first time.

    Digging into the Data Sets

    As previously discussed, in order to do the matching, we needed two sets of participants in the experiment, the treatment group, and the control group. We decided to set our experiment for a six-month period, starting in January 2019 to remove any confounding effect of COVID-19.

    We segment our two groups as follows:

  • Treatment Group: American shops that were first-time Capital adopters in January 2019, on the platform for at least three months prior (to ensure they were established on the platform), and still Shopify customers in April 2020.
  • Control Group: Canadian shops that had been a customer for at least three months prior to January 2019 and pre-qualified for Capital in Canada when we launched it in April 2020.
  • Ideally, we would have recreated underwriting criteria from January 2019 to see which Canadian shops would have pre-qualified for Capital at that time. To proxy for this, we looked at shops that remained stable until at least April 2020 in the US and Canada, and then went backwards to analyze their 2019 data.

    Key assumptions:

  • Shops in Canada didn’t take an offer for the sole reason that Capital didn’t exist in Canada at that time.
  • Shops in the US and Canada have equal access to external financing sources we can’t control (for example, small business loans)
  • The environments that Canadian and US merchants operate in are more or less the same
  • Matchmaking Methodology

    We began our matching process with approximately 8,000 control shops and about 600 treated shops. At the end of the day, our goal was to make the distributions of the propensity scores for each group of shops match as closely as possible.

    Foundational Setup

    For the next stage in our matching, we set up some features, using characteristics from within the Shopify platform to describe a shop. The literature says there’s no right or wrong way to pick characteristics—just use your discernment to choose whichever ones make the most sense for your business problem.

    We opted to use merchants’ (which we’ll refer to as shops) sales and performance in Shopify. While we have to keep the exact characteristics a secret for privacy reasons, we can say that some of the characteristics we used are the same ones the model would use to generate a Shopify Capital offer.

    At this stage, we also logarithmically transformed many of the covariates. We did this because of the wild extremes we can get in terms of variance on some of the features we were using. Transforming them to logarithmic space shrinks the variances and thus makes the linear regressions behave better (for example, to shrink large disparities in revenue). This helps minimize skew.

    It’s a Match!

    There are many ways we could match the participants on both sets—the choice of algorithm depends on the research objectives, desired analysis, and cost considerations. For the purpose of this study, we chose a caliper matching algorithm.

    A caliper matching algorithm is basically a nearest neighbors (NN) greedy matching algorithm where, starting from the largest score, the algorithm tries to find the closest match on the other set. It differs from a regular NN greedy algorithm as it only allows for matches within a certain threshold. The caliper defines the maximum distance the algorithm is allowed to have between matches—this is key because if the caliper is infinite, you’ll always find a neighbor, but that neighbor might be pretty far away. This means not all shops will necessarily find matches, but the matches we end up with will be fairly close. We followed Austin’s recommendation to choose our caliper width.

    After computing the caliper and running the greedy NN matching algorithm, we found a match to all but one US first-time Capital adopter within Canadian counterparts.

    Matching Quality

    Before jumping to evaluate the impact of Capital, we need to determine the quality of our matching. We used the following three techniques to assess balance:

    1. Standardized mean differences: This methodology compares the averages of the distributions for the covariates for the two groups. When close to zero, it indicates good balance. Several recommended thresholds have been published in the literature with many authors recommending 0.1. We can visualize this using a “love plot,” like so:

      Love plot comparing feature absolute standardized differences before and after matching.
      Love plot comparing feature absolute standardized differences before and after matching.
    2. Visual Diagnostics: Visual diagnostics such as empirical cumulative distribution plots (eCDF), quantile-quantile plots, and kernel density plots can be used to see exactly how the covariate distributions differ from each other (that is, where in the distribution are the greatest imbalances). We plot their distributions to check visually how they look pre and post matching. Ideally, the distributions are superimposed on one another after matching.

      Propensity score plots before matching - less overlapping before matching indicating less matches were found between groups
      Propensity score plots before matching - less overlapping before matching indicating less matches were found between groups.
      Propensity score plots after matching - Increased overlapping indicating good matches between groups
      Propensity score plots after matching - Increased overlapping indicating good matches between groups.
    3. Variance Ratios: The variance ratio is the ratio of the variance of a covariate in one group to that in the other. Variance ratios close to 1 indicate good balance because they imply the variances of the samples are similar, whereas numbers close to 2 are sometimes considered extreme. Only one of our covariates was hitting the 0.1 threshold in the standardized mean differences method. Visual comparison (see above) showed great improvement and good alignment in covariate distributions for the matched sets. And all of our variance ratios were below 1.3.

    The checks presented cover most of the steps presented in the literature in regards to making sure the matching is okay to be used in further analysis. While we could go further and keep tweaking covariates and testing different methods until a perfect matching is achieved, that would risk introducing bias and wouldn’t guarantee the assumptions would be any stronger. So, we decided to proceed with assessing the treatment effect. 

    How We Evaluated Impact

    At this point, the shops were matched, we had the counterfactual and treatment group, and we knew the matching was balanced. We’d come to the real question: Is Shopify Capital impacting their sales? What’s the difference in GMV between shops who did and didn’t receive Shopify Capital? 

    In order to assess the effect of the treatment, we set up a simple binary regression: y’ = β₀ + β₁ * T.

    Where T is a binary indicator of whether or not the data point is for a US or Canadian shop, β₀ is the intercept for the regression and β₁ is the coefficient that will show how being on treatment on average influences our target. Target, 𝑦', is a logarithm of the cumulative six-month GMV, from February to July 2019,  plus one (that is, log1p transform of six-month sales).

    Using this methodology, we found that US merchants on average had a 36% higher geometric average of cumulative six-month GMV after taking Capital for the first time than their peers in Canada.

    How Confident Are We in Our Estimated Treatment Effect? 

    In order to make sure we were confident in the treatment effect we calculated, we ran several robustness checks. We won’t get into the details, but we used the margins package, simulated an A/A test to validate our point estimate, and followed Greifer’s proposed method for bootstrapping.

    Cumulative geometric average of sales between groups before and after taking their first round of Capital
    Cumulative geometric average of sales between groups before and after taking their first round of Capital.

    Our results show that the 95% confidence interval for the average increase in the target, after taking Capital for the first time, is between 13% and 65%. The most important takeaway is that the lower bound is positive—so we can say with high confidence that Shopify Capital has a positive effect on merchants’ sales.

    Final Thoughts

    With high statistical significance, backed by robustness checks, we concluded that the average difference in the geometric mean of GMV in the following six months after adopting Shopify Capital for the first time is +36%, bounded by +13% and +65%. We can now say with confidence that Shopify Capital does indeed help our merchants—and not only that, but it validates the work we’re doing as a data team. Through this study, we were able to prove that one of our first machine learning products has a significant real-world impact, making funding more accessible and helping merchants grow their businesses. We look forward to continuing to create innovative solutions that help our merchants achieve their goals.

    Breno Freitas is a Staff Data Scientist working on Shopify Capital Data and a machine learning researcher at Federal University of Sao Carlos, Brazil. Breno has worked with Shopify Capital for over four years and currently leads a squad within the team. Currently based in Ottawa, Canada, Breno enjoys kayaking and working on DIY projects in his spare time.

    Nevena Francetic is a Senior Data Science Manager for Money at Shopify. She’s leading teams that use data to power and transform financial products. She lives in Ottawa, Ontario and in her spare time she spoils her little nephews. To connect, reach her on LinkedIn.


    Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

    Continue reading

    Building Blocks of High Performance Hydrogen-powered Storefronts

    Building Blocks of High Performance Hydrogen-powered Storefronts

    The future of commerce is dynamic, contextual, and personalized. Hydrogen is a React-based framework for building custom and creative storefronts giving developers everything they need to start fast, build fast, and deliver the best personalized and dynamic buyer experiences powered by Shopify’s platform and APIs. We’ve built and designed Hydrogen to meet the three needs of commerce:

    1. fast user experience: fast loading and responsive
    2. best-in-class merchant capabilities: personalized, contextual, and dynamic commerce
    3. great developer experience: easy, maintainable, and fun.
    A visualization of a .tsx file showing the ease of adding an Add to Cart button to a customized storefront
    Hydrogen provides optimized React components enabling you to start fast.

    These objectives have inherent tension that’s important to acknowledge. You can achieve fast loading through static generation and edge delivery, but you must forgo or make personalization a client-side concern that results in a deferred display of critical content. Vice versa, rendering dynamic responses from the server implies a slower initial render but, when done correctly, can deliver better commerce and shopping experience. However, delivering efficient streaming server-side rendering for React-powered storefronts, and smart server and client caching, is a non-trivial and unsolved developer experience hurdle for most teams.

    Hydrogen is built and optimized to power personalized, contextual, and dynamic commerce. Fast and efficient server-side rendering with high-performance storefront data access is the prerequisite for such experiences. To optimize the user experience, we leverage a collection of strategies that work together:

    There’s a lot to unpack here, let’s take a closer look at each one.

    Streaming Server-side Rendering

    Consider a product page that contains a significant amount of buyer personalized content: a localized description and price for a given product, a dynamic list of recommended products powered by purchase and navigation history, a custom call to action (CTA) or promotion banner, and the assignment to one or several multivariate A/B tests.

    A client-side strategy would, likely, result in a fast render of an empty product page skeleton, with a series of post-render, browser-initiated fetches to retrieve and render the required content. These client-initiated roundtrips quickly add up to a subpar user experience.

    A visualization showing the differences between Client-side Rendering and Server-side Rendering
    Client-side rendering vs. server-side rendering

    The client-side rendering (CSR) strategy typically results in a delayed display of critical page content—that is, slow LCP. An alternative strategy is to server-side render (SSR)—fetch the data on the server and return it in the response—that helps eliminate RTTs and allows first and largest contentful paints to fire close together, but at a cost of a slow time-to-first-byte (TTFB) because the server is blocked on the data. This is where and why streaming SSR is a critical optimization.

    A visualization showing how Streaming Server-side Rendering unlocks critical performance benefits.
    Streaming server-side rendering unlocks fast, non-blocking first render

    Hydrogen adopts the new React 18 alpha streaming SSR API powered by Suspense that unlocks critical performance benefits:

    • Fast TTFB: the browser streams the HTML page shell without blocking the server-side data fetch. This is in contrast to “standard” SSR where TTFB is blocked until all data queries are resolved.
    • Progressive hydration: as server-side data fetches are resolved, the data is streamed within the HTML response, and the React runtime progressively hydrates the state of each component, all without extra client round trips or blocking on rendering the full component tree. This also means that individual components can show custom loading states as the page is streamed and constructed by the browser.

    The ability to stream and progressively hydrate and render the application unlocks fast TTFB and eliminates the client-side waterfall of CSR—it’s a perfect fit for the world of dynamic and high-performance commerce.

    React Server Components

    “Server Components allow developers to build apps that span the server and client, combining the rich interactivity of client-side apps with the improved performance of traditional server rendering.”
        —RFC: React Server Components

    Server components are another building block that we believe (and have been collaborating on with the React core team) is critical to delivering high-performance storefronts. RSC enables separation of concerns between client and server logic and components that enables a host downstream benefits:

    • server-only code that has zero impact on bundle size and reduces bundle sizes
    • server-side access to custom and private server-side data sources
    • seamless integration and well-defined protocol for server+client components
    • streaming rendering and progressive hydration
    • subtree and component-level updates that preserve client-state
    • server and client code sharing where appropriate.
    An home.server.jsx file that has been highlighted to show where code sharing happens, the server-side data fetch, and the streaming server-side response.

    Server components are a new building block for most React developers and have a learning curve, but, after working with them for the last ten months, we’re confident in the architecture and performance benefits that they unlock. If you haven’t already, we encourage you to read the RFC, watch the overview video, and dive into Hydrogen docs on RSC.

    Efficient Data Fetching, Colocation, and Caching

    Delivering fast server-side responses requires fast and efficient first party (Shopify) and third party data access. When deployed on Oxygen—a distributed, Shopify hosted V8 Isolate-powered worker runtime—the Hydrogen server components query the Storefront API with localhost speed: store data is colocated and milliseconds away. For third party fetches, the runtime exposes standard Fetch API enhanced with smart cache defaults and configurable caching strategies:

    • smart default caching policy: key generation and cache TTLs
    • ability to override and customize cache keys, TTLs, and caching policies
    • built-in support for asynchronous data refresh via stale-while-revalidate.

    To learn more, see our documentation on useShopQuery for accessing Shopify data, and fetch policies and options for efficient data fetching.

    Combining the Best of Dynamic and Edge Serving

    Adopting Hydrogen doesn’t mean all data must be fetched from the server. On the contrary, it’s good practice to defer or lazyload non-critical content from the client. Below the fold or non-critical content can be loaded on the client using regular React patterns and browser APIs, for example, through use of IntersectionObserver to determine when content is on or soon to be on screen and loaded on demand.

    Similarly, there’s no requirement that all requests are server-rendered. Pages and subrequests with static or infrequently updated content can be served from the edge. Hydrogen is built to give developers the flexibility to deliver the critical personalized and contextual content, rendered by the server, with the best possible performance while still giving you full access to the power of client-side fetching and interactivity of any React application.

    The important consideration isn’t which architecture to adopt, but when you should be using server-side rendering, client-side fetching, and edge delivery to provide the best commerce experience—a decision that can be made at a page and component level.

    For example, an about or a marketing page that’s typically static can and should be safely cached, served directly from the CDN edge, and asynchronously revalidated with the help of a stale-while-revalidate strategy. The opt-in to edge serving is a few keystrokes away for any response on a Hydrogen storefront. This capability, combined with granular and optimized subrequest—powered by the fetch API we covered above—caching gives full control over data freshness and the revalidation strategy.

    Putting It All Together

    Delivering a high-performance, dynamic, contextual, and personalized commerce experience requires layers of optimizations at each layer of the stack. Historically, this has been the domain of a few, well-resourced engineering teams. The goal of Hydrogen and Oxygen is to level the playing field:

    • the framework abstracts all the streaming
    • the components are tuned to speak to Shopify APIs
    • the Oxygen runtime colocates and distributes rendering around the globe.

    Adopting Hydrogen and Oxygen should, we hope, enable developers to focus on building amazing commerce experiences, instead of the undifferentiated technology plumbing and production operations to power a modern and resilient storefront.

    Take Hydrogen out for a spin, read the docs, leave feedback. Let’s build.

    Ilya Grigorik is a Principal Engineer at Shopify and author of High Performance Browser Networking (O'Reilly), on a mission to supercharge commerce and empower entrepreneurs around the world.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    The Vitality of Core Web Vitals

    The Vitality of Core Web Vitals

    In 2020, Google introduced unified guidance for great user experience (UX) on the web called Core Web Vitals. It’s proposed to evaluate specific metrics and have numerical estimates for such a multifaceted discipline. Current metrics focus on loading, interactivity, and visual stability. You might think, “Nice stuff, thank you, Google. I’ll save this to my bookmarks and look into it once the time comes for nice-to-have investigations!” But before deciding, have a closer look into the following. This year, Google implemented the metrics of Core Web Vitals in the Search Engine ranking algorithm as one of the factors. To be precise, the rollout of page experience in ranking systems began in mid-June of 2021 and completed at the end of August.

    Does that mean that we should notice a completely different ranking of Google Search results in September already? Or the horror case, our websites to be shown in the s-e-c-o-n-d Search Engine Results Page (SERP)? Drastic changes won’t appear overnight, but the update will undoubtedly influence the future of ranking. First of all, the usability of web pages is only one factor that influences ranking. Meaning of inserted query, the relevance of a page, quality of sources, context, and settings are other “big influencers” deciding the final results. Secondly, most websites are in the same boat getting “not great, not terrible” grades. According to Google Core Web Vitals Study April 2021, only four percent of all studied websites are prepared for the update and have a good ranking in all three metrics. It’s good timing for companies to invest efforts for necessary improvements and easily stand out among other websites. Lastly, user expectations continue to rise higher standards, and Google has a responsibility to help users reach relevant searches. At the same time, Google pushes the digital community to prioritize UX because that helps to keep users on their websites. Google's study shows that visitors are 24% less likely to abandon the websites that meet proposed metrics thresholds.

    Your brains are most likely filled with dopamine by thinking about possible UX improvements to your website. Let’s use that momentum and dig deeper into each metric of Core Web Vitals.

    Core Web Vitals Metrics

    Core Web Vitals is the subset of unified guidance for great UX indication called Web Vitals. Core metrics highlight the metrics that matter most. Metrics are not written in stone! They represent the best available indicators developers have today. Thus, be welcoming for future improvements or additions.

    The current set of metrics is largest contentful paint (LCP), first input delay (FID), and cumulative layout shift (CLS).

    An image showing the three Core Web Vitals and the four Other Web Vitals
    Listed metrics of Web Vitals: mobile-friendly, safe browsing, HTTPS, no intrusive interstitials, loading, visual stability and interactivity. The last 3 ones are ascribed to Core Web Vitals.

    Largest Contentful Paint

    LCP measures the time to render the largest element in a currently viewed page part. The purpose is to measure how quickly the main content is ready to be used for the user. Discussions and research helped to understand that the main content is considered the largest element as an image or text block in a viewport. Exposed elements are

    • <img>
    • <image> inside an <svg> (Note: <svg> itself currently is not considered as a candidate)
    • <video>
    • an element with a background image loaded via the url()
    • block-level elements containing text nodes or other inline-level text elements children.

    During the page load, the largest element in a viewport is detected as a candidate of LCP. It might change until the page is fully loaded. In example A below, the candidate changed three times since larger elements were found. Commonly, the LCP is the last loaded element, but that’s not always the case. In example B below, the paragraph of text is the largest element displayed before a page loads an image. Comparing the two, example B has a better LCP score than example A.

    An image depicting the differences between LCP being the last loaded element and LCP occuring before the page is fully loaded.
    LCP detection in two examples: LCP is the last loaded element on a page (A), LCP occurs before the page is fully loaded (B).

    Websites should meet 2.5 seconds or less LCP for getting a score that indicates good UX. But… Why 2.5? The inspiration was taken from studies by Stuart K. Card and Robert B. Miller,  that found that a user will wait roughly 0.3 to 3 seconds before losing focus. In addition, gathered data about top-performing sites across the Web showed that such a limit is consistently achievable for well-optimized sites.

    Good Poor
    LCP <= 2.5s > 4s
    FID <= 100ms > 300ms
    CLS <= 0.1 > 0.25

    The thresholds of “good” and “poor” Web Core Vitals scores. The scores in between are considered “needs improvement”.

    First Input Delay

    FID quantifies the user’s first impression of the responsiveness and interactivity of a page. To be precise, how long does a browser take to become available to respond to the first user’s interaction on a page. For instance, the time between when the user clicks on the “open modal” button and when the browser is ready to trigger the modal opening. You may wonder, shouldn’t the code be executed immediately after the user’s action? Not necessarily, during page load, the browser’s main thread is super busy parsing and executing loaded JavaScript (JS) files—incoming events might wait until processing.

    A visualization of a browser loading a webpage showing that FID is the time between the user's first interaction and when they can respond
    FID represents the time between when a browser receives the first user’s interaction and can respond to that.

    FID measures only the delay in event processing. Time to process the event and update UI afterwards were deliberately excluded to avoid workarounds like moving event processing logic to asynchronous callbacks. The workaround would improve such metric scores because it separates processing from the task associated with the event. Sadly, that wouldn’t bring any benefits for the user—likely the opposite.

    User’s interactions require the main thread to be idle even when the event listener is not registered. For example, the main thread might delay the user’s interaction with the following HTML elements until it completes ongoing tasks:

    • text fields, checkboxes, and radio buttons (<input>, <textarea>)
    • select dropdowns (<select>)
    • links (<a>).

    Jakob Nielsen described in the Usability Engineering book: “0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result”. Despite being first described in 1993, the same limit is considered good in Core Web Vitals nowadays.

    Cumulative Layout Shift

    CLS measures visual stability and counts how much the visible content shifts around. Layout shifts occur when existing elements change their start position defined by Layout Instability API. Note that when a new element is added to the DOM or an existing element changes size—it doesn’t count as a layout shift!

    The metric is named “cumulative” because the score of each shift is summed. In June 2021, the duration of CLS was improved for long-lived pages (for example SPAs and infinite scroll apps) by grouping layout shifts and ensuring the score doesn’t grow unbounded.

    Are all layout shifts bad? No. CLS focuses only on unexpected ones. Expected layout shifts occur within 500 milliseconds after user’s interactions (that is clicking on a link, typing in a search box, etc.). Such shifts are excluded from CLS score calculations. This knowledge may encourage creating extra space immediately after the user’s input with a loading state for tasks that take longer to complete.

    Let’s use a tiny bit of math to calculate the layout shift score of the following example:

    1. Impact fraction describes the amount of space an unstable element takes up of a viewport. When an element covers 60% of a viewport, its impact fraction is 0.6.
    2. Distance fraction defines the amount of space that an unstable element moves from the original to the final position. When an element moves by 25% of a viewport height, its distance fraction is 0.25.
    3. Having layout_shift_score = impact_fraction * distance_fraction formula, the layout shift score of this example is 0.6 * 0.25 = 0.15.
    A visualization of two mobile screens showing the 0.15 layout shift score. The second mobile screen shows the page after the layout shift
    The example of 0.15 layout shift score in a mobile view.

    A good CLS score is considered to be 0.1 or less for a page. Evaluated real-world pages revealed that the shifts of such good scores are still detectable but not excessively disruptive. Leaving shifts of 0.15 score and above consistently the opposite.

    How to Measure the Score of My Web Page?

    There are many different toolings to measure Core Web Vitals for a page. Tools reflect two main measurement techniques: in the lab or in the field.

    In the Lab

    Lab data, also known as synthetic data, is collected from a simulated environment without a user. Measures in such an environment can be tested before features are released in production. Be aware that FID can’t be measured in such a way! Lab data doesn’t contain the required real user input. As an alternative, its suggested to track its proxy—Total Blocking Time (TBT).

    Tooling:

    • Lighthouse: I think it is the most comprehensive tool using lab data. It can be executed either on public or authenticated web pages. The generated report indicates the scores and suggests personalised opportunities to improve the performance. The best part is that Chrome users already have this tool ready to be used under DevTools. The drawback I noticed during using the tool—the screen of a page should be visible during the measurement process. Thus, the same browser doesn’t support the analysis of several pages in parallel. Lastly, Lighthouse can be incorporated into continuous integration workflows via Lighthouse CI.
    • WebPageTest: The tool can perform analyses for public pages. I was tricked by the tool when I provided a URL of the authenticated page for the first time. I got results. Better than I expected. Just before patting myself on the back, I decided to dig deeper into a waterfall view. The view showed clearly that the authenticated page wasn’t even reached and it was navigated to a public login page. Despite that, the tool has handy options to test against different locations, browsers, and device emulators. It might help to identify which country or state troubles the most and start thinking about Content Delivery Network (CDN). Finally, be aware that the report includes detailed analyses but doesn’t provide advice for improvements.
    • Web Vitals extension: It's the most concrete tool of all. It contains only metrics and scores for the currently viewed page. In addition, the tool shows how it calculates scores in real-time. For example, FID is shown as “disabled” until your interaction happens on a page.

    In the Field 

    A site’s performance can vary dramatically based on a user’s personalized content, device capabilities, and network conditions. Real User Monitoring (RUM) captures the reality of page performance, including the mentioned differences. Monitoring data shows the performance experienced by a site’s actual users. On the other hand, there’s a way to check the real-world performance of a site without a RUM setup. Chrome User Experience Report gathers and aggregates UX metrics across the public web from opted-in users. Such findings power the following tools:

    • Chrome UX Report Compare Tool (CRUX): As the name dictates, the tool is meant for pages’ comparison. The report includes metrics and scores of selected devices’ groups: desktop, tablet, or mobile. It is a great option to compare your site with similar pages of your competitors.
    • PageSpeed Insights: The tool provides detailed analyzes for URLs that are known by Google’s web crawlers. In addition, it highlights the opportunities for improvements.
    • Search Console: The tool reports performance data per page, including historical data. Before using it—verification of ownership is mandatory.
    • Web Vitals extension: The tool was mentioned for lab toolings, but there’s one more feature to reveal. For pages in which field data is available via Chrome UX Report, lab data (named “local” in the extension) is combined with real-user data from the field. This integration might indicate how similar your individual experiences are to other website users.

    CRUX based tools are great and quick starters for investigations. Despite that, your retrieved RUM data can provide more detailed and immediate feedback. To setup RUM for a website might look scary at the beginning, but usually, it takes these steps:

    1. In order to send data from a website, a developer implements a RUM Javascript snippet to the source code.
    2. Once the user interacts or leaves the website, the data about an experience is sent to a collector. This data is processed and stored in a database that anyone can view via convenient dashboards.

    How to Improve

    Core Web Vitals provides insights into what’s hurting the UX. For example, setting up RUM even for a few hours can reveal where the most significant pain points exist. The worst scoring metrics and pages can indicate where to start searching for improvements. Other toolings mentioned in the previous section might suggest how to fix the specific issues. The great thing is that all scores will likely increase by applying changes to improve one metric.

    Many indications and bits of advice may sound like coins in the Super Mario game, which are hanging for you to grab. That isn’t the case. The hard and sweaty work remains on your table! Not all opportunities are straightforward to implement. Some might include big and long-lasting refactoring that can’t be done in one go or for which preparations should be completed. Thus, it adds several strategies to start explorations:

    1. Update third-party libraries. After reviewing your application libraries, you might reveal that some are no longer used or lighter alternatives (covering the same use case) exist. Next, sometimes only a part of the included library is actually used. That leads to the situation where a portion of JS code is loaded without purpose at all. Tree-shaking could solve this issue. It enables loading only registered specific features from a library instead of loading everything. Be aware that not all libraries support tree-shaking yet, but it’s getting more and more popular. Updates of application dependencies may sound like a small help, but let’s lead by an example. During Shopify internal Hack Days, my team executed the mentioned updates for our dropshipping app Oberlo. It decreased the compressed bundle size of the application by 23%! How long did it take for research and development? Less than three days.
      This improves FID and LCP.
    2. Preload critical assets. The loading process might be extended due to the late discovery of crucial page resources by the browser. By noting which resources can be fetched as soon as possible, the loading can be improved drastically. For example, Shopify noticed a 50% (1.2 seconds) improvement in time-to-text-paint by preloading Web Fonts.
      This improves FID and LCP.
    3. Review your server response time. If you’re experiencing severe delays, you may try the following:
      a) use a dedicated server instead of a shared one for web hosting
      b) route the user to a nearby CDN
      c) cache static assets
      d) use service workers to reduce the amount of data users need to request from a server.
      This improves FID and LCP
    4. Optimize heavy elements. Firstly, shorten the loading and rendering of critical resources by implementing the lazy-loading strategy. It defers the loading of large elements like images below the page viewport once it’s required for a user. Do not add lazy loading for elements in the initial viewport because the LCP element should be loaded as fast as possible! Secondly, compress images to have fewer bytes to download. Images don’t always require high quality or resolution and can be downgraded intentionally without affecting the user. Lastly, provide dimensions as width and height attributes or aspect-ratio boxes. It ensures that a browser can allocate the correct amount of space in the document while the image is loading to avoid unexpected layout shifts.
      This improves FID, LCP, and CLS.

    To sum everything up, Google introduced Core Web Vitals to help us to improve the UX of websites. In this article, I’ve shared clarifications of each core metric, the motives of score thresholds, tools to measure the UX scores of your web pages, and strategies for improvements. Loading, interactivity, and visual stability are the metrics highlighted today. Future research and analyses might reveal different Core Web Vitals to focus on. Be prepared!

    Meet the author of this article—Laura Silvanavičiūtė. Laura is a Web Developer who is making drop-shipping better for everyone together with the Oberlo app team. Laura loves Web technologies and is thrilled to share the most exciting parts with others via tech talks at conferences or articles on medium.com/@laurasilvanavi.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Always on Calibration with a Quarterly Summary

    Always on Calibration with a Quarterly Summary

    Being always-on means we don’t wait to reach alignment around expectations and performance at the end of a review period. We’re always calibrating and aligning. This removes any sort of surprise in how an individual on the team is doing and recognizes the specific impact they had on the business. It’s also for us to find and perform specific just-in-time growth opportunities for each individual. 

    Today, continuous deployment is an accepted and common best practice when it comes to building and releasing software. But, where’s the continuous deployment for the individuals? Just like the software we write, we’re iterating on ourselves and deploying new versions all the time. We learn, grow, and try new ideas, concepts, and approaches in our work and interactions. But how are we supposed to iterate and grow when, generally, we calibrate so infrequently? Take a minute and think about your experiences:

    • What do calibrations look like for you? 
    • When was the last time you got feedback? 
    • How often do you have performance conversations? Do you have them at all? 
    • Do you always know how your work connects to your job expectations? Do you know how your work contributes to the organizational goals?
    • Are you ever surprised by your performance evaluation? 
    • Were you able to remember, in detail, all of your contributions for the performance review window? 

    If you receive feedback, have performance reviews, and answer "no" or "I don’t know" to any of the above questions, then unfortunately, you’re not alone. I’ve answered "no" and "I don’t know" to each of these questions at one point in time or another in my career. I went years without any sort of feedback or review. I have also worked on teams where performance reviews were yearly, and only at the end of the year did we try and collect a list of the feedback and contributions made over that year.

    I have found great value in performance reviews with my team, if and only if, they are continuous, and always-on. Anything short of always-on leads to surprises, missed opportunities for growth and for corrective feedback to be effective, as well as an incomplete view on individuals' contributions. This culminates in missed opportunities for the individual when it comes to their performance, promotion, and compensation reviews and conversations.

    Ye Old Performance Review

    Ye Old performance reviews are plagued by the same problems which surrounded software development before continuous deployments. Once every X number of months, we’d get all the latest features and build release candidates, and try to ship our code to production. These infrequent releases never worked out and were always plagued with issues. Knowing that this didn’t work so well for software deployments, why are so many still applying the same thinking to the individuals in their company and on their team?

    The Goal of Performance Reviews 

    Performance reviews intend to calibrate (meaning to capture and align) with an individual on the impact they had on the business over a given period, reward them for that impact, provide feedback, and stretch them into new opportunities (when they’re ready).

    Some Problems with Ye Old Performance Reviews 

    I don’t know about you, but I can’t recall what I had for dinner a week ago, let alone what I worked on months ago. Sure, I can give you the high-level details just like I can for that dinner I had last week. I probably had protein and some vegetables. Which one? Who knows. The same applies with impact conversations. How are we supposed to have a meaningful impact conversation if we can’t recall what we did and the impact it had? How are we supposed to provide feedback, so you can do better next time on that meal, if we don’t even recall what we ate? The specifics around those contributions are important, and those specifics are available now. 

    An always-on calibration reduces: surprises and disconnects. It helps ensure the presence of specificity when reviewing contributions, and individuals grow into better versions of themselves. What if after that meal I made last week, I got feedback right away from my partner. That meal was good, but it could have used a little more spice. Great. Now the next time I make that meal, I can add some extra spice. If she didn’t tell me for a year, then every time I made it this year, it would be lacking that spice. I would have missed the opportunity to refine my palate and grow. Let’s look more closely at some of the problems which arise with these infrequent calibrations.

    Contributions

    When reviews are infrequent or non-existent, we’re unable to know the full scope and impact of the work performed by the individual. Their work isn’t fully realized, captured, and rewarded. If we do try to capture these contributions at the end of a review window, we run into a few problems:

    • Our memory fades and we forget our contributions and the impact we had. 
    • Our contributions often end up taking the form of general statements that lack specificity.
    • We suffer from a recency bias that colours how we see the contributions over the entire review period. This results in more weight being given to recent contributions and less to those which happened closer to the start of the review period. 
    • Managers and individuals can have a different or an incomplete view of the contributions made.

    If we are unable to see the full scope of work contributed by the individual, we’re missing opportunities to reward them for this effort and grow them towards their long-term goals.

    Growth

    When reviews are infrequent, we’re missing out on the opportunity to grow the individuals on our team. With reviews come calibration, feedback, and alignment. This leads to areas for growth in terms of new opportunities and feedback as to where they can improve. If we try to capture these things at the end of review windows, we have a few problems:

    • We’re unable to grow in our careers as quickly because 
      • We can’t quickly try out new things and receive early and frequent feedback.
      • We’re infrequently looking for new opportunities that benefit our career growth. 
    • We won’t be able to provide early feedback or support when something isn’t working. 
    • There’s a lack of specificity on how team members can achieve the next level in their careers.
    • Individuals can overstay on a team when there’s a clear growth opportunity for them elsewhere in the organization.

    Frequent calibrations allow individuals to grow faster as they can find opportunities when they are ready, iterate on their current skills, and pivot towards a more successful development and contribution path. 

    The Quarterly Calibration Document

    Every quarter, each member of my team makes a copy of this quarterly objectives template. At present, this template consists of six sections:

    1. Intended Outcomes: what they intend to accomplish going into this quarter?
    2. Top Accomplishments: what are their most impactful accomplishments this quarter? Other Accomplishments: what other impactful work did they deliver this quarter?
    3. Opportunities for the next three to six months: what opportunities have we identified for them that aren’t yet available to work on but will be available in the upcoming quarters?
    4. Feedback: what feedback did we receive from coworkers?
    5. Quarterly Review: a table that connects the individual’s specific impact for a given quarter to the organization’s expectations for their role and level.

    As you can probably guess by looking at these sections, this is a living document, and it’ll be updated over a given quarter. Each individual creates a new one at the start of the quarter and outlines their intended outcomes for the quarter. After this is done, we discuss and align on those intended outcomes during our first one-on-one of the quarter. After we have aligned, the individual updates this document every week before our one-on-one to ensure it contains their accomplishments and any feedback they’ve received from their coworkers. With this document and the organizations' role expectations, we can calibrate and state where they’re meeting, exceeding, or missing our expectations of them in their role weekly. We can also call out areas for development and look for opportunities in the current and upcoming work for them to develop. There’s never any surprise as to how they are performing and what opportunities are upcoming for their growth.

    A Few Key Points

    There are a few details I’ve learned to pay close attention to with these calibrations. 

    Review Weekly During Your One-on-one

    I’m just going to assume you are doing weekly one-on-ones. These are a great opportunity for coaching and mentorship. For always-on calibrations to be successful, you need to include and dedicate time as part of these weekly meetings. Calibrating weekly lets you:

    • Provide feedback on their work and its impact
    • Recognize their contributions regularly
    • Show them the growth opportunity available to them that connects to their development plan
    • Identify deviations from the agreed upon intended outcomes and take early action to ensure they have a successful quarter

    Who Drives and Owns the Quarterly Calibration Document?

    These calibration documents are driven by the individual as they’re mentored and coached by their manager. Putting the ownership of this document on the individual means they see how their objectives align with the expectations your organization has for someone in this role and level. They know what work they completed and the impact around that work. They also have a development plan and know what they’re working towards. Now that’s not to say their manager doesn’t have a place in this document. We’re there to mentor and coach them. If the intended outcomes aren’t appropriate for their level or are unrealistic for the provided period, we provide them with this feedback and help them craft appropriate intended outcomes and objectives. We also know about upcoming work, and how it might interest or grow them towards their long-term goals. 

    Always Incorporate Team Feedback

    Waiting to collect and share feedback on an infrequent basis results in vague, non-actionable, and non-specific feedback that hinders the growth of the team. Infrequently collected feedback often takes the form of "She did great on this project," "They're a pleasure to work with." This isn’t helpful to anyone's growth. Feedback needs to be candid, specific, timely, and focused on the behaviour. Most importantly, it needs to come from a place of caring. By discussing feedback in the quarterly document and during weekly one-on-ones, you can:

    • Collect highly specific and timely feedback.
    • Identify timely growth opportunities.
    • Provide a reminder to each individual on the importance of feedback to everyone’s growth. 

    During our weekly one-on-ones, we discuss any feedback that we’ve received during the previous week from their coworkers. I also solicit feedback for teammates at this time that they may or may not have shared with their coworkers. We take time to break down this feedback together and discuss the specifics. If the feedback is positive, we’ll make sure to note it as it’s useful in future promotion and compensation conversations. If the feedback is constructive, we discuss the specifics and highlight future opportunities where we can apply what was learned. Where appropriate, we also incorporate new intended outcomes into our quarterly calibration document. 

    What Managers Can Do when Things Aren’t on Track.

    This topic deserves its own post but short of it is: hard conversations don’t get easier with time, they get more difficult. Let’s say we don’t appear to be on track for reaching our intended outcomes and the reason is performance-related. This is where we need to act immediately. Don’t wait. Do it now. The longer you wait, the harder it is to have these difficult conversations. For me, this is akin to howlers in Harry Potter. For those who don’t know Harry Potter, these are letters written to seemingly misbehaving students from parents for which the letter yells at the student for some misbehaving that occurred. If you don’t open these letters right away and get the yelling over with, they get worse and worse. They smoke in the corner, and eventually the yelling from the letter is far worse when you eventually do open it. To me, this is what I think of whenever I have difficult feedback I need to provide. I know it’s going to be difficult, but all parties should provide this early and give the recipient a chance to course correct. The good news is that you’re having weekly calibration sessions and not yearly, so the individual has plenty of time to correct any performance issues before it becomes a serious problem. But only if the manager jumps in.

    What Happens When We Aren’t Hitting Our Intended Outcomes and Objectives 

    First, it’s important to be clear as to which intended outcomes aren’t being hit. Are they specific to personal development or their current role and level?

    Personal Development Intended Outcomes

    In addition to achieving the expectations of the role and level, they’re hopefully working towards their long-term aims. (See The AWARE Development Plan for more on this topic). Working with your Development Plan, you’ve worked back from your long-term aims to set a series of the short-term intended outcomes that move you towards these goals. Missing these intended outcomes results in a delay in reaching your long-term aim, but this doesn’t affect your performance in your current role and level. When personal development aims are missed, we should reaffirm the value in these intended outcomes, and if they’re still appropriate, prioritize them in the future. We need to discuss and acknowledge the delay that missing these outcomes have on their long-term aims, so all parties remain aligned on development goals. 

    Role or Level Intended Outcomes

    At the start of a review period, we’ve agreed to a set of intended outcomes for an individual based on the role and level. Once we learn that the intended outcomes for the role or level aren’t going to be achieved, we need to understand why. If the reason is that priorities have changed, we need to refocus our efforts elsewhere. We acknowledge the work and impact they’ve had to date and set new intended outcomes that are appropriate for the time that remains in the review period. If, on the other hand, they’re unable to meet the intended outcomes because they aren’t up to the task, we may need to set up a performance improvement plan to help them gain the skills needed to execute at the level expected of them.

    The Quarterly Review

    We roll up and calibrate our impact every quarter. This period can be adjusted to work for your organization, but I’d recommend something less than a year but more than a month. Waiting a year is too long and there’s too much data for you to work within creating your evaluation. Breaking down these yearly reviews into smaller windows has a few advantages, it:

    • Allows you to highlight impactful windows of contribution (you may have a great quarter and just an ok year, breaking it down by quarter gives you the chance to celebrate that great period).
    • Allows you to snapshot smaller windows (this is your highlight reel for a given period and when looking back over the year you can look at these snapshots).
    • Allows you to assign a performance rating for this period and course correct early.

    If we want to be sure we have a true representation of each individual's contributions to an organization, ensure those contributions are meeting your organizations' expectations for their role and level, and provide the right opportunities for growth for each individual, then we need to be constantly tracking and discussing the specifics of their work, its impact, and where they can grow. It’s not enough to look back at the end of the year and collect feedback and a list of accomplishments. Your list will, at best, be incomplete, but more importantly you’ll have missed out on magnifying the growth of each individual. Worse yet, that yearly missed growth opportunity compounds, so the impact you are missing out on by not continuously calibrating with your team is huge.

    David Ward is a Development Manager who has been with Shopify since 2018 and is working to develop: Payment Flexibility, Draft Orders, and each member of the team. Connect with David on Twitter, GitHub and LinkedIn.


    Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

    Continue reading

    Start your free 14-day trial of Shopify